This problem is typical of an OCR. Known as character targeting ( char segmentation algorithms
, if you search on google), this problem has been studied since the 1950s.
In general, the algorithm that both tesseract and ocropus use (the 2 most well-known OCRs opensources) does the following:
Find the connected components - Scan the image and find all pixels that have neighbors. So characters that do not stick together can be easily separated.
For each connected component, calculate a box that fits all component pixels (bounding box).
Only, as you mentioned, this algorithm fails in 2 cases. When the characters are stuck and when a character breaks in 2.
To solve this problem, ocrs use a technique called Seam Carving . In general terms, it is an algorithm that calculates a path that spends less "energy", or goes through fewer obstacles. In the case of character segmentation, a path with fewer obstacles is a vertical path that least touches the character's boundary.
For example, let's say we have the following image:
Thepathswiththefewestobstaclesarethefollowing:
Inthisimage,thepossiblecutsareblackinthesentenceandatthebottomagraphshowingthenumberofobstaclescalculatedbythealgorithmforeachcolumnofpixels.
Theresultofthisalgorithmisnotthecutsitself,butthecuttingpossibilities.Somearerightandsomearenot.ThenextstepinOCRistoresolvethisproblem.
Inthisstep,OCRgeneratesseveralclippingpossibilities,joiningorseparatingthesegmentsfound,asintheimagebelow:
ThisgeneratesdozensofpossibilitiesforOCR.ButhowdoesOCRknowthecorrectclipping?Simple,itthrowseverypossibilityintoacharacterrecognizer,whichwillreturnadegreeofconfidencethatthatcutisaletter.
TheOCRthentakesallthecutoutsinwhichthesumofthedegreesofconfidenceisaslargeaspossible.
PS:I'mnotgoingtodetaileveryalgorithmhere,becauseitwouldgiveyoua5,000wordpost,butyoucanstudyallthosealgorithmsimplementedin OCRopus .