This problem is typical of an OCR. Known as character targeting ( char segmentation algorithms
, if you search on google), this problem has been studied since the 1950s.
In general, the algorithm that both tesseract and ocropus use (the 2 most well-known OCRs opensources) does the following:
Find the connected components - Scan the image and find all pixels that have neighbors. So characters that do not stick together can be easily separated.
For each connected component, calculate a box that fits all component pixels (bounding box).
Only, as you mentioned, this algorithm fails in 2 cases. When the characters are stuck and when a character breaks in 2.
To solve this problem, ocrs use a technique called Seam Carving . In general terms, it is an algorithm that calculates a path that spends less "energy", or goes through fewer obstacles. In the case of character segmentation, a path with fewer obstacles is a vertical path that least touches the character's boundary.
For example, let's say we have the following image:
data:image/s3,"s3://crabby-images/e0ca0/e0ca0877dff91b5a4c7faf30f00758c307a0a86d" alt=""
Thepathswiththefewestobstaclesarethefollowing:
data:image/s3,"s3://crabby-images/fafb0/fafb0a2a5985f690fe193fecfce6b7e0af8ba5c4" alt=""
Inthisimage,thepossiblecutsareblackinthesentenceandatthebottomagraphshowingthenumberofobstaclescalculatedbythealgorithmforeachcolumnofpixels.
Theresultofthisalgorithmisnotthecutsitself,butthecuttingpossibilities.Somearerightandsomearenot.ThenextstepinOCRistoresolvethisproblem.
Inthisstep,OCRgeneratesseveralclippingpossibilities,joiningorseparatingthesegmentsfound,asintheimagebelow:
data:image/s3,"s3://crabby-images/118e3/118e3d2949ff97006b5bb37a86d1f4a691091b39" alt=""
ThisgeneratesdozensofpossibilitiesforOCR.ButhowdoesOCRknowthecorrectclipping?Simple,itthrowseverypossibilityintoacharacterrecognizer,whichwillreturnadegreeofconfidencethatthatcutisaletter.
TheOCRthentakesallthecutoutsinwhichthesumofthedegreesofconfidenceisaslargeaspossible.
PS:I'mnotgoingtodetaileveryalgorithmhere,becauseitwouldgiveyoua5,000wordpost,butyoucanstudyallthosealgorithmsimplementedin OCRopus .