Captcha break is not a trivial problem, you need to know a little bit about image processing, even for simple cases.
This captcha image is simple to break compared to most of the captcha's that exist by ae. The letter is well detached from the background and the letters are well separated from each other.
The algorithm would be more or less:
Separate each letter
Recognize by letter - The letters do not seem to vary in rotation, so a simple match between the cut letter and the pre-sorted alphabet seems to be enough.
Unfortunately I do not know VB, so I'll write the code in python, but I believe it's simple to adapt to Visual Basic using Emgu , a .NET wrapper for OpenCV.
Step 1 - Separate the letters
I'll take an example from the page and run the algorithm step by step.
Thesestepsareknowninimageprocessingasbackgroundremoval.Theideaofthistechniqueistoremovefromtheimagewhatisnotimportant,leavinginauniquecolor,andhighlighttheimportantobject,leavinginanothercolor,sothatitissimpletogetthecoordinatesoftheobjectsofinterest(InWikipediathereismoreinformation- Background Subtraction )
The first step is to turn the color image into grayscale image.
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
The result is the image below:
Afterthat,tohighlightthewhiteletters,we'lluseamorphologicaloperationcalled dilation . It will make the lyrics "chubby", reinforcing the area of the objects of interest.
kernel = cv2.getStructuringElement(cv2.MORPH_CROSS, (3, 3))
dilated = cv2.dilate(gray, kernel)
The result applied to the grayscale image is the image below:
Nowthatwehavetheletterwellhighlighted,wecanturnthebackgroundinblackandthelettersinwhitejustbylookingattheircolor.Thepixelsthataresmallerthanthevalue127willpaintinblackandthelargerwewillpaintinwhite.Thistechniqueiscalled Thresholding .
_, bw = cv2.threshold(dilated, 127, 255, cv2.THRESH_BINARY)
The result of the threshold in the dilated image is as below:
Nice,nowthebackgroundisallblackandthelettersallwhite.Nowwehavetoseparateeachletter,andforthis,opencvalreadyhasanicefunction,whichgivesthesamevalueforallneighboringpixelsthathavethesamecolor.
total,markers=cv2.connectedComponents(bw)
totalwillhavetheamountofcomponents,includingthebackground,andmarkerswillbeanimagewiththeconnectedcomponents(eachletter)paintedthesamecolor.Itisdrawnbelow:
Nowjustfindthecoordinatesofeachletter.ForthisweuseamethodcalledfindContoursthatwillfindtheoutlinesofeachletter.
#filtraoscomponentes,deixandoapenasoscommaisde10pxemenosde1milpximages=[numpy.uint8(labels==i)*255foriinrange(total)ifnumpy.uint8(labels==i).sum()>10andnumpy.uint8(labels==i).sum()<1000]#fazumacopiadaimagemempretoebranco,sopravisualizacaoimg=cv2.cvtColor(bw,cv2.COLOR_GRAY2RGB)#pintaretangulosemvoltadecadacomponentecolor=(255.0,0.0,0.0)forlabelinimages:#encontraoscontornosparacadacomponentecountours=cv2.findContours(label,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)#calculaoretanguloemvoltadoscontornos(x,y,w,h)=cv2.boundingRect(countours[0])#epintaelecv2.rectangle(img,(x,y),(x+w,y+h),color,1)
Theresultistherectanglesdrawnaroundeachletter,asintheimagebelow:
Now with the x, y, height, and width coordinates of each letter it is trivial to cut them down and save them in a directory. These letters should then be sorted in some way (all 'a' in the same directory, for example, all 'b', and so on).
Step 2 - Make recognition
Now that you have a good base of graded letters and know how to separate the letters into a new captcha, you simply cut out each letter and compare it to all the letters of your base, as well as brute force. Opencv has a function that does this, called matchTemplate. It has several methods to calculate the difference between 2 images, by experience, I usually use the TM_CCOEFF_NORMED method.
Imagining that you have 1 letter cut out that you want to recognize, and a list of images as a template, you can use the method below, which will give you the best wedding.
# Busca o melhor template p/ uma letra
def search_for_letter(image, letter, templates):
best = 2 ** 32
pos = None
for template in templates:
# busca o template na imagem, usando o metodo passado
match = cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED)
# encontra a pontuacao e a localizacao do template
minVal,maxVal,minLoc,maxLoc = cv2.minMaxLoc(match)
if best < maxVal:
pos = {
'error': maxVal,
'location': maxLoc,
'letter': letter
}
best = maxVal
return pos
Now you can use this method in a loop, to find all the letters contained in the image, something like the method below:
# itera em cima de todas as letras para achar
# o melhor resultado
def search(file, templates):
matches = []
# esse cut_and_binarize eh todo o passo 1
image = cut_and_binarize(file)
for letter in templates:
pos = search_for_letter(image, letter, templates[letter])
if pos is not None:
matches.append(pos)
# ordena os melhores casamentos
matches = sorted(matches, key=lambda x:x['error'],reverse=True)
# pega os 6 melhores casamentos e ordena em X
return sorted(matches[:6], key=lambda x:x['location'][0])
With this, I believe you can achieve more than 90% accuracy in the recognition of this captcha. As I said at the outset, understanding image processing is important in automating the recognition of captchas, but knowing the basic techniques, the work is simple.