Image processing and specifically OCR (Optical Character Recognition) has become an obsession of mine lately. A lot of research is being done in OCR for handwriting, digitizing books, cursive writing, and even CAPTCHA cracking. For those of you who may not know what a CAPTCHA is, it stands for Completely Automated Public Turing test to tell Computers and Humans Apart. It's those little images with letters and numbers in them that are used when registering on websites and even posting comments to blogs and forums.
The idea is that using a CAPTCHA will prevent computer programs from automatically registering or submitting comments on a given website. Breaking a CAPTCHA by using OCR renders these systems irrelevant. It's definitely a game of cat and mouse. When a CAPTCHA is cracked, the intelligent thing to do is replace it with a stronger CAPTCHA.
I'm kind of reluctant to post very much information on how to crack CAPTCHAs and I'm sure it's obvious why. I probably won't be posting full source code for any given CAPTCHA and the code I do give out will either be crippled or just be snippets of a larger OCR program. The techniques used for cracking CAPTCHAs are really just image processing algorithms that have been applied for this specific use.
In the future I will be posting techniques on how to crack specific CAPTCHAs. For example, in my next article I'll present algorithms for cracking the CAPTCHA at Bumpzee.com. Generally, if I post an article on how to crack a specific CAPTCHA it will probably be a site that isn't worth spamming.
Each CAPTCHA is unique and the techniques used to crack a specific CAPTCHA have to be altered slightly, but generally all CAPTCHAs are cracked using similar techniques. For example, you read the image into memory, eliminate any noise, separate each character into its own image, then perform some kind of pixel matching to determine what each character is. With most CAPTCHAs in the wild today you can train your OCR software to recognize characters by doing pixel matching against each letter in the CAPTCHA.
This approach is really brute force and doesn't work very well on the more advanced CAPTCHAs. For now I will be focusing on the brute force pixel matching techniques and maybe in later posts I will go into advanced techniques.
Using Python and PIL (Python Imaging Library), loading a CAPTCHA (or any image) is as simple as:
img = Image.open(filename)
Sometimes a CAPTCHA will have noise in the background. Since each site's CAPTCHA is unique, you have to come up with techniques to eliminate that noise. One of my favorite techniques is to convert the CAPTCHA to a greyscale image:
def captcha_to_greyscale(captcha):
if captcha.mode == 'L': return captcha
captcha = captcha.convert('L', (.4, .4, .4, 0))
return captcha
I like to use (.4, .4, .4, 0) for my conversion matrix when converting from 'RGB' to 'L' (greyscale). Past experience has shown this to be a decent conversion matrix but like I said earlier, all CAPTCHAs are different and some might not do well with that conversion matrix. You may even be able to get away without using a conversion matrix at all.
After converting the CAPTCHA to greyscale, another technique I use is to eliminate pixels that aren't part of the letters. A lot of times this means the letters have darker shades of grey in them and the background noise has lighter shades of grey. You can determine which pixels to eliminate by trial and error. PIL provides a method that will give you all the colors in an image:
print captcha.getcolors()
Using the output of getcolors() and modifying pixels until you determine the best colors to eliminate is all trial and error. Here's a function you can use to play with for eliminating lighter-colored pixels:
# pixels is gotten from the image with: pixels = captcha.load()
# w and h is gotten with: w, h = captcha.size
# Note: captcha.load() returns a pixel access object. if you alter a pixel using this object
# then you alter the captcha itself. there are other ways to load pixels but I like using
# the pixel access objects.
# captcha.size returns a tuple of height and width values for the image.
def light_pixels_to_white_pixels(pixels, w, h):
for x in xrange(w):
for y in xrange(h):
if pixels[x, y] > 140: pixels[x, y] = 255
return pixels
The function is straightforward: iterate through each pixel, check if its color is greater than 140 and set it to white if the check passes. The idea is that this eliminates the lighter background noise while leaving the darker character pixels.
After eliminating the basic noise, there's another thing I like to do called 'skeletonization'. There are a few different ways of achieving similar, but different, results. To put it plainly, skeletonization is a technique that takes an image and reduces the amount of edge pixels there are. For some CAPTCHAs it's good enough to check surrounding pixels and eliminate them if there are too many white pixels surrounding a dark pixel. Another skeletonization technique is more advanced and is used for trimming edges to one-pixel widths in some cases. The skeletonization technique I'm going to cover here is the simpler version for getting rid of some noise in the CAPTCHA.
# This function uses two passes over the pixels, once for marking black pixels for removal
# and one for actually removing the black pixels. The two-pass approach is generally ideal
# because you don't want to be flipping black pixels to white pixels while you're still
# iterating over them looking for neighbors. That would artificially inflate the number of
# white pixels and we don't want that.
def skeletonize(pixels, w, h):
for x in xrange(w):
for y in xrange(h):
# no point in processing white pixels since we only want to remove black pixels
if pixels[x, y] == 255: continue
count = 0
# Using a try/except block here is a weak solution. The proper way to do this
# would be to test that each pixel is within the image's borders and then check
# if it's not white. Using try/except means that when an exception is raised,
# code execution resumes after the except: pass statement and no other if-statements
# are executed. This results in the variable 'count' not getting a correct value
# and may result in a pixel getting set for removal.
try:
if pixels[x-1, y-1] != 255: count += 1
if pixels[x-1, y ] != 255: count += 1
if pixels[x-1, y+1] != 255: count += 1
if pixels[x, y+1 ] != 255: count += 1
if pixels[x+1, y+1] != 255: count += 1
if pixels[x+1, y ] != 255: count += 1
if pixels[x+1, y-1] != 255: count += 1
if pixels[x, y-1 ] != 255: count += 1
except: pass
# not enough neighbors are dark pixels so mark this pixel
# to be changed to white
if count < 4:
pixels[x, y] = 1
# second pass: this time set all 1's to 255 (white)
for x in xrange(w):
for y in xrange(h):
if pixels[x, y] == 1: pixels[x, y] = 255
return pixels
Now that the CAPTCHA is clean and noise is removed, the next step is to separate the characters from the CAPTCHA. There are a bunch of techniques for splitting a CAPTCHA into its letters. One that I've seen and even used is very brute force. The algorithm iterates over the CAPTCHA's pixels and looks for non-white pixels. When it finds one, it records the x,y coordinates. It also stores values for the min and max x,y coordinates. Those coordinates allow you to crop the CAPTCHA and pull out the letter. The way it determines a letter's bounding box is by finding a column that only has white pixels. A column that has zero black pixels indicates that there are no letter pixels in them and the letter's bounding box is complete. This brute-force approach is problematic when a CAPTCHA has letters that have the same X coordinates with different Y coordinates. As you can imagine, using this algorithm to split a CAPTCHA's letters will result in pulling two or more letters if the X coordinates of the letters is the same.
I'll cover the brute force algorithm for now and in a later post I will go over the more elegant flood-fill algorithm that doesn't fail on overlapping X coordinates.
def split_captcha_letters(captcha):
started = False
letters = []
width, height = captcha.size
bottomY, topY = 0, height
pixels = captcha.load()
for x in xrange(width):
black_pixel_in_col = False
for y in xrange(height):
if pixels[x, y] != 255:
if started == False:
started = True
firstX = x
lastX = x
if y > bottomY: bottomY = y
if y < topY: topY = y
if x > lastX: lastX = x
black_pixel_in_col = True
if black_pixel_in_col == False and started == True:
rect = (firstX, topY, lastX, bottomY)
new_captcha = captcha.crop(rect)
letters.append(new_captcha)
started = False
bottomY, topY = 0, height
return letters
The function above iterates over all the pixels in the CAPTCHA looking for pixels that aren't white. If it's the first non-white pixel found, record that pixel's X coordinate in firstX. It also sets the initial value for lastX. It then checks the minimums and maximums for the top and bottom Y coordinates and the lastX coordinate. It then overwrites the variables with new values if necessary.
As long as there is a black pixel in each column, we know we're looking at a letter in the CAPTCHA, so we only crop the CAPTCHA when we hit a column without any non-white pixels. Those bounding box variables (firstX, topY, lastX, bottomY) now come into play when setting up a crop box for the CAPTCHA.
Append this cropped image (a letter) to the letters list, reset the algorithm's bounding box variables and resume scanning the CAPTCHA for more letters.
The final step in brute force CAPTCHA cracking is pixel matching. I'll be exploring more advanced methods of OCRing CAPTCHAs, but for now the simplest method is doing a pixel-by-pixel match.
There is one thing I've left out until this point: OCR software has to be trained. For example, when you first run a CAPTCHA cracker you have to tell it which characters it's reading. You basically have to solve CAPTCHAs for all letters and numbers until the OCR can successfully match a significant portion of all CAPTCHAs on a site. Training it is just a matter of letting the OCR software split the CAPTCHA into letters and then you manually input which letter it is. The software then saves that letter either in a directory named after the letter you input or in some other way that it's easily identified as being the correct letter.
This is where the pixel matching comes into play. It splits the live CAPTCHA into its letters, iterates over all saved letters that you 'trained' the software with, and then finds the best match by counting the number of pixels that are matched. Since it knows where the letter came from, such as a directory, it knows that the directory name of the best-matched letter is the correct value for that character.
For now I'll leave out this pixel matching function and I may post it at a later date.