Cracking CAPTCHAs for Fun and Profit
November 21st, 2007
Image processing and specifically OCR (Optical Character Recognition) has become an obsession of mine lately. A lot of research is being done in OCR for handwriting, digitizing books, cursive writing, and even CAPTCHA cracking. For those of you who may not know what a CAPTCHA is, it stands for Completely Automated Public Turing test to tell Computers and Humans Apart. It’s those little images with letters and numbers in them that are used when registering on websites and even posting comments to blogs and forums.
The idea is that using a CAPTCHA will prevent computer programs from automatically registering or submitting comments on a given website. Breaking a CAPTCHA by using OCR renders these systems irrelevant. It’s definitely a game of cat and mouse. When a CAPTCHA is cracked, the intelligent thing to do is replace it with a stronger CAPTCHA.
I’m kind of reluctant to post very much information on how to crack CAPTCHAs and I’m sure it’s obvious why. I probably won’t be posting full source code for any given CAPTCHA and the code I do give out will either be crippled or just be snippets of a larger OCR program. The techniques used for cracking CAPTCHAs are really just image processing algorithms that have been applied for this specific use.
In the future I will be posting techniques on how to crack specific CAPTCHAs. For example, in my next article I’ll present algorithms for cracking the CAPTCHA at Bumpzee.com. Generally, if I post an article on how to crack a specific CAPTCHA it will probably be a site that isn’t worth spamming.
Each CAPTCHA is unique and the techniques used to crack a specific CAPTCHA have to be altered slightly, but generally all CAPTCHAs are cracked using similar techniques. For example, you read the image into memory, eliminate any noise, separate each character into its own image, then perform some kind of pixel matching to determine what each character is. With most CAPTCHAs in the wild today you can train your OCR software to recognize characters by doing pixel matching against each letter in the CAPTCHA.
This approach is really brute force and doesn’t work very well on the more advanced CAPTCHAs. For now I will be focusing on the brute force pixel matching techniques and maybe in later posts I will go into advanced techniques.
Using Python and PIL (Python Imaging Library), loading a CAPTCHA (or any image) is as simple as:
Sometimes a CAPTCHA will have noise in the background. Since each site’s CAPTCHA is unique, you have to come up with techniques to eliminate that noise. One of my favorite techniques is to convert the CAPTCHA to a greyscale image:
I like to use (.4, .4, .4, 0) for my conversion matrix when converting from ‘RGB’ to ‘L’ (greyscale). Past experience has shown this to be a decent conversion matrix but like I said earlier, all CAPTCHAs are different and some might not do well with that conversion matrix. You may even be able to get away without using a conversion matrix at all.
After converting the CAPTCHA to greyscale, another technique I use is to eliminate pixels that aren’t part of the letters. A lot of times this means the letters have darker shades of grey in them and the background noise has lighter shades of grey. You can determine which pixels to eliminate by trial and error. PIL provides a method that will give you all the colors in an image:
Using the output of getcolors() and modifying pixels until you determine the best colors to eliminate is all trial and error. Here’s a function you can use to play with for eliminating lighter-colored pixels:
The function is straightforward: iterate through each pixel, check if its color is greater than 140 and set it to white if the check passes. The idea is that this eliminates the lighter background noise while leaving the darker character pixels.
After eliminating the basic noise, there’s another thing I like to do called ’skeletonization’. There are a few different ways of achieving similar, but different, results. To put it plainly, skeletonization is a technique that takes an image and reduces the amount of edge pixels there are. For some CAPTCHAs it’s good enough to check surrounding pixels and eliminate them if there are too many white pixels surrounding a dark pixel. Another skeletonization technique is more advanced and is used for trimming edges to one-pixel widths in some cases. The skeletonization technique I’m going to cover here is the simpler version for getting rid of some noise in the CAPTCHA.
Now that the CAPTCHA is clean and noise is removed, the next step is to separate the characters from the CAPTCHA. There are a bunch of techniques for splitting a CAPTCHA into its letters. One that I’ve seen and even used is very brute force. The algorithm iterates over the CAPTCHA’s pixels and looks for non-white pixels. When it finds one, it records the x,y coordinates. It also stores values for the min and max x,y coordinates. Those coordinates allow you to crop the CAPTCHA and pull out the letter. The way it determines a letter’s bounding box is by finding a column that only has white pixels. A column that has zero black pixels indicates that there are no letter pixels in them and the letter’s bounding box is complete. This brute-force approach is problematic when a CAPTCHA has letters that have the same X coordinates with different Y coordinates. As you can imagine, using this algorithm to split a CAPTCHA’s letters will result in pulling two or more letters if the X coordinates of the letters is the same.
I’ll cover the brute force algorithm for now and in a later post I will go over the more elegant flood-fill algorithm that doesn’t fail on overlapping X coordinates.
The function above iterates over all the pixels in the CAPTCHA looking for pixels that aren’t white. If it’s the first non-white pixel found, record that pixel’s X coordinate in firstX. It also sets the initial value for lastX. It then checks the minimums and maximums for the top and bottom Y coordinates and the lastX coordinate. It then overwrites the variables with new values if necessary.
As long as there is a black pixel in each column, we know we’re looking at a letter in the CAPTCHA, so we only crop the CAPTCHA when we hit a column without any non-white pixels. Those bounding box variables (firstX, topY, lastX, bottomY) now come into play when setting up a crop box for the CAPTCHA.
Append this cropped image (a letter) to the letters list, reset the algorithm’s bounding box variables and resume scanning the CAPTCHA for more letters.
The final step in brute force CAPTCHA cracking is pixel matching. I’ll be exploring more advanced methods of OCRing CAPTCHAs, but for now the simplest method is doing a pixel-by-pixel match.
There is one thing I’ve left out until this point: OCR software has to be trained. For example, when you first run a CAPTCHA cracker you have to tell it which characters it’s reading. You basically have to solve CAPTCHAs for all letters and numbers until the OCR can successfully match a significant portion of all CAPTCHAs on a site. Training it is just a matter of letting the OCR software split the CAPTCHA into letters and then you manually input which letter it is. The software then saves that letter either in a directory named after the letter you input or in some other way that it’s easily identified as being the correct letter.
This is where the pixel matching comes into play. It splits the live CAPTCHA into its letters, iterates over all saved letters that you ‘trained’ the software with, and then finds the best match by counting the number of pixels that are matched. Since it knows where the letter came from, such as a directory, it knows that the directory name of the best-matched letter is the correct value for that character.
For now I’ll leave out this pixel matching function and I may post it at a later date.