Solving reCAPTCHA with image segmentation machine learning

9 min readNov 13, 2023

CAPTCHAs are seemingly simple puzzles that separate humans from bots in the digital realm. They’re like the gatekeepers of the internet — they challenge us to prove our humanity one click at a time. While CAPTCHAs have been historically robust to automation software, recent advances in Machine Learning may pose a threat to their existence.

So is it possible to bypass CAPTCHA with machine learning? Well … kind of! The following is going to be a step by step tutorial on how to build a fully functional automation that can solve Google reCAPTCHA v2. Note that this system is not very accurate (however it does perform surprisingly well for a college term project).

Automating reCAPTCHA Solves

First some formalities. We are going to be using a CNN pre-trained on the MIT ADE20K image segmentation dataset for this approach. Additionally, we will be using Selenium web driver to drive clicks and actually interact with challenges live on the web. By combining machine learning with automation we can build a system that looks at each challenge image, analyzes it, and then clicks on that image based on class prediction.

The higher level approach

Take a look at the above reCAPTCHA challenge. Each square in this 4x4 image contains a lot of information. Lets think about our overall goal here. We want to determine whether each image contains traffic lights. We’re working with images (or one large image) that contain a lot of information and may or may not contain the object class that we are looking for. To me this sounds like an image classification problem — better yet an image segmentation problem. Once we know whether each image contains our solution class we just need to click them and thats where Selenium comes in! So lets summarize, we need to do the following:

Split the challenge grid into individual images and process them separately
Feed each image to a segmentation CNN and make a prediction of all the object classes that it contains
If our solution class is found in the image, note that it should be clicked
Click the images with Selenium

Selenium what?

Selenium is an open-source framework designed for automating web browsers. It provides a suite of tools and libraries that allow developers and testers to interact with web applications programmatically. The core component, Selenium WebDriver, enables the automation of various browser actions, such as clicking buttons, filling forms, and navigating between pages. Selenium supports multiple programming languages, making it versatile for developers using Java, Python, C#, and more. In our case we can leverage Selenium python framework to interact with CAPTCHA challenges live on the web.

So what is image segmentation?

TLDR: image segmentation labels every pixel in an image with a class. For example, pixel one belongs to a chair, pixel two belongs to a bus, etc.

Image segmentation is a computer vision and image processing technique that involves dividing an image into multiple, meaningful regions or segments. These segments are typically homogeneous with respect to certain visual characteristics, such as color, intensity, texture, or other features. There are various techniques for image segmentation, including thresholding, edge-based methods, region-based methods, clustering, deep learning-based approaches, and more. For our automation we will be using a resnet50dilated architecture pre-trained on the MIT ADE20K image segmentation dataset. I highly recommend you checkout their demo Google Colab notebook if you want more information on this model and the training process.

Below is an example of this model at work. The top row of images represents input and the second row output. Each object is labelled by color. You can see that the model assigns an object class to every pixel in the image. This is just what we need to identify if a challenge solution object is in an image and after, if that image needs to be clicked!

Diving into the code

Now i’m not gonna lie, there is a lot of code here to walkthrough. For the sake of time, I will give a higher level overview of the important parts & make the full source available on Github.

Creating the web driver instance

captcha_site = 'https://www.google.com/recaptcha/api2/demo'
click_delay = .5
solved = False

# create a new instance of the Chrome browser
driver = webdriver.Chrome()

# navigate to the website
driver.get(captcha_site)

# wait for reCAPTCHA iframe to load and switch to it
wait = WebDriverWait(driver, 10)
    # switch back to the main frame and wait for the "recaptcha challenge expires in two minutes" iframe to load
recaptcha_iframe = wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, 'iframe[title="reCAPTCHA"]')))

    # wait for div#rc-anchor-container to load and click on div.recaptcha-checkbox-border
recaptcha_wait = WebDriverWait(driver, 10)
recaptcha = recaptcha_wait.until(EC.presence_of_element_located((By.ID, 'rc-anchor-container')))
recaptcha.click()

Okay, lets unpack this. First of all Google conveniently provides a live reCAPTCHA demo which we leverage. We create a new instance of selenium using Chrome driver and we navigate to this demo website. We then click on the challenge itself. Next we scrape the actual image out of the page HTML. This is so that we can feed it to our segmentation model for prediction.

def scrape_image(driver, wait):

    driver.switch_to.default_content()
    challenge_iframe = wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, 'iframe[title="recaptcha challenge expires in two minutes"]')))   
    image = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'img[src*="api2/payload"]')))

    image_src = image.get_attribute('src')
    class_name = image.get_attribute('class')


    label = driver.find_element(By.TAG_NAME, 'strong').text

    response = requests.get(image_src)

    image_content = response.content

        # Open the image from bytes
    image = Image.open(BytesIO(image_content))

    file_name = f"{label}{random.randint(0, 10000000)}.png"  # You can change the file extension to any image format you prefer
    file_path = os.path.join("captchas/test/", file_name)
    image.save(file_path)
    
    return image, label



 image, label = scrape_image(driver, wait)
 pil_image = image.convert('RGB')
 img_original = np.array(pil_image)
 img = cv2.resize(img_original, (120, 120))
 img = np.expand_dims(img, axis=0)
 drive_4x4(driver, image, label, pred_type, click_delay)

You can see above we scrape the image and do some conversions to clean it up for prediction. We also scrape its corresponding solution (which is just a text label at the top of the challenge). Next lets look at a function that drives the clicks and handles predictions on a higher level.

def drive_4x4(driver, image, label, pred_type, click_delay):
        #predict on our scraped captcha
        output_list, pred_df = pipeline(image, label, pred_type) 

        # Find all elements with class name "rc-imageselect-tile"
        elements = driver.find_elements(By.CLASS_NAME, 'rc-imageselect-tile')
            
        idx = 0
        for element in elements:
            if output_list[idx]:
                element.click()
                time.sleep(click_delay)
            idx +=1

        verify_button = driver.find_element(By.ID, 'recaptcha-verify-button')
        verify_button.click()

Now I will admit, this isn’t the most interesting piece of code, but it is pretty necessary. Basically, this calls a black box pipeline function that magically predicts the class of each image in our challenge. We then take those predictions and map them to physical images on the reCAPTCHA demo site. After that we iterate through these images by their CSS class and click them if they match our solution! But, lets dive a bit deeper into where the magic happens.

def pipeline(image, label, pred_type):
    confidence =  10
    target = parse_target(label)
    print("Target", target)
    if not target_present(target):
        print(target, "is not found in our classes")
        return (False, False)
    pil_image = image.convert('RGB')
    
    if pred_type == "nonsegmentation":
        output_list, pred_df = solve_3_x_3(pil_image, target, confidence)
        return output_list, pred_df
    elif pred_type == "segmentation":
        output_list, pred_df = solve_4x4(pil_image, target, confidence)
        return output_list, pred_df
    else:
        print("Letters detected")
        return (False, False)

First we make sure that our scrapped solution target is present in our target list. We do this by calling parse_target which behind the scenes takes this object class and ensures our segmentation model can handle it. The reasoning here is that ADE20K contains 150 object categories and thus our pre-trained model can only predict on these classes. If the CAPTCHA challenge has a solution that is not included in these 150 classes we have no chance of predicting it. Luckily, at the time of writing this, most reCAPTCHA v2 objects were in this list (you can also fine tune a model with custom data , but more on that later). We then call yet another black box function solve_4x4 which, I promise, is the bottom of the stack this time.

def solve_4x4(image, target_name, confidence):
    # Define the transformation to apply to the image
    pil_to_tensor = torchvision.transforms.Compose([
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize(
            mean=[0.485, 0.456, 0.406], # These are RGB mean+std values
            std=[0.229, 0.224, 0.225])  # across a large photo dataset.
    ])

    # Define the output list to return
    output_list = []
    # Define the dictionary of predictions to return
    pred_dict = {}

    # Convert the image to a numpy array
    img_original = np.array(image)
    # Convert the image to a tensor
    img_data = pil_to_tensor(image)
    # Create a singleton batch
    singleton_batch = {'img_data': img_data[None].cpu()}
    # Get the output size
    output_size = img_data.shape[1:]

    # Run the segmentation at the highest resolution.
    with torch.no_grad():
        scores = segmentation_module(singleton_batch, segSize=output_size)

    # Get the predicted scores for each pixel
    _, pred = torch.max(scores, dim=1)
    pred = pred.cpu()[0].numpy()

    # Create a grid of predictions
    pred_grid = create_grid_pred(pred, 4)

    for r in range(4):
        for c in range(4):
            # Get the prediction for the current square
            pred = pred_grid[r][c]
            # Check if the target class is present in the image
            target_class = -1
            for i in range(len(names)):
                if target_name in names[i+1]:
                    target_class = i
                    break
            if target_class == -1:
                raise ValueError('Target class not found in the class names.')
            
            top_preds = np.argsort(np.bincount(pred.flatten()))[::-1][:confidence]
            top_classes = [names[i+1] for i in top_preds]
            is_present = target_class in top_preds
            
            output_list.append(is_present)
            pred_dict[f'{r},{c}'] = top_classes

    # Get maximum length of arrays in pred_dict
    max_len = max(len(v) for v in pred_dict.values())
    # Pad shorter arrays with NaN values
    d_padded = {k: v + [np.nan]*(max_len - len(v)) for k, v in pred_dict.items()}
    # Convert to dataframe
    pred_df = pd.DataFrame.from_dict(d_padded)
    return output_list, pred_df

Okay, theres a lot going on here. First we call torchvision.transforms.Compose which converts our image to a tensor. If you aren’t super familar with machine learning, no need to worry about this step too much. Basically we are turning our image into something that our model can understand. Now, time for the magic!

# Create a singleton batch
    singleton_batch = {'img_data': img_data[None].cpu()}
    # Get the output size
    output_size = img_data.shape[1:]

    # Run the segmentation at the highest resolution.
    with torch.no_grad():
        scores = segmentation_module(singleton_batch, segSize=output_size)

    # Get the predicted scores for each pixel
    _, pred = torch.max(scores, dim=1)
    pred = pred.cpu()[0].numpy()

    # Create a grid of predictions
    pred_grid = create_grid_pred(pred, 4)

The above code does the following. First we create a singleton_batch with our image. A batch is a common term in machine learning. I like to think of a batch as the number of examples that our algorithm learns from before it adjusts from those examples. In our case we have a single batch because we are running predictions and not training the model. singleton_batch = {'img_data': img_data[None].cpu()}
Next the segmentation module is given this image to predict on. We pass the model the image in a single batch. Keep in mind this uses a Python pytorch approach. Note, in the code above we freeze our model's gradient so that this data does not affect its state segmentation_module(singleton_batch, output_size)
Our model spits our 150 predictions for each pixel. Each prediction is just a probability, so the max value out of these 150 predictions is our prediction for that pixel. Thats exactly what this code does! We now have an array of predictions for all the pixels in our image _, pred = torch.max(scores, dim=1); pred = pred.cpu()[0].numpy()

Finally, lets take our pred_grid and iterate through the image squares on the page. If our solution class is present in the current image, we need to click that image!

    # Create a grid of predictions
    pred_grid = create_grid_pred(pred, 4)

    for r in range(4):
        for c in range(4):
            # Get the prediction for the current square
            pred = pred_grid[r][c]
            # Check if the target class is present in the image
            target_class = -1
            for i in range(len(names)):
                if target_name in names[i+1]:
                    target_class = i
                    break
            if target_class == -1:
                raise ValueError('Target class not found in the class names.')
            
            top_preds = np.argsort(np.bincount(pred.flatten()))[::-1][:confidence]
            top_classes = [names[i+1] for i in top_preds]
            is_present = target_class in top_preds
            
            output_list.append(is_present)
            pred_dict[f'{r},{c}'] = top_classes

Closing thoughts

While I only scratched the surface here, hopefully you got a taste for a real approach at solving CAPTCHA with machine learning — think about how this can potentially be improved to be pretty effective! Interestingly enough, as Google reCAPTCHA v3 slowly rolls out, techniques like this will become completely useless as challenge grids are replaced with automated browser scans. See more on reCAPTCHA v3 here.

If you liked this post and would like a much more detailed guide on the approach, feel free to leave a comment & I will happily write a more extensive walkthrough. If you got here, thanks for reading :).