These examples help to showcase how to gather data efficiently from websites with advanced anti-bot protection like Cloudflare Turnstile.
A more detailed description about this project is available on our website.
- I use Scrapy to gather data from the quotes.toscrape.com website.
- The spider goes over the 10 pages of data in 2.6 seconds ensuring a really effective technique.
- I showcase that Scrapy is receiving
HTTPS 403 error message
when I try to scrape data from a website that is protected by an anti-bot system.
- In the second example I use Playwright to scrape the same dataset form the quotes.toscrape.com website.
- The headless browser needs to render the page that makes the scraping slower. It takes about 6.4 seconds to gather the data.
- I showcase that playwright is receiving "infinite captcha" when I try to scrape data from a website that is protected by an anti-bot system.
Headless browsers come handy when you scrape data from JavaScript heavy websites, or you want to interact more with the website. When data is protected by anti-bot systems, the best you can do is to utilize an anti-detect browser. Kameleo provides an undetectable web automation browser. This is not an open-source solution, however the platform provides unlimited fresh fingerprints, and ensures that their custom-built browsers (Chroma and Junglefox) are constantly updated to ensure, you stay on top of the anti-bot game without tiring maintenance overhead.
In the second part of the demo we try to scrape data from the review page of BurgerKing on indeed.com.
If you try to open the page, you will see the Cloudflare Turnstile
Most headless Chrome browsers fail to bypass this protection layer.
According to Pierluigi Vinciguerra from The Web Scraping Club it is very hard to rely on open source solutions such as Playwright and Cloudscraper. I couldn’t make it work with almost any open-source tools like Puppeteer Stealth or Playwright Stealth. When I found a working solution like Botasaurus, later when I tried to deploy my code it wasn’t working anymore, and Cloudflare blocked my scraper bot.
Kameleo is an anti-detect browser specialized for web scraping. We are constantly testing our custom-built browsers (Chroma and Junglefox) against anti-bot systems. Updates are quickly deployed to ensure you don't need to maintain your code to keep a high success rate.
- Kameleo launches its undetected chrome (called Chroma) with a fresh browser fingerprint.
- It simply bypasses the Cloudflare Turnstile and loads the BurgerKing review page on indeed.com
- We export the
cf_clearance
cookie which is our "pass through ticket for future cloudflare verifications"
- Scrapy wouldn't be able to scrape the data from the BurgerKing review page due to an
403 forbidden
error message caused by the protection by Cloudflare. - So I add the
cf_clearance
cookie to the request. - Note that I also need to set up the same
user-agent
for Scrapy that I used with Kameleo when I was getting thecf_clearance
cookie. - This ensures I can do effective scraping behind Cloudflare's protection layer