Scrapy & Kameleo Integration for web scraping

These examples help to showcase how to gather data efficiently from websites with advanced anti-bot protection like Cloudflare Turnstile.

A more detailed description about this project is available on our website.

1-a-showcase-the-speed-of-scrapy

I use Scrapy to gather data from the quotes.toscrape.com website.
The spider goes over the 10 pages of data in 2.6 seconds ensuring a really effective technique.

1-b-scrapy-cloudflare

I showcase that Scrapy is receiving HTTPS 403 error message when I try to scrape data from a website that is protected by an anti-bot system.

2-a-compare-speed-with-headless-browser

In the second example I use Playwright to scrape the same dataset form the quotes.toscrape.com website.
The headless browser needs to render the page that makes the scraping slower. It takes about 6.4 seconds to gather the data.

2-b-playwright-cloudflare

I showcase that playwright is receiving "infinite captcha" when I try to scrape data from a website that is protected by an anti-bot system.

Headless browsers come handy when you scrape data from JavaScript heavy websites, or you want to interact more with the website. When data is protected by anti-bot systems, the best you can do is to utilize an anti-detect browser. Kameleo provides an undetectable web automation browser. This is not an open-source solution, however the platform provides unlimited fresh fingerprints, and ensures that their custom-built browsers (Chroma and Junglefox) are constantly updated to ensure, you stay on top of the anti-bot game without tiring maintenance overhead.

In the second part of the demo we try to scrape data from the review page of BurgerKing on indeed.com.

If you try to open the page, you will see the Cloudflare Turnstile

Most headless Chrome browsers fail to bypass this protection layer.

According to Pierluigi Vinciguerra from The Web Scraping Club it is very hard to rely on open source solutions such as Playwright and Cloudscraper. I couldn’t make it work with almost any open-source tools like Puppeteer Stealth or Playwright Stealth. When I found a working solution like Botasaurus, later when I tried to deploy my code it wasn’t working anymore, and Cloudflare blocked my scraper bot.

Kameleo is an anti-detect browser specialized for web scraping. We are constantly testing our custom-built browsers (Chroma and Junglefox) against anti-bot systems. Updates are quickly deployed to ensure you don't need to maintain your code to keep a high success rate.

3-bypass-cloudflare-turnstile-with-Kameleo

Kameleo launches its undetected chrome (called Chroma) with a fresh browser fingerprint.
It simply bypasses the Cloudflare Turnstile and loads the BurgerKing review page on indeed.com
We export the cf_clearance cookie which is our "pass through ticket for future cloudflare verifications"

4-add-cf_clearance-cookie-to-scrapy

Scrapy wouldn't be able to scrape the data from the BurgerKing review page due to an 403 forbidden error message caused by the protection by Cloudflare.
So I add the cf_clearance cookie to the request.
Note that I also need to set up the same user-agent for Scrapy that I used with Kameleo when I was getting the cf_clearance cookie.
This ensures I can do effective scraping behind Cloudflare's protection layer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrapy & Kameleo Integration for web scraping

1-a-showcase-the-speed-of-scrapy

1-b-scrapy-cloudflare

2-a-compare-speed-with-headless-browser

2-b-playwright-cloudflare

3-bypass-cloudflare-turnstile-with-Kameleo

4-add-cf_clearance-cookie-to-scrapy

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
1-a-showcase-the-speed-of-scrapy		1-a-showcase-the-speed-of-scrapy
1-b-scrapy-cloudflare		1-b-scrapy-cloudflare
2-a-compare-speed-with-headless-browser		2-a-compare-speed-with-headless-browser
2-b-playwright-cloudflare		2-b-playwright-cloudflare
3-bypass-cloudflare-turnstile-with-Kameleo		3-bypass-cloudflare-turnstile-with-Kameleo
4-add-cf_clearance-cookie-to-scrapy		4-add-cf_clearance-cookie-to-scrapy
readme-res		readme-res
.gitignore		.gitignore
README.md		README.md

kameleo-io/web-scraping-scrapy-kameleo-integration

Folders and files

Latest commit

History

Repository files navigation

Scrapy & Kameleo Integration for web scraping

1-a-showcase-the-speed-of-scrapy

1-b-scrapy-cloudflare

2-a-compare-speed-with-headless-browser

2-b-playwright-cloudflare

3-bypass-cloudflare-turnstile-with-Kameleo

4-add-cf_clearance-cookie-to-scrapy

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages