Knowledge_Graph_Wander/content/Web_Crawler.md at master · Schlampig/Knowledge_Graph_Wander · GitHub

Web Crawler

Source

Common Crawl: an open repository of web crawl data that can be accessed and analyzed by anyone.

Tools

Literatures/Books

Awesome-crawler

link: https://github.com/BruceDone/awesome-crawler
author: Bruce Tang
note: a collection of awesome web crawler,spider in different languages

examples-of-web-crawlers

link: https://github.com/shengqiangzhang/examples-of-web-crawlers
author: Shengqiang Zhang
note: some interesting examples of python crawlers that are friendly to beginners.

Crack-JS

link: https://github.com/xianyucoder/Crack-JS
blog: http://xianyucoder.cn/
author: huangjin
note: Python3爬虫项目进阶实战.

wechat-spider

link: https://github.com/striver-ing/wechat-spider
author: striver-ing
note: 开源微信爬虫：爬取公众号所有文章、阅读量、点赞量和评论内容.

crawlab

link: https://github.com/crawlab-team/crawlab
author: Crawlab Team
note: 分布式爬虫管理平台，支持任何语言和框架.

paperscraper

link: https://github.com/PhosphorylatedRabbits/paperscraper
author: PhosphorylatedRabbits
note: tools to scrape publication metadata from pubmed, arxiv, medrxiv and chemrxiv.

arxiv2latex

link: https://github.com/liuyixin-louis/arxiv2latex
author: Yixin Liu
note: download the source latex code of multiple arxiv paper with one click.

magical_spider

link: https://github.com/lixi5338619/magical_spider
author: 李玺
note: 神奇的蜘蛛🕷，一个几乎适用于所有web端站点的采集方案.

Newspaper3k

link: https://github.com/codelucas/newspaper
author: Lucas Ou-Yang
note: news, full-text, and article metadata extraction in python 3.