Skip to content

Add robots.txt.sample #1024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 6, 2022
Merged

Add robots.txt.sample #1024

merged 4 commits into from
Jun 6, 2022

Conversation

simbus82
Copy link
Contributor

@simbus82 simbus82 commented Jun 4, 2020

The old magento has never provided a robots.txt to guide search engine through the multiple urls that a magento installation can generate.

This robots.txt was born from a some years of experience in SEO and from several readings in the most famous "blogs" on magento (and SEO too).

The goal is to ensure that the search engine does not consume all the crawling budget on pages that do not make sense being indexed.
In the same way it avoids that numerous CMS assets are still crawled and indexed.

The search engine is very smart, therefore it is unlikely that once a assets of the core of a cms has been scanned it will be proposed in the SERP.

But even better, we will prevent the search engine from going to waste time on files or urls which will then be rejected anyway.

E-commerce sites with a large number of products and complex layered navigation systems should benefit, reducing the time to index their pages and allowing the search engine to crawl the site more often and more easily.

Bonus: the robots.txt file is the ideal file where to enter site's sitemap path.
Almost all search engines consider the indications entered in robots.txt, so the presence of sitemap.xml will be very important for let search engines crawl first all our most important pages.

PS: PR as Draft, because i don't know if i need to write elsewhere the presence of two new files...

@Flyingmana
Copy link
Contributor

I would prefer to only have one file, and better the sample one to avoid accidentally overwriting the ones people modified.

Disallowing the URLs with params leads to external links to this pages not be possible to crawl, so also the canonical Tag can not be read by them and linked to the appropriate main Page.
I believe its better to handle this params in the configuration of the SearchEngine instead of disallowing them.

While I see the possible benefits for bigger shops, we should focus the defaults on smaller shops, who could by this defaults get hurt a lot.

Delete robots.txt to avoid overwrite of previous robots.txt.
Robots.txt.sample can be renamed or used as a best practice
@colinmollenhour
Copy link
Member

A friendly crawler should already never hit most of these urls because by definition of a crawler they only crawl things that have links to them and many of these urls that should never be visited are already never linked to from anywhere. So for malicious crawlers you are just providing a list of urls that they might want to crawl. I've not investigated the real-world impact of designing a robots.txt this way but I would venture to guess there would be an increase in traffic to these urls, not a decrease.

However, some of these rules make perfect sense such as disallowing crawling links to alternative sort directions, filters, etc. So I think I'd be in favor if it was trimmed down to just the urls that might actually appear as links somewhere.

@addison74
Copy link
Contributor

This is the robots.txt file content I am using in production on all stores. Maybe you can take some ideas for this PR.

## Disable robots.txt rules for some crawlers
User-agent: AhrefsBot
User-agent: Alexibot
User-agent: AppEngine
User-agent: Aqua_Products
User-agent: archive.org_bot
User-agent: archive
User-agent: asterias
User-agent: b2w/0.1
User-agent: BackDoorBot/1.0
User-agent: BecomeBot
User-agent: BlekkoBot
User-agent: Blexbot
User-agent: BlowFish/1.0
User-agent: Bookmark search tool
User-agent: BotALot
User-agent: BuiltBotTough
User-agent: Bullseye/1.0
User-agent: BunnySlippers
User-agent: CCBot
User-agent: CheeseBot
User-agent: CherryPicker
User-agent: CherryPickerElite/1.0
User-agent: CherryPickerSE/1.0
User-agent: Copernic
User-agent: CopyRightCheck
User-agent: cosmos
User-agent: Crescent
User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
User-agent: DittoSpyder
User-agent: dotbot
User-agent: dumbot
User-agent: EmailCollector
User-agent: EmailSiphon
User-agent: EmailWolf
User-agent: Enterprise_Search
User-agent: Enterprise_Search/1.0
User-agent: EroCrawler
User-agent: es
User-agent: exabot
User-agent: ExtractorPro
User-agent: FairAd Client
User-agent: Flaming AttackBot
User-agent: Foobot
User-agent: Gaisbot
User-agent: GetRight/4.2
User-agent: gigabot
User-agent: grub
User-agent: grub-client
User-agent: Go-http-client
User-agent: Harvest/1.5
User-agent: Hatena Antenna
User-agent: hloader
User-agent: http://www.SearchEngineWorld.com bot
User-agent: http://www.WebmasterWorld.com bot
User-agent: httplib
User-agent: humanlinks
User-agent: ia_archiver
User-agent: ia_archiver/1.6
User-agent: InfoNaviRobot
User-agent: Iron33/1.0.2
User-agent: JamesBOT
User-agent: JennyBot
User-agent: Jetbot
User-agent: Jetbot/1.0
User-agent: Kenjin Spider
User-agent: Keyword Density/0.9
User-agent: larbin
User-agent: LexiBot
User-agent: libWeb/clsHTTP
User-agent: LinkextractorPro
User-agent: LinkpadBot
User-agent: LinkScan/8.1a Unix
User-agent: LinkWalker
User-agent: LNSpiderguy
User-agent: looksmart
User-agent: lwp-trivial
User-agent: lwp-trivial/1.34
User-agent: Mata Hari
User-agent: Megalodon
User-agent: Microsoft URL Control
User-agent: Microsoft URL Control - 5.01.4511
User-agent: Microsoft URL Control - 6.00.8169
User-agent: MIIxpc
User-agent: MIIxpc/4.2
User-agent: Mister PiX
User-agent: MJ12bot
User-agent: moget
User-agent: moget/2.1
User-agent: MSIECrawler
User-agent: naver
User-agent: NerdyBot
User-agent: NetAnts
User-agent: NetMechanic
User-agent: NICErsPRO
User-agent: Nutch
User-agent: Offline Explorer
User-agent: Openbot
User-agent: Openfind
User-agent: Openfind data gathere
User-agent: Oracle Ultra Search
User-agent: PerMan
User-agent: ProPowerBot/2.14
User-agent: ProWebWalker
User-agent: psbot
User-agent: Python-urllib
User-agent: QueryN Metasearch
User-agent: Radiation Retriever 1.1
User-agent: RepoMonkey
User-agent: RepoMonkey Bait & Tackle/v1.01
User-agent: RMA
User-agent: rogerbot
User-agent: scooter
User-agent: Screaming Frog SEO Spider
User-agent: searchpreview
User-agent: SEMrushBot
User-agent: SemrushBot
User-agent: SemrushBot-SA
User-agent: SEOkicks-Robot
User-agent: SiteSnagger
User-agent: sootle
User-agent: SpankBot
User-agent: spanner
User-agent: spbot
User-agent: Stanford
User-agent: Stanford Comp Sci
User-agent: Stanford CompClub
User-agent: Stanford CompSciClub
User-agent: Stanford Spiderboys
User-agent: SurveyBot
User-agent: SurveyBot_IgnoreIP
User-agent: suzuran
User-agent: Szukacz/1.4
User-agent: Szukacz/1.4
User-agent: Teleport
User-agent: TeleportPro
User-agent: Telesoft
User-agent: Teoma
User-agent: The Intraformant
User-agent: TheNomad
User-agent: toCrawl/UrlDispatcher
User-agent: True_Robot
User-agent: True_Robot/1.0
User-agent: turingos
User-agent: Typhoeus
User-agent: URL Control
User-agent: URL_Spider_Pro
User-agent: URLy Warning
User-agent: VCI
User-agent: VCI WebViewer VCI WebViewer Win32
User-agent: Web Image Collector
User-agent: WebAuto
User-agent: WebBandit
User-agent: WebBandit/3.50
User-agent: WebCopier
User-agent: WebEnhancer
User-agent: WebmasterWorld Extractor
User-agent: WebmasterWorldForumBot
User-agent: WebSauger
User-agent: Website Quester
User-agent: Webster Pro
User-agent: WebStripper
User-agent: WebVac
User-agent: WebZip
User-agent: WebZip/4.0
User-agent: Wget
User-agent: Wget/1.5.3
User-agent: Wget/1.6
User-agent: WWW-Collector-E
User-agent: Xenu's
User-agent: Xenu's Link Sleuth 1.1c
User-agent: Zeus
User-agent: Zeus 32297 Webster Pro V2.9 Win32
User-agent: Zeus Link Scout
Disallow: /

## Enable next rules for all crawlers
User-agent: *
#Allow: /

## GENERAL SETTINGS
Sitemap: https://www.[YOURDOMAIN].[TLD]/sitemap.xml

## Crawl-delay parameter: number of seconds to wait between successive requests to the same server.
## Set a custom crawl rate if you're experiencing traffic problems with your server.
Crawl-delay: 10

## SEO IMPROVEMENTS

## Do not crawl sub category pages that are sorted or filtered.
Disallow: /*?*
#Disallow: /?q=
#Disallow: /*?p=
#Disallow: /*&p=
#Disallow: /*?dir=
#Disallow: /*&dir=
#Disallow: /*?limit=
#Disallow: /*&limit=
#Disallow: /*?order=
#Disallow: /*&order=

## Do not crawl product page links
Disallow: /wishlist/
Disallow: /product_compare/
Disallow: /review/
Disallow: /sendfriend/
Disallow: /productalert/
Disallow: /tag/
 
## Do not crawl links with session IDs
Disallow: /*?SID=
 
## Do not crawl checkout and user account pages
Disallow: /checkout/
Disallow: /customer/
Disallow: /customer/account/
Disallow: /customer/account/login/
 
## Do not crawl seach pages and not-SEO optimized catalog links
Disallow: /catalogsearch/
Disallow: /searchautocomplete/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/

## IMAGE CRAWLERS SETTINGS
 
## Extra: Uncomment if you do not wish Google and Bing to index your images
# User-agent: Googlebot-Image
# Disallow: /
# User-agent: msnbot-media
# Disallow: /

@addison74
Copy link
Contributor

I agree @colinmollenhour. There are ethical crawlers that once they find that the robots.txt file exists they follow the rules. But there are unethical crawlers that either ignore the robots.txt content and traverse all the links inside a webiste, or do the exact opposite of the disallowed rules. I consider more important to identify the IP where the crawler is running and block it than to rely on this file. I am using Fail2Ban to identify User-Agent from a list and ban those IP's.

@fballiano fballiano changed the title [SEO] Add a robots.txt to OpenMage Add robots.txt.sample May 29, 2022
Copy link

@fballiano fballiano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is a good robots.txt.sample and it could be added to our repo.

@fballiano fballiano marked this pull request as ready for review May 29, 2022 09:22
@fballiano fballiano merged commit a75b06b into OpenMage:1.9.4.x Jun 6, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Jun 6, 2022

Unit Test Results

1 files  1 suites   0s ⏱️
0 tests 0 ✔️ 0 💤 0 ❌
7 runs  5 ✔️ 2 💤 0 ❌

Results for commit a75b06b.

@simbus82
Copy link
Contributor Author

simbus82 commented Jun 6, 2022

This is the robots.txt file content I am using in production on all stores. Maybe you can take some ideas for this PR.

## Disable robots.txt rules for some crawlers
User-agent: AhrefsBot
User-agent: Alexibot
User-agent: AppEngine
User-agent: Aqua_Products
User-agent: archive.org_bot
...

Uhm... these rules are often useless, spambots doesn't follow robots.txt rules.
It's preferable to block spambot access via htaccess rules (or directly in apache conf), so the bots can't eat server resources.

Indicative example:

<IfModule mod_setenvif.c>
SetEnvIfNoCase User-Agent (archive.org|binlar|casper|checkpriv|choppy|clshttp|cmsworld|diavol|dotbot|extract|feedfinder|flicky|g00g1e|harvest|heritrix|httrack|kmccrew|loader|miner|nikto|nutch|planetwork|postrank|purebot|pycurl|python|seekerspider|siclab|skygrid|sqlmap|sucker|turnit|vikspider|winhttp|xxxyy|youda|zmeu|zune) bad_bot
   
   <IfModule mod_authz_core.c>
      <RequireAll>
         Require all Granted
         Require not env bad_bot
      </RequireAll>
   </IfModule>
</IfModule>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants