-
-
Notifications
You must be signed in to change notification settings - Fork 449
Add robots.txt.sample #1024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add robots.txt.sample #1024
Conversation
I would prefer to only have one file, and better the sample one to avoid accidentally overwriting the ones people modified. Disallowing the URLs with params leads to external links to this pages not be possible to crawl, so also the canonical Tag can not be read by them and linked to the appropriate main Page. While I see the possible benefits for bigger shops, we should focus the defaults on smaller shops, who could by this defaults get hurt a lot. |
Delete robots.txt to avoid overwrite of previous robots.txt. Robots.txt.sample can be renamed or used as a best practice
A friendly crawler should already never hit most of these urls because by definition of a crawler they only crawl things that have links to them and many of these urls that should never be visited are already never linked to from anywhere. So for malicious crawlers you are just providing a list of urls that they might want to crawl. I've not investigated the real-world impact of designing a robots.txt this way but I would venture to guess there would be an increase in traffic to these urls, not a decrease. However, some of these rules make perfect sense such as disallowing crawling links to alternative sort directions, filters, etc. So I think I'd be in favor if it was trimmed down to just the urls that might actually appear as links somewhere. |
As suggested by Colin
This is the robots.txt file content I am using in production on all stores. Maybe you can take some ideas for this PR.
|
I agree @colinmollenhour. There are ethical crawlers that once they find that the robots.txt file exists they follow the rules. But there are unethical crawlers that either ignore the robots.txt content and traverse all the links inside a webiste, or do the exact opposite of the disallowed rules. I consider more important to identify the IP where the crawler is running and block it than to rely on this file. I am using Fail2Ban to identify User-Agent from a list and ban those IP's. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is a good robots.txt.sample and it could be added to our repo.
Unit Test Results1 files 1 suites 0s ⏱️ Results for commit a75b06b. |
Uhm... these rules are often useless, spambots doesn't follow robots.txt rules. Indicative example:
|
The old magento has never provided a robots.txt to guide search engine through the multiple urls that a magento installation can generate.
This robots.txt was born from a some years of experience in SEO and from several readings in the most famous "blogs" on magento (and SEO too).
The goal is to ensure that the search engine does not consume all the crawling budget on pages that do not make sense being indexed.
In the same way it avoids that numerous CMS assets are still crawled and indexed.
The search engine is very smart, therefore it is unlikely that once a assets of the core of a cms has been scanned it will be proposed in the SERP.
But even better, we will prevent the search engine from going to waste time on files or urls which will then be rejected anyway.
E-commerce sites with a large number of products and complex layered navigation systems should benefit, reducing the time to index their pages and allowing the search engine to crawl the site more often and more easily.
Bonus: the robots.txt file is the ideal file where to enter site's sitemap path.
Almost all search engines consider the indications entered in robots.txt, so the presence of sitemap.xml will be very important for let search engines crawl first all our most important pages.
PS: PR as Draft, because i don't know if i need to write elsewhere the presence of two new files...