How Robots.txt Works
2024-12-04 ยท en-j3PyPqV-e1s manual
MARTIN SPLITT: Hello, and welcome to another Search Central Lightning Talk. This time, we will talk about robots.txt files-- when to use them, how to use them, and how you can test it with Google Search Console. [UPBEAT MUSIC] When you have a website, you probably want it indexed in Google Search so people can find it in all your pages when searching for something online. But sometimes, there might be some things you don't want to see in Google Search, or you might not want Googlebot to spend time on them. Then the robots meta tag or robots.txt file might be what you're looking for. Let's start with the robots meta tag. It's an HTML meta element that you add to your site's head. Its name is robots, and it can take a bunch of different things as its content. Let's keep this one simple for now. We can just set it to noindex to keep this page out of Google Search index. We can also be more granular and keep specific bots from indexing this page. Say we are OK with Googlebot indexing the page for search, but we don't want Googlebot for Google News to index the page. Then we can specify this in the name. Instead of robots, we will call it googlebot-news instead. We can also specify multiple things in one tag, like here. We don't want snippets for this page, and we also do not want translations in the search result. Alternatively, we can also use an HTTP header instead of a meta tag. In this case, the header would be called X-Robots, and it can contain the exact same values as the robots meta tag. For more information on this, check out the link to our robots metatag documentation below. All right. Now we've discussed how to keep a page out of the index. But sometimes, you want to do something slightly different. You want to tell Googlebot not to even retrieve a specific page. That can be done with what's called a robots.txt file. It lives on the root path of your domain. So let's say, example.com/robots.txt. It can't be in another directory. So example.com/products/robots.txt wouldn't work, for instance. However, if you use subdomains like shop.example.com then shop.example.com/robots.txt it is fine, however. These files are relatively simple. They contain text in a specific format that many bots on the internet, like Googlebot, understand. And by the way, if you use a website builder or content management system, there likely is a plugin setting or some way to manage the robots.txt file content. Here is an example. This file disallows every URL that starts with /no-touchy on this domain from being accessed by any bot that follows the robots.txt standard. This is called a rule. Rules can allow or disallow URLs or patterns of URLs for bots. Note that not all bots on the internet will follow this, but Google bot and most other search engines will do so. You can also specify a specific bot by its user agent name and give it specific instructions. Say for example, you would like to allow a bot called SteveBot to access the directory we've excluded from the other bots earlier. You can also use the asterisk character as a wildcard to make your rules a bit simpler. In addition, you can use robots.txt to point the bots to your sitemap if you use the sitemap directive. If you want to learn more about robots.txt, check out the links to our documentation and the robot standard documentation as well. I would like to point something out while we are here. Sometimes, people use both robots.txt and robot meta tags or headers to stop a page from showing up in Google Search results, but then wonder why that doesn't work well. The problem here is that in order to see the robots meta tag or header, Googlebot would have to retrieve and access the page first. But it cannot do that if you prevent Googlebot from doing it in robots.txt. The issue then is that Googlebot might find the link to that page somewhere and then tries to crawl it, but it finds out it is not allowed to crawl it, and then it knows the page exists, but it doesn't see what's on it. And that includes the robots tag. So it might actually put it in the index, albeit the limited information there is for this page, due to it being blocked in robots.txt. So to stop it from getting into the index, use the robots meta tag or the X-Robots header, but do not disallow it in robots.txt. If you want to see how your robots.txt influences Google Search, you can check out the robots.txt report in Google Search Console, or use the open source robots.txt tester I've linked below in the description. I do hope this video helped you get a better idea of the different mechanisms that influence how robots interact with a page and when to use which one. Please do leave us a comment below if you would like to learn more about technical topics, and leave a like or subscribe to our channel if you want more content around Google Search from us. Anyway, thank you so much for watching, and bye-bye. [UPBEAT MUSIC]