How Robots.txt Works

2024-12-04 · en-j3PyPqV-e1s manual
MARTIN SPLITT: Hello, and welcome to another Search
Central Lightning Talk.
This time, we will talk about robots.txt files--
when to use them, how to use them,
and how you can test it with Google Search Console.
[UPBEAT MUSIC]
When you have a website, you probably
want it indexed in Google Search so people
can find it in all your pages when
searching for something online.
But sometimes, there might be some things
you don't want to see in Google Search,
or you might not want Googlebot to spend time on them.
Then the robots meta tag or robots.txt file
might be what you're looking for.
Let's start with the robots meta tag.
It's an HTML meta element that you add to your site's head.
Its name is robots, and it can take a bunch of different things
as its content.
Let's keep this one simple for now.
We can just set it to noindex to keep this page out
of Google Search index.
We can also be more granular and keep specific bots
from indexing this page.
Say we are OK with Googlebot indexing the page for search,
but we don't want Googlebot for Google News to index the page.
Then we can specify this in the name.
Instead of robots, we will call it googlebot-news instead.
We can also specify multiple things in one tag, like here.
We don't want snippets for this page,
and we also do not want translations
in the search result. Alternatively,
we can also use an HTTP header instead of a meta tag.
In this case, the header would be called X-Robots,
and it can contain the exact same values as the robots meta
tag.
For more information on this, check out
the link to our robots metatag documentation below.
All right.
Now we've discussed how to keep a page out of the index.
But sometimes, you want to do something slightly different.
You want to tell Googlebot not to even retrieve
a specific page.
That can be done with what's called a robots.txt file.
It lives on the root path of your domain.
So let's say, example.com/robots.txt.
It can't be in another directory.
So example.com/products/robots.txt
wouldn't work, for instance.
However, if you use subdomains like shop.example.com then
shop.example.com/robots.txt it is fine, however.
These files are relatively simple.
They contain text in a specific format
that many bots on the internet, like Googlebot, understand.
And by the way, if you use a website builder or content
management system, there likely is a plugin setting or some way
to manage the robots.txt file content.
Here is an example.
This file disallows every URL that starts with /no-touchy
on this domain from being accessed by any bot that follows
the robots.txt standard.
This is called a rule.
Rules can allow or disallow URLs or patterns of URLs for bots.
Note that not all bots on the internet will follow this,
but Google bot and most other search engines will do so.
You can also specify a specific bot by its user agent name
and give it specific instructions.
Say for example, you would like to allow a bot called SteveBot
to access the directory we've excluded
from the other bots earlier.
You can also use the asterisk character
as a wildcard to make your rules a bit simpler.
In addition, you can use robots.txt to point
the bots to your sitemap if you use the sitemap directive.
If you want to learn more about robots.txt,
check out the links to our documentation and the robot
standard documentation as well.
I would like to point something out while we are here.
Sometimes, people use both robots.txt and robot meta tags
or headers to stop a page from showing up
in Google Search results, but then
wonder why that doesn't work well.
The problem here is that in order
to see the robots meta tag or header,
Googlebot would have to retrieve and access the page first.
But it cannot do that if you prevent Googlebot from doing it
in robots.txt.
The issue then is that Googlebot might find the link to that page
somewhere and then tries to crawl it,
but it finds out it is not allowed to crawl it,
and then it knows the page exists,
but it doesn't see what's on it.
And that includes the robots tag.
So it might actually put it in the index,
albeit the limited information there is for this page,
due to it being blocked in robots.txt.
So to stop it from getting into the index,
use the robots meta tag or the X-Robots header,
but do not disallow it in robots.txt.
If you want to see how your robots.txt influences Google
Search, you can check out the robots.txt report in Google
Search Console, or use the open source robots.txt
tester I've linked below in the description.
I do hope this video helped you get a better
idea of the different mechanisms that
influence how robots interact with a page
and when to use which one.
Please do leave us a comment below
if you would like to learn more about technical topics,
and leave a like or subscribe to our channel
if you want more content around Google Search from us.
Anyway, thank you so much for watching, and bye-bye.
[UPBEAT MUSIC]