Transcript Collector

Analysing Robots.txt at scale with HTTP Archive and BigQuery

2026-04-23 ยท en-j3PyPqV-e1s manual

Open YouTube
[MUSIC PLAYING]
GARY ILLYES: Hello, it is I, Gary, from Google Search.
And you're hearing me today because we have yet another
episode of "Search Off the Record,"
the podcast from the Google Search team discussing all
things digital and shedding some light on how Google Search
works, sometimes, or how the internet works sometimes.
Lots of things.
Let's see.
Am I alone?
I'm not alone.
Martin, hello.
MARTIN SPLITT: Hello, did you almost forget about me?
Am I that forgettable to you?
GARY ILLYES: Yes.
MARTIN SPLITT: Ouch!
Wow.
GARY ILLYES: Well, don't ask questions
if you're not prepared to be hurt by the answers.
MARTIN SPLITT: OK, fine.
Hello.
GARY ILLYES: Hello.
It's been a while I've seen you, like less than 24 hours.
MARTIN SPLITT: Oh, true.
Yeah, we've seen each other yesterday.
That's true, yeah.
GARY ILLYES: Do you have anything exciting coming up?
MARTIN SPLITT: Search Central Lives are coming up.
We are visiting the world once again, yes.
GARY ILLYES: Where are you going first?
MARTIN SPLITT: My first one is going to be Brazil.
GARY ILLYES: Oh, please eat some acai for me.
MARTIN SPLITT: Oh, I will.
Oh, god, yes, thank you.
Oh, yes, so good.
GARY ILLYES: I very much appreciate that.
It's so good.
MARTIN SPLITT: But I guess you don't
want to talk about acai today.
GARY ILLYES: Oh, I could talk about acai.
I read so much about acai the past few years.
The first time I went to Brazil for a conference,
it was an external conference.
Someone introduced me to acai.
I never had that before.
And it was basically just from the hotel, walking one block.
I think it was with Pedro Diaz, former Googler.
Now I think he's a developer or something,
or he has a company where they are developing stuff and also
some SEO.
Anyway, and he just took me to this very small corner shop.
And he was like, you have to try acai.
And I'm like, uh, no, leave me alone.
No, no.
And, 'no, no, no, you have to try it.'
And then I tried it, and then I kept eating acai.
And then I kept eating acai.
On one day, I had, like, five bowls
of acai because it was so good.
And I kept ordering via room service.
And at one point, the room service person was like, yeah,
we think you should stop.
And I'm like, no.
And the person was like, yeah, it's already, like,
7:00 PM or something.
You should really stop because you're not going to sleep.
And I'm like, what do you mean you are not going to sleep?
It's full of, not caffeine, but some other-- one
of those components that just make you not sleep great.
I haven't slept for two days.
But you're right, I'm not here to talk about acai today.
I had a saga, and you helped me a little bit with that saga.
And I thought that we talk about that saga.
MARTIN SPLITT: Oh, the--
GARY ILLYES: You know what I'm talking about.
MARTIN SPLITT: I think the SQL stuff we did for Web Almanac?
GARY ILLYES: Well, yeah, but the project was bigger.
So to give some background, we received pull requests
on the official robots.txt repository
to add two new rules/directives to the unsupported tags list.
Basically, Search Console would report
that it recognizes these tags, but Google doesn't support them.
And the pull request was great.
It was a very good idea.
The person who sent it, well, the username is 3x10raisedto8.
I don't know who that is, but the pull request was great.
And it was a good idea.
But I don't know about other companies,
but at Google, we try to not do things arbitrarily, but rather
collect data and then say that, yes, this makes sense
based on the data.
And John Mueller, our manager, had this idea
that, how about we don't just add this one tag,
but look through the, let's say, top 10 or top 15 tags
and add the ones that we don't have in that list yet?
Because that would give us a decent starting
point, a decent baseline, and be able to say that,
OK, we are documenting the top 10 of these tags
that we don't support.
And fast forward two days, I'm struggling
finding a public repository of robots.txt files
that we could use to identify these tags.
And he suggests the HTTP archive.
I have never used the HTTP archive before,
other than looking at the reports that they do--
I think it's called the Almanac or something.
MARTIN SPLITT: Yeah, the Web Almanac.
GARY ILLYES: So I don't know how it works.
I don't know where the data lives.
I don't know anything about it; in fact, I still don't.
Do you?
MARTIN SPLITT: [CHUCKLES] Yeah, I do.
I used to contribute to the SEO chapter a few times.
GARY ILLYES: Really?
MARTIN SPLITT: Mhm.
GARY ILLYES: OK, want to teach me how it works?
MARTIN SPLITT: OK, yeah, sure.
So you know nothing about it, I'm assuming?
GARY ILLYES: I honestly have literally nothing.
We did some code for it the past few days,
but I don't know how it's used or why it's used.
I don't know the data sets.
I don't-- literally nothing.
MARTIN SPLITT: OK, so it has been running for the last,
I don't know how many years.
It's definitely been around since 2019,
it must have been, because I think I was involved in the 2019
edition.
I believe that it has been running before that, as well,
but I'm not sure about that.
But the idea is that you basically
look at a large number of websites, or web pages
more specifically, and look at how the web changes, or things
that you can learn from looking at large quantities of websites.
For instance, what language are they in?
Are they mobile friendly?
Are they using HTTPS?
Are they, I don't know, using canonicals,
these kind of things-- all sorts of stuff
that you can basically infer from looking at the source
code of a website or from things that you can infer
from the behavior of a website.
So to do so, it has to do a crawl.
So it has to basically know all these things.
And then it also has to, quote, unquote, "render" them.
So it has to do some sort of analysis on them
to get some additional data, for instance performance,
like how fast is this loading, how do the core web vitals look
like, and so on and so forth.
You can't get that from just a crawl.
You have to actually run the website in a browser
to get this kind of data.
And these two things can be combined.
And then a bunch of people set out every year
to ask questions about the big data set of information
they have.
So that's two stages.
In stage one, they're like, hm, I wonder,
I don't know-- for instance, I wonder how many words per page
there are for all the websites that we will look at.
And then they write some script that
gets this data from either the crawls
or with words from a page.
You probably can get some of that information from a crawl.
But if it uses JavaScript, you would have to get that also
from the rendered version.
GARY ILLYES: So when you say a crawl, is that, what-- like,
who's crawling what?
MARTIN SPLITT: OK, so we start with a bunch of URLs
that we know exist.
And I believe that these URLs are coming from the Chrome UX
report, if I remember correctly.
GARY ILLYES: OK.
MARTIN SPLITT: So if you opt into it,
your Chrome browser sends data to an aggregate report
basically that says, like, hey, here
is what we've seen in terms of performance data,
for instance, from real users opening this website.
It doesn't say, like, Martin or Gary have seen these numbers,
but it basically aggregates it.
So on average, across all the people
who have visited this website until now or in the last year--
I'm actually not exactly sure how Chrome UX
report segments the data.
But basically, all these URLs, all these websites
have been visited by someone who sends the data into the Chrome
UX report.
And then you can query the Chrome UX report.
So it's a public data set of aggregated user experience
metrics for websites.
GARY ILLYES: Interesting.
MARTIN SPLITT: And in this set are millions of URLs.
I believe it's 16-point-something million.
That's a huge data set.
Historically, that have mostly been home pages.
So they kind of filtered it out to only get home-page data,
arguing like, oh, it's probably the more popular part
of every website, to go to the home page.
You go to ebay.com or to amazon.com or to google.com
rather than to google.com/howsearchworks.
That is a page on this website, but it's probably not one
that we have a lot of data on.
So historically it has been focusing on the home pages.
But in the recent couple of years-- and I'm not sure
when they started this, but at some point
they expanded to what they call secondary pages.
So you can say, oh, we are only interested in home pages,
or we are also interested in, how do home pages perform
or look like compared to, quote, unquote, "secondary" pages?
Because usually we have this kind of stuff in the Chrome UX
report.
And for some websites, they might also
be much more popular than the homepage, for instance.
And then the homepage gets a bit neglected.
And then whatever secondary page you have is more popular,
so you put more effort into it.
And then they basically run a crawl.
I'm not exactly sure how they're crawling it.
But they're basically doing a bigger run.
I think what they do is they run through WebPageTest.
I'm not sure-- are you familiar with WebPageTest?
GARY ILLYES: That's some service--
MARTIN SPLITT: Yes.
GARY ILLYES: OK.
MARTIN SPLITT: Yes, webpagetest.org, you
can go there.
You can type in a URL, and it runs your website
in an actual browser.
And I believe what they do is they do that.
They have their own instance.
They're probably paying for that,
I'm not sure, or have some sort of collaboration
with webpagetest.org.
And then they put these URLs that they
got from the list from URLs-- or the list of URLs
from Chrome UX report.
And they basically run these through a browser
instance on a server hosted by WebPageTest.
And that's what they do to crawl, I believe.
GARY ILLYES: Cool.
MARTIN SPLITT: But as you run it in a browser,
you get a bunch of information that you
don't get if you were basically using curl or wget
or whatever on the command line to just download the HTML.
For instance, you can tell what amount of CSS
has been actually used and how much is unused.
You can run a Lighthouse test on it.
And you can run some JavaScript that you can control,
and that's what we wrote.
Remember?
That's the JavaScript that we created.
GARY ILLYES: Oh, I remember.
MARTIN SPLITT: Yes.
GARY ILLYES: I remember.
MARTIN SPLITT: Yes, of course you
do because you love JavaScript so much.
GARY ILLYES: I love JavaScript.
So that was also weird to me because I
didn't realize that you can use JavaScript
for this kind of stuff.
But anyway, the way I discovered this whole thing works-- well,
not how it works, but how the data is stored--
I don't know if you can download it or not,
but there's also a BigQuery data set or data sets.
And then you can query--
write basically SQL queries to query those data sets.
MARTIN SPLITT: Yeah, that's the second step, yeah.
GARY ILLYES: Which can be very harsh on your wallet,
as I learned.
MARTIN SPLITT: [LAUGHS] That is true
because the data is relatively large, I guess.
GARY ILLYES: I literally remember
that Daniel Waisberg, our teammate, he
wrote a blog post about how to avoid large charges on BigQuery
[CHUCKLES] when you are digging into Search Console data.
And when I got the charge--
I ran one query, one large query.
And I got hundreds of dollars' worth of charge
for that one particular query.
And yes, it was running for quite a while.
But still, it's hundreds of dollars?
So yeah.
MARTIN SPLITT: Yay.
[LAUGHTER]
It happens.
And it's an open-source project, so yeah, I would just absorb it,
I guess.
But it was painful.
Anyway-- what?
GARY ILLYES: Reminds me of the, "what can a banana cost,
Michael? $10?"
MARTIN SPLITT: Oh, yeah, I love that.
It's so good.
It's a great show, as well, yeah.
GARY ILLYES: That was a coffee, no?
From "Mean Girls" or something?
Anyway--
MARTIN SPLITT: "Arrested Development."
It was--
GARY ILLYES: Ah, yeah, yeah!
MARTIN SPLITT: --"Arrested Development."
GARY ILLYES: Anyway, and we quickly
figured out that no one is actually
requesting robots.txt files.
So the data sets don't typically have robots.txt files in it,
which was also very painful because I already
paid hundreds of dollars for that one particular query.
[LAUGHS] It's great.
Great.
But don't-- stop laughing.
MARTIN SPLITT: I'm so sorry.
GARY ILLYES: You're not.
MARTIN SPLITT: No.
GARY ILLYES: And then more internal discussions.
And then we realized that, why don't we
just put this in the custom metrics data
set, which, again, is not something that I knew of.
If I knew of that thing, then probably
I wouldn't have run that initial query that cost me so much.
Do you know about the custom metrics?
MARTIN SPLITT: Yeah, so step number one
is kind of exactly what you then did,
the custom metrics bit, where we take the URLs,
and we run them through WebPageTest.
And as WebPageTest says, OK, this page is now done,
there's nothing happening anymore in this test browser,
you can run some JavaScript on whatever you got.
And I believe that there are some URLs in there that
are robots.txt URLs because I think, in the SEO chapter,
there is a robots.txt analysis.
I'm not exactly sure how you can filter for robots.txt
specifically from all the URLs that we have.
But basically, that's step one.
You gather these metrics.
And they are custom metrics because they are not,
by default, exposed.
For instance, if you run a Lighthouse test,
you have certain things like, I don't know--
I think Lighthouse tests for the Core Web Vitals.
So you can basically say, like, hey,
from this database that is created from all the things that
we run through WebPageTest, I want to see from each
of the pages the Lighthouse.CoreWebVitals.--
I don't know what else--
LargestContentfulPaint.
And then you get the numbers.
And then you can do things, like you can tell, hey,
so what's the average?
What's the maximum?
What's the minimum?
Blah, blah blah.
What's the 90th percentile?
You can do these kind of things.
But that's not a custom metric because these metrics are
default, and you can get them just by running
the page through the browser.
But then you can run these extra JavaScripts that
are looking at the content.
And you can do things like, for instance, you
can say, hey, give me all the children elements
that are in the head.
And then, later on in the queries part,
you get a list of all the things that are in the head.
Let's say you call it custom.head-invalid-elements
or something.
And then you can say, OK, so for all of the things
that we have in this database of these head
elements, which are ones that don't belong there?
Or which is the most likely head element that we are seeing?
Or, I don't know, what's the charset
that people are setting in their meta
char set element that they have in the head?
And to have any sort of metric that
isn't by default available to a browser or Lighthouse
or whatever other tools we are running--
I think there's another one, Web App Analyzer or something
like that, that gives you information,
like what framework has this been built with,
or what content management system is this using?
So if it's not in these default tools that are running,
then you can add custom code to get out what you need.
And that's what you did, right?
GARY ILLYES: Yeah, I mean, we did it.
MARTIN SPLITT: OK, fair enough-- we did.
Yeah, you wrote the code.
I looked at it and cried only a little.
GARY ILLYES: So that's the suggestion
that we got from Barry Pollard.
So he pointed us to their GitHub repository for custom metrics.
And then there we found this weirdo JavaScript function,
or class--
I don't remember what it is-- anyway,
that is actually extracting some limited number of rules,
but they were hard-coded.
So basically, it was a no-index and no-archive, I don't know,
crawl delay, whatever; basically just counting those
that they knew of already.
And we needed the exact opposite.
We wanted to learn of all the rules that people are using,
not just the ones that we know about.
MARTIN SPLITT: Mhm.
GARY ILLYES: So we twisted it around,
and we got some really good comments
from Barry and some other folks in the GitHub community.
And then we started collecting data.
I think we submitted it February 3 or something like that.
And then it was merged a bit later.
But it was submitted right before the next run, so
basically the next crawl, I don't know.
MARTIN SPLITT: Yeah, the next run basically.
GARY ILLYES: Probably using the wrong terminology.
But basically, we managed to get in data for the February 1 data
set.
MARTIN SPLITT: Ah, nice.
OK, that's really nice.
GARY ILLYES: Yeah, and then, again, go back to BigQuery--
or wait for the run to complete.
Go back to BigQuery, and then run the query again.
Get heart attack.
And then just use that data.
And yeah, that's the story.
Do you want to talk about our JavaScript or not?
MARTIN SPLITT: We can talk about it
a little bit, about the JavaScript,
because I basically start to remember things, which
is, I think, generally good.
GARY ILLYES: Great.
So you know what would have helped a lot, Martin?
MARTIN SPLITT: What?
GARY ILLYES: If we had a JavaScript parser.
MARTIN SPLITT: You were less than
enthusiastic and interested, and now it's
the most important thing.
OK, fine.
Fine.
GARY ILLYES: No, I told you a bunch of times that there are
people who actually need it.
And finally, it was me who needed it.
And I was very disappointed that I didn't--
MARTIN SPLITT: I'm so sorry, sweetie.
I apologize.
GARY ILLYES: Do you?
MARTIN SPLITT: Yes, I actually do.
GARY ILLYES: We are going to include a link
to that JavaScript function.
And I'm going to ping you this so
you can also see it because you probably don't have it.
MARTIN SPLITT: I have it open.
GARY ILLYES: Oh, you do?
MARTIN SPLITT: Mhm.
GARY ILLYES: How did you find it?
It's very secret.
MARTIN SPLITT: No, it's not.
GARY ILLYES: So some discoveries.
Basically what I was trying to do,
and then you confirmed that we can do that,
is to roughly imitate what the C++ parser is doing.
And that is basically going line by line.
And then I thought about going character by character,
but it doesn't make sense when you are doing this kind of stuff
because you are not looking for one specific tag or rule.
You are looking for anything that looks like a rule.
MARTIN SPLITT: Yes.
GARY ILLYES: Right?
MARTIN SPLITT: Yes.
GARY ILLYES: So I am really, really,
really bad at writing regex, or "reh-jex."
So I asked the toaster or the AI chatbot to write me a regex
because it is really good at writing regexes for some reason.
I don't know why.
Maybe there's lots of training data for it.
But it came up with this monstrosity of a regex
on line 58.
MARTIN SPLITT: Yeah, that one was scary.
I mean, regex in general-- difficult, difficult.
But this one-- jeez.
GARY ILLYES: Yeah, and basically just came up with that.
I tested it over and over and over again.
I actually ran it through a fuzzer, so basically just
to try to break it-- basically test its limits.
And it didn't break, so I was happy with it.
And then we are just matching each line that we extracted,
that starts with something that resembles a key value pair.
MARTIN SPLITT: Mhm, separated by a colon.
GARY ILLYES: Separated by a colon.
And then we are just extracting that.
And that will produce lots of weird stuff.
If you look at the distribution-- maybe
I will put this on LinkedIn or something.
Or maybe not LinkedIn, but what's
the bird-- the new bird thing?
MARTIN SPLITT: Bluesky.
GARY ILLYES: Bluesky.
If you look at the distribution of rules that it extracted,
it is--
how do I show it to you?
MARTIN SPLITT: I don't know.
Send me a link.
GARY ILLYES: OK, where's Martin?
MARTIN SPLITT: I'm here.
GARY ILLYES: Martin's the--
OK, link, "Mortimer."
"Mortimer."
So if you look at the distribution--
MARTIN SPLITT: Oh, yeah.
Ooh.
GARY ILLYES: --it's basically an extremely sharp drop-off
after the really popular ones.
MARTIN SPLITT: Mhm.
GARY ILLYES: So basically, you can
see that we have the other bucket, which is basically
all the lines that had a column in them or something like that.
But after allow and disallow and user agent,
the drop is extremely drastic.
Even if you put it in log scale--
I have one in log scale, as well,
because that's showing it better.
And also, people can extract this from BigQuery,
as well, from the [INAUDIBLE].
MARTIN SPLITT: Yeah, from BigQuery.
This is now in the latest crawl data.
GARY ILLYES: It's in the custom records.
If you look at this one, you can see that, even on log scale,
the drop-off is extremely sharp.
So basically, there is a large chunk of robots.txt files
that contain these tags.
And then there's broken files like johnmueller.com/robots.txt,
or garyillyes.com/robots.txt, which contain just fun stuff,
so to say.
MARTIN SPLITT: Actually, there's a bunch
of pages that probably don't have the robots.txt
and give us some sort of error page here.
GARY ILLYES: Yeah, yeah, lots of HTML pages with--
MARTIN SPLITT: CSS in it, yeah.
GARY ILLYES: --with CSS in it, yeah, exactly.
That's why you see all those padding and IMG and--
MARTIN SPLITT: A color, width.
GARY ILLYES: Yeah, we can also use this to identify
the typos of the disallows.
So I'm probably going to expand the typos that we accept.
MARTIN SPLITT: I just realized we
might be able to filter these out in the query in the custom
metric.
GARY ILLYES: OK, if you have ideas,
I'm happy to review it because I'm so good at JavaScript,
as you know.
MARTIN SPLITT: Yeah.
I mean, logically speaking, we have
to check that we get a 200 status back.
So we will avoid all the 404 pages.
GARY ILLYES: Sure.
MARTIN SPLITT: We can probably tell
if its content type is text HTML and then just not deal with it.
GARY ILLYES: Well, if you are strict with it, then it's fine.
If you're strict with the parsing,
then it's fine to ignore those.
But technically, Google does want
to parse out rules from normal HTML files, as well, if--
MARTIN SPLITT: OK, but we are not
doing that if the HTTP status is not 200, right?
GARY ILLYES: That's correct, yeah.
Anyway, and then all these things that it extracted,
plus some additional data that was always there,
like the size of the-- the raw byte size of the thing,
the "thing" being the robots.txt file--
those are put in a JSON file and then
basically put in the data set-- custom metrics data set?
Is it a data set?
What is it?
MARTIN SPLITT: Mm, data set, I would say.
GARY ILLYES: OK.
Yeah, and that's how we expanded our understanding
of robots.txt rules with data.
MARTIN SPLITT: That's really cool.
Wow.
And I think it's really nice because that
might make its way into the SEO chapter for this year's Web
Almanac because they just have more information available.
GARY ILLYES: Oh, ah, yeah.
I did not know that.
MARTIN SPLITT: Yeah, I think they have-- let me check.
I think, the Web Almanac in the SEO chapter, it's brilliant.
I definitely highly recommend reading it.
I think they do discuss-- yeah, here.
So robots.txt is discussed.
For instance, the status codes--
84.9% of the URLs that they had looked at from the crawl set,
basically, have a 200.
13% have a 404.
And then others are weird timeouts--
4, 3, 500-- are negligible, basically, less than a percent
each.
Robots.txt size in kilobytes-- most of them
are between 0 and 100 kilobytes.
GARY ILLYES: Huh, yeah.
I mean, that makes sense, that you
can put that much stuff in it.
MARTIN SPLITT: A lot of them contain asterisks
as the user agent.
GARY ILLYES: Makes sense.
MARTIN SPLITT: AdsBot-Google is the more-often mentioned.
Googlebot only appears in 6.2% of the robots.txt
files they looked at, but AdsBot-Google in 9.8%,
last year.
GARY ILLYES: Oh.
MARTIN SPLITT: Interesting.
GARY ILLYES: Yeah, huh.
MARTIN SPLITT: [CHUCKLES] Interesting.
So yeah, they have a bunch of stuff here.
GARY ILLYES: Cool.
MARTIN SPLITT: Nice.
Has been fun, though.
GARY ILLYES: Well, Martin, guess what.
Do you want to talk about something else?
MARTIN SPLITT: Oh, yes, but we cannot.
GARY ILLYES: Aw.
MARTIN SPLITT: So you leave me?
GARY ILLYES: Well, you are leaving me, quite literally.
You are moving to--
MARTIN SPLITT: Fine.
GARY ILLYES: --a different country.
MARTIN SPLITT: Temporarily.
I'll be back.
You don't have to worry about that.
GARY ILLYES: Will you?
MARTIN SPLITT: Yes.
GARY ILLYES: Will you?
MARTIN SPLITT: Yes, of course.
And we will be back to all of you
out there, as well, with a new episode soon, as well.
GARY ILLYES: You are saying my line.
MARTIN SPLITT: Yeah, because I'm a sweetheart like that.
I reduce your work.
GARY ILLYES: Oh, fantastic.
Now I don't have to deal with AI anymore.
I have to deal with you taking my line.
MARTIN SPLITT: [LAUGHS] That's worse.
Less predictable, more unstable.
GARY ILLYES: Well, Martin, thank you so much
for chatting with me.
You are the only one who's still chatting with me.
And for the listeners, thank you also for listening to us.
Please like and subscribe wherever you get your podcasts.
And please do, because if you want to listen to more episodes,
then we need numbers because we are data-driven,
Martin and Gary.
Well, Martin, again, nice chatting with you.
Goodbye.
Nice talking to you.
MARTIN SPLITT: Bye, bye.
[MUSIC PLAYING]
GARY ILLYES: We've been having fun with these podcast episodes.
I hope you, the listener, have found
them both entertaining and insightful, too.
Feel free to drop us a note on LinkedIn,
or chat with us at one of the next events
that we go to if you have any thoughts.
And of course, don't forget to like and subscribe.
Thank you, and goodbye.
[MUSIC PLAYING]