Server logs provide the true view of how search engines crawl your website. Google Search Console or any crawler provide useful data, but they don’t show the full story.
I had opened the Coverage report in Google Search Console and found this.
The report provides insights into what pages are excluded and the reasons for excluding them, which is amazing. However, the report shows only a sample of 1,000 URLs, which is insufficient for large sites. There are three options:
Analyze the 1,000 URL sample and hope the main problems can be spotted there.
Create new properties in GSC for each section of a site.
Get server logs and blend them with crawl data, sitemaps, Google Analytics, etc.
I like the third option!
Not familiar with logs or need a quick refresher? No problem. Read this post about analyzing log files first and then come back.
Blending data
My tool of choice is Power BI, but you can use any tool that allows you to blend multiple data sources.
If you want to learn about Power BI, you can check:
Microsoft’s Learn Power BI
Power BI courses via edX
Wil Reynolds’ Power BI for Digital Marketers series
I will use seven data sources and create relationships among them
Log files
Crawl data from Screaming Frog
Organic landing page data from Google Analytics
Performance data from Google Search Console
URLs in the Sitemap.xml file
Prerendering data
List of crawlers (this helps me translate user agents from logs to more friendly names, e.g. Googlebot Smartphone)
Initially, I recommend starting small with only a couple of data sources to learn how the tool works.
But don’t limit yourself in the long run. There are other data sources you may use, such as PPC data from AdWords or keyword ranking data.
In my examples, I’m using only small samples and modified subsets of the datasets as I can’t show the real numbers. If some numbers don’t add up or look off, now you know why.
Start with basics
It’s always a good idea to aggregate numbers only from logs to get a better understanding of what you’re working with.
There are a few insights find in this overview:
Most Googlebot requests return 200, which is good. The site doesn’t waste crawl activity with non-existing pages.
Googlebot crawls important URLs multiple times per day.
Googlebot (Desktop) crawls the site more often than Googlebot Smartphone (60.59% vs. 39.41%), and this suggests the site hasn’t migrated to mobile-first indexing yet.
Remember, each crawler may behave differently (Bingbot vs. Googlebot) so slice your data by user-agent to discover meaningful insights.
Now we know what we’re working with, let’s blend data because that’s where the magic happens.
Sitemap.xml and logs
Sitemap.xml files contain URLs that should be discovered and crawled by search engines. However, are they crawled?
You are only a few clicks away from discovering.
Having URLs from the sitemap.xml file and log data allows you to easily create two tables:
Requests for URLs in the sitemap
Requests for URLs not in the sitemap
Only 38% (398) of crawled URLs are listed in the sitemap.xml. The other 62% of URLs? You can review them in the table on the right-hand side. It helps you answer questions like…
What URLs are crawled often but are not in the sitemap.xml?
What URLs should be in the sitemap.xml but are not there?
Tip: Use logs that capture a bigger time frame before making conclusions. Not all pages are crawled every day.
Google Analytics and logs
Ideally, you want Googlebot to crawl pages that drive revenue. It makes sense because you want search engines to display the most up-to-date title, meta description, and structured data. If you’re an e-commerce site, you don’t want to show an incorrect price in the search results.
Creating a scatter chart with Revenue from Google Analytics and crawl frequency from logs will reveal if your most important pages are crawled often enough.
How to do the scatter plot
Visualizations > Scatter Chart
Add “Landing Page” (from Google Analytics) to “Details”
Add “Revenue” (from Google Analytics) to “X Axis”
Create a new measure to get the number of crawl request by Googlebot (from logs)
Add the new measure to “Y Axis”
Pages with extremes values often make the chart hard to read, so I filter those outliers out.
Don’t have revenue data? Use any other goal you’ve set up or traffic data (e.g., Users).
Crawl data and logs
Googlebot should spend the majority of its time crawling pages that are indexable.
Are the requested pages indexable?
Connect data from your favorite crawler with logs, and you may find that almost 69% of Googlebot requests are for URLs not meant to be indexed.
Wondering which contain a canonical tag linking to another page? Click to “Canonicalised” bar to see all those pages in the table.
Factors affecting crawl frequency?
Have you ever wondered if longer content is crawled more often? Plot the word count and the number of crawl Googlebot requests into a scatter chart.
Does the number of internal links pointing to a page affect Googlebot’s crawl frequency? Simply replace “word count” with “inlinks” (metric from Screaming Frog) to find out.
A few tips here:
Use filters to remove outliers.
Be mindful of correlation vs. causation before making conclusions.
I’m only scratching the surface here. Do you have a feeling that structured data affects the crawl rate? Cool, test your hypothesis! Your crawler should be capable of extracting structured data while crawling (look for custom extraction is the crawl settings).
Prerendering success rate
Websites without the luxury of full server-side rendering often rely on Dynamic rendering. They detect crawlers (by checking the user-agent) and serve them with a prerendered static HTML version while requests from users are served normally with code full of JavaScript.
Having a high prerendering success rate is critical. If prerendering fails, the crawlers (Google, Bing, Facebook, Twitter, etc) may receive a blank page with no content.
Combining logs and prerendering data reveals the number of Googlebot requests.
Fully prerendered HTML was returned to 80.51% of Googlebot requests
A blank page with no content was returned to 19.49% of Googlebot requests
This insight helps you to prioritize this issue. You can take this even further and plug the revenue numbers in.
Don’ts
Please don’t consider the info from my screenshots as universal facts. My dataset is not even close enough to the size I’d need to make claims that apply to all websites. Every site is different so analyze your data. I bet you’ll discover very interesting insights.
Don’t look at all data in your logs; it rarely makes sense. You want to investigate the behavior of a specific bot. So, slice your logs by the user agent.
Conclusion
Some great paid tools do some of this, but sometimes you don’t have the budget or the time to implement those tools. It’s always about people and not tools. You can conduct these analyses on your own without those tools.
All the examples I’ve shown can be done with no additional investment. Power BI Desktop is free. Everyone has a web crawler, web analytics platform, and Google Search Console. You can start when you get access to log files. So, you can start today.