Server logs are a powerful source of information that can help you understand complex SEO problems with crawling and indexing. Every technical SEO should be able to perform a log analysis. It’s a must-do if you manage a large website.
I will start with some basics, but I’ll also provide advanced tips and tricks, including a Google spreadsheet that helps you discover when other bots or crawlers spoofed the Googlebot.
Scroll to the bottom if you want to watch a video where I cover most of this posts.
Why should care about how search engines crawl my site?
Before we get into detail, I want to mention a couple of reasons you should care and pay attention to how search engines go through your website.
From my experience, I’ve seen a correlation between pages crawled often and pages that rank high. This is only a correlation; not causation.
Staying relevant in search results
Most e-commerce websites use structured data to enrich their snippet in search results. Rich snippets for products may contain information such as product rating, price, availability and more.
Regardless of your position in the company, you want to display the correct information. If you run a product promotion, you want to show the discounted price in search results to acquire more clicks (because you sell the product for less than competitors). However, the discounted price won’t be reflected in search results until that product page is recrawled.
Once the promotion is over, you want to recrawl that page as fast as possible, so as not to disappoint users who saw your discounted price in the search results, only to discover the real price is higher. It’s not a good experience.
Finding challenges that search engines face
Log file analysis helped us discover crawling and indexing issues of an e-commerce site. The insights from the analysis lead to a set of actions that resulted in an increase in revenue by 55% from organic traffic to PLPs (category pages).
The analysis itself doesn’t help with anything, but the actions you make based on insights from the analysis drive the impact.
SEO crawlers don’t reveal the real search engines behavior
As much as I love using SEO crawlers such as Screaming Frog, it’s important to realize that crawlers see only a part of many URLs that search engines know of. There are URLs SEO crawlers don’t discover because they:
- Crawl only your website, so they don’t see external links pointing to your site,
- Provide a snapshot of your website at a certain time (search engines have crawled your website for years and remember legacy URLs),
What’s crawl budget?
Every SEO has heard about crawl budget, but yet many of us cannot explain it.
Google has no single term that would describe everything that “crawl budget” stands for externally (source). Google uses crawl rate limit and crawl demand to prioritize what to crawl and when.
- Crawl rate limit depends mostly on your servers. Google wants to be a good citizen of the web, so they try to figure out how many URLs they can crawl without impacting the overall performance of the website.
- Crawl demand is determined popularity (more popular URLs tend to be crawled more often) and staleness (Google attempts to prevent URLs from becoming stale in the index).
The problem occurs when the crawl rate limit is lower than the crawl demand. There are two solutions to that:
- Increase the crawl rate limit by improving your servers.
- Lower crawl demand by cleaning up your site, consolidating duplicate pages, disallowing to crawl some combinations in faceted navigation, etc.
That’s being said, most websites don’t have problems with “crawl budget”. Learn to prioritize because there’s not much value in it for small sites.
Optimizing crawl budget for a 100-page website doesn’t make sense.
Challenges of log analysis
Many people don’t dive into log files because they think it’s difficult. However, it’s not!
You don’t need to have a degree from statistics or know a programing language. It’s not rocket science.
The hardest thing is getting access to logs.
Talk to your developers, engineers, web ops and make friends.
I don’t want to go through the pros and cons of each tool in this post, but it’s important to know there are two ways you can analyze logs.
Static vs. Real-Time
If a developer sends you a log file from last week, you work with a static file. If you want to see logs from this week, you must request another file. Static files can be opened in Excel or tools like Screaming Frog Log Analyzer. You can use Google BigQuery or MySQL for larger log files.
Some solutions give you direct access to logs and allow you to change the time frame. You don’t have to request log files again and again. One of them is ELK Stack (Elasticsearch, Logstash, and Kibana). Once you get access to Kibana, you’re good to go, and you need not talk to a developer again.
Breakdown of a log file
Regardless of the tool of your choice, the basic logic is like using an Excel sheet with filters and pivot tables. Think of each request as a row in a table and each type of information as a column.
What’s a log file?
A log file is a recording or everything that goes in and out of a server.
It’s a ledger that records all request and responses made to a server (including failed requests).
It logs all interactions, and you can find all requests from users and bots there.
If you open a log file, you’ll see something like this.
Let’s deconstruct that.
126.96.36.199 – – [08/Dec/2017:04:54:20 -0400] “GET /contact/ HTTP/1.1” 200 11179 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
This is one row of the file. One row means one request.
- 188.8.131.52 – IP address (who)
- [08/Dec/2017:04:54:20 -0400] – Timestamp (when)
- “GET /contact/ HTTP/1.1” – Access request (what)
- 200 – Status code (result of the request)
- 11179 – Bytes transferred (size)
- “-” – Referrer URL (source)*
- Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) – User agent (signature)
* This is a request from the Googlebot and, in general, requests from search engine bots don’t include referrer URL.
Some bots pretend to be the Googlebot. This is called spoofing, and they do it to ensure you don’t block them.
Often, you want to block those bots because too many requests may slow down your server and/or distort your data (log files showing higher Googlebot activity than it has been).
To verify if a bot accessing your server really is the Googlebot, you may run a reverse DNS lookup and a forward DNS lookup on the IP address from the request.
DNS (Domain Name System) is the Internet equivalent of a phone book. It ties an IP address and domain name in the same way a phone book ties a phone number and phone owner together.
Tools dedicated to log analysis do it automatically for you. If that’s not the case, you may want to verify Googlebot requests on your own.
Let’s use Google Sheets combined with Hacker Target’s API. The IP tools have a limit of 100 queries per day, or you can sign up for a membership to increase the daily quota.
I want to help you get started today, so I’ve created a spreadsheet that does it for you.
How to verify Googlebot
- Download your logs – Ask your developers to do that (or most hosting providers allow you to download them via cPanel).
- Import Googlebot requests to a Google Spreadsheet – Make sure that each type of information has its own column (column 1: IP address, column 2: Timestamp, column 3: Request, etc.).
- Extract all IP addresses and order them by the number of requests – You can use pivot tables, but I prefer using QUERY function* in this case.
- Run a reverse DNS lookup – Use Reverse DNS API to convert IP addresses to domain names. You’ll need to combine CONCATENATE and IMPORTDATA functions.
- Run a forward DNS lookup – Use DNS Lookup API to convert domain names back to IP addresses. Again, you can use CONCATENATE and IMPORTDATA functions.
- Compare IP addresses – Take the final IP address and compare it with the initial IP address. If the IPs are identical, you successfully verified the Googlebot. If IPs don’t match, you know the requests from that IP are not from Googlebot.
* QUERY function uses the principles of SQL to do searches so if you’re familiar with SQL, you’ll love this function.
The IP tools have a limit of 100 queries per day, so the sheet may return errors when you open it because the limit might have been exceeded by other users opening that sheet. Therefore, I created two sheets:
- IPs (Dynamic) – This sheet includes all the functions but may return errors.
- IPs (Static) – This sheet includes only values returned via the APIs to demonstrate how it’s supposed to look like if you don’t go over the limit.
*If you want to use the sheet, recreate the sheet (you can see all functions) or make your own copy (FILE -> MAKE A COPY). Use it as an inspiration.
Analyzing log files
As with all analysis, don’t blindly dive into numbers with a hope they will reveal all the secrets. If you do that, you will get overwhelmed by the number of insights and small things to fix. You may easily overlook bigger problems.
Ask questions and find answers. This usually leads to more questions so repeat the process a few times.
Asking what questions to start with? Let’s look at a handful of examples:
- What search engines crawl your website? If you are an international business, you want to make sure that not only Google crawls your site. Don’t forget about Baidu (China), Yandex (Russia), Naver (South Korea), and Seznam (Czech Republic).
- What URLs are crawled most often? If a URL is often crawled, it may indicate that search engines consider that URL as important. See if any weird or unimportant URLs are often crawled.
- Which content types are crawled most often? Step back and look for bigger patterns. If product pages are crawled more often than category pages, it may be an indication of issues with internal linking.
- Which status codes are returned? You don’t want search engines to be hitting 404s, unless it’s intentional. Discover how often your servers have performance issues by looking at spikes with 500 status codes.
Returning 404s to search engines is usually undesirable. Generate a list of URLs that return 404 and sort that list by the number of requests (DESC).
|Path||# of requests|
You don’t have to redirect all 404s, that’s very time consuming especially for large sites, but this list helps to prioritize. It’s a good idea to review and redirect top 50 URLs from the list once in a while.
Everyone has heard about Mobile-first indexing. Google has migrated some of the websites for Mobile-first indexing, and you should know if your site is one of them.
Google started sending notifications via Google Search Console, but those notifications come in waves and are delayed. This means your site might have been migrated and you have no idea. But no worries, log files can save the day.
- If about 80% of Googlebot requests come from the desktop crawler, the site hasn’t been migrated
- If about 80% of Googlebot requests come from the smartphone crawler, the site has been migrated.
Segmenting or grouping data from logs to help discover bigger problems such as if the navigation is not crawlable, certain type of content is hidden to search engines, etc.
You can group requests by page type and compare PDP (product detail pages) versus PLP (product landing pages) or group them by languages (English pages vs. French pages) to get a better understanding of bigger trends.
Grouping URLs is simple if you have URLs that provide key information.
The URL above tells you the language (English), type of the page (PDP), brand of the product (Adidas), type of the product (shoe) and product name (Stan Smith). You can use all that information to segment your data set with a little help from regular expressions. If you are not familiar with RegEx yet, stop reading this article and come back once you know them. Seriously!
Here are three great sources to start with:
- Regular Expressions for Regular Joes by Paul Shapiro
- Principles of Regular Expressions for an SEO by Tomasz Rudzki
- Regular expressions and XPaths alternatives every SEO needs by Tobias Willmann
However, it may not be that simple, and you can encounter URLs like this:
Is this your case? Don’t feel bad for yourself.
There are ways to get around this. Be creative or just read the next tip!
Merge log files with other data sources
This one is big!
Log files give you valuable insights but merging them with other data sources unlocks another level of insights. Merge logs with:
- Google Analytics data – Do bots crawl your money making pages? If not, why?
- SEO crawls – Do bots crawl only URLs that are that supposed to be indexed? What about those with noindex or non-canonical URLs? You may be surprised to discover seeing over 50% of Googlebot requests for URLs that are not meant to be indexed.
Think of logs as a source of information
Don’t see logs as an SEO tool. See them as a source of information. Adopting this mindset allows you to use them in different scenarios, such as debugging Google Analytics.
After changing one simple thing on the homepage of a website, organic traffic to the homepage dropped significantly. Thanks to log files, we found an issue and applied a hotfix.
Spoiler alert: The traffic didn’t drop but was only misclassified as Direct instead of “Organic Search”. You can read the entire story with a detailed explanation and screenshots on my blog.
Analyzing logs is not as difficult as you may think. It’s easy and extremely valuable. Logs tell you how search engines see your website. They tell you the full story.
You can unlock the next level of insights by merging logs with your Google Analytics or data from an SEO crawler of your choice.
Log files are not only useful for SEO; they may be also used to debug Google Analytics. Think of them as a source of information.
STAT City Crawl 2018 Vancouver