top of page
Writer's pictureRobin Rozhon

How to Extract Data from Wikipedia and Wikidata

Wikipedia is a trusted source of information that ranks for the majority of two-word search queries, so looking at its data can reveal trends and popularity. With this post, you are a few clicks away from creating a simple data mining tool. This handy add-on gives you the power to mine Wikipedia and Wikidata.



There may be reasons not to work with Google Sheets, and that’s ok. Just extract Wikipedia data via Google Spreadsheets, download all the data from the sheet to your laptop, and open it in Excel or LibreOffice.


Google AdWords Keyword Planner suggests keywords with the commercial or transactional intent, unless you dig deep and use highly specific keywords in the input. Wikipedia can give you a large list of related topics. That list can reveal topics you would never think relate to your area of interest and that can fuel your content strategy.


If you’re creating keyword research or looking for new content ideas, Wikipedia is there to help you and reveal additional insights to your targeted niche.


The best thing is it’s created by humans to humans. Editors use natural language and are interested in areas they contribute to. Just reading articles give you a sense of language and vocabulary that your targeted audience uses.


In summer 2015, there were reports about a decline in organic traffic, later confirmed by Jimmy Wales, the co-founder of Wikipedia, as a long-term issue, rather than a sudden drop.


However, it seems like the trend has changed and the visibility grew again.



Wikipedia’s organic traffic according to SEMrush for desktop

How Much Data Does Wikipedia Store?

The data size of all pages with complete edit history in XML format and compressed was about 100 GB in June 2015. Uncompressed size was 10 TB.


By March 6th, 2017, 5,350,828 English articles are managed by 1,269 administrators and 30,381,154 users. Wikipedia contains 41,612,715 pages.


The entire Wikipedia database can be downloaded via Wikipedia dumps in SQL and XML.


Wikipedia Tools for Google Spreadsheets

This add-on allows you to extract data, not only from Wikipedia, but also from Wikidata. Wikidata is a free knowledge database that stores data in a machine-friendly format. It’s a collection of structured data that may be used by other projects.


Note: Google Knowledge graph was powered, in part, by Freebase, but in 2015, the data from Freebase was migrated to Wikidata.


This add-on extends Google Sheets with 21 new custom functions through Google Apps Scripts. The functions are written in JavaScript and extract data via three APIs:


  • Wikipedia API

  • Pageviews API

  • Wikidata API


A paper with use case descriptions was published by Thomas Steiner, a solutions engineer at Google, on February 8th, 2016.


The opportunities are endless and only limited by your creativity, so I’ll mention only some functions in which I see the biggest potential for SEO and marketing.


Unfortunately, all the formulas return only the first 500 values based on alphabetical order.


WIKIDATAFACTS

Returns Wikidata facts for a Wikipedia article.


Using the formula is straightforward.


=WIKIDATAFACTS(“article”)

The name of the Wikipedia article is case sensitive; therefore, use uppercase if you search for acronyms, such as NASA, IBM, etc.


=WIKIDATAFACTS(“en:NASA”)

Another option to get facts is using the Wikidata item’s “QID”, a unique identificator for each item.


=WIKIDATAFACTS(“Q23548”)

There are a few ways to find “QID” for any Wikidata item:


  • In a link connecting the Wikipedia page to the Wikidata item (at the end of the Wikidata URL)




  • Right after the main headline on a Wikidata page





WIKIINBOUNDLINKS

Returns Wikipedia inbound links for a Wikipedia article.


=WIKIINBOUNDLINKS(“language:article”)

Language represents one of the 291 Wikipedia languages. If you don’t specify the language, English results will be returned by default.


=WIKIINBOUNDLINKS(“en:NASA”)
=WIKIINBOUNDLINKS(“de:NASA”)

When to use it: This formula is handy for initial research on a topic you are unfamiliar with. It instantly gives you a high overview of articles related to your topic.


WIKILINKSSEARCH

Returns Wikipedia articles with a link that matches a link pattern.


=WIKILINKSEARCH(“language:linkPatern”,”protocol”)

Specifying the protocol is optional, and it’s set to “http” by default.


If you want to get links to your site, including all your subdomains, use “*.site.com”.


=WIKILINKSEARCH(“en:*.apple.com”,”http”)
=WIKILINKSEARCH(“en:*.apple.com”,”https”)

Use the formula for both protocols. As we know, many websites switched from HTTP to HTTPS in the last couple of years, so you can miss a lot of links. In the example above, “http” returned 224 links and “https” returned 369 links.


When to use it: Have you ever wondered if your website has any links from Wikipedia? With this formula, you can find out in a few seconds. This can reveal interesting connections. Why does a Wikipedia article about Coldplay link to apple.com?


WIKIOUTBOUNDLINKS

Returns Wikipedia outbound links for a Wikipedia article.


=WIKIOUTBOUNDLINKS(“language:article”)

This is a very similar formula to WIKIINBOUNDLINKS I mentioned above.


=WIKIOUTBOUNDLINKS(“en:NASA”)

When to use it: Great for discovering connections about entities and associated information. You may be writing an article about mountain biking. Simply using this formula, you can get a list of mountain bike types, events, related sports, or famous riders. Every item on that list is related to that topic, so you can use them in your article or save for next articles.


WIKIPAGEEDITS

Returns Wikipedia page edits statistics for a Wikipedia article.


=WIKIPAGEEDITS(“article”, “start date”, “end date”)

The start date and end data are optional, and the function will return data for the last 30 days if you don’t use them. The dates can be also specified by using TODAY() function.


=WIKIPAGEEDITS(“Mountain bike” ,TODAY()-365,TODAY())

When to use it: Does your business have a Wikipedia page? If so, you can monitor changes on your page using this formula.


Also, if you see a lot of edits in the last months, you may assume people are interested in that topic and looking for more information. Identify missing info in an article and cover it on your site.


WIKIPAGEVIEWS

Returns Wikipedia pageviews statistics for a Wikipedia article.


=WIKIPAGEVIEWS(“article”, “start date”, “end date”)

This is my favorite formula. It returns the number of pageviews per day, which gives you the ability to find trends.


=WIKIPAGEVIEWS(“Glamping”,TODAY()-730,TODAY())

When to use it: Seek spikes in pageviews, find out the reason it triggered interest, and try to replicate it. If you’re unable to replicate it, you can produce content related to that trigger.


I requested data for “Glamping” and saw an enormous spike on December 19, 2016, an increase of 7600% compared to the previous day (254 vs. 19,558 pageviews).



Pageviews for Wikipedia article about glamping (June 30th, 2015 – March 20th, 2017)

If your business has anything to do with glamping, you have to know what the hell happened that day. For a spike of interest like that, it’s easy to find the reason with a little help from Google.


Google Search for Glamping

Search for “glamping” in Google and set the date range from the day of the spike to a day after (sometimes, it takes a bit to write and publish the story).


The reason? Oprah suggested to Michelle Obama to go glamping together on her show.


WIKISEARCH

Returns Wikipedia article results for a query.


=WIKISEARCH(“language:article”, “did you mean suggestions”)

“Did you mean suggestions” is a Boolean, so use TRUE or FALSE. If it’s not specified, the default is set to FALSE.


When to use it: This is another function that helps discover similar topics.


Conclusion

There are more functions, such as WIKISYNONYMS and WIKISUBCATEGORIES, not covered but may be leveraged for SEO and marketing; however, I rarely found any useful insights in returned data.


Let me know if you have any creative ways to use the functions. Don’t limit yourself, because very interesting things happen, once you chain these formulas.




bottom of page