Basic Web Scraping With Google Sheets

This article serves two purposes. 1) To show you that there are basically no limits to what can be done with Google Sheets. 2) To give you a glimpse into web scraping.

The key takeaway for you should be that there is almost always a way of automating a task, just make sure not to fall into this trap.

What we’re doing

Basically we’re building a way to extract structured data from seemingly unstructured data (a website).

Example

  • I extract all links from a given website.
  • I filter the links to only look at specific links (in the example case: product pages)
  • From each given filtered URL I grab a specific item on the page and extract it’s value to the sheet. (in the example the product price of the item)

Check out the example sheet

I’m utilising a few advanced functions that we will partly come back to later:

  • importxml() in conjunction with XPath Query Language – this is the secret sauce to the example. This function enables us to import contents of a remote website (or other files)
  • regexmatch() is one of three regex-related functions google sheets offers which I’ll explain in a future lesson.
  • filter() is a basic function and should definitely be in your personal tool-stack.

With these functions in place we could in theory build a product feed from a website that we don’t even own. And with a bit of google apps script we could utilise these functions to build a price tracker for a website, pretty neat, right?

Update: I built another example – a reddit rank tracker. You’ll have copy it and set up a time trigger in Tools -> Script Editor. Also set the spreadsheet recalculation settings to „On change and every hour“ (File -> Spreadsheet settings -> Calculation)

Update 2: Looking for more advanced web scraping techniques. Check out my article on web scraping with Ruby and Nokogiri.