Advanced Web Scraping

In one of my previous posts I shared with you how to do basic web scraping with a combination of google sheets, the importxml() formula and xpath.

As soon as you’re project will become a bit bigger and you start to use more and more importxml() queries in your sheet you eventually will be capped by googles limit on the formula. The content just won’t load anymore.

Enter Nokogiri

Luckily there are plenty of other options for web scraping and you’ll be able to reuse your knowledge of xpath. Nokogiri is a so called ruby “gem” that you can run in terminal.

An example of what can be done with Nokogiri & Ruby

  • Scrape multiple elements of an URL as columns in a CSV
  • Scrape multiple (similar) URLs at once
  • Save the whole dataset to one CSV

An image tells a thousand words

The scraper in action

Pretty satisfying to watch this script doing it’s job.

Not sharing the code

Unfortunately scraping a website is an extremely “grey” area of the law – see this great writeup on the topic. That’s why I won’t share the code here. Think carefully about the consequences of your actions and ask the the content owner for permission before you do anything.

Why writing the post without an exact guide?

Scraping is a great thing to have in your toolbelt as technical marketing especially as you probably will find yourself in a situation where you’ll want to scrape your own content for one or another reason. Without knowing (and searching for) other options you’ll never question and improve your own approach.

How do you get started?

Check out this awesome tutorial, it’s very detailed and allowed me to finish my project.