Web Scraping with R

Web Scraping is used to pull data from web pages when an API is unavailable.  Imagine copying the data by hand (horrible chore) this is essentially web scraping.

I’ve wanted to get my head around this for a while and see if modern sites are structured in ways to make this easier.

When I was researching wine types for another project an opportunity came up.

Wine.com has a stellar dataset and they actually have a good API but I’ll ignore that for this demo.

The packages we can use to make our life easier are:

rvest   – this is from Hadley Wikham and makes scraping web data simple.

stringi – this comes from Marek Gagolewski, Bartek Tartanus et al and gives us many functions to manipulate text from any locale.  In Excel we have functions like Trim(), Left(), Right(), Mid() – well, after reading the docs for stringi my mind was truly blown.

There’s one other tool that’s recommended, Selectorgadget.

The first thing to do when scraping a page is to browse to the page.

http://www.wine.com/v6/wineshop/list.aspx?N=7155+2096&pagelength=25

We can then use the Selectorgadget tool to detect the right CSS selector. In the screenshot below I’ve clicked over the abstract and the class .productAbstract has been selected.  We know it’s the right selector as the other items in the list are highlighted.  A little patience is required here as there can be different selectors in the CSS which can give erroneous data.

selector

The other selector we need is .productList.

With this information we’re good to go.

The script starts with loading the libraries and setting the URL we found earlier. I’ve increased the pagelength for illustration purposes.

library(rvest)
library(stringi)

url <- "http://www.wine.com/v6/wineshop/list.aspx?N=7155+2096&pagelength=100"

To pull the web page into R’s memory we simply use :

page <- html(url)

We can then use the selector we found earlier to extract the data we’re interested in.

product_name <- page %>%
html_nodes(".listProductName") %>%
html_text()

products

We can do a similar thing to extract the Abstract.

abstract <-page %>%
html_nodes(".productAbstract") %>%
html_text()

abstract

That is not helpful at all.  We have \r\n and too much space.  Fear not, we can use stringi to clean up the data.   When I look at a problem like this I breakdown the problem into smaller bits and solve these individually.  So in this spirit we’ll handle the \r\n problem first.

stri_replace_all_regex(abstract, "\r\n", "")

This is simply replacing the string “\r\n” with empty.  Of course with Regex there are many ways to skin a cat.   If you want to learn more feel free.

We have spaces left to eliminate.  Notice I’m searching for double space as I don’t want to remove the space from the abstract itself.

stri_replace_all_regex("  ","")

We add the product and abstract columns together, add columns names and export to CSV.  Job done.

products <- cbind(product_name,abstract)
colnames(products) <- c("ProductName", "Variety", "Origin")
write.csv(products,file="Product.csv")

finalset

Complete code is listed below

library(rvest)
library(stringi)

url < -"http://www.wine.com/v6/wineshop/list.aspx?N=7155+2096&pagelength=300"

page <- html(url)

product_name <- page %>%
html_nodes(".listProductName") %>%
html_text() 

abstract <- page %>%
html_nodes(".productAbstract") %>%
html_text()

abstract <-(stri_replace_all_regex(abstract, "\r\n", "") %>%
stri_replace_all_regex(" ","") %>%
stri_split_fixed(" from ", simplify = TRUE))

products <- cbind(product_name,abstract)

colnames(products) <- c("ProductName", "Variety", "Origin")
write.csv(products,file="Product.csv")

16 lines of code.  In summary, we’ve scraped the page and selected the pertinent data and used basic Regex to clean up.    Data cleaning isn’t so bad after all 🙂

Posted in R | Tagged , , | 2 Comments

The Ultimate Tour

Here’s an example of using web services to source data & processing from 3rd parties.  I queried Yelp for top venues, used Google Maps to retrieve a distance matrix.  I then used an R package TSP to calculate the optimal route.  Finally plotting the route on Google Maps.

I presented this at ManchesterR in Feb-15.  Code walk-through can be found on Rpubs 

Posted in Analytics, R | Tagged , | 2 Comments

Predictive Analytics slides from Nov-14 SQL Sat London

Here are my slides from the Predictive Analytics talk I gave in Nov 14 at the top of Smithfield Meat Market in London.  What a memorable day that was.

Posted in Analytics | Tagged | Leave a comment

Data Analysis prep with Linux Mint & R

When I embarked on learning R last year I decided to jump whole heartedly into the open source stack with Linux, using a virtual machine running Linux Mint.

There were various reason for this such as productivity with the terminal and easier access to Linux data sources, not to mention, it’s much easier to deploy R scripts with Linux.

During my Data Analysis adventures I found I had to customise Linux to make it more suitable for Data Analysis with R.

I had to rebuild the VM this week, don’t ask! Of course I had to scratch around for the configs again.

Here are the steps I used to get Mint 17.1/Ubuntu 14.04 LTS, primed for Data Analysis with R.

From the terminal :

1. Install the latest version of R, Git, sqLite and libraries that I’ve found are required for data & web munging. Be sure to change to the url to your local mirror found on this page.

sudo add-apt-repository "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/"
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install r-base r-base-dev r-cran-rodbc libxml2-dev libcurl4-openssl-dev git-core sqlite3

3. Set up git

git config --global user.name "Your Name"
git config --global user.email "your@email.com"

2. Install R Studio. You’ll have to change the url for later versions if required.

cd ~
wget http://download1.rstudio.org/rstudio-0.98.1091-amd64.deb
sudo dpkg -i rstudio-0.98.1091-amd64.deb
sudo rm rstudio-0.98.1091-amd64.deb

This should get you going. I’ll add to this if I discover any new config on my travels.

Posted in R | Tagged , | 1 Comment

Johns Hopkins and Coursera MOOC review : Data Science

When I started learning R as a means to boost my analytical skills I was hit by the steep learning curve despite having half-decent coding skills.  I needed to dig deep.

After a couple of books and blogs I was off the starting line but it soon became clear that this was going to take more effort.

Continue reading

Posted in Analytics, R, Training | Tagged | 1 Comment