RvsPython #1: Webscraping

Webscraping is a powerful tool available for efficent data collection. There are ways to do it in both R and Python.
I’ve built the same scraper in R and Python which gathers information about all the whitehouse breifings available on www.whitehouse.gov (don’t worry guys–it’s legal);

This is based off of what I learned from FreeCodeCamp about webscraping in Python (heres the link: https://www.youtube.com/watch?v=87Gx3U0BDlo ).

This blog is about approaches I naturally used with R’s rvest package and Python’s BeautifulSoup library.

Here are two versions of code which I use to scrape all the breifings

This webscraper extracts:

1) Date of the Breifing
2) The title of the Breifing
3) The URL to the Breifing
4) The The Issue Type

and puts them in a data frame.

The differences between the way I did this in Python vs R:

Python

(a) I grabbed the data using the xml
(b) Parsing the data was done with the html classes (and cleaned with a small amount of Regex)
(c) I used for loops
(d) I had to import other libraries besides for bs4

R

(a) I used a CSS selector to get the raw data.
(b) The data was parsed using good ol’ regular expressions.
(c) I used sapply()
(d) I just used rvest and the base library.

This is a comparison between how I learned to webscrape in Python vs How I learned how to do it in R. Lets jump in and see which one did faster!

Python Version with BeautifulSoup

# A simple webscraper providing a dataset of all Whitehouse Breifings
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import lxml


def get_whitehouse_breifings():
    # Generalize to all pages

    orig_link = requests.get("https://www.whitehouse.gov/briefings-statements/")

    orig_content = orig_link.content

    sp = BeautifulSoup(orig_content, 'lxml')

    pages = sp.find_all('a', {'class': 'page-numbers'})

    the_pages = []

    for pg in pages:
        the_pages.append(pg.get_text())

    # Now make set of links

    the_links = []

    for num in range(1, int(max(the_pages)) + 1):
        the_links.append('https://www.whitehouse.gov/briefings-statements/' + 'page/' + str(num) + '/')

    dat = pd.DataFrame()
    for link in the_links:
        link_content = requests.get(link)
        link_content = link_content.content
        sp = BeautifulSoup(link_content, 'lxml')
        h2_links = sp.find_all('h2')
        date_links = sp.find_all('p', {"class": "meta__date"})
        breif_links = sp.find_all('div', {"class": "briefing-statement__content"})

        title = []
        urls = []
        date = []
        breifing_type = []
        for i in h2_links:
            a_tag = i.find('a')
            urls.append(a_tag.attrs['href'])
            title.append(a_tag.get_text())
        for j in date_links:
            d_tag = j.find('time')
            date.append(d_tag.get_text())
        for k in breif_links:
            b_tag = k.find('p')
            b_tag = b_tag.get_text()
            b_tag = re.sub('\\t', '', b_tag)
            b_tag = re.sub('\\n', '', b_tag)
            breifing_type.append(b_tag)

        dt = pd.DataFrame(list(zip(date, title, urls, breifing_type)))

        dat = pd.concat([dat, dt])

    dat.rename(columns={"Date": date, "Title": title, "URL": urls, "Issue Type": breifing_type})
    return (dat)

Running the code, Python’s Time

import time
start_time=time.time()

pdt = get_whitehouse_breifings()


# Time taken to run code
print("--- %s seconds ---" % (time.time() - start_time))

## --- 162.8423991203308 seconds ---
 

R Version with rvest

library(rvest)

get_whitehouse_breifings<- function(){
  #Preliminary Functions




  pipeit<-function(url,code){
    read_html(url)%>%html_nodes(code)%>%html_text()
  }

  pipelink<-function(url,code){
    read_html(url)%>%html_nodes(code)%>%html_attr("href")
  }


  first_link<-"https://www.whitehouse.gov/briefings-statements/"

  # Get total number of pages

  pages<-pipeit(first_link,".page-numbers")

  pages<-as.numeric(pages[length(pages)])

  #Get all links
  all_pages<-c()

  for (i in 1:pages){
    all_pages[i]<-paste0(first_link,"page/",i,"/")
  }



  urls<-unname(sapply(all_pages,function(x){
        pipelink(x,".briefing-statement__title a")
        })) %>% unlist()

  breifing_content<-unname(sapply(all_pages,function(x){
    pipeit(x,".briefing-statement__content")
  })) %>%  unlist()


  # Data Wrangling

  test<-unname(sapply(breifing_content,function(x) gsub("\\n|\\t","_",x)))

  test<-unname(sapply(test,function(x) strsplit(x,"_")))

  test<-unname(sapply(test,function(x) x[x!=""]))

  breifing_type<-unname(sapply(test,function(x) x[1])) %>% unlist()
  title<-unname(sapply(test,function(x) x[2])) %>% unlist()
  dat<-unname(sapply(test,function(x) x[length(x)])) %>% unlist()


  dt<- data.frame("Date"=dat,"Title"=title,"URL"=urls,"Issue Type"= breifing_type)

  dt
}

Running the code,R’s Time

##    user  system elapsed 
##   16.77    4.22  415.95

Analysis and Conclusion:

On my machine Python was waaaaay faster than R. This was primarily because the function I wrote in R had to go over the website a second time to extract links. Could it be sped up if I wrote the code extracting text and links in one step? Very likely. But I would have to change the approach to be similar to how I did it in Python.

For me rvest seems to be great for “quick and dirty” code (Point and click with a CSS selector, put it in a function, iterate accross pages; Repeat for next field). BeautifulSoup seems like its better for more methodical scraping. The approach is naturally more html heavy.

Python requires one to refrence the library every time they call a function from it, which for myself being a native R user find frustrating as opposed to just attaching the library to the script.

For R you have to play with the data structure (from lists to vectors) to get the data to be coerced to a dataframe. I didn’t need to do any of this for Python.

I’m sure theres more to write about these libraries (and how there are better ways to do it in both of these languages), but I’m happy that I am aquainted with them both!

Let me know what you think!

P.S. This was uploaded with the RWordpress Package. Check out my Linkedin Post on the topic here.

7 thoughts on “RvsPython #1: Webscraping

  1. It’s interesting that you chose BeautifulSoup, I wouldn’t consider that the equivalent of rvest! For less code I would at least check out https://scrapy.org/, you might get speed and elegance 🙂 Thanks for the post!

    Like

  2. A more elegant way to do it in R
    library(rvest)
    library(tidyverse)

    pages %
    html_nodes(“.page-numbers”) %>%
    html_text()

    pages <- as.numeric(pages[length(pages)])

    dt %
    slice(1:2) %>%
    mutate(html = map(pages, . %>% read_html()),
    URL = map(html, . %>% html_nodes(“.briefing-statement__title a”) %>% html_attr(“href”)),
    briefing_type = map(html, . %>% html_nodes(“.briefing-statement__type”) %>% html_text()),
    Issue_type = map(html,. %>% html_nodes(“.briefing-statement__content”) %>% html_node(“.issue-flag–left”) %>% html_text() ),
    Date = map(html, . %>% html_nodes(“time”) %>% html_text()),
    Title = map(html, . %>% html_nodes(“h2 a”) %>% html_text())) %>%
    unnest(cols = c(URL,Title,Issue_type,briefing_type,Date)) %>%
    mutate_at(vars(briefing_type,Issue_type), ~str_remove_all(.,”[\t\n]”)) %>%
    select(Date, Title,URL,briefing_type,Issue_type )
    glimpse(dt)

    Like

    1. The challenge for me is that as efficient as this may be, I struggle (presently) to see this intuitively. Maybe it was because I have learned how to do this stuff “base R” style, I prefer to stick to for loops and apply() functions.

      Thank you for sharing! I wonder how it times against the Python script!

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Python-bloggers

Data science news and tutorials - contributed by Python bloggers

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Create your website at WordPress.com
Get started
%d bloggers like this: