Web Scraping: Tables, PDFs, OCR
Cleo O’Brien-Udry
                                         Yale University
                                        25 May 2020
Cleo O’Brien-Udry (Yale University)        Web Scraping    25 May 2020   1 / 11
Plan
   1   short review of html code/basic web-scraping techniques
   2   scraping tables from a webpage
   3   importing PDFs into R
   4   Optical character recognition (pulling text from images into R)
Cleo O’Brien-Udry (Yale University)   Web Scraping               25 May 2020   2 / 11
Tools
       RStudio: packages rvest, pdftools,tesseract, magick, tidyverse,
       plyr, data.table
       Github script, slides, additional resources
       (https://github.com/cobrienudry/webscrape)
       Selector Gadget Chrome Extension
       (https://chrome.google.com/webstore/detail/
       selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=en)
Cleo O’Brien-Udry (Yale University)   Web Scraping            25 May 2020   3 / 11
Quick review
Web scraping: extract data from websites and store on your computer (or
an external server)
   1   Find web-page
   2   Identify location of relevant data on web-page
   3   Import into R
   4   Clean data
   5   Repeat
Cleo O’Brien-Udry (Yale University)   Web Scraping         25 May 2020   4 / 11
Research Question: Global Voting Patterns
Cleo O’Brien-Udry (Yale University)   Web Scraping   25 May 2020   5 / 11
Research Question: Global Voting Patterns
How have global levels of voting changed over the last 50 years? Which
countries show similar patterns of turnout and registration; which show
different patterns?
Cleo O’Brien-Udry (Yale University)   Web Scraping           25 May 2020   5 / 11
Research Question: Global Voting Patterns
How have global levels of voting changed over the last 50 years? Which
countries show similar patterns of turnout and registration; which show
different patterns?
Data we need:
       Country voter turnout data
       Covariates (country development indicators, VDEM indicators, etc.)
Cleo O’Brien-Udry (Yale University)   Web Scraping            25 May 2020   5 / 11
Research Question: Global Voting Patterns
How have global levels of voting changed over the last 50 years? Which
countries show similar patterns of turnout and registration; which show
different patterns?
Data we need:
       Country voter turnout data
       Covariates (country development indicators, VDEM indicators, etc.)
Use https://www.idea.int/data-tools, which has lots of data.
Cleo O’Brien-Udry (Yale University)   Web Scraping            25 May 2020   5 / 11
Plan
   1   short review of html code/basic web-scraping techniques
   2   scraping tables from a webpage
   3   importing PDFs into R
   4   Optical character recognition (pulling text from images into R)
Cleo O’Brien-Udry (Yale University)   Web Scraping              25 May 2020   6 / 11
Plan
   1   short review of html code/basic web-scraping techniques
   2   scraping tables from a webpage
   3   importing PDFs into R
   4   Optical character recognition (pulling text from images into R)
Cleo O’Brien-Udry (Yale University)   Web Scraping               25 May 2020   7 / 11
Plan
   1   short review of html code/basic web-scraping techniques
   2   scraping tables from a webpage
   3   importing PDFs into R
   4   Optical character recognition (pulling text from images into R)
Cleo O’Brien-Udry (Yale University)   Web Scraping               25 May 2020   8 / 11
Plan
   1   short review of html code/basic web-scraping techniques
   2   scraping tables from a webpage
   3   importing PDFs into R
   4   Optical character recognition (pulling text from images into R)
Cleo O’Brien-Udry (Yale University)   Web Scraping               25 May 2020   9 / 11
Other web scraping topics
       Python for web scraping
       clicking links
       remote servers
Cleo O’Brien-Udry (Yale University)   Web Scraping   25 May 2020   10 / 11
Thank you!
                                      cleo.obrien-udry@yale.edu
Cleo O’Brien-Udry (Yale University)           Web Scraping        25 May 2020   11 / 11