0% found this document useful (0 votes)
65 views43 pages

Beautifulsoup: Web Scraping With Python

This document provides an overview of using BeautifulSoup, a Python library for parsing and scraping HTML and XML documents. It begins with an introduction to BeautifulSoup and outlines the topics to be covered, including getting started, examples of parsing HTML tables and using regular expressions to extract data, and outputting scraped data. The document then covers HTML and XML basics, navigating the parse tree in BeautifulSoup, and functions for extracting elements like headers, titles and text. It also discusses using regular expressions to match strings and extract needed values, and outputting scraped data to CSV files.

Uploaded by

ZahidRafique
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views43 pages

Beautifulsoup: Web Scraping With Python

This document provides an overview of using BeautifulSoup, a Python library for parsing and scraping HTML and XML documents. It begins with an introduction to BeautifulSoup and outlines the topics to be covered, including getting started, examples of parsing HTML tables and using regular expressions to extract data, and outputting scraped data. The document then covers HTML and XML basics, navigating the parse tree in BeautifulSoup, and functions for extracting elements like headers, titles and text. It also discusses using regular expressions to match strings and extract needed values, and outputting scraped data to CSV files.

Uploaded by

ZahidRafique
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

BeautifulSoup: Web Scraping with Python

Apr 9, 2020

BeautifulSoup
Roadmap

Uses: data types, examples...


Getting Started
downloading files with wget
BeautifulSoup: in depth example - election results table
Additional commands, approaches
PDFminer
(time permitting) additional examples

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Etiquette/ Ethics

Similar rules of etiquette apply as Pablo mentioned:


Limit requests, protect privacy, play nice...

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Data/Page formats on the web

HTML, HTML5 (<!DOCTYPE html>)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Data/Page formats on the web

HTML, HTML5 (<!DOCTYPE html>)


data formats: XML, JSON
PDF
APIs
other languages of the web: css, java, php, asp.net...
(don’t forget existing datasets)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

BeautifulSoup

General purpose, robust, works with broken tags


Parses html and xml, including fixing asymmetric tags, etc.
Returns unicode text strings
Alternatives: lxml (also parses html), Scrapey
Faster alternatives: ElementTree, SGMLParser (custom)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Installation

pip install beautifulsoup4 or


easy_install beautifulsoup4
See: http://www.crummy.com/software/BeautifulSoup/
On installing libraries:
http://docs.python.org/2/install/

BeautifulSoup
Introduction Example Regex Other Methods PDFs

HTML Table basics

<table> Defines a table


<th> Defines a header cell in a table
<tr> Defines a row in a table
<td> Defines a cell in a table

BeautifulSoup
Introduction Example Regex Other Methods PDFs

HTML Tables

BeautifulSoup
Introduction Example Regex Other Methods PDFs

HTML Tables

<h4>Simple table:</h4>
<table>
<tr>
<td>[r1, c1] </td>
<td>[r1, c2] </td>
</tr>
<tr>
<td>[r2, c1]</td>
<td>[r2, c2]</td>
</tr>
</table>

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Example: Election data from html table

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Example: Election data from html table

election results spread across hundreds of pages


want to quickly put in useable format (e.g. csv)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Download relevant pages

website might change at any moment


ability to replicate research
limits page requests

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Download relevant pages

I use wget (GNU), which can be called from within python


alternatively cURL may be better for macs, or scrapy

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Download relevant pages

wget: note the --no-parent option!


os.system("wget --convert-links --no-clobber \
--wait=4 \
--limit-rate=10K \
-r --no-parent http://www.necliberia.org/results2011/results

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Step one: view page source

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Outline of Our Approach

1 identify the county and precinct number


2 get the table:
identify the correct table
put the rows into a list
for each row, identify cells
use regular expressions to identify the party & lastname
3 write a row to the csv file

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Open a page

soup = BeautifulSoup(html_doc)
Print all: print(soup.prettify())
Print text: print(soup.get_text())

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Navigating the Page Structure

some sites use div, others put everything in tables.

BeautifulSoup
Introduction Example Regex Other Methods PDFs

find all

finds all the Tag and NavigableString objects that match the
criteria you give.
find table rows: find_all("tr")
e.g.:

for link in soup.find_all(’a’):


print(link.get(’href’))

BeautifulSoup
Introduction Example Regex Other Methods PDFs

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Let’s try it out

We’ll run through the code step-by-step

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions

Allow precise and flexible matching of strings


precise: i.e. character-by-character (including spaces, etc)
flexible: specify a set of allowable characters, unknown
quantities
import re

BeautifulSoup
Introduction Example Regex Other Methods PDFs

from xkcd

(Licensed under Creative Commons Attribution-NonCommercial 2.5 License.)


BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: metacharacters

Metacharacters:
|. ^ $ * + ? { } [ ] \ | ( )
excape metacharacters with backslash \

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: Character class

brackets [ ] allow matching of any element they contain


[A-Z] matches a capital letter, [0-9] matches a number
[a-z][0-9] matches a lowercase letter followed by a number

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: Repeat

star * matches the previous item 0 or more times


plus + matches the previous item 1 or more times
[A-Za-z]* would match only the first 3 chars of Xpr8r

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: match anything

dot . will match anything but line break characters \r \n


combined with * or + is very hungry!

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: or, optional

pipe is for ‘or’


‘abc|123’ matches ‘abc’ or ‘123’ but not ‘ab3’
question makes the preceeding item optional: c3?[a-z]+
would match c3po and also cpu

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: in reverse

parser starts from beginning of string


can tell it to start from the end with $

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions

\d, \w and \s
\D, \W and \S NOT digit (use outside char class)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions

Now let’s see some examples and put this to use to get the
party.

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Basic functions: Getting headers, titles, body

soup.head
soup.title
soup.body

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Basic functions

soup.b
id: soup.find_all(id="link2")
eliminate from the tree: decompose()

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Other Methods: Navigating the Parse Tree

With parent you move up the parse tree. With contents


you move down the tree.
contents is an ordered list of the Tag and NavigableString
objects contained within a page element.
nextSibling and previousSibling: skip to the next or
previous thing on the same level of the parse tree

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Data output

Create simple csv files: import csv


many other possible methods: e.g. use within a pandas
DataFrame (cf Wes McKinney)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Putting it all together

Loop over files


for vote total rows, make the party empty
print each row with the county and precinct number as
columns

BeautifulSoup
Introduction Example Regex Other Methods PDFs

PDFs

Can extract text, looping over 100s or 1,000s of pdfs.


not based on character recognition (OCR)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

pdfminer

There are other packages, but pdfminer is focused more


directly on scraping (rather than creating) pdfs.
Can be executed in a single command, or step-by-step

BeautifulSoup
Introduction Example Regex Other Methods PDFs

pdfminer

BeautifulSoup
Introduction Example Regex Other Methods PDFs

PDFs

We’ll look at just using it within python in a single command,


outputting to a .txt file.
Sample pdfs from the National Security Archive Iraq War:
http://www.gwu.edu/~nsarchiv/NSAEBB/NSAEBB418/

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Performing an action over all files

Often useful to do something over all files in a folder.


One way to do this is with glob:
import glob
for filename in glob.glob(’/filepath/*.pdf’):
print filename
see also an example file with pdfminer

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Additional Examples

(time permitting): newspapers, output to pandas...

BeautifulSoup

You might also like