Open navigation menu

Scribd

0% found this document useful (0 votes)

65 views43 pages

Beautifulsoup: Web Scraping With Python

This document provides an overview of using BeautifulSoup, a Python library for parsing and scraping HTML and XML documents. It begins with an introduction to BeautifulSoup and outlines the topics to be covered, including getting started, examples of parsing HTML tables and using regular expressions to extract data, and outputting scraped data. The document then covers HTML and XML basics, navigating the parse tree in BeautifulSoup, and functions for extracting elements like headers, titles and text. It also discusses using regular expressions to match strings and extract needed values, and outputting scraped data to CSV files.

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views43 pages

Beautifulsoup: Web Scraping With Python

This document provides an overview of using BeautifulSoup, a Python library for parsing and scraping HTML and XML documents. It begins with an introduction to BeautifulSoup and outlines the topics to be covered, including getting started, examples of parsing HTML tables and using regular expressions to extract data, and outputting scraped data. The document then covers HTML and XML basics, navigating the parse tree in BeautifulSoup, and functions for extracting elements like headers, titles and text. It also discusses using regular expressions to match strings and extract needed values, and outputting scraped data to CSV files.

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

BeautifulSoup: Web Scraping with Python

Apr 9, 2020

BeautifulSoup
Roadmap

Uses: data types, examples...

Getting Started
downloading files with wget
BeautifulSoup: in depth example - election results table
Additional commands, approaches
PDFminer
(time permitting) additional examples

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Etiquette/ Ethics

Similar rules of etiquette apply as Pablo mentioned:

Limit requests, protect privacy, play nice...

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Data/Page formats on the web

HTML, HTML5 (<!DOCTYPE html>)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Data/Page formats on the web

HTML, HTML5 (<!DOCTYPE html>)

data formats: XML, JSON
PDF
APIs
other languages of the web: css, java, php, asp.net...
(don’t forget existing datasets)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

BeautifulSoup

General purpose, robust, works with broken tags

Parses html and xml, including fixing asymmetric tags, etc.
Returns unicode text strings
Alternatives: lxml (also parses html), Scrapey
Faster alternatives: ElementTree, SGMLParser (custom)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Installation

pip install beautifulsoup4 or

easy_install beautifulsoup4
See: http://www.crummy.com/software/BeautifulSoup/
On installing libraries:
http://docs.python.org/2/install/

BeautifulSoup
Introduction Example Regex Other Methods PDFs

HTML Table basics

<table> Defines a table

<th> Defines a header cell in a table
<tr> Defines a row in a table
<td> Defines a cell in a table

BeautifulSoup
Introduction Example Regex Other Methods PDFs

HTML Tables

BeautifulSoup
Introduction Example Regex Other Methods PDFs

HTML Tables

<h4>Simple table:</h4>
<table>
<tr>
<td>[r1, c1] </td>
<td>[r1, c2] </td>
</tr>
<tr>
<td>[r2, c1]</td>
<td>[r2, c2]</td>
</tr>
</table>

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Example: Election data from html table

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Example: Election data from html table

election results spread across hundreds of pages

want to quickly put in useable format (e.g. csv)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Download relevant pages

website might change at any moment

ability to replicate research
limits page requests

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Download relevant pages

I use wget (GNU), which can be called from within python

alternatively cURL may be better for macs, or scrapy

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Download relevant pages

wget: note the --no-parent option!

os.system("wget --convert-links --no-clobber \
--wait=4 \
--limit-rate=10K \
-r --no-parent http://www.necliberia.org/results2011/results

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Step one: view page source

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Outline of Our Approach

1 identify the county and precinct number

2 get the table:
identify the correct table
put the rows into a list
for each row, identify cells
use regular expressions to identify the party & lastname
3 write a row to the csv file

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Open a page

soup = BeautifulSoup(html_doc)
Print all: print(soup.prettify())
Print text: print(soup.get_text())

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Navigating the Page Structure

some sites use div, others put everything in tables.

BeautifulSoup
Introduction Example Regex Other Methods PDFs

find all

finds all the Tag and NavigableString objects that match the
criteria you give.
find table rows: find_all("tr")
e.g.:

for link in soup.find_all(’a’):

print(link.get(’href’))

BeautifulSoup
Introduction Example Regex Other Methods PDFs

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Let’s try it out

We’ll run through the code step-by-step

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions

Allow precise and flexible matching of strings

precise: i.e. character-by-character (including spaces, etc)
flexible: specify a set of allowable characters, unknown
quantities
import re

BeautifulSoup
Introduction Example Regex Other Methods PDFs

from xkcd

(Licensed under Creative Commons Attribution-NonCommercial 2.5 License.)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: metacharacters

Metacharacters:
|. ^ $ * + ? { } [ ] \ | ( )
excape metacharacters with backslash \

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: Character class

brackets [ ] allow matching of any element they contain

[A-Z] matches a capital letter, [0-9] matches a number
[a-z][0-9] matches a lowercase letter followed by a number

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: Repeat

star * matches the previous item 0 or more times

plus + matches the previous item 1 or more times
[A-Za-z]* would match only the first 3 chars of Xpr8r

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: match anything

dot . will match anything but line break characters \r \n

combined with * or + is very hungry!

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: or, optional

pipe is for ‘or’

‘abc|123’ matches ‘abc’ or ‘123’ but not ‘ab3’
question makes the preceeding item optional: c3?[a-z]+
would match c3po and also cpu

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: in reverse

parser starts from beginning of string

can tell it to start from the end with $

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions

\d, \w and \s
\D, \W and \S NOT digit (use outside char class)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions

Now let’s see some examples and put this to use to get the
party.

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Basic functions: Getting headers, titles, body

soup.head
soup.title
soup.body

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Basic functions

soup.b
id: soup.find_all(id="link2")
eliminate from the tree: decompose()

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Other Methods: Navigating the Parse Tree

With parent you move up the parse tree. With contents

you move down the tree.
contents is an ordered list of the Tag and NavigableString
objects contained within a page element.
nextSibling and previousSibling: skip to the next or
previous thing on the same level of the parse tree

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Data output

Create simple csv files: import csv

many other possible methods: e.g. use within a pandas
DataFrame (cf Wes McKinney)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Putting it all together

Loop over files

for vote total rows, make the party empty
print each row with the county and precinct number as
columns

BeautifulSoup
Introduction Example Regex Other Methods PDFs

PDFs

Can extract text, looping over 100s or 1,000s of pdfs.

not based on character recognition (OCR)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

pdfminer

There are other packages, but pdfminer is focused more

directly on scraping (rather than creating) pdfs.
Can be executed in a single command, or step-by-step

BeautifulSoup
Introduction Example Regex Other Methods PDFs

pdfminer

BeautifulSoup
Introduction Example Regex Other Methods PDFs

PDFs

We’ll look at just using it within python in a single command,

outputting to a .txt file.
Sample pdfs from the National Security Archive Iraq War:
http://www.gwu.edu/~nsarchiv/NSAEBB/NSAEBB418/

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Performing an action over all files

Often useful to do something over all files in a folder.

One way to do this is with glob:
import glob
for filename in glob.glob(’/filepath/*.pdf’):
print filename
see also an example file with pdfminer

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Additional Examples

(time permitting): newspapers, output to pandas...

BeautifulSoup

You might also like

BeautifulSoup Web Scraping Guide
No ratings yet
BeautifulSoup Web Scraping Guide
43 pages
DAP - Module 4
No ratings yet
DAP - Module 4
57 pages
Python Module-4
No ratings yet
Python Module-4
109 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
Apuntes Curso
No ratings yet
Apuntes Curso
2 pages
Lecture 12 - Web Scrapping
No ratings yet
Lecture 12 - Web Scrapping
11 pages
055-En
No ratings yet
055-En
2 pages
Unit I
No ratings yet
Unit I
12 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
Python For Web Scraping - Week 3: 1 Installing A Module
No ratings yet
Python For Web Scraping - Week 3: 1 Installing A Module
4 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
Beautiful Soup: Python HTML/XML Parsing
No ratings yet
Beautiful Soup: Python HTML/XML Parsing
40 pages
Beautiful Soup Documentation
No ratings yet
Beautiful Soup Documentation
53 pages
Beautiful Soup Documentation - Beautiful Soup 4.4.0 Documentation
No ratings yet
Beautiful Soup Documentation - Beautiful Soup 4.4.0 Documentation
49 pages
DAP Module4 1
No ratings yet
DAP Module4 1
110 pages
Beautiful Soup 4 Documentation Guide
No ratings yet
Beautiful Soup 4 Documentation Guide
61 pages
A Guide To Web Scraping in Python Using Beautiful Soup
No ratings yet
A Guide To Web Scraping in Python Using Beautiful Soup
6 pages
Beautiful Soup Documentation: Getting Help
100% (1)
Beautiful Soup Documentation: Getting Help
56 pages
Webscraping1 1 PDF
No ratings yet
Webscraping1 1 PDF
10 pages
Web Scrapping
100% (1)
Web Scrapping
20 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Beautiful Soup
No ratings yet
Beautiful Soup
7 pages
BeautifulSoup For Python RPA
No ratings yet
BeautifulSoup For Python RPA
6 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Beautiful Soup Documentation - Beautiful Soup 4.13.0 Documentation
No ratings yet
Beautiful Soup Documentation - Beautiful Soup 4.13.0 Documentation
54 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
Beautiful Soup
No ratings yet
Beautiful Soup
61 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Web Scarpping
No ratings yet
Web Scarpping
4 pages
Beautiful Soup Tutorial
100% (2)
Beautiful Soup Tutorial
56 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Python Web Scraping Guide
100% (2)
Python Web Scraping Guide
35 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
22 pages
Download
No ratings yet
Download
4 pages
BeautifulSoup HTML Parsing Guide
No ratings yet
BeautifulSoup HTML Parsing Guide
9 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Strip HTML Tags Using Python
No ratings yet
Strip HTML Tags Using Python
8 pages
Retrieving Data From The Web
No ratings yet
Retrieving Data From The Web
9 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
Scraping
No ratings yet
Scraping
6 pages
Web Scraping and HTML Basics
No ratings yet
Web Scraping and HTML Basics
4 pages
03 Web Scraping
No ratings yet
03 Web Scraping
41 pages
4F IntroToWebScraping
No ratings yet
4F IntroToWebScraping
6 pages
Api and Data Structure
No ratings yet
Api and Data Structure
3 pages
How To Scrape Websites With Python and BeautifulSoup PDF
100% (2)
How To Scrape Websites With Python and BeautifulSoup PDF
10 pages
HTML Table Data Extraction Guide
No ratings yet
HTML Table Data Extraction Guide
12 pages
Getting Started With Beautiful Soup Sample Chapter
No ratings yet
Getting Started With Beautiful Soup Sample Chapter
15 pages
Python Tools for Data Scientists
100% (1)
Python Tools for Data Scientists
23 pages
Lesson 4 Unstructured Data
No ratings yet
Lesson 4 Unstructured Data
20 pages
Beautiful Soup & Selenium Web Scraping Guide
No ratings yet
Beautiful Soup & Selenium Web Scraping Guide
5 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
5G NR KPI References - RF (RAN) Optimization
100% (1)
5G NR KPI References - RF (RAN) Optimization
23 pages
Engineering Specs for RTR Pipes
No ratings yet
Engineering Specs for RTR Pipes
4 pages
01-SAMSS-035 PDF Download - API Line Pipe - PDFYAR - Engineering Notes, Documents & Lectures
No ratings yet
01-SAMSS-035 PDF Download - API Line Pipe - PDFYAR - Engineering Notes, Documents & Lectures
4 pages
01-SAMSS-017 PDF Download - Auxiliary Piping For Mechanical Equipment
No ratings yet
01-SAMSS-017 PDF Download - Auxiliary Piping For Mechanical Equipment
4 pages
Properties and Applications of Highly Reactive Metakaolin Concrete - PDFBAG
No ratings yet
Properties and Applications of Highly Reactive Metakaolin Concrete - PDFBAG
14 pages
Encrypted Document Analysis
No ratings yet
Encrypted Document Analysis
6 pages
01-SAMSS-038 - Small Quantity Purchase of Pipe From Stockist and Approved Pipe Mills
No ratings yet
01-SAMSS-038 - Small Quantity Purchase of Pipe From Stockist and Approved Pipe Mills
4 pages
01-SAMSS-044 PDF Download - CRA Clad Pipe Spools
No ratings yet
01-SAMSS-044 PDF Download - CRA Clad Pipe Spools
4 pages
01-SAMSS-043 PDF Download - Carbon Steel Pipes For On-Plot Piping
No ratings yet
01-SAMSS-043 PDF Download - Carbon Steel Pipes For On-Plot Piping
4 pages
01-SAMSS-005 PDF Download - Shop Applied, Internal Cement Mortar Lining of Steel Pipe
No ratings yet
01-SAMSS-005 PDF Download - Shop Applied, Internal Cement Mortar Lining of Steel Pipe
3 pages
01-SAMSS-023 PDF - Intrusive Online Corrosion Monitoring
No ratings yet
01-SAMSS-023 PDF - Intrusive Online Corrosion Monitoring
4 pages
SAES-N-004 PDF Download - Design and Installation of Building Thermal Envelop - PDFYAR
No ratings yet
SAES-N-004 PDF Download - Design and Installation of Building Thermal Envelop - PDFYAR
8 pages
SAES-L-132 PDF Download - Material Selection For Piping Systems - PDFYAR
100% (1)
SAES-L-132 PDF Download - Material Selection For Piping Systems - PDFYAR
6 pages
Refractory Systems Guide
No ratings yet
Refractory Systems Guide
6 pages
SAES-M-009 PDF Download - Design Criteria For Blast Resistant Buildings - PDFYAR - Engineering Notes, Documents & Lectures
No ratings yet
SAES-M-009 PDF Download - Design Criteria For Blast Resistant Buildings - PDFYAR - Engineering Notes, Documents & Lectures
7 pages
SAES-L-650 PDF Download - Construction of Nonmetallic Piping in Hydrocarbon - PDFYAR
No ratings yet
SAES-L-650 PDF Download - Construction of Nonmetallic Piping in Hydrocarbon - PDFYAR
6 pages
SAES-M-100 PDF Download - Saudi Aramco Building Code - PDFYAR - Engineering Notes, Documents & Lectures
No ratings yet
SAES-M-100 PDF Download - Saudi Aramco Building Code - PDFYAR - Engineering Notes, Documents & Lectures
5 pages
SAES-L-460 PDF Download - Pipeline Crossings Under Roads - PDFYAR
100% (1)
SAES-L-460 PDF Download - Pipeline Crossings Under Roads - PDFYAR
7 pages
SAES-N-001 PDF Download - Basic Criteria, Industrial Insulation - PDFYAR
100% (2)
SAES-N-001 PDF Download - Basic Criteria, Industrial Insulation - PDFYAR
7 pages
SAES-L-610 PDF Download - Nonmetallic Piping in Oily Water Services - PDFYAR
No ratings yet
SAES-L-610 PDF Download - Nonmetallic Piping in Oily Water Services - PDFYAR
6 pages
SAES-L-150 PDF Download - Pressure Testing of Plant Pipelines - PDFYAR
No ratings yet
SAES-L-150 PDF Download - Pressure Testing of Plant Pipelines - PDFYAR
7 pages
Processing Repairs in SAP Plant Maintenance
No ratings yet
Processing Repairs in SAP Plant Maintenance
5 pages
SAES-M-001 PDF Download - Structural Design Criteria For Non-Building Structures - PDFYAR - Engineering Notes, Documents & Lectures
No ratings yet
SAES-M-001 PDF Download - Structural Design Criteria For Non-Building Structures - PDFYAR - Engineering Notes, Documents & Lectures
10 pages
SAES-L-133 PDF Download - Corrosion Protection Requirements - PDFYAR
100% (2)
SAES-L-133 PDF Download - Corrosion Protection Requirements - PDFYAR
6 pages
SAES-L-140 PDF Download - Thermal Expansion Relief in Piping - PDFYAR
100% (1)
SAES-L-140 PDF Download - Thermal Expansion Relief in Piping - PDFYAR
6 pages
SAES-L-102 PDF Download - Regulated Vendors List For Valves - PDFYAR
No ratings yet
SAES-L-102 PDF Download - Regulated Vendors List For Valves - PDFYAR
5 pages
Notification and Orders Completion in SAP Plant Maintenance
No ratings yet
Notification and Orders Completion in SAP Plant Maintenance
10 pages
Buffer Solution PH 7.0 Keep 30 Seconds Buffer Solution PH 7.0 Keep 30 Seconds
No ratings yet
Buffer Solution PH 7.0 Keep 30 Seconds Buffer Solution PH 7.0 Keep 30 Seconds
2 pages
SAES-L-110 Pipe Joint Limitations PDF
No ratings yet
SAES-L-110 Pipe Joint Limitations PDF
9 pages
Android Studio Brief W4
No ratings yet
Android Studio Brief W4
54 pages
Lessons Learned Scraping and Structuring Corporate Subsidiary Data
No ratings yet
Lessons Learned Scraping and Structuring Corporate Subsidiary Data
31 pages
Catia Piping Design
100% (2)
Catia Piping Design
435 pages
Theory of Computation Course
No ratings yet
Theory of Computation Course
17 pages
Schedule Change Web Service User Guide SDS v1.1
No ratings yet
Schedule Change Web Service User Guide SDS v1.1
31 pages
PhilHealth - Electronic - Claims - Implementation - Guide For PECWS 2.5 (20190130)
No ratings yet
PhilHealth - Electronic - Claims - Implementation - Guide For PECWS 2.5 (20190130)
79 pages
XML in A Nutshell 2nd Edition Elliotte Rusty Harold W Scott Means Download
100% (3)
XML in A Nutshell 2nd Edition Elliotte Rusty Harold W Scott Means Download
35 pages
1.12 Android App/Project Folder Structure
No ratings yet
1.12 Android App/Project Folder Structure
7 pages
PGDCA Online Library Project Report
No ratings yet
PGDCA Online Library Project Report
72 pages
Soa
No ratings yet
Soa
4 pages
HTML-XML Utils: Tools for File Manipulation
No ratings yet
HTML-XML Utils: Tools for File Manipulation
2 pages
TCP3151 Integrative Programming and Technologies Assignment
No ratings yet
TCP3151 Integrative Programming and Technologies Assignment
3 pages
Post User
No ratings yet
Post User
21 pages
Log
No ratings yet
Log
56 pages
PLM XML Export Import Admin PDF
No ratings yet
PLM XML Export Import Admin PDF
131 pages
Xcos On Web
No ratings yet
Xcos On Web
55 pages
Unit-III Introduction To XML
No ratings yet
Unit-III Introduction To XML
25 pages
Web Programming Lab Guide
No ratings yet
Web Programming Lab Guide
27 pages
Dictionary Services Prog Guide
No ratings yet
Dictionary Services Prog Guide
32 pages
Master Data Governance Mass Import Solution For Article Master
No ratings yet
Master Data Governance Mass Import Solution For Article Master
22 pages
Microsoft ActiveSync WBXML Standard
No ratings yet
Microsoft ActiveSync WBXML Standard
43 pages
Seong Lee 2024 Developing e Procurement Systems A Case Study On The Government e Procurement Systems in Korea
No ratings yet
Seong Lee 2024 Developing e Procurement Systems A Case Study On The Government e Procurement Systems in Korea
29 pages
AAWorkflowWithMES Dec2012
No ratings yet
AAWorkflowWithMES Dec2012
108 pages
10264A - Developing Web Applications With Microsoft Visual Studio 2010 - Vol1 PDF
No ratings yet
10264A - Developing Web Applications With Microsoft Visual Studio 2010 - Vol1 PDF
506 pages
How To Read XML Files in Datastage Server Edition
No ratings yet
How To Read XML Files in Datastage Server Edition
18 pages
Create XML Schema in SAP BODS
No ratings yet
Create XML Schema in SAP BODS
21 pages
API Testing
No ratings yet
API Testing
9 pages
XML Fundamentals Ag
No ratings yet
XML Fundamentals Ag
228 pages
M S Ramaiah School of Advanced Studies
No ratings yet
M S Ramaiah School of Advanced Studies
5 pages
User Guide
No ratings yet
User Guide
490 pages