0% found this document useful (0 votes)

32 views63 pages

Getting Data

Uploaded by

zoewong412

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views63 pages

Getting Data

Uploaded by

zoewong412

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Getting data

The problem!

Data exists in different sources and different formats

and we have to work with whatever format we get

We go to data analysis with data in the format we have,

not the format we want!
Sources of data
local data
csv files
pdf files
xls files

web data
json
xml
html

database servers
mysql
postgres
mongoDB
RESTful Web Services

REST: Representational State Transfer

“A network of web pages connected through

links and HTTP commands (GET, POST,
etc.)”

RESTful: A web service that conforms to the

REST standards
RESTful Web Services

URLs: RESTful Web Services deliver

resources to the client. Each resources (html,
json, image, etc.) is associated with a URL
and an HTTP method

RESTful: A web service that conforms to the

REST standards
Example: Epicurious

http://www.epicurious.com

type “Tofu Chili” in search

box

http GET

new url:
http://www.epicurious.com/se
arch/Tofu%20Chili
Example: NYTIMES login

https://myaccount.nytimes.com/auth/login

type Email address and

Password in the form

http POST

new url:
http://www.nytimes.com/
Example: Google GEOCODING API

HTTP GET request with

a JSON response

https://maps.googleapis.com/maps/api/geocode/json?address=Columbia_University,_New_York,_NY

All google API requests take the form:

<api_url>/<response_type>?<parameters>
json (or xml)
address=Columbia_University,_New_York,_NY

https://maps.googleapis.com/maps/api/geocode/
What we need

The ability to

* create and send HTTP requests

* receive and process HTTP responses
* convert data residing in JSON/XML/HTML
format into python objects
Python libraries for
getting web data
* Send an http request and get an http response
* requests
* urllib.requests (urllib2 on python2)

* parse the response and extract data

* json
* lxml
* BeautifulSoup, Selenium (for html data)
http requests

requests: Python library for handling http requests and responses

http://docs.python-requests.org/en/master/
using requests
* Import the library
import requests
* Construct the url
url = “http://www.epicurious.com/search/Tofu+Chili”
* Send the request and get a response
response = requests.get(url)
* Check if the request was successful
if response.status_code == 200:
“SUCCESS”!!!!
else:
“FAILURE”!!!
response status codes
* 200 or 201
the request response cycle worked as planned
* Other 200s
the request response cycle worked but there is
additional information associated with the response
* 400s
there was an error (page not found/malformed
request/etc.)
* General rule of thumb
check if the status code was 200 for accessing data
through a GET or POST HTTP request
* response.content response content
returns the content of the HTTP response
* response.content.decode(‘utf-8’)
if the content is byte encoded (which it usually is!),
converts it into unicode - a python str
* What is unicode?
http://unicode.org/standard/WhatIsUnicode.html
* General rule of thumb
web pages are usually returned as byte strings and need
to be decoded. utf-8 is the usual decoding (but not always!)
* What is utf-8?
https://www.w3schools.com/charsets/ref_html_utf8.asp
Try this

1. Open https://en.wikipedia.org/wiki/main_page using the

requests library
2. Check the status code. Did your request work?
3. Get the content. Decode it. Then search the page for the string
“Did you know” using the str find function
4. If your find function returned a positive number - Great!
5. If it returned -1 (that means it was not found), you’ve done
something wrong. Try figuring out what went wrong
web data formats

* HTML
the common format when scraping web pages for data
* JSON or XML
usually when accessing data through an API or when
the server is explicitly sharing data with you
JSON
JavaScript Object Notation

- Standard for "serializing" data objects for

storage or transmission
- Human-readable, useful for data
interchange
- Also useful for representing and storing
semistructured data
- Stored as plain (byte strings or utf-8 strings)
text
JSON constructs and
Python equivalents

JSON Python

number int,float

string str

Null None

true/false True/False

Object dict

Array list
python json library

json.loads(<str>): converts a JSON string to

python objects

json.dumps(<python_object>): converts a
python object into a JSON formatted string
python json library

Converts a json string to an equivalent

python type

str
import json
json_data = '[{"b": [2, 4], "c": 3.0, "a": "A"}]'
python_data = json.loads(json_data)

list
python json library

Converts a json string to an equivalent

python type

str import json

data_string = json.dumps(python_data)

list
requests and json

The response object handles json

the requests library handles
spaces in a url for you

address=“Columbia University, New York, NY"

url="https://maps.googleapis.com/maps/api/geocode/json?address=%s" % (address)
response = requests.get(url).json()

response now points to a

python data object
requests and json

requests.json returns an exception if the

JSON object is ill-formed

note that some http errors are returned as

JSON

always check for exceptions!

requests and json
address="Columbia University, New York, NY"
url="https://maps.googleapis.com/maps/api/geocode/json?address=%s" % (address)
try:
response = requests.get(url)
if not response.status_code == 200:
print("HTTP error",response.status_code)
else:
try:
response_data = response.json()
except:
print("Response not in valid JSON format")
except:
print("Something went wrong with requests.get")
print(type(response_data))
requests and json: example

Let’s take a look at the JSON object returned by Google Geocoding API
Working with json

Problem 1: Write a function that takes an

address as an argument and returns a
(latitude, longitude) tuple

def get_lat_lng(address_string):
#python code goes here
Solution to problem 1

def get_lat_lng(address):
url="https://maps.googleapis.com/maps/api/geocode/json?address=%s"%(address)
import requests
response = requests.get(url)
if response.status_code == 200:
lat = response.json()['results'][0]['geometry']['location']['lat']
lng = response.json()['results'][0]['geometry']['location']['lng']
return lat,lng
else:
return None
xml

Extensible Markup Language

- Tree structure
- Tagged elements (nested)
- Attributes
- Text (leaves of the tree)
<Last_Name>Berenholtz</Last_Name>

</Author> xml: Example

</Authors>

</Book>

<Book ISBN="ISBN-13:978-1579128562" Price="15.80">

Five Hundred Buildings of New York and over one million other books are available for Amazon Kindle.

</Remark>

<Title>Five Hundred Buildings of New York</Title>

<First_Name>Bill</First_Name>

<Last_Name>Harris</Last_Name>

</Author>
XML Tree
ISBN="ISBN-13:978-1599620787" Price="15.23" Weight="1.5"
$15.23
1.5oz Bookstore
ISBN-13:978-1599620787

Book Book

Title Authors Authors Title

subtree subtree
Author Author
New York Deco

firstname lastname

Bill Harris
lxml: Python xml library

xml tree object definition

from lxml import etree

root = etree.XML(data)
print(root.tag) prints the root tag
(Bookstore in our
example)

http://lxml.de/1.3/tutorial.html
lxml: Python xml library
Examining the tree

print(etree.tostring(root, pretty_print=True).decode("utf-8"))
lxml: Iterating over
children of a tag

root is a collection of
children so we can
iterate over it
for child in root:
print(child.tag)
lxml: Iterating over elements

for element in root.iter():

print(element.tag)

iter is an ‘iterator’. it
generates a sequence of
elements in the order
they appear in the xml
code
lxml: Iterating over elements

iterates over the tree

matching only those
elements that have
“Author” as the tag
for element in root.iter("Author"):
print(element.find('First_Name').text,element.find('Last_Name').text)

returns the first element

that has “First_Name”
as the tag
lxml: using XPath

XPath: expression for

navigating through an
xml tree

for element in root.findall('Book/Title'):

print(element.text)
Try this

Find the last names of all authors in the tree “root” using path
Solution

for element in root.findall('Book/Authors/Author/Last_Name'):

print(element.text)
lxml: Finding by attribute value

root.find('Book[@Weight="1.5"]/Authors/Author/First_Name').text
Problem 3

Print first and last names of all authors who live in

New York City
from lxml import etree
root = etree.XML(data_string)
for author_element in root.findall('Book[@ISBN="ISBN-13:978-1579128562"]/Authors/Author'):
f_name = author_element.find('First_Name').text
l_name = author_element.find('Last_Name').text
print(f_name,l_name)
HTML

HyperText Markup Language

- Formats text
- Tagged elements (nested)
- Attributes
- Derived from SGML (but who
cares!)
- Closely related to XML
- Can contain runnable scripts
HTML/CSS

Study this on your own!

Make sure you’ve reviewed the first two topics (“Intro to html” and
“Intro to CSS”) on Khan Academy
https://www.khanacademy.org/computing/computer-programming/html-css
getting data from the
web
creeping, crawling, pouncing!

Web scraping: Automating the process of extracting information from web pages
* for data collection and analysis
* for incorporating in a web app

APIs (Application Programming Interface): Functions and libraries for communicating with
specific web servers
* for data collection and analysis
* for incorporating in a web app

Web crawling: Automating the process of traversing links on web pages

* for indexing the web
* for collecting data from multiple web sites
Legal and ethical
Legal issues issues
➡ Often against the ‘Terms of Use’ of a web site
➡ but, regardless, murky and and not fully settled
➡ See: https://en.wikipedia.org/wiki/Web_scraping#Legal_issues

➡ probably depends upon three things:

๏ factual, non-proprietary data is generally ok
๏ proprietary data scraping depends
Ethical on what you do with it
issues
๏ potential or actual damage to the scrapee
➡ Public vs. private information. Public information is rarely unethical
➡ Purpose. Ethical or unethical depends on why you’re scraping the web site
➡ Always better to try and get the information openly using APIs or contacting the
owner of the server
➡ Is there a public interest involved? If yes, it’s probably ethical to scrape
Libraries for web scraping

requests: Python library for connecting to a web page, managing

the connection and retrieving contents of the page

Beautiful Soup: A library that utilizes the ‘tag structure’ of

an html page to quickly parse the contents of a page and retrieve data

Selenium: A library that utilizes the ‘tag structure’ of

an html page to execute javascript scripts on the page and retrieve data.
Slower than Beautiful Soup but gets around the ‘javascript’ problem
BeautifulSoup4
➡ HTML (and XML) parser
➡ Uses ‘tags’

➡ Creates a parse tree (using lxml/html5lib or other

python parser)
➡ Can handle incomplete tagging

➡ tags are organized in hierarchical dictionaries

https://www.crummy.com/software/BeautifulSoup/bs4/doc/
bs4

initialize bs4 object: BeautifulSoup(document,parser)

parser: lxml (fast) or html5lib (slower but more robust)

import requests
from bs4 import BeautifulSoup
url = "http://www.epicurious.com/search/Tofu%20Chili"
response = requests.get(url)
page_soup = BeautifulSoup(response.content,'lxml')
print(page_soup.prettify())
page_soup is the object from
which we will extract the
data we need
Unique data identifiers

➡ We want to create a list of recipes and links to the

recipes
➡ We need to figure out how to ‘programmatically’
extract each recipe name and recipe link

➡ Search for the tag with a unique attribute value that

identifies recipes and recipe links
➡ The easiest way is to examine the page source on a
browser
Finding unique data identifiers
['article-content-card']

['gallery-content-card']

for tag in page_soup.find_all(‘article'): ['recipe-content-card']

print(tag.get(‘class’) ['article-content-card']

['recipe-content-card']
prints
This gets the innermost tags with the recipe name. ['recipe-content-card']

['recipe-content-card']

['article-content-card']

['recipe-content-card']

['recipe-content-card']
looks like class=‘recipe-content-card’ gives us the
['recipe-content-card']
recipes
['recipe-content-card']

['article-content-card']

['recipe-content-card']
bs4 functions
<tag>.find(<tag_name>,attribute=value) finds the first matching child tag (recursively)

<tag>.find_all(<tag_name>,attribute=value) finds all matching child tags (recursively)

<tag>.get_text() returns the marked up text

<tag>.parent returns the (immediate) parent

<tag>.parents returns all parents (recursively)

<tag>.children returns the (direct) children

<tag>.descendants returns all children (recursively)

<tag>.get(attribute) returns the value of the specified attribute

<tag>.name returns the name of a tag

<tag>.attrs returns all the attributes of a tag

Problem 4: Extract recipes and recipe links

Write a function epicurious_recipes(search_string) that returns

the list of recipes and links associated with search_string

Call the function with a search_string, open the link associated

with the first recipe, then return the ingredients and preparation
instructions associated with that link
Problem 4: Step 1

given a recipe url, get the description and the ingredients

def get_recipe_detail(url):
import requests
from bs4 import BeautifulSoup
html_data = requests.get(url)
if not html_data.status_code == 200:
return '',[]
recipe_data = BeautifulSoup(html_data.content,'lxml')
description = get_description(recipe_data)
ing_list = get_ingredients(recipe_data)
return description,ing_list
need to write these two functions
Problem 4: Step 2

write the get_description function

def get_description(recipe_page_data):
description_tag = recipe_page_data.find('div',itemprop = 'description')
if description_tag:
return description_tag.get_text()
return ''
Problem 4: Step 3

write the get_ingredients function

def get_ingredients(ing_page_data):
ing_list = list()
for item in ing_page_data.find_all('li',class_='ingredient'):
ing_list.append(item.get_text())
return ing_list
Problem 4: Step 4

write the function that:

takes key words as an argument
and returns a list of tuples
(name,description,ingredients)
def get_recipes(keywords):
recipe_list = list()
Problem 4: Step 4 code
import requests
from bs4 import BeautifulSoup
url = "http://www.epicurious.com/search/" + keywords
response = requests.get(url)
if not response.status_code == 200:
return None
try:
results_page = BeautifulSoup(response.content,'lxml')
recipes = results_page.find_all('article',class_="recipe-content-card")
for recipe in recipes:
recipe_link = "http://www.epicurious.com" + recipe.find('a').get('href')
recipe_name = recipe.find('a').get_text()
try:
recipe_description = recipe.find('p',class_='dek').get_text()
except:
recipe_description = ''
recipe_list.append((recipe_name,recipe_link,recipe_description))
return recipe_list
logging in with requests and beautifulsoup
➡ Figure out the login url
➡ https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page
➡ Look for the login form in the html source
➡ form_tag = page_soup.find('form')
➡ Look for ALL the inputs in the login form (some may be tricky!)
➡ input_tags = form_tag.find_all('input')
➡ Create a Python dict object with key,value pairs for each input
➡ Use requests.session to create an open session object
➡ Send the login request (POST)
➡ Send followup requests keeping the sessions object open
Setting up the inputs

payload = {
'wpName': username,
'wpPassword': password,
'wploginattempt': 'Log in',
'wpEditToken': "+\\",
'title': "Special:UserLogin",
'authAction': "login",
'force': "",
'wpForceHttps': "1",
'wpFromhttp': "1",
#'wpLoginToken': ‘', #We need to read this from the page
}
Extracting token information

wpLoginToken: the value of this attribute is provided by the page. we need

to extract it.

login_page_response =
s.get('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto
=Main+Page')
soup = BeautifulSoup(login_page_response.content,'lxml')
token = soup.find('input',{'name':"wpLoginToken"}).get('value')
Finalizing session parameters
username=<your username>
password=<your password>

def get_login_token(response):
soup = BeautifulSoup(response.text, 'lxml')
token = soup.find('input',{'name':"wpLoginToken"}).get('value')
return token

with requests.session() as s:
response = s.get('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page')
payload['wpLoginToken'] = get_login_token(response)
#Send the login request
response_post = s.post('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login',
data=payload)
#Get another page and check if we’re still logged in
response = s.get('https://en.wikipedia.org/wiki/Special:Watchlist')

Getting Data
No ratings yet
Getting Data
54 pages
Unit 4
No ratings yet
Unit 4
36 pages
Introduction To Using Apis With Python: Nalette Brodnax
No ratings yet
Introduction To Using Apis With Python: Nalette Brodnax
42 pages
IOT UNIT 2 Part 1
100% (5)
IOT UNIT 2 Part 1
56 pages
IOT UNIT 2 Part 1
No ratings yet
IOT UNIT 2 Part 1
56 pages
Api and Data Structure
No ratings yet
Api and Data Structure
3 pages
Python API Tutorial - Getting Started With APIs - Dataquest
100% (1)
Python API Tutorial - Getting Started With APIs - Dataquest
26 pages
Pythonlearn-13-WebServices Python
No ratings yet
Pythonlearn-13-WebServices Python
54 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Glossary: Apis and Data Collection: Term Definition
No ratings yet
Glossary: Apis and Data Collection: Term Definition
2 pages
Course Notes - Web Scraping and API Fundamentals in Python
No ratings yet
Course Notes - Web Scraping and API Fundamentals in Python
10 pages
(L8) Programming With Python (Intermediate Level)
No ratings yet
(L8) Programming With Python (Intermediate Level)
13 pages
Python For Data Science, AI & Development Working With Data in Python
No ratings yet
Python For Data Science, AI & Development Working With Data in Python
2 pages
REST API Basics with Python
No ratings yet
REST API Basics with Python
2 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
CH 2 Data Collection Management
No ratings yet
CH 2 Data Collection Management
42 pages
Ibm Python Module 5 Glossary
No ratings yet
Ibm Python Module 5 Glossary
2 pages
Data Analyst in R
No ratings yet
Data Analyst in R
169 pages
FDSWeb Scraping
No ratings yet
FDSWeb Scraping
31 pages
Introduction To Python
No ratings yet
Introduction To Python
18 pages
Week 10 - Restful Web Services
No ratings yet
Week 10 - Restful Web Services
30 pages
Web Technologies QA
No ratings yet
Web Technologies QA
5 pages
Glossary APIs and Data Collection
No ratings yet
Glossary APIs and Data Collection
2 pages
Michal Ow Ski 2004
No ratings yet
Michal Ow Ski 2004
8 pages
Cheat Sheet: API's and Data Collection: Package/Method Description Code Example
No ratings yet
Cheat Sheet: API's and Data Collection: Package/Method Description Code Example
6 pages
Cheat Sheet API's and Data Collection
No ratings yet
Cheat Sheet API's and Data Collection
6 pages
3252 Ids 10
No ratings yet
3252 Ids 10
5 pages
Python Projects With API
No ratings yet
Python Projects With API
259 pages
CheatSheet - APIs and Data Collection
No ratings yet
CheatSheet - APIs and Data Collection
6 pages
0intro To Rest Apis
No ratings yet
0intro To Rest Apis
8 pages
Working With Apis: Takeaways: Syntax
No ratings yet
Working With Apis: Takeaways: Syntax
2 pages
Web Programming
No ratings yet
Web Programming
36 pages
Glosario m5
No ratings yet
Glosario m5
2 pages
Hy Lisp Python
No ratings yet
Hy Lisp Python
120 pages
CSV
No ratings yet
CSV
3 pages
Cheat Sheet For API's and Data Collection
No ratings yet
Cheat Sheet For API's and Data Collection
4 pages
REST API Guide for Developers
No ratings yet
REST API Guide for Developers
99 pages
API & Data Collection Guide
No ratings yet
API & Data Collection Guide
4 pages
Web Server Log Analysis
No ratings yet
Web Server Log Analysis
6 pages
Documentation Part 1
No ratings yet
Documentation Part 1
18 pages
Best Python Project With API
No ratings yet
Best Python Project With API
259 pages
Development Web Scrapping
No ratings yet
Development Web Scrapping
14 pages
01 - Worked Example Geodata Chapter 16.en
No ratings yet
01 - Worked Example Geodata Chapter 16.en
3 pages
Unit - V
No ratings yet
Unit - V
121 pages
Flask RestApi and SqlAlchemy
No ratings yet
Flask RestApi and SqlAlchemy
22 pages
Using Python To Scrape The Meet-Up API
No ratings yet
Using Python To Scrape The Meet-Up API
9 pages
An A-Z of Useful Python Tricks - freeCodeCamp - Org - Medium PDF
No ratings yet
An A-Z of Useful Python Tricks - freeCodeCamp - Org - Medium PDF
14 pages
WEB2PY 2.0 Cheat Sheet: Database Abstraction Layer Forms
No ratings yet
WEB2PY 2.0 Cheat Sheet: Database Abstraction Layer Forms
2 pages
Data Analysis for Beginners
No ratings yet
Data Analysis for Beginners
46 pages
API's and Data Collection
No ratings yet
API's and Data Collection
4 pages
PY4E - Python For Everybody
No ratings yet
PY4E - Python For Everybody
2 pages
Object - Oriented Programming in Python-1
No ratings yet
Object - Oriented Programming in Python-1
11 pages
Python Idioms and Features Guide
100% (1)
Python Idioms and Features Guide
72 pages
Black Hat Rust
83% (6)
Black Hat Rust
357 pages
Comac Abila Floor Scrubber
No ratings yet
Comac Abila Floor Scrubber
4 pages
THC 8 Syllabus 2022 1
No ratings yet
THC 8 Syllabus 2022 1
14 pages
Generic Roadmap For The Counties of Kenya V1.1
No ratings yet
Generic Roadmap For The Counties of Kenya V1.1
33 pages
FULL Version Testbank PointSet Topology With Topics Basic General Topology For Graduate Studies Robert Andre Multiple Formats
No ratings yet
FULL Version Testbank PointSet Topology With Topics Basic General Topology For Graduate Studies Robert Andre Multiple Formats
405 pages
MBA Thesis Help: Supply Chain Focus
100% (3)
MBA Thesis Help: Supply Chain Focus
7 pages
GLEX 2025 Brochure
No ratings yet
GLEX 2025 Brochure
2 pages
Part 4 Battery Chargers Theriault
No ratings yet
Part 4 Battery Chargers Theriault
37 pages
University Culture Survey Guide
No ratings yet
University Culture Survey Guide
7 pages
Viva Vigan Binatbatan Festival of The Arts 2016 - Schedule of Activities
No ratings yet
Viva Vigan Binatbatan Festival of The Arts 2016 - Schedule of Activities
4 pages
KC-20VS Service Manual
No ratings yet
KC-20VS Service Manual
8 pages
Web Basics for Beginners
100% (11)
Web Basics for Beginners
11 pages
General Formula
97% (32)
General Formula
53 pages
An Experimental Investigation On The Mechanical PR
No ratings yet
An Experimental Investigation On The Mechanical PR
8 pages
Fluid Coupling
No ratings yet
Fluid Coupling
20 pages
Chauvin CA6240
No ratings yet
Chauvin CA6240
2 pages
Toyota Innova Price List Malaysia
No ratings yet
Toyota Innova Price List Malaysia
1 page
Lacoto Targe Julytodec 2023
No ratings yet
Lacoto Targe Julytodec 2023
2 pages
Otaq Epa-420 b-10-040 Transport Conform Hot-Spot Analysis Appx PDF
No ratings yet
Otaq Epa-420 b-10-040 Transport Conform Hot-Spot Analysis Appx PDF
131 pages
GPS & GSM Vehicle Tracker
No ratings yet
GPS & GSM Vehicle Tracker
49 pages
Hospital Incident Analysis Report
No ratings yet
Hospital Incident Analysis Report
8 pages
How To Find The Work You Were Meant To Do
No ratings yet
How To Find The Work You Were Meant To Do
102 pages
CLP5202 Veterinary Pharmacy and Agrochemicals Jan-2025
No ratings yet
CLP5202 Veterinary Pharmacy and Agrochemicals Jan-2025
55 pages
Department of Education: "Project S .E.I.M.S" (School Supply and Equipment Inventory Management System)
100% (1)
Department of Education: "Project S .E.I.M.S" (School Supply and Equipment Inventory Management System)
6 pages
AGRT02 21 Guide To Road Tunnels Part 2 Planning Design Commissioning
No ratings yet
AGRT02 21 Guide To Road Tunnels Part 2 Planning Design Commissioning
176 pages
Parliamentary Procedure
No ratings yet
Parliamentary Procedure
5 pages
P L D 2000 Lahore 461 Section 90
No ratings yet
P L D 2000 Lahore 461 Section 90
42 pages
Group Study - Satcom Answer Key
No ratings yet
Group Study - Satcom Answer Key
9 pages
Saudi Aramco Inspection Checklist
No ratings yet
Saudi Aramco Inspection Checklist
4 pages
Resource Person Thesis
100% (2)
Resource Person Thesis
6 pages
Blockchain White Paper PDF
No ratings yet
Blockchain White Paper PDF
49 pages