0% found this document useful (0 votes)

2 views12 pages

Unit I

Material

Uploaded by

Leela Rallapudi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views12 pages

Unit I

Material

Uploaded by

Leela Rallapudi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Unit I: Web Scraping in Python - A Usha Kumari

Course Overview:
This unit covers the fundamentals of web scraping using Python, from making your rst HTTP request to
building sophisticated web crawlers. Students will learn both theory and practical implementation with clear,
simple examples.

1. Introduction to Web Scraping

What is Web Scraping?

Web scraping is the process of automatically extracting data from websites [1]. Think of it as teaching your
computer to read and collect information from web pages just like you would manually, but much faster.

Real-world applications:

• Price comparison websites collecting product prices

• News aggregators gathering articles from multiple sources

• Research data collection from academic websites

• Job boards collecting listings from company websites

The Web Scraping Process

Web scraping follows four main steps [2]:

1. Send HTTP Request: Ask the website for the page content

2. Parse HTML Content: Convert raw HTML into a structured format

3. Extract Data: Find and collect the speci c information you need

4. Store Data: Save the extracted data in a useful format (CSV, JSON, database)

2. Setting Up Your Environment

Installing Required Libraries

# Install the essential libraries

pip install requests beautifulsoup4
fi
fi
# Optional: Install better parsers for improved performance
pip install lxml html5lib

What each library does:

• requests: Handles HTTP requests to websites [3]

• beautifulsoup4: Parses and navigates HTML documents [4]

• lxml: Fast XML/HTML parser (optional but recommended)

• html5lib: Most lenient parser for broken HTML

Simple Installation Test

# Test if everything is working
import requests
from bs4 import BeautifulSoup

print("✓ All libraries installed successfully!")

3. Understanding HTML Structure

HTML Basics

Before scraping, you need to understand HTML structure. HTML is like a family tree [5]:

<!DOCTYPE html>
<html> 
<head> 
<title>Page Title</title>
</head>
<body> 
<div class="container">
<h1>Main Heading</h1> 
<p class="intro">Text</p> 
</div>
</body>
</html>

Key Concepts:
• Tags: <div>, <p>, <a> (the building blocks)

• Attributes: class="intro", id="main" (additional information)

• Content: Text between opening and closing tags

4. Your First Web Scraper

Step 1: Making HTTP Requests

import requests

# Connect to a website
url = "https://quotes.toscrape.com/"
response = requests.get(url)

# Check if request was successful

print(f"Status Code: {response.status_code}") # 200 = success
print(f"Content Length: {len(response.text)} characters")

Important HTTP Status Codes:

• 200: Success

• 404: Page not found

• 403: Access forbidden

• 500: Server error

Step 2: Introduction to BeautifulSoup

from bs4 import BeautifulSoup

# Parse the HTML content

soup = BeautifulSoup(response.content, 'html.parser')

# Get basic page information

print("Page title:", soup.title.text)
print("Number of paragraphs:", len(soup. nd_all('p')))

Step 3: Extracting Your First Data

fi
# Extract all quotes from the page
quotes = soup. nd_all('div', class_='quote')

for quote in quotes:

text = quote. nd('span', class_='text').text
author = quote. nd('small', class_='author').text
print(f'"{text}" - {author}')

Output Example:

"The world as we have created it is a process of our thinking..." - Albert Einstein

"It is our choices, Harry, that show what we truly are..." - J.K. Rowling

5. Advanced HTML Parsing with BeautifulSoup

Different Ways to Find Elements

1. Basic Find Methods:

# Find rst occurrence

rst_quote = soup. nd('div', class_='quote')

# Find all occurrences

all_quotes = soup. nd_all('div', class_='quote')

# Find by ID
footer = soup. nd('div', id='footer')

2. CSS Selectors (More Powerful):

# Select by class
quotes = soup.select('.quote')

# Select by ID
footer = soup.select('#footer')

# Select nested elements

quote_texts = soup.select('div.quote span.text')

# Select direct children only

fi
fi
fi
fi
fi
fi
fi
fi
authors = soup.select('div.quote > small.author')

Accessing HTML Attributes

# Find a link element

link = soup. nd('a')

# Get attribute value (safe method)

href = link.get('href')
print(f"Link URL: {href}")

# Get all attributes as dictionary

all_attributes = link.attrs
print(f"All attributes: {all_attributes}")

Navigating the HTML Tree

# Find a quote element

quote = soup. nd('span', class_='text')

# Move up the tree (parent)

quote_container = quote.parent
print(f"Parent tag: {quote_container.name}")

# Move sideways (siblings)

next_element = quote. nd_next_sibling()
print(f"Next sibling: {next_element.name if next_element else None}")

6. Regular Expressions with BeautifulSoup

What are Regular Expressions?

Regular expressions (regex) are patterns that help you nd speci c text [6]. Think of them as advanced search
patterns.

Common Patterns:

import re
fi
fi
fi
fi
fi
# Find all numbers
numbers = re. ndall(r'\d+', 'I have 25 apples and 10 oranges')
print(numbers) # ['25', '10']

# Find email addresses

emails = re. ndall(r'\w+@\w+\.\w+', 'Contact: john@email.com or support@site.org')
print(emails) # ['john@email.com', 'support@site.org']

# Find prices
prices = re. ndall(r'\$\d+\.\d{2}', 'Items cost $12.99 and $45.50')
print(prices) # ['$12.99', '$45.50']

Using Regex with BeautifulSoup

# Find elements with text matching a pattern

price_elements = soup. nd_all(text=re.compile(r'\$\d+'))

# Find links with speci c URL patterns

external_links = soup. nd_all('a', href=re.compile(r'^https://'))

# Find elements with class names containing 'product'

product_elements = soup. nd_all(class_=re.compile(r'product'))

7. Lambda Expressions for Advanced Filtering

What are Lambda Functions?

Lambda functions are small, anonymous functions that can have any number of arguments but only one
expression [7].

Basic Syntax:

# Regular function
def double(x):
return x * 2

# Equivalent lambda function

double_lambda = lambda x: x * 2
fi
fi
fi
fi
fi
fi
fi
print(double_lambda(5)) # Output: 10

Lambda with BeautifulSoup

# Find elements with more than 2 attributes

multi_attr = soup. nd_all(lambda tag: len(tag.attrs) > 2)

# Find div elements that contain 'quote' in their class

quote_divs = soup. nd_all(lambda tag:
tag.name == 'div' and
tag.get('class') and
'quote' in tag.get('class', []))

# Find elements with speci c text length

long_text = soup. nd_all(lambda tag:
tag.text and len(tag.text.strip()) > 100)

8. Building Web Crawlers

What is Web Crawling?

Web crawling is the process of automatically browsing through multiple web pages by following links [8]. It's
like having a robot that clicks through a website systematically.

Basic Crawler Components:

1. URL Queue: List of pages to visit

2. Visited Set: Track pages already crawled

3. Parser: Extract data and nd new links

4. Politeness: Don't overwhelm the server

Simple Crawler Example

from urllib.parse import urljoin, urlparse

from collections import deque
import time

def simple_crawler(start_url, max_pages=5):

fi
fi
fi
fi
fi
# Initialize data structures
to_visit = deque([start_url])
visited = set()
scraped_data = []

while to_visit and len(visited) < max_pages:

url = to_visit.popleft()

if url in visited:
continue

print(f"Crawling: {url}")

try:
# Get page content
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Extract data
title = soup.title.text if soup.title else "No title"
scraped_data.append({'url': url, 'title': title})

# Find new links to crawl

for link in soup. nd_all('a', href=True):
new_url = urljoin(url, link['href'])
if new_url not in visited:
to_visit.append(new_url)

visited.add(url)
time.sleep(1) # Be polite to the server

except Exception as e:
print(f"Error crawling {url}: {e}")

return scraped_data

# Run the crawler

results = simple_crawler("https://quotes.toscrape.com/")
for item in results:
print(f"Title: {item['title'][:50]}...")
fi
Crawling Within a Single Domain

def same_domain_crawler(start_url, max_pages=10):

domain = urlparse(start_url).netloc
to_visit = deque([start_url])
visited = set()

while to_visit and len(visited) < max_pages:

url = to_visit.popleft()

# Skip if already visited

if url in visited:
continue

# Only crawl same domain

if urlparse(url).netloc != domain:
continue

print(f"Crawling same domain: {url}")

try:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Extract links from same domain only

for link in soup. nd_all('a', href=True):
new_url = urljoin(url, link['href'])
if urlparse(new_url).netloc == domain:
to_visit.append(new_url)

visited.add(url)
time.sleep(1)

except Exception as e:
print(f"Error: {e}")

print(f"Crawled {len(visited)} pages from domain: {domain}")

9. Best Practices and Ethics

Important Guidelines
fi
1. Respect robots.txt: Check website.com/robots.txt before crawling

2. Use delays: Add time.sleep(1) between requests

3. Handle errors gracefully: Use try/except blocks

4. Don't overload servers: Limit concurrent requests

5. Respect terms of service: Read and follow website rules

Polite Scraping Example

def polite_scraper(url):
headers = {
'User-Agent': 'Educational Web Scraper 1.0'
}

try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status() # Raises exception for HTTP errors

soup = BeautifulSoup(response.content, 'html.parser')

# Extract data here

time.sleep(1) # Be polite
return soup

except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
return None

10. Putting It All Together: Complete Project

Project: Quote Collector

Let's build a complete scraper that collects quotes and saves them to a le:

import requests
from bs4 import BeautifulSoup
import csv
import time
fi
def scrape_quotes():
quotes_data = []
url = "https://quotes.toscrape.com/"

while url:
print(f"Scraping: {url}")

# Get page content

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Extract quotes from current page

quotes = soup. nd_all('div', class_='quote')

for quote in quotes:

text = quote. nd('span', class_='text').text
author = quote. nd('small', class_='author').text

# Extract tags
tags = [tag.text for tag in quote. nd_all('a', class_='tag')]

quotes_data.append({
'text': text,
'author': author,
'tags': ', '.join(tags)
})

# Find next page link

next_btn = soup. nd('li', class_='next')
if next_btn:
next_url = next_btn. nd('a')['href']
url = f"https://quotes.toscrape.com{next_url}"
else:
url = None

time.sleep(1) # Be polite

return quotes_data

def save_to_csv(data, lename='quotes.csv'):

with open( lename, 'w', newline='', encoding='utf-8') as le:
writer = csv.DictWriter( le, eldnames=['text', 'author', 'tags'])
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
writer.writeheader()
writer.writerows(data)

print(f"Saved {len(data)} quotes to { lename}")

# Run the complete project

if __name__ == "__main__":
quotes = scrape_quotes()
save_to_csv(quotes)
print("Quote collection complete!")

11. Common Debugging Tips

When Things Go Wrong:

1. Check the HTML source: Use print(soup.prettify()) to see the structure

2. Verify your selectors: Test CSS selectors in browser developer tools

3. Handle missing elements: Use .get() method or check if element exists

4. Check for dynamic content: Some data loads with JavaScript (need Selenium)

5. Add error handling: Always use try/except blocks

Debugging Example:

def debug_scraper(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Debug: Print page structure

print("Page title:", soup.title.text if soup.title else "No title")
print("Number of divs:", len(soup. nd_all('div')))

# Debug: Check if speci c elements exist

quotes = soup. nd_all('div', class_='quote')
print(f"Found {len(quotes)} quotes")

if quotes:
print("First quote structure:")
print(quotes[^0].prettify()[:200] + "...")
fi
fi
fi
fi

Lecture 12 - Web Scrapping
No ratings yet
Lecture 12 - Web Scrapping
11 pages
Python Module-4
No ratings yet
Python Module-4
109 pages
DAP - Module 4
No ratings yet
DAP - Module 4
57 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
DAP Module4 1
No ratings yet
DAP Module4 1
110 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
A Simple Python Web Crawler...
100% (1)
A Simple Python Web Crawler...
5 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
Web Scraping CheatSheet Guide
No ratings yet
Web Scraping CheatSheet Guide
10 pages
Web Scraping & API Guide
No ratings yet
Web Scraping & API Guide
24 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
A Guide To Web Scraping in Python Using Beautiful Soup
No ratings yet
A Guide To Web Scraping in Python Using Beautiful Soup
6 pages
4F IntroToWebScraping
No ratings yet
4F IntroToWebScraping
6 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Webscraping1 1 PDF
No ratings yet
Webscraping1 1 PDF
10 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
0% (1)
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Python Web Scraping Guide
No ratings yet
Python Web Scraping Guide
7 pages
Basic Scraping Techniques
No ratings yet
Basic Scraping Techniques
7 pages
BeautifulSoup Notes
No ratings yet
BeautifulSoup Notes
22 pages
Notes For Web Scraping - BeautifulSoup-3903
No ratings yet
Notes For Web Scraping - BeautifulSoup-3903
6 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Webscraping
No ratings yet
Webscraping
12 pages
Web Scarpping
No ratings yet
Web Scarpping
4 pages
Download
No ratings yet
Download
4 pages
BeautifulSoup HTML Parsing Guide
No ratings yet
BeautifulSoup HTML Parsing Guide
9 pages
Beautifulsoap4 Experiments
No ratings yet
Beautifulsoap4 Experiments
7 pages
Retrieving Data From The Web
No ratings yet
Retrieving Data From The Web
9 pages
Python Web Scraping Guide
100% (1)
Python Web Scraping Guide
13 pages
Quick Guide Web Scraping With Python
No ratings yet
Quick Guide Web Scraping With Python
3 pages
Python For Web Scraping - Week 3: 1 Installing A Module
No ratings yet
Python For Web Scraping - Week 3: 1 Installing A Module
4 pages
Web Scraping
No ratings yet
Web Scraping
11 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Beautiful Soup & Selenium Web Scraping Guide
No ratings yet
Beautiful Soup & Selenium Web Scraping Guide
5 pages
Beginner Guide To Web Scraping of Data
No ratings yet
Beginner Guide To Web Scraping of Data
14 pages
Scraping
No ratings yet
Scraping
6 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
Beautiful Soup Tutorial
100% (2)
Beautiful Soup Tutorial
56 pages
055-En
No ratings yet
055-En
2 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Cheat Sheet For API's and Data Collection
No ratings yet
Cheat Sheet For API's and Data Collection
4 pages
Class Assign
No ratings yet
Class Assign
3 pages
Chapter 11. Web Scraping
100% (1)
Chapter 11. Web Scraping
57 pages
03 Web Scraping
No ratings yet
03 Web Scraping
41 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
20 - BeautifulSoup Library For Web Scraping
No ratings yet
20 - BeautifulSoup Library For Web Scraping
12 pages
Experiment2 Web Scraping and Data Analysis
No ratings yet
Experiment2 Web Scraping and Data Analysis
5 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
Python Web Crawler
No ratings yet
Python Web Crawler
15 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Web Scraping With Python Collecting Data From The Modern Web 1st Edition Ryan Mitchell PDF Available
No ratings yet
Web Scraping With Python Collecting Data From The Modern Web 1st Edition Ryan Mitchell PDF Available
127 pages
RajSingh WIexp4
No ratings yet
RajSingh WIexp4
7 pages
Web Scraping for Developers
No ratings yet
Web Scraping for Developers
8 pages
Computers Lab
No ratings yet
Computers Lab
52 pages
Text Summarization Paper at ICMICT 2023
No ratings yet
Text Summarization Paper at ICMICT 2023
1 page
Certificates - MIC-178
No ratings yet
Certificates - MIC-178
1 page
JSD 1
No ratings yet
JSD 1
9 pages
Employee Working Hours Report
No ratings yet
Employee Working Hours Report
11 pages
Trainee Software Engineer Profile
100% (1)
Trainee Software Engineer Profile
12 pages
Map Reduce Algorithm
No ratings yet
Map Reduce Algorithm
4 pages
AMAZON Reasoning Ability
No ratings yet
AMAZON Reasoning Ability
3 pages
CESS Tracks Electives Direct Prerequisites Tree
No ratings yet
CESS Tracks Electives Direct Prerequisites Tree
1 page
Anatomy and Physiology of The Eye Khurana
76% (21)
Anatomy and Physiology of The Eye Khurana
519 pages
A I Class 8 MCQ
No ratings yet
A I Class 8 MCQ
9 pages
December, 2024 Term-end-Examination For E-Vidya Bharti Learners - Login Credentials For Accessing The Examination Portal and Instructions Thereof
No ratings yet
December, 2024 Term-end-Examination For E-Vidya Bharti Learners - Login Credentials For Accessing The Examination Portal and Instructions Thereof
3 pages
B S: U AIA P T M F: Eyond The UM Nlocking Gents Otential Hrough Arket Orces
No ratings yet
B S: U AIA P T M F: Eyond The UM Nlocking Gents Otential Hrough Arket Orces
22 pages
Coding Theory Essentials
No ratings yet
Coding Theory Essentials
7 pages
VLSI Design Flow: Post-GDS Processes
No ratings yet
VLSI Design Flow: Post-GDS Processes
25 pages
3crxjk10075 13apr2005
No ratings yet
3crxjk10075 13apr2005
3 pages
Week 1 Cpe101
No ratings yet
Week 1 Cpe101
20 pages
Rita MockTest1
No ratings yet
Rita MockTest1
35 pages
Naruto Arena Rules
No ratings yet
Naruto Arena Rules
5 pages
SQL
No ratings yet
SQL
102 pages
Business Analyst Expertise
No ratings yet
Business Analyst Expertise
4 pages
Collective Order
100% (1)
Collective Order
26 pages
To Securely Format An External SSD So That Data Cannot Be Recovered
No ratings yet
To Securely Format An External SSD So That Data Cannot Be Recovered
2 pages
SDN Lab Exp 3C
No ratings yet
SDN Lab Exp 3C
5 pages
Pygame Zero
No ratings yet
Pygame Zero
41 pages
Peserta Tes 27 Juni 2024
No ratings yet
Peserta Tes 27 Juni 2024
16 pages
Avinash Kottu Email: (971) - 727-0299
No ratings yet
Avinash Kottu Email: (971) - 727-0299
4 pages
Microcontrollers Explained
No ratings yet
Microcontrollers Explained
11 pages
IFC Shared Parameters-RevitIFCBuiltIn
No ratings yet
IFC Shared Parameters-RevitIFCBuiltIn
7 pages
Academic Staff Directory
No ratings yet
Academic Staff Directory
64 pages
Seccionadora A SF6 - Média Tensão
No ratings yet
Seccionadora A SF6 - Média Tensão
4 pages
PM A-Z Course Workbook - Uber
100% (1)
PM A-Z Course Workbook - Uber
42 pages
Advanced Excel
No ratings yet
Advanced Excel
164 pages
Marketing Manager - Application
No ratings yet
Marketing Manager - Application
9 pages
Quadratic Functions & Graphs
No ratings yet
Quadratic Functions & Graphs
30 pages
KAS User Guide
No ratings yet
KAS User Guide
12 pages
E-R and UML Diagram Notations
No ratings yet
E-R and UML Diagram Notations
17 pages
BSC Alarm List and Its Description
78% (9)
BSC Alarm List and Its Description
7 pages