Unit I: Web Scraping in Python - A Usha Kumari
Course Overview:
This unit covers the fundamentals of web scraping using Python, from making your rst HTTP request to
building sophisticated web crawlers. Students will learn both theory and practical implementation with clear,
simple examples.
1. Introduction to Web Scraping
What is Web Scraping?
Web scraping is the process of automatically extracting data from websites [1]. Think of it as teaching your
computer to read and collect information from web pages just like you would manually, but much faster.
Real-world applications:
• Price comparison websites collecting product prices
• News aggregators gathering articles from multiple sources
• Research data collection from academic websites
• Job boards collecting listings from company websites
The Web Scraping Process
Web scraping follows four main steps [2]:
1. Send HTTP Request: Ask the website for the page content
2. Parse HTML Content: Convert raw HTML into a structured format
3. Extract Data: Find and collect the speci c information you need
4. Store Data: Save the extracted data in a useful format (CSV, JSON, database)
2. Setting Up Your Environment
Installing Required Libraries
# Install the essential libraries
pip install requests beautifulsoup4
fi
fi
# Optional: Install better parsers for improved performance
pip install lxml html5lib
What each library does:
• requests: Handles HTTP requests to websites [3]
• beautifulsoup4: Parses and navigates HTML documents [4]
• lxml: Fast XML/HTML parser (optional but recommended)
• html5lib: Most lenient parser for broken HTML
Simple Installation Test
# Test if everything is working
import requests
from bs4 import BeautifulSoup
print("✓ All libraries installed successfully!")
3. Understanding HTML Structure
HTML Basics
Before scraping, you need to understand HTML structure. HTML is like a family tree [5]:
<!DOCTYPE html>
<html> <!-- Root element -->
<head> <!-- Page metadata -->
<title>Page Title</title>
</head>
<body> <!-- Visible content -->
<div class="container">
<h1>Main Heading</h1> <!-- Child of div -->
<p class="intro">Text</p> <!-- Sibling of h1 -->
</div>
</body>
</html>
Key Concepts:
• Tags: <div>, <p>, <a> (the building blocks)
• Attributes: class="intro", id="main" (additional information)
• Content: Text between opening and closing tags
4. Your First Web Scraper
Step 1: Making HTTP Requests
import requests
# Connect to a website
url = "https://quotes.toscrape.com/"
response = requests.get(url)
# Check if request was successful
print(f"Status Code: {response.status_code}") # 200 = success
print(f"Content Length: {len(response.text)} characters")
Important HTTP Status Codes:
• 200: Success
• 404: Page not found
• 403: Access forbidden
• 500: Server error
Step 2: Introduction to BeautifulSoup
from bs4 import BeautifulSoup
# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# Get basic page information
print("Page title:", soup.title.text)
print("Number of paragraphs:", len(soup. nd_all('p')))
Step 3: Extracting Your First Data
fi
# Extract all quotes from the page
quotes = soup. nd_all('div', class_='quote')
for quote in quotes:
text = quote. nd('span', class_='text').text
author = quote. nd('small', class_='author').text
print(f'"{text}" - {author}')
Output Example:
"The world as we have created it is a process of our thinking..." - Albert Einstein
"It is our choices, Harry, that show what we truly are..." - J.K. Rowling
5. Advanced HTML Parsing with BeautifulSoup
Different Ways to Find Elements
1. Basic Find Methods:
# Find rst occurrence
rst_quote = soup. nd('div', class_='quote')
# Find all occurrences
all_quotes = soup. nd_all('div', class_='quote')
# Find by ID
footer = soup. nd('div', id='footer')
2. CSS Selectors (More Powerful):
# Select by class
quotes = soup.select('.quote')
# Select by ID
footer = soup.select('#footer')
# Select nested elements
quote_texts = soup.select('div.quote span.text')
# Select direct children only
fi
fi
fi
fi
fi
fi
fi
fi
authors = soup.select('div.quote > small.author')
Accessing HTML Attributes
# Find a link element
link = soup. nd('a')
# Get attribute value (safe method)
href = link.get('href')
print(f"Link URL: {href}")
# Get all attributes as dictionary
all_attributes = link.attrs
print(f"All attributes: {all_attributes}")
Navigating the HTML Tree
# Find a quote element
quote = soup. nd('span', class_='text')
# Move up the tree (parent)
quote_container = quote.parent
print(f"Parent tag: {quote_container.name}")
# Move sideways (siblings)
next_element = quote. nd_next_sibling()
print(f"Next sibling: {next_element.name if next_element else None}")
6. Regular Expressions with BeautifulSoup
What are Regular Expressions?
Regular expressions (regex) are patterns that help you nd speci c text [6]. Think of them as advanced search
patterns.
Common Patterns:
import re
fi
fi
fi
fi
fi
# Find all numbers
numbers = re. ndall(r'\d+', 'I have 25 apples and 10 oranges')
print(numbers) # ['25', '10']
# Find email addresses
emails = re. ndall(r'\w+@\w+\.\w+', 'Contact: john@email.com or support@site.org')
print(emails) # ['john@email.com', 'support@site.org']
# Find prices
prices = re. ndall(r'\$\d+\.\d{2}', 'Items cost $12.99 and $45.50')
print(prices) # ['$12.99', '$45.50']
Using Regex with BeautifulSoup
# Find elements with text matching a pattern
price_elements = soup. nd_all(text=re.compile(r'\$\d+'))
# Find links with speci c URL patterns
external_links = soup. nd_all('a', href=re.compile(r'^https://'))
# Find elements with class names containing 'product'
product_elements = soup. nd_all(class_=re.compile(r'product'))
7. Lambda Expressions for Advanced Filtering
What are Lambda Functions?
Lambda functions are small, anonymous functions that can have any number of arguments but only one
expression [7].
Basic Syntax:
# Regular function
def double(x):
return x * 2
# Equivalent lambda function
double_lambda = lambda x: x * 2
fi
fi
fi
fi
fi
fi
fi
print(double_lambda(5)) # Output: 10
Lambda with BeautifulSoup
# Find elements with more than 2 attributes
multi_attr = soup. nd_all(lambda tag: len(tag.attrs) > 2)
# Find div elements that contain 'quote' in their class
quote_divs = soup. nd_all(lambda tag:
tag.name == 'div' and
tag.get('class') and
'quote' in tag.get('class', []))
# Find elements with speci c text length
long_text = soup. nd_all(lambda tag:
tag.text and len(tag.text.strip()) > 100)
8. Building Web Crawlers
What is Web Crawling?
Web crawling is the process of automatically browsing through multiple web pages by following links [8]. It's
like having a robot that clicks through a website systematically.
Basic Crawler Components:
1. URL Queue: List of pages to visit
2. Visited Set: Track pages already crawled
3. Parser: Extract data and nd new links
4. Politeness: Don't overwhelm the server
Simple Crawler Example
from urllib.parse import urljoin, urlparse
from collections import deque
import time
def simple_crawler(start_url, max_pages=5):
fi
fi
fi
fi
fi
# Initialize data structures
to_visit = deque([start_url])
visited = set()
scraped_data = []
while to_visit and len(visited) < max_pages:
url = to_visit.popleft()
if url in visited:
continue
print(f"Crawling: {url}")
try:
# Get page content
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Extract data
title = soup.title.text if soup.title else "No title"
scraped_data.append({'url': url, 'title': title})
# Find new links to crawl
for link in soup. nd_all('a', href=True):
new_url = urljoin(url, link['href'])
if new_url not in visited:
to_visit.append(new_url)
visited.add(url)
time.sleep(1) # Be polite to the server
except Exception as e:
print(f"Error crawling {url}: {e}")
return scraped_data
# Run the crawler
results = simple_crawler("https://quotes.toscrape.com/")
for item in results:
print(f"Title: {item['title'][:50]}...")
fi
Crawling Within a Single Domain
def same_domain_crawler(start_url, max_pages=10):
domain = urlparse(start_url).netloc
to_visit = deque([start_url])
visited = set()
while to_visit and len(visited) < max_pages:
url = to_visit.popleft()
# Skip if already visited
if url in visited:
continue
# Only crawl same domain
if urlparse(url).netloc != domain:
continue
print(f"Crawling same domain: {url}")
try:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Extract links from same domain only
for link in soup. nd_all('a', href=True):
new_url = urljoin(url, link['href'])
if urlparse(new_url).netloc == domain:
to_visit.append(new_url)
visited.add(url)
time.sleep(1)
except Exception as e:
print(f"Error: {e}")
print(f"Crawled {len(visited)} pages from domain: {domain}")
9. Best Practices and Ethics
Important Guidelines
fi
1. Respect robots.txt: Check website.com/robots.txt before crawling
2. Use delays: Add time.sleep(1) between requests
3. Handle errors gracefully: Use try/except blocks
4. Don't overload servers: Limit concurrent requests
5. Respect terms of service: Read and follow website rules
Polite Scraping Example
def polite_scraper(url):
headers = {
'User-Agent': 'Educational Web Scraper 1.0'
}
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status() # Raises exception for HTTP errors
soup = BeautifulSoup(response.content, 'html.parser')
# Extract data here
time.sleep(1) # Be polite
return soup
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
return None
10. Putting It All Together: Complete Project
Project: Quote Collector
Let's build a complete scraper that collects quotes and saves them to a le:
import requests
from bs4 import BeautifulSoup
import csv
import time
fi
def scrape_quotes():
quotes_data = []
url = "https://quotes.toscrape.com/"
while url:
print(f"Scraping: {url}")
# Get page content
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Extract quotes from current page
quotes = soup. nd_all('div', class_='quote')
for quote in quotes:
text = quote. nd('span', class_='text').text
author = quote. nd('small', class_='author').text
# Extract tags
tags = [tag.text for tag in quote. nd_all('a', class_='tag')]
quotes_data.append({
'text': text,
'author': author,
'tags': ', '.join(tags)
})
# Find next page link
next_btn = soup. nd('li', class_='next')
if next_btn:
next_url = next_btn. nd('a')['href']
url = f"https://quotes.toscrape.com{next_url}"
else:
url = None
time.sleep(1) # Be polite
return quotes_data
def save_to_csv(data, lename='quotes.csv'):
with open( lename, 'w', newline='', encoding='utf-8') as le:
writer = csv.DictWriter( le, eldnames=['text', 'author', 'tags'])
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
writer.writeheader()
writer.writerows(data)
print(f"Saved {len(data)} quotes to { lename}")
# Run the complete project
if __name__ == "__main__":
quotes = scrape_quotes()
save_to_csv(quotes)
print("Quote collection complete!")
11. Common Debugging Tips
When Things Go Wrong:
1. Check the HTML source: Use print(soup.prettify()) to see the structure
2. Verify your selectors: Test CSS selectors in browser developer tools
3. Handle missing elements: Use .get() method or check if element exists
4. Check for dynamic content: Some data loads with JavaScript (need Selenium)
5. Add error handling: Always use try/except blocks
Debugging Example:
def debug_scraper(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Debug: Print page structure
print("Page title:", soup.title.text if soup.title else "No title")
print("Number of divs:", len(soup. nd_all('div')))
# Debug: Check if speci c elements exist
quotes = soup. nd_all('div', class_='quote')
print(f"Found {len(quotes)} quotes")
if quotes:
print("First quote structure:")
print(quotes[^0].prettify()[:200] + "...")
fi
fi
fi
fi