0% found this document useful (0 votes)
2 views12 pages

Unit I

Material

Uploaded by

Leela Rallapudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views12 pages

Unit I

Material

Uploaded by

Leela Rallapudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Unit I: Web Scraping in Python - A Usha Kumari

Course Overview:
This unit covers the fundamentals of web scraping using Python, from making your rst HTTP request to
building sophisticated web crawlers. Students will learn both theory and practical implementation with clear,
simple examples.

1. Introduction to Web Scraping

What is Web Scraping?

Web scraping is the process of automatically extracting data from websites [1]. Think of it as teaching your
computer to read and collect information from web pages just like you would manually, but much faster.

Real-world applications:

• Price comparison websites collecting product prices

• News aggregators gathering articles from multiple sources

• Research data collection from academic websites

• Job boards collecting listings from company websites

The Web Scraping Process

Web scraping follows four main steps [2]:

1. Send HTTP Request: Ask the website for the page content

2. Parse HTML Content: Convert raw HTML into a structured format

3. Extract Data: Find and collect the speci c information you need

4. Store Data: Save the extracted data in a useful format (CSV, JSON, database)

2. Setting Up Your Environment

Installing Required Libraries

# Install the essential libraries


pip install requests beautifulsoup4
fi
fi
# Optional: Install better parsers for improved performance
pip install lxml html5lib

What each library does:

• requests: Handles HTTP requests to websites [3]

• beautifulsoup4: Parses and navigates HTML documents [4]

• lxml: Fast XML/HTML parser (optional but recommended)

• html5lib: Most lenient parser for broken HTML

Simple Installation Test


# Test if everything is working
import requests
from bs4 import BeautifulSoup

print("✓ All libraries installed successfully!")

3. Understanding HTML Structure

HTML Basics

Before scraping, you need to understand HTML structure. HTML is like a family tree [5]:

<!DOCTYPE html>
<html> <!-- Root element -->
<head> <!-- Page metadata -->
<title>Page Title</title>
</head>
<body> <!-- Visible content -->
<div class="container">
<h1>Main Heading</h1> <!-- Child of div -->
<p class="intro">Text</p> <!-- Sibling of h1 -->
</div>
</body>
</html>

Key Concepts:
• Tags: <div>, <p>, <a> (the building blocks)

• Attributes: class="intro", id="main" (additional information)

• Content: Text between opening and closing tags

4. Your First Web Scraper

Step 1: Making HTTP Requests

import requests

# Connect to a website
url = "https://quotes.toscrape.com/"
response = requests.get(url)

# Check if request was successful


print(f"Status Code: {response.status_code}") # 200 = success
print(f"Content Length: {len(response.text)} characters")

Important HTTP Status Codes:

• 200: Success

• 404: Page not found

• 403: Access forbidden

• 500: Server error

Step 2: Introduction to BeautifulSoup

from bs4 import BeautifulSoup

# Parse the HTML content


soup = BeautifulSoup(response.content, 'html.parser')

# Get basic page information


print("Page title:", soup.title.text)
print("Number of paragraphs:", len(soup. nd_all('p')))

Step 3: Extracting Your First Data


fi
# Extract all quotes from the page
quotes = soup. nd_all('div', class_='quote')

for quote in quotes:


text = quote. nd('span', class_='text').text
author = quote. nd('small', class_='author').text
print(f'"{text}" - {author}')

Output Example:

"The world as we have created it is a process of our thinking..." - Albert Einstein


"It is our choices, Harry, that show what we truly are..." - J.K. Rowling

5. Advanced HTML Parsing with BeautifulSoup

Different Ways to Find Elements

1. Basic Find Methods:

# Find rst occurrence


rst_quote = soup. nd('div', class_='quote')

# Find all occurrences


all_quotes = soup. nd_all('div', class_='quote')

# Find by ID
footer = soup. nd('div', id='footer')

2. CSS Selectors (More Powerful):

# Select by class
quotes = soup.select('.quote')

# Select by ID
footer = soup.select('#footer')

# Select nested elements


quote_texts = soup.select('div.quote span.text')

# Select direct children only


fi
fi
fi
fi
fi
fi
fi
fi
authors = soup.select('div.quote > small.author')

Accessing HTML Attributes

# Find a link element


link = soup. nd('a')

# Get attribute value (safe method)


href = link.get('href')
print(f"Link URL: {href}")

# Get all attributes as dictionary


all_attributes = link.attrs
print(f"All attributes: {all_attributes}")

Navigating the HTML Tree

# Find a quote element


quote = soup. nd('span', class_='text')

# Move up the tree (parent)


quote_container = quote.parent
print(f"Parent tag: {quote_container.name}")

# Move sideways (siblings)


next_element = quote. nd_next_sibling()
print(f"Next sibling: {next_element.name if next_element else None}")

6. Regular Expressions with BeautifulSoup

What are Regular Expressions?

Regular expressions (regex) are patterns that help you nd speci c text [6]. Think of them as advanced search
patterns.

Common Patterns:

import re
fi
fi
fi
fi
fi
# Find all numbers
numbers = re. ndall(r'\d+', 'I have 25 apples and 10 oranges')
print(numbers) # ['25', '10']

# Find email addresses


emails = re. ndall(r'\w+@\w+\.\w+', 'Contact: john@email.com or support@site.org')
print(emails) # ['john@email.com', 'support@site.org']

# Find prices
prices = re. ndall(r'\$\d+\.\d{2}', 'Items cost $12.99 and $45.50')
print(prices) # ['$12.99', '$45.50']

Using Regex with BeautifulSoup

# Find elements with text matching a pattern


price_elements = soup. nd_all(text=re.compile(r'\$\d+'))

# Find links with speci c URL patterns


external_links = soup. nd_all('a', href=re.compile(r'^https://'))

# Find elements with class names containing 'product'


product_elements = soup. nd_all(class_=re.compile(r'product'))

7. Lambda Expressions for Advanced Filtering

What are Lambda Functions?

Lambda functions are small, anonymous functions that can have any number of arguments but only one
expression [7].

Basic Syntax:

# Regular function
def double(x):
return x * 2

# Equivalent lambda function


double_lambda = lambda x: x * 2
fi
fi
fi
fi
fi
fi
fi
print(double_lambda(5)) # Output: 10

Lambda with BeautifulSoup

# Find elements with more than 2 attributes


multi_attr = soup. nd_all(lambda tag: len(tag.attrs) > 2)

# Find div elements that contain 'quote' in their class


quote_divs = soup. nd_all(lambda tag:
tag.name == 'div' and
tag.get('class') and
'quote' in tag.get('class', []))

# Find elements with speci c text length


long_text = soup. nd_all(lambda tag:
tag.text and len(tag.text.strip()) > 100)

8. Building Web Crawlers

What is Web Crawling?

Web crawling is the process of automatically browsing through multiple web pages by following links [8]. It's
like having a robot that clicks through a website systematically.

Basic Crawler Components:

1. URL Queue: List of pages to visit

2. Visited Set: Track pages already crawled

3. Parser: Extract data and nd new links

4. Politeness: Don't overwhelm the server

Simple Crawler Example

from urllib.parse import urljoin, urlparse


from collections import deque
import time

def simple_crawler(start_url, max_pages=5):


fi
fi
fi
fi
fi
# Initialize data structures
to_visit = deque([start_url])
visited = set()
scraped_data = []

while to_visit and len(visited) < max_pages:


url = to_visit.popleft()

if url in visited:
continue

print(f"Crawling: {url}")

try:
# Get page content
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Extract data
title = soup.title.text if soup.title else "No title"
scraped_data.append({'url': url, 'title': title})

# Find new links to crawl


for link in soup. nd_all('a', href=True):
new_url = urljoin(url, link['href'])
if new_url not in visited:
to_visit.append(new_url)

visited.add(url)
time.sleep(1) # Be polite to the server

except Exception as e:
print(f"Error crawling {url}: {e}")

return scraped_data

# Run the crawler


results = simple_crawler("https://quotes.toscrape.com/")
for item in results:
print(f"Title: {item['title'][:50]}...")
fi
Crawling Within a Single Domain

def same_domain_crawler(start_url, max_pages=10):


domain = urlparse(start_url).netloc
to_visit = deque([start_url])
visited = set()

while to_visit and len(visited) < max_pages:


url = to_visit.popleft()

# Skip if already visited


if url in visited:
continue

# Only crawl same domain


if urlparse(url).netloc != domain:
continue

print(f"Crawling same domain: {url}")

try:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Extract links from same domain only


for link in soup. nd_all('a', href=True):
new_url = urljoin(url, link['href'])
if urlparse(new_url).netloc == domain:
to_visit.append(new_url)

visited.add(url)
time.sleep(1)

except Exception as e:
print(f"Error: {e}")

print(f"Crawled {len(visited)} pages from domain: {domain}")

9. Best Practices and Ethics

Important Guidelines
fi
1. Respect robots.txt: Check website.com/robots.txt before crawling

2. Use delays: Add time.sleep(1) between requests

3. Handle errors gracefully: Use try/except blocks

4. Don't overload servers: Limit concurrent requests

5. Respect terms of service: Read and follow website rules

Polite Scraping Example

def polite_scraper(url):
headers = {
'User-Agent': 'Educational Web Scraper 1.0'
}

try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status() # Raises exception for HTTP errors

soup = BeautifulSoup(response.content, 'html.parser')

# Extract data here

time.sleep(1) # Be polite
return soup

except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
return None

10. Putting It All Together: Complete Project

Project: Quote Collector

Let's build a complete scraper that collects quotes and saves them to a le:

import requests
from bs4 import BeautifulSoup
import csv
import time
fi
def scrape_quotes():
quotes_data = []
url = "https://quotes.toscrape.com/"

while url:
print(f"Scraping: {url}")

# Get page content


response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Extract quotes from current page


quotes = soup. nd_all('div', class_='quote')

for quote in quotes:


text = quote. nd('span', class_='text').text
author = quote. nd('small', class_='author').text

# Extract tags
tags = [tag.text for tag in quote. nd_all('a', class_='tag')]

quotes_data.append({
'text': text,
'author': author,
'tags': ', '.join(tags)
})

# Find next page link


next_btn = soup. nd('li', class_='next')
if next_btn:
next_url = next_btn. nd('a')['href']
url = f"https://quotes.toscrape.com{next_url}"
else:
url = None

time.sleep(1) # Be polite

return quotes_data

def save_to_csv(data, lename='quotes.csv'):


with open( lename, 'w', newline='', encoding='utf-8') as le:
writer = csv.DictWriter( le, eldnames=['text', 'author', 'tags'])
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
writer.writeheader()
writer.writerows(data)

print(f"Saved {len(data)} quotes to { lename}")

# Run the complete project


if __name__ == "__main__":
quotes = scrape_quotes()
save_to_csv(quotes)
print("Quote collection complete!")

11. Common Debugging Tips

When Things Go Wrong:

1. Check the HTML source: Use print(soup.prettify()) to see the structure

2. Verify your selectors: Test CSS selectors in browser developer tools

3. Handle missing elements: Use .get() method or check if element exists

4. Check for dynamic content: Some data loads with JavaScript (need Selenium)

5. Add error handling: Always use try/except blocks

Debugging Example:

def debug_scraper(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Debug: Print page structure


print("Page title:", soup.title.text if soup.title else "No title")
print("Number of divs:", len(soup. nd_all('div')))

# Debug: Check if speci c elements exist


quotes = soup. nd_all('div', class_='quote')
print(f"Found {len(quotes)} quotes")

if quotes:
print("First quote structure:")
print(quotes[^0].prettify()[:200] + "...")
fi
fi
fi
fi

You might also like