0% found this document useful (0 votes)
10 views11 pages

ML Week 6

The document discusses the importance of data collection in machine learning, highlighting common data formats such as CSV, JSON, and XML. It explains web scraping as a method to efficiently gather large amounts of data from websites, detailing the components and types of web scrapers, as well as the popular use of Python for this task. Additionally, it covers data handling techniques, including data description, handling missing values, and data normalization, which are crucial for preparing data for machine learning models.

Uploaded by

bella.shine7799
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views11 pages

ML Week 6

The document discusses the importance of data collection in machine learning, highlighting common data formats such as CSV, JSON, and XML. It explains web scraping as a method to efficiently gather large amounts of data from websites, detailing the components and types of web scrapers, as well as the popular use of Python for this task. Additionally, it covers data handling techniques, including data description, handling missing values, and data normalization, which are crucial for preparing data for machine learning models.

Uploaded by

bella.shine7799
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Data Collection Important

In machine learning, data is like fuel. Before you can train a model, you need to collect the right
data from files, databases, or the web.
Common Data Formats
Let’s understand the most popular data formats used in data science.
1. CSV (Comma-Separated Values)
CSV is a simple text format where each row is a record, and values are separated by commas.
Example:
Name, Age, City
Alice, 24, New York
Bob, 30, London
Easy to read and write
Ideal for spreadsheets and tabular data
Load CSV in Python:
import pandas as pd
data = pd.read_csv("data.csv")
2. JSON (JavaScript Object Notation)
JSON stores data as key-value pairs great for structured or nested data (like APIs).
Example:
{
"name": "Alice",
"age": 24,
"location": "New York"
}
Works well for hierarchical data
Common in web APIs
Load JSON in Python:
import json
with open("data.json") as f:
data = json.load(f)
Or using pandas:
data = pd.read_json("data.json")
3. XML (eXtensible Markup Language)
XML is similar to HTML, using tags to represent structured data.
Example:
<person>
<name>Alice</name>
<age>24</age>
<location>New York</location>
</person>
Used in web services, legacy systems
More complex to parse than JSON
Parse XML in Python:
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
Suppose You Want Information from a Website…
Let’s say you want a paragraph about Donald Trump from a website like Wikipedia. You could
simply copy and paste it easy, right?
But what if you need a lot of information thousands of articles or product listings for something
like training a machine learning model? Copy-pasting won’t help anymore. It’s slow and boring!
This is where Web Scraping comes in. Instead of doing things manually, web scraping uses
smart tools to automatically collect large amounts of data from websites fast and efficiently.
Web Scraping:
Web Scraping is a way to automatically collect data from websites. Most of the data you see
online is in HTML format (webpage code). A scraper pulls this data and turns it into something
organized like an Excel sheet or database so it can be used for analysis, research, or machine
learning.
You can scrape data using:
 Online tools
 APIs (when websites offer structured access to their data)
 Writing your own code
Some websites like Google, Twitter, and Facebook offer APIs. But if a site doesn’t offer one or
only shows limited data then scraping is the best solution.
How Does Web Scraping Work?
Web scraping usually has two parts:
1. Crawler: This part goes from page to page looking for the data (just like how Google
searches the internet).
2. Scraper: This part actually extracts the data you want from the page.
Example:
If you're scraping Amazon for juicers, you might want to collect only the product names and
prices, not all the customer reviews. The scraper will go through the HTML code, find what you
asked for, and save it often in a CSV, Excel, or JSON file.
Types of Web Scrapers
There are several types, depending on your skill and needs:
1. Self-Built Scrapers
 You write your own code (e.g. in Python)
 Very flexible, but requires programming skills
2. Pre-Built Scrapers
 Ready-made tools you can install and use
 Easier, but less customizable
3. Software Scrapers
 Programs you install on your computer
 More powerful than browser extensions
4. Cloud-Based Scrapers
 Run on online servers
 Don’t slow down your computer, great for big tasks
5. Local Scrapers
 Run on your own PC
 Use your system’s memory and CPU
Why is Python Popular for Web Scraping?
Python is the most-used language for web scraping because:
 It’s easy to learn
 It has powerful libraries made just for scraping
Common Python Libraries:
 BeautifulSoup: Reads and parses HTML easily
 Scrapy: A complete scraping framework for big projects
 Selenium: Automates browsers for scraping JavaScript-heavy websites
What is Web Scraping Used For?
1. Price Monitoring
Businesses track product prices (theirs and competitors) to set the best prices.
2. Market Research
Companies collect large data to analyze trends and make smart decisions.
3. News Monitoring
Scraping news sites helps companies track headlines about them or their industry.
4. Sentiment Analysis
Companies scrape data from social media to find out what people are saying about their
products (positive or negative).
5. Email Marketing
Some companies collect email addresses from websites to send promotions (though this must
follow privacy laws!).
Tools Like Smartproxy
Tools like Smartproxy offer ready-to-use scraping solutions:
 Use proxies (different IP addresses) to avoid getting blocked
 Handle even complex JavaScript websites
They offer scrapers for:
 eCommerce
 Social media
 Search engine results (SERPs)
SQL:
1. Saving Web Scraped Data in SQL
Suppose you scraped product data from an e-commerce site (e.g., name, price, category):
CREATE TABLE Products (
id INT PRIMARY KEY,
name VARCHAR(100),
price DECIMAL(10, 2),
category VARCHAR(50)
);
Add a product:
INSERT INTO Products (id, name, price, category)
VALUES (1, 'Juicer 3000', 89.99, 'Kitchen Appliance');
2. Using SQL to Explore Your Data (for ML)
Get all juicers under $100:
SELECT * FROM Products
WHERE price < 100 AND category = 'Kitchen Appliance';
This helps you understand your data before feeding it into a machine learning model.
3. SQL + Feature Engineering
Suppose you have a table of user reviews you scraped:
CREATE TABLE Reviews (
id INT,
product_id INT,
review_text TEXT,
rating INT
);
Now you want to check average ratings a new feature:
SELECT product_id, AVG(rating) AS avg_rating
FROM Reviews
GROUP BY product_id;
This is an example of using SQL to engineer new features (like average rating) for ML.
4. Collecting Data from Multiple Sources (JOIN)
You have scraped two types of data:
 Products
 Reviews
Let’s join them to build a training dataset:
SELECT Products.name, Products.price, Reviews.rating, Reviews.review_text
FROM Products
JOIN Reviews ON Products.id = Reviews.product_id;
This gives you one clean table combining product info and user reviews — perfect for machine
learning.
5. Exporting Data for ML
After cleaning and joining your data with SQL, you can export it as a CSV for machine learning:
.mode csv
.output ml_dataset.csv
SELECT * FROM Products;
(This command works in SQLite CLI tools.)
What is Data Description?
Before training a machine learning model, you must understand your data. This is called data
description or data exploration.
You look at:
 What types of data you have?
 How clean or messy the data is
 What transformations are needed
1. Numeric Data
Example:

Age Salary

25 50000

30 60000

28 NaN

These are numbers. You can:


 Describe them using:
o Mean (average)
o Median
o Min, Max, Std (spread)
 Visualize using histograms or box plots
 Transform using:
o Normalization (0–1 scale)
o Standardization (mean = 0, std = 1)
Example in Python (pandas):
df['Salary'].mean()
df['Salary'].fillna(df['Salary'].mean()) # Impute missing values
2. Text Data

✅Example:

Review

"Great product"

"Not worth the price"

NaN
Text is unstructured and needs:
 Cleaning: remove punctuation, lower case, stop words
 Tokenization: break into words
 Vectorization: convert to numbers using:
o Bag of Words
o TF-IDF
o Word Embeddings (e.g., Word2Vec)
Example:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(df['Review'].fillna(''))
3. Categorical Data
Example:

Gender Country

Male USA

Female Canada

NaN Australia

These are labels or categories. You can:


 Describe using value counts
 Transform using:
o Label Encoding (e.g., Male = 0, Female = 1)
o One-Hot Encoding (turn each category into a separate column)
Example:
df['Gender'].fillna('Unknown', inplace=True)
pd.get_dummies(df['Country'])
4. Handling Missing Values
Missing Values Look Like:
 NaN (Not a Number)
 Null
 Empty cells
Ways to Handle:
1. Remove rows/columns (if too many missing)
2. Fill/Impute:
o For numbers: mean, median
o For categories: most common value
o For time series: forward/backward fill
Example:
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
1. Handling Duplications
Duplicates are exact copies of data rows that appear more than once in your dataset. They can
cause bias and affect your model’s accuracy.
How to handle?
 Find duplicates
 Remove duplicates (keep only one copy)
Example (Python pandas):
# Find duplicates
duplicates = df.duplicated()
# Remove duplicates
df = df.drop_duplicates()
2. Handling Categorical Data
Categorical data means data that represents categories or labels, like Gender (Male/Female) or
Country (USA, India).
How to handle?
 Label Encoding: Convert categories into numbers (e.g., Male = 0, Female = 1)
 One-Hot Encoding: Convert categories into separate binary columns (e.g., Country_USA
= 1, Country_India = 0)
Example (Python pandas):
# Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])
# One-Hot Encoding
df = pd.get_dummies(df, columns=['Country'])
3. Normalizing Values
Normalization means scaling numeric data so that values are between 0 and 1 (or sometimes -1
to 1). This helps many ML algorithms work better.
How to do it?
 Use Min-Max scaling:
xnorm=x−xminxmax−xminx_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}}
Example (Python sklearn):
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['Age_normalized'] = scaler.fit_transform(df[['Age']])
4. String Manipulations
Cleaning and changing text data to make it consistent and usable. Examples include:
 Lowercasing
 Removing extra spaces
 Removing punctuation
 Extracting parts of strings
Example (Python pandas):
df['Name'] = df['Name'].str.lower() # lowercase
df['Name'] = df['Name'].str.strip() # remove extra spaces
df['Name'] = df['Name'].str.replace('[^\w\s]', '') # remove punctuation (regex)

You might also like