0% found this document useful (0 votes)

10 views11 pages

ML Week 6

The document discusses the importance of data collection in machine learning, highlighting common data formats such as CSV, JSON, and XML. It explains web scraping as a method to efficiently gather large amounts of data from websites, detailing the components and types of web scrapers, as well as the popular use of Python for this task. Additionally, it covers data handling techniques, including data description, handling missing values, and data normalization, which are crucial for preparing data for machine learning models.

Uploaded by

bella.shine7799

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views11 pages

ML Week 6

Uploaded by

bella.shine7799

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Data Collection Important

In machine learning, data is like fuel. Before you can train a model, you need to collect the right
data from files, databases, or the web.
Common Data Formats
Let’s understand the most popular data formats used in data science.
1. CSV (Comma-Separated Values)
CSV is a simple text format where each row is a record, and values are separated by commas.
Example:
Name, Age, City
Alice, 24, New York
Bob, 30, London
Easy to read and write
Ideal for spreadsheets and tabular data
Load CSV in Python:
import pandas as pd
data = pd.read_csv("data.csv")
2. JSON (JavaScript Object Notation)
JSON stores data as key-value pairs great for structured or nested data (like APIs).
Example:
{
"name": "Alice",
"age": 24,
"location": "New York"
}
Works well for hierarchical data
Common in web APIs
Load JSON in Python:
import json
with open("data.json") as f:
data = json.load(f)
Or using pandas:
data = pd.read_json("data.json")
3. XML (eXtensible Markup Language)
XML is similar to HTML, using tags to represent structured data.
Example:
<person>
<name>Alice</name>
<age>24</age>
<location>New York</location>
</person>
Used in web services, legacy systems
More complex to parse than JSON
Parse XML in Python:
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
Suppose You Want Information from a Website…
Let’s say you want a paragraph about Donald Trump from a website like Wikipedia. You could
simply copy and paste it easy, right?
But what if you need a lot of information thousands of articles or product listings for something
like training a machine learning model? Copy-pasting won’t help anymore. It’s slow and boring!
This is where Web Scraping comes in. Instead of doing things manually, web scraping uses
smart tools to automatically collect large amounts of data from websites fast and efficiently.
Web Scraping:
Web Scraping is a way to automatically collect data from websites. Most of the data you see
online is in HTML format (webpage code). A scraper pulls this data and turns it into something
organized like an Excel sheet or database so it can be used for analysis, research, or machine
learning.
You can scrape data using:
 Online tools
 APIs (when websites offer structured access to their data)
 Writing your own code
Some websites like Google, Twitter, and Facebook offer APIs. But if a site doesn’t offer one or
only shows limited data then scraping is the best solution.
How Does Web Scraping Work?
Web scraping usually has two parts:
1. Crawler: This part goes from page to page looking for the data (just like how Google
searches the internet).
2. Scraper: This part actually extracts the data you want from the page.
Example:
If you're scraping Amazon for juicers, you might want to collect only the product names and
prices, not all the customer reviews. The scraper will go through the HTML code, find what you
asked for, and save it often in a CSV, Excel, or JSON file.
Types of Web Scrapers
There are several types, depending on your skill and needs:
1. Self-Built Scrapers
 You write your own code (e.g. in Python)
 Very flexible, but requires programming skills
2. Pre-Built Scrapers
 Ready-made tools you can install and use
 Easier, but less customizable
3. Software Scrapers
 Programs you install on your computer
 More powerful than browser extensions
4. Cloud-Based Scrapers
 Run on online servers
 Don’t slow down your computer, great for big tasks
5. Local Scrapers
 Run on your own PC
 Use your system’s memory and CPU
Why is Python Popular for Web Scraping?
Python is the most-used language for web scraping because:
 It’s easy to learn
 It has powerful libraries made just for scraping
Common Python Libraries:
 BeautifulSoup: Reads and parses HTML easily
 Scrapy: A complete scraping framework for big projects
 Selenium: Automates browsers for scraping JavaScript-heavy websites
What is Web Scraping Used For?
1. Price Monitoring
Businesses track product prices (theirs and competitors) to set the best prices.
2. Market Research
Companies collect large data to analyze trends and make smart decisions.
3. News Monitoring
Scraping news sites helps companies track headlines about them or their industry.
4. Sentiment Analysis
Companies scrape data from social media to find out what people are saying about their
products (positive or negative).
5. Email Marketing
Some companies collect email addresses from websites to send promotions (though this must
follow privacy laws!).
Tools Like Smartproxy
Tools like Smartproxy offer ready-to-use scraping solutions:
 Use proxies (different IP addresses) to avoid getting blocked
 Handle even complex JavaScript websites
They offer scrapers for:
 eCommerce
 Social media
 Search engine results (SERPs)
SQL:
1. Saving Web Scraped Data in SQL
Suppose you scraped product data from an e-commerce site (e.g., name, price, category):
CREATE TABLE Products (
id INT PRIMARY KEY,
name VARCHAR(100),
price DECIMAL(10, 2),
category VARCHAR(50)
);
Add a product:
INSERT INTO Products (id, name, price, category)
VALUES (1, 'Juicer 3000', 89.99, 'Kitchen Appliance');
2. Using SQL to Explore Your Data (for ML)
Get all juicers under $100:
SELECT * FROM Products
WHERE price < 100 AND category = 'Kitchen Appliance';
This helps you understand your data before feeding it into a machine learning model.
3. SQL + Feature Engineering
Suppose you have a table of user reviews you scraped:
CREATE TABLE Reviews (
id INT,
product_id INT,
review_text TEXT,
rating INT
);
Now you want to check average ratings a new feature:
SELECT product_id, AVG(rating) AS avg_rating
FROM Reviews
GROUP BY product_id;
This is an example of using SQL to engineer new features (like average rating) for ML.
4. Collecting Data from Multiple Sources (JOIN)
You have scraped two types of data:
 Products
 Reviews
Let’s join them to build a training dataset:
SELECT Products.name, Products.price, Reviews.rating, Reviews.review_text
FROM Products
JOIN Reviews ON Products.id = Reviews.product_id;
This gives you one clean table combining product info and user reviews — perfect for machine
learning.
5. Exporting Data for ML
After cleaning and joining your data with SQL, you can export it as a CSV for machine learning:
.mode csv
.output ml_dataset.csv
SELECT * FROM Products;
(This command works in SQLite CLI tools.)
What is Data Description?
Before training a machine learning model, you must understand your data. This is called data
description or data exploration.
You look at:
 What types of data you have?
 How clean or messy the data is
 What transformations are needed
1. Numeric Data
Example:

Age Salary

25 50000

30 60000

28 NaN

These are numbers. You can:

 Describe them using:
o Mean (average)
o Median
o Min, Max, Std (spread)
 Visualize using histograms or box plots
 Transform using:
o Normalization (0–1 scale)
o Standardization (mean = 0, std = 1)
Example in Python (pandas):
df['Salary'].mean()
df['Salary'].fillna(df['Salary'].mean()) # Impute missing values
2. Text Data

✅Example:

Review

"Great product"

"Not worth the price"

NaN
Text is unstructured and needs:
 Cleaning: remove punctuation, lower case, stop words
 Tokenization: break into words
 Vectorization: convert to numbers using:
o Bag of Words
o TF-IDF
o Word Embeddings (e.g., Word2Vec)
Example:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(df['Review'].fillna(''))
3. Categorical Data
Example:

Gender Country

Male USA

Female Canada

NaN Australia

These are labels or categories. You can:

 Describe using value counts
 Transform using:
o Label Encoding (e.g., Male = 0, Female = 1)
o One-Hot Encoding (turn each category into a separate column)
Example:
df['Gender'].fillna('Unknown', inplace=True)
pd.get_dummies(df['Country'])
4. Handling Missing Values
Missing Values Look Like:
 NaN (Not a Number)
 Null
 Empty cells
Ways to Handle:
1. Remove rows/columns (if too many missing)
2. Fill/Impute:
o For numbers: mean, median
o For categories: most common value
o For time series: forward/backward fill
Example:
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
1. Handling Duplications
Duplicates are exact copies of data rows that appear more than once in your dataset. They can
cause bias and affect your model’s accuracy.
How to handle?
 Find duplicates
 Remove duplicates (keep only one copy)
Example (Python pandas):
# Find duplicates
duplicates = df.duplicated()
# Remove duplicates
df = df.drop_duplicates()
2. Handling Categorical Data
Categorical data means data that represents categories or labels, like Gender (Male/Female) or
Country (USA, India).
How to handle?
 Label Encoding: Convert categories into numbers (e.g., Male = 0, Female = 1)
 One-Hot Encoding: Convert categories into separate binary columns (e.g., Country_USA
= 1, Country_India = 0)
Example (Python pandas):
# Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])
# One-Hot Encoding
df = pd.get_dummies(df, columns=['Country'])
3. Normalizing Values
Normalization means scaling numeric data so that values are between 0 and 1 (or sometimes -1
to 1). This helps many ML algorithms work better.
How to do it?
 Use Min-Max scaling:
xnorm=x−xminxmax−xminx_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}}
Example (Python sklearn):
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['Age_normalized'] = scaler.fit_transform(df[['Age']])
4. String Manipulations
Cleaning and changing text data to make it consistent and usable. Examples include:
 Lowercasing
 Removing extra spaces
 Removing punctuation
 Extracting parts of strings
Example (Python pandas):
df['Name'] = df['Name'].str.lower() # lowercase
df['Name'] = df['Name'].str.strip() # remove extra spaces
df['Name'] = df['Name'].str.replace('[^\w\s]', '') # remove punctuation (regex)

Module 2 - Final
No ratings yet
Module 2 - Final
58 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
DSE 3 Unit 3
No ratings yet
DSE 3 Unit 3
4 pages
Data Collection
No ratings yet
Data Collection
14 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
2 Data Science - Managing Data
No ratings yet
2 Data Science - Managing Data
37 pages
Programming in Ds With Python
No ratings yet
Programming in Ds With Python
11 pages
08 Gtu TPT Report
No ratings yet
08 Gtu TPT Report
37 pages
UE20CS203-Unit1-Class6-Scraping The Web, Reading Files (.CSV)
No ratings yet
UE20CS203-Unit1-Class6-Scraping The Web, Reading Files (.CSV)
29 pages
Data Science My Notes
No ratings yet
Data Science My Notes
61 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Lesson 1. Introduction To Data Wrangling
No ratings yet
Lesson 1. Introduction To Data Wrangling
56 pages
Webscraping
No ratings yet
Webscraping
12 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Ds Final
No ratings yet
Ds Final
45 pages
AIML Manual Lab-For Students
No ratings yet
AIML Manual Lab-For Students
45 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Data Science Papers
No ratings yet
Data Science Papers
109 pages
Data Preparation
No ratings yet
Data Preparation
6 pages
Data Wrangling
No ratings yet
Data Wrangling
4 pages
Rohan Report
No ratings yet
Rohan Report
25 pages
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
No ratings yet
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
42 pages
Anis D. Ultimate Step by Step Guide To Data Science..Python.2021
No ratings yet
Anis D. Ultimate Step by Step Guide To Data Science..Python.2021
161 pages
Assignment Unit I and II
No ratings yet
Assignment Unit I and II
3 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Web Scraping with Machine Learning
No ratings yet
Web Scraping with Machine Learning
4 pages
Data Science - Unit2
No ratings yet
Data Science - Unit2
24 pages
21CSS203TCT-1 - SET A - Answer Key
No ratings yet
21CSS203TCT-1 - SET A - Answer Key
4 pages
Screenshot 2024-12-10 at 8.32.21 PM
No ratings yet
Screenshot 2024-12-10 at 8.32.21 PM
24 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
Python Web Scraping Guide
100% (2)
Python Web Scraping Guide
35 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
Data Visulization Chapter 2
No ratings yet
Data Visulization Chapter 2
24 pages
Python Web Scraping Basics
No ratings yet
Python Web Scraping Basics
4 pages
Text Processing For NLP Web Scrapping
No ratings yet
Text Processing For NLP Web Scrapping
18 pages
Web Scrapping Final
No ratings yet
Web Scrapping Final
7 pages
Unit-1 .Ds
No ratings yet
Unit-1 .Ds
30 pages
Data Science
No ratings yet
Data Science
24 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Webscraping 2
No ratings yet
Webscraping 2
2 pages
DS Syllabus
No ratings yet
DS Syllabus
29 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
Become An AI Engineer - Baap of All Jobs
No ratings yet
Become An AI Engineer - Baap of All Jobs
29 pages
Master Data Science With Python
No ratings yet
Master Data Science With Python
87 pages
Lecture 2 - Collecting, Analyzing, and Visualizing Data With Python Part I
No ratings yet
Lecture 2 - Collecting, Analyzing, and Visualizing Data With Python Part I
15 pages
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
No ratings yet
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
91 pages
DADS404 Unit-02 - V1.1
No ratings yet
DADS404 Unit-02 - V1.1
23 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
On Unit-4
No ratings yet
On Unit-4
60 pages
Data+Science+in+Python+ +Data+Prep+&+EDA
No ratings yet
Data+Science+in+Python+ +Data+Prep+&+EDA
196 pages
Unit - 2 Web Intelligence
No ratings yet
Unit - 2 Web Intelligence
12 pages
Module 4
No ratings yet
Module 4
14 pages
ML SIG - Day 1
No ratings yet
ML SIG - Day 1
55 pages
Big Data Course for Developers
No ratings yet
Big Data Course for Developers
20 pages
CHAPTER 9 Network Security
No ratings yet
CHAPTER 9 Network Security
9 pages
CHAPTER 7 Network Security
No ratings yet
CHAPTER 7 Network Security
9 pages
Object Modeling in Requirements Engineering
No ratings yet
Object Modeling in Requirements Engineering
4 pages
ML Week 4
No ratings yet
ML Week 4
5 pages
Introduction To Requirement Engineering
No ratings yet
Introduction To Requirement Engineering
17 pages
ML Week 2 Part 2
No ratings yet
ML Week 2 Part 2
6 pages
Compiler Construction Lec 17
No ratings yet
Compiler Construction Lec 17
6 pages
Compiler Construction Lect 18
No ratings yet
Compiler Construction Lect 18
2 pages
Compiler Construction Week 4 Lect 7
No ratings yet
Compiler Construction Week 4 Lect 7
2 pages
Compiler Construction Week 4 Lecture 8
No ratings yet
Compiler Construction Week 4 Lecture 8
9 pages
Scrapy Tutorial PDF
100% (3)
Scrapy Tutorial PDF
114 pages
Search Engine Marketing Application Report
No ratings yet
Search Engine Marketing Application Report
2 pages
Software Development
No ratings yet
Software Development
12 pages
Mining Engineering Handbook On cd-ROM
No ratings yet
Mining Engineering Handbook On cd-ROM
1 page
Himanshu Genai CV
No ratings yet
Himanshu Genai CV
1 page
Seo Packages
No ratings yet
Seo Packages
11 pages
Manual Ingles Singer 740-760
No ratings yet
Manual Ingles Singer 740-760
67 pages
Grocery Data Intelligence - BigBasket, Zepto, Big Bazaar
No ratings yet
Grocery Data Intelligence - BigBasket, Zepto, Big Bazaar
8 pages
FUNDUS: High-Quality News Scraping
No ratings yet
FUNDUS: High-Quality News Scraping
10 pages
Balaji 1
No ratings yet
Balaji 1
30 pages
SEO & Webmasters: Google Index Tool
No ratings yet
SEO & Webmasters: Google Index Tool
3 pages
From Zero To Hero Affiliate Site 0 To 23k in 5 Months by Moon Hussain and Marie Ysais
100% (1)
From Zero To Hero Affiliate Site 0 To 23k in 5 Months by Moon Hussain and Marie Ysais
18 pages
My Link Building Recommendations
100% (1)
My Link Building Recommendations
2 pages
500 - Free Tools For Startups
No ratings yet
500 - Free Tools For Startups
7 pages
Digital Marketing
No ratings yet
Digital Marketing
12 pages
Web Scraping Basics for Beginners
No ratings yet
Web Scraping Basics for Beginners
12 pages
DM-MICA P3 PrekshaPorwal
No ratings yet
DM-MICA P3 PrekshaPorwal
15 pages
Downloaded From Manuals Search Engine
No ratings yet
Downloaded From Manuals Search Engine
44 pages
MONEY
No ratings yet
MONEY
2 pages
Software Engineering Project
No ratings yet
Software Engineering Project
55 pages
The SEO Guide Azinseo
No ratings yet
The SEO Guide Azinseo
13 pages
Panasonic KX tg2631 Guia de Operacion
No ratings yet
Panasonic KX tg2631 Guia de Operacion
75 pages
Backlink Edu
No ratings yet
Backlink Edu
11 pages
Car Enthusiast's Resource Guide
No ratings yet
Car Enthusiast's Resource Guide
4 pages
Webometrics
No ratings yet
Webometrics
22 pages
What Is Google Alerts?
No ratings yet
What Is Google Alerts?
7 pages
Landing Pages
No ratings yet
Landing Pages
4 pages
SEO For Online Reputation Management - Gunung Inti Sempurna
No ratings yet
SEO For Online Reputation Management - Gunung Inti Sempurna
14 pages
Free High DA Directory Submission List
No ratings yet
Free High DA Directory Submission List
80 pages

ML Week 6

Uploaded by

ML Week 6

Uploaded by

Data Collection Important

These are numbers. You can:

"Not worth the price"

These are labels or categories. You can:

You might also like