0% found this document useful (0 votes)

89 views11 pages

Quiz 2

The document provides instructions for a quiz on text classification. It explains how to download data, prepare the data by cleaning and splitting it, build a naive Bayes classifier model to predict spam or ham, and test the model on held-out data. Key steps include splitting the dataset into training and test sets, finding word probabilities conditioned on spam or ham, and making predictions by calculating spam and ham scores for test messages. The goal is to classify short text messages as spam or non-spam (ham) using a naive Bayes approach.

Uploaded by

KSHITIJ SONI 19PE10041

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views11 pages

Quiz 2

Uploaded by

KSHITIJ SONI 19PE10041

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

###QUIZ 2 Instructions

- Fill all the incomplete functions. Strictly follow the function specs.
- Do not copy or plagiarize. IIPE,VIZAG has a very strict policy against
plagiarism

Download and read file

### Download data from google drive. You need not mess with this code.

import requests

def download_file_from_google_drive(id, destination):

URL = "https://docs.google.com/uc?export=download"

session = requests.Session()

response = session.get(URL, params = { 'id' : id }, stream = True)

token = get_confirm_token(response)

if token:
params = { 'id' : id, 'confirm' : token }
response = session.get(URL, params = params, stream = True)

save_response_content(response, destination)

def get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith('download_warning'):
return value

return None

def save_response_content(response, destination):

CHUNK_SIZE = 32768

with open(destination, "wb") as f:

for chunk in response.iter_content(CHUNK_SIZE):
if chunk: # filter out keep-alive new chunks
f.write(chunk)

if __name__ == "__main__":
file_id = '1e_Azf9zGvSWsDhM9PP2sfMNKC72-iWAK'
destination = 'data.txt'
download_file_from_google_drive(file_id, destination)
with open('data.txt', 'r') as f:
data_raw = f.readlines()

1. Data preparation

Now the entire data is stored in the list data_raw.

Every line in the file is a different element of the list.
First let us look at the first five elements of the list.

1.1

Write a function that returns first five elements of the list if length of list is greater than or equal
to 5 and None value otherwise.

def first_five_in_list(l):
"""
Inputs:
l: Python list

Outputs:
l_5 : python list, first five elements of list if length of list greater
than 5; None otherwise
"""
### Your code here
return l_5

1.2

def remove_trailing_newlines(s):
"""
Function that removes all trailing newlines at the end of it
Inputs:
s : string

Outputs:
s_clean : string, string s but without newline characters at the end
"""
### Write your code here
return s_clean
If we apply remove_trailing_newlines to first element of data_ra

w, we get

You can see that the newline at the end has disappeared.

1.3

But we now we need to apply this function to the whole list.

Write a function named mapl, that takes two arguments - a function on elements of type t and a
list l of elements of type t and applies the function over all elements of the list l and returns them
as a list.

def mapl(f, l):

"""
Function that applies f over all elements of l
Inputs:
f : function, f takes elements of type t1 and returns elements of type
t2
l : list, list of elements of type t1

Ouptuts:
f_l : list, list of elements of type t2 obtained by applying f over ea
ch element of l
"""
### Write your code here

return f_l

Now we can use mapl to apply remove_trailing_newlines to all lines in data_raw

data_clean = mapl(remove_trailing_newlines, data_raw)

First five elements of data_clean look like this:

This is a dataset of text messages. We have to classify this into spam or ham. Ham means non-
spam relevant text messages. More details can be found here -
http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

You can see that each line starts by specifying whether the message is ham or spam and then there
is a tab character, \t followed by actual text message.
Now we need to split the lines to extract the two components - data label (ham or spam) and data
sample (the text message).

1.4

Write a function split_at_s that takes two strings - text and s.

It splits the string text into two parts at the first occurence of s.
Then it wraps both parts in a tuple and returns it.

def split_at_s(text, s):

"""Function that splits string text into two parts at the first occurenc
e of string s
Inputs:
text: string, string to be split
s : string, string of length 1 at which to split

Outputs:
split_text: tuple of size 2, contains text split in two (do not includ
e the string s at which split occurs in any of the split parts)
"""
### Write your code here
return split_text
Python has a very handy feature used to define short functions called lambda expressions. This is
from official python docs

Use lambda expressions and split_at_s to write a function, split_at_tab that takes only one
argument - text and splits at the first occurence of '\t' character. (If you can't understand
lambda expressions, just define the function in the ususal way)

### Write your code here

1.5

Now apply split_at_tab function over the elements of list data_clean and assign it to
variable named data_clean2.

#### Write your code here

After splitting at '\t' character, one data point looks like

this -

Now let us remove the punctuations in an sms.

import string
def remove_punctuations_and_lower(text):
"""Function that removes punctuations in a text
Inputs:
text: string
Outputs:
text_wo_punctuations
"""
return (text.translate(str.maketrans("","", string.punctuation))).lower(
)

1.6

Now use the function remove_punctuations to remove punctuations from the text part of all of
the tuples in data_clean2 and assign it to a variable named dataset

### Write your code here

First 5 elements of dataset look like this now.

Now let us count number of occurences of ham and spam in our dataset.
1.7

Write a function counter that takes two arguments -

 a list l of elements of type t

 a function f:t→u (means f takes an argument of type t and returns values of type u)

Counter returns a dictionary whose keys are u1,u2,…etc - unique values of type u obtained by
applying f over elements of l.
The values corresponding to the keys are the the number of times a particular key say u1 is
obtained when we apply f over elements of l

def counter(l, f):

"""
Function that returns a dictionary of counts of unique values obtained b
y applying f over elements of l
Inputs:
l: list; list of elements of type t
f: function; f takes arguments of type t and returns values of type u

Outputs:
count_dict: dictionary; keys are elements of type u, values are ints
"""
### Write your code here
return count_dict

1.8

Write a function named aux_func that can be passed to counter along with the list dataset to
get a dictionary containing counts of ham and spam

#### Write your code here

The counts of ham and spam as we can see are {'ham': 4827, 'spam': 747}

Now let us split our dataset into training and test sets. We'll first shuffle the elements of the
dataset, then we'll use 80% of data for training and 20% for testing.
1.9

Write a function that takes a list, randomly shuffles it and then returns it.
Hint: Use the random library of python - https://docs.python.org/3/library/random.html

Double-click (or enter) to edit

def random_shuffle(l):
"""Function that returns a randomly shuffled list
Inputs:
l: list
Outputs:
l_shuffled: list, contains same elements as l but randomly shuffled
"""
### Write your code here
return l_shuffled

1.10

Now split the shuffled list. Take 80% (4459) samples and assign them to a variable called
data_train . Put the rest in a variable called data_test

### Write your code here

2.Data Modeling

We shall use Naive Bayes for modelling our classifier. You can read about Naive Bayes from
here (https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes). But you
don't actually need to read it, because we are going to move step by step in building this
classifier.

First we need to find the probabilities P(wi|C)

We read P(A|B) as probability of event A, given event B.
P(wi|C) is probability that word wi occurs in the sms given that the sms belongs to class C
where C can be either spam or ham .
But we will be finding P~(wi|C) which is the smoothed probability function to take care of
words with 0 probabilities that may cause problems.

P~(wi|C)=Number of occurences of wi in all samples of class C+1Total number of words in all samples of class
C + Vocabulary size

2.1

Find the vocabulary - list of unique words in all smses of data_train and assign it to the
variable vocab

### Write your code here

2.2

For every word wi in vocab, find the count (total number of occurences) of wi in all smses of
type spam. Put these counts in a dictionary and assign it to a variable named dict_spam where
key is the word wi and value is the count.
In a similar way, create a variable called dict_ham which contains counts of each word in
vocabulary in smses of type ham. (This is only w.r.t samples in data_train)

### Write your code here

2.3

For every word wi in vocab, find the smoothed probability P~(wi| spam ) and put in a
dictionary named dict_prob_spam. In a similar way, define the dictionary dict_prob_ham
which contains smoothed probabilities P~(wi| ham )

### Write your code here

3. Prediction

We need to test our model on data_test . For each sample of data_test, prediction procedure
is as follows:
 For all words common to the sample and vocabulary, find spam_score and ham_score
 If spam_score is higher than ham_score, then we predict the sample to be spam and vice
versa.
 spam_score = P(spam)∗P~(w1| spam )∗P~(w2| spam )∗… where w1,w2,… are
words which occur both in the test sms and vocabulary.
 Similary, ham_score = P(ham)∗P~(w1| ham )∗P~(w2| ham )∗… where w1,w2,…
are words which occur both in the test sms and vocabulary.
Here P(spam)=Number of samples of type spam in training setTotal number of samples in training set
Similarly, P(ham)=Number of samples of type ham in training setTotal number of samples in training
set
(Note: The above is prediction procedure for a single sample in data_test)
Write a function predict which does this.

3.1
def predict(text, dict_prob_spam, dict_prob_ham, data_train):
"""Function which predicts the label of the sms
Inputs:
text: string, sms
dict_prob_spam: dictionary, contains dict_prob_spam as defined above
dict_prob_spam: dictionary, contains dict_prob_ham as defined above
data_train: list, list of tuples of type(label, sms), contains trainin
g dataset

Outputs:
prediction: string, one of two strings - either 'spam' or 'ham'
"""
### Write your code here
return prediction

3.2

Now find accuracy of the model. Apply function predict to all the samples in data_test.
accuracy=number of correct predictionssize of test set
Write the function accuracy which applies predict to all samples in data_test and returns
accuracy

def accuracy(data_test, dict_prob_spam, dict_prob_ham, data_train):

"""Function which finds accuracy of model
Inputs:
data_test: list, contains tuples of data (label, sms)
dict_prob_spam: dictionary, contains dict_prob_spam as defined above
dict_prob_spam: dictionary, contains dict_prob_ham as defined above
data_train: list, list of tuples of type(label, sms), contains trainin
g dataset

Outputs:
accuracy: float, value of accuracy
"""
### Write your code here
return accuracy

Interview Question
No ratings yet
Interview Question
15 pages
Quiz 2 Questions
No ratings yet
Quiz 2 Questions
10 pages
Iae 2 Answer Key
No ratings yet
Iae 2 Answer Key
4 pages
Lab 2-Part 2: Lists
No ratings yet
Lab 2-Part 2: Lists
5 pages
Map Reduce Filter Lambda Generator
No ratings yet
Map Reduce Filter Lambda Generator
27 pages
12 CS - SQP
No ratings yet
12 CS - SQP
5 pages
Sol PYTHON PROG - III Sem Question Paper (2023-24) Odd
No ratings yet
Sol PYTHON PROG - III Sem Question Paper (2023-24) Odd
25 pages
Python
No ratings yet
Python
33 pages
100 Helpful Python Tips You Can Learn Before Finishing Your Morning Coffee - by
No ratings yet
100 Helpful Python Tips You Can Learn Before Finishing Your Morning Coffee - by
108 pages
Python Lab Manual
No ratings yet
Python Lab Manual
17 pages
Functionsexercise
No ratings yet
Functionsexercise
12 pages
Week 8,9
No ratings yet
Week 8,9
10 pages
AI Lab2
No ratings yet
AI Lab2
28 pages
Bhavya 12C C.S File
No ratings yet
Bhavya 12C C.S File
45 pages
Aprial27 Py
No ratings yet
Aprial27 Py
3 pages
01 139232 096 134033109426 23022025 110952pm
No ratings yet
01 139232 096 134033109426 23022025 110952pm
48 pages
Python Winterr
No ratings yet
Python Winterr
11 pages
Interview Questions For Fresher
No ratings yet
Interview Questions For Fresher
4 pages
Lab2 - Python Programming Basics
No ratings yet
Lab2 - Python Programming Basics
16 pages
Python Programming
No ratings yet
Python Programming
8 pages
Email Class
No ratings yet
Email Class
39 pages
Practice Questions For Term-2 - Ip
No ratings yet
Practice Questions For Term-2 - Ip
5 pages
PRINCIPLES OF DATA SCIENCE Lab
No ratings yet
PRINCIPLES OF DATA SCIENCE Lab
20 pages
CKCS149 - Lab 2 Updated
No ratings yet
CKCS149 - Lab 2 Updated
6 pages
Python - IA 3 Scheme
No ratings yet
Python - IA 3 Scheme
6 pages
Python English
No ratings yet
Python English
17 pages
Python Programming
No ratings yet
Python Programming
8 pages
Xii Comerece
No ratings yet
Xii Comerece
6 pages
CD Back
No ratings yet
CD Back
2 pages
Functions and Conditionals: Define A Function
No ratings yet
Functions and Conditionals: Define A Function
15 pages
Sample Python Programming Previous Year Solved Paper (AKTU)
100% (1)
Sample Python Programming Previous Year Solved Paper (AKTU)
10 pages
E Data Analysis With Python Master Manual
No ratings yet
E Data Analysis With Python Master Manual
61 pages
Python Midterm: Indexing, Loops, Functions, Dictionaries, and File Processing
No ratings yet
Python Midterm: Indexing, Loops, Functions, Dictionaries, and File Processing
5 pages
ISL56 Python Lab - EXAM-FINAL-QB
No ratings yet
ISL56 Python Lab - EXAM-FINAL-QB
4 pages
Py4Inf Solutions
No ratings yet
Py4Inf Solutions
6 pages
CS 471 HW 3 - Spam Detection
No ratings yet
CS 471 HW 3 - Spam Detection
6 pages
Rufh 4
No ratings yet
Rufh 4
24 pages
UNIT 2 PDS Notes P1
No ratings yet
UNIT 2 PDS Notes P1
20 pages
Guess - Solutions - Python
No ratings yet
Guess - Solutions - Python
24 pages
Ut 4
No ratings yet
Ut 4
2 pages
Python Programming Basics
No ratings yet
Python Programming Basics
7 pages
Class 12 Python Assignment
No ratings yet
Class 12 Python Assignment
7 pages
Python Notes
No ratings yet
Python Notes
232 pages
Class11 1
No ratings yet
Class11 1
9 pages
Computerscience Assignment
No ratings yet
Computerscience Assignment
27 pages
CN Fornt (2) - 1
No ratings yet
CN Fornt (2) - 1
1 page
Python Basics
No ratings yet
Python Basics
49 pages
Artificial Intelligence Journal
No ratings yet
Artificial Intelligence Journal
46 pages
Super Advanced Python PDF
No ratings yet
Super Advanced Python PDF
120 pages
Module - 2 VTU QP and Solution: ('A', 'B', 'C', 'D') ('A', 42, 'C', 'D')
No ratings yet
Module - 2 VTU QP and Solution: ('A', 'B', 'C', 'D') ('A', 42, 'C', 'D')
10 pages
Python File and Data Operations
No ratings yet
Python File and Data Operations
34 pages
6-10 Python Lab Program
No ratings yet
6-10 Python Lab Program
16 pages
Nss 1
No ratings yet
Nss 1
2 pages
XI AnnualExam CS
No ratings yet
XI AnnualExam CS
6 pages
Computer Science
No ratings yet
Computer Science
23 pages
Python Reserved Words
No ratings yet
Python Reserved Words
2 pages
Kaish Sample Paper
No ratings yet
Kaish Sample Paper
19 pages
Python Basics for Beginners
No ratings yet
Python Basics for Beginners
46 pages
Python
No ratings yet
Python
18 pages
Scanned With Camscanner
No ratings yet
Scanned With Camscanner
14 pages
Assignment II
No ratings yet
Assignment II
6 pages
Programming and Data Structures: Control Flow: Looping
No ratings yet
Programming and Data Structures: Control Flow: Looping
55 pages
2004 09813v1 PDF
No ratings yet
2004 09813v1 PDF
10 pages
Oslab
No ratings yet
Oslab
96 pages
Session 6
No ratings yet
Session 6
94 pages
2004 10934v1 PDF
No ratings yet
2004 10934v1 PDF
17 pages
2005 05535v4 PDF
No ratings yet
2005 05535v4 PDF
17 pages
Grade 2 PPT - Math - Q1 - W6 - Day 1-4
100% (1)
Grade 2 PPT - Math - Q1 - W6 - Day 1-4
66 pages
Morong, Rizal: Prepared By: Checked by
No ratings yet
Morong, Rizal: Prepared By: Checked by
3 pages
Econ 110: Sampling Theory and Statistical Inference in Economics
No ratings yet
Econ 110: Sampling Theory and Statistical Inference in Economics
14 pages
Packaging Technology Color Testing of Preforms Test Method Rev 022007
No ratings yet
Packaging Technology Color Testing of Preforms Test Method Rev 022007
9 pages
Unit 1 Theory and Methods
100% (1)
Unit 1 Theory and Methods
53 pages
Laboratory Method
100% (9)
Laboratory Method
58 pages
Chapter Three 3.0 Research Methodology 3.1 Introduction.
89% (139)
Chapter Three 3.0 Research Methodology 3.1 Introduction.
11 pages
Telling The Time Book 2 (Page 26)
No ratings yet
Telling The Time Book 2 (Page 26)
1 page
Quantitative Data Collection Techniques
No ratings yet
Quantitative Data Collection Techniques
39 pages
Cooper Chapter Notes (1-29)
85% (41)
Cooper Chapter Notes (1-29)
173 pages
Jueteng Legalization
No ratings yet
Jueteng Legalization
18 pages
HGFIHIHGSIHUSDGUSDGSDG
No ratings yet
HGFIHIHGSIHUSDGUSDGSDG
4 pages
Sources of Knowledge
86% (14)
Sources of Knowledge
4 pages
History & Evolution of Quality Management
No ratings yet
History & Evolution of Quality Management
2 pages
NLP Master Practitioner PDF
0% (1)
NLP Master Practitioner PDF
4 pages
10 GMAT Secrets To Score Above 700
No ratings yet
10 GMAT Secrets To Score Above 700
30 pages
Reward Dan Punishment: Pengaruh Pemberian Terhadap Kinerja Pegawai (Survey Pada Pegawai Cafe Detuik Kabupaten Bandung)
No ratings yet
Reward Dan Punishment: Pengaruh Pemberian Terhadap Kinerja Pegawai (Survey Pada Pegawai Cafe Detuik Kabupaten Bandung)
20 pages
Business Admin Thesis Guide
No ratings yet
Business Admin Thesis Guide
22 pages
Natural Vs Synthetic Questions, Checklist, Rubric
No ratings yet
Natural Vs Synthetic Questions, Checklist, Rubric
1 page
Understanding Self: Body, Soul, and Existence
100% (1)
Understanding Self: Body, Soul, and Existence
8 pages
David, Diana Joie B. 12-ABM B Application
No ratings yet
David, Diana Joie B. 12-ABM B Application
2 pages
Dialogism in Detail: Per Linell's Rethinking Language, Mind, and World Dialogically and Its Potentials
No ratings yet
Dialogism in Detail: Per Linell's Rethinking Language, Mind, and World Dialogically and Its Potentials
10 pages
Examples of Descriptive Research
0% (1)
Examples of Descriptive Research
1 page
Philosophy's Role in Education
No ratings yet
Philosophy's Role in Education
41 pages
11 Analisis Data Kuantitatif
No ratings yet
11 Analisis Data Kuantitatif
53 pages
PPT BUDAYA PERAWAT Di Era Akreditasi
No ratings yet
PPT BUDAYA PERAWAT Di Era Akreditasi
32 pages
A Dynamic Risk Assessment
No ratings yet
A Dynamic Risk Assessment
13 pages
11 US V Brobst
No ratings yet
11 US V Brobst
8 pages
Densitometer Calibration Guide
No ratings yet
Densitometer Calibration Guide
4 pages
Goldman PDF
No ratings yet
Goldman PDF
15 pages

Quiz 2

Uploaded by

Quiz 2

Uploaded by

###QUIZ 2 Instructions

Download and read file

def download_file_from_google_drive(id, destination):

response = session.get(URL, params = { 'id' : id }, stream = True)

def save_response_content(response, destination):

with open(destination, "wb") as f:

Now the entire data is stored in the list data_raw.

But we now we need to apply this function to the whole list.

def mapl(f, l):

Now we can use mapl to apply remove_trailing_newlines to all lines in data_raw

data_clean = mapl(remove_trailing_newlines, data_raw)

Write a function split_at_s that takes two strings - text and s.

def split_at_s(text, s):

### Write your code here

#### Write your code here

After splitting at '\t' character, one data point looks like

Now let us remove the punctuations in an sms.

### Write your code here

First 5 elements of dataset look like this now.

Write a function counter that takes two arguments -

 a list l of elements of type t

def counter(l, f):

#### Write your code here

Double-click (or enter) to edit

### Write your code here

First we need to find the probabilities P(wi|C)

### Write your code here

### Write your code here

### Write your code here

def accuracy(data_test, dict_prob_spam, dict_prob_ham, data_train):

You might also like