###QUIZ 2 Instructions
- Fill all the incomplete functions. Strictly follow the function specs.
- Do not copy or plagiarize. IIPE,VIZAG has a very strict policy against
plagiarism
Download and read file
### Download data from google drive. You need not mess with this code.
import requests
def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"
    session = requests.Session()
    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)
    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)
    save_response_content(response, destination)
def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value
    return None
def save_response_content(response, destination):
    CHUNK_SIZE = 32768
    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
if __name__ == "__main__":
    file_id = '1e_Azf9zGvSWsDhM9PP2sfMNKC72-iWAK'
    destination = 'data.txt'
    download_file_from_google_drive(file_id, destination)
with open('data.txt', 'r') as f:
  data_raw = f.readlines()
1. Data preparation
Now the entire data is stored in the list data_raw.
Every line in the file is a different element of the list.
First let us look at the first five elements of the list.
1.1
Write a function that returns first five elements of the list if length of list is greater than or equal
to 5 and None value otherwise.
def first_five_in_list(l):
  """
  Inputs:
  l: Python list
  Outputs:
  l_5 : python list, first five elements of list if length of list greater
 than 5; None otherwise
  """
  ### Your code here
  return l_5
1.2
def remove_trailing_newlines(s):
  """
  Function that removes all trailing newlines at the end of it
  Inputs:
    s : string
  Outputs:
    s_clean : string, string s but without newline characters at the end
  """
  ### Write your code here
  return s_clean
If we apply remove_trailing_newlines to first element of data_ra
w, we get
You can see that the newline at the end has disappeared.
1.3
But we now we need to apply this function to the whole list.
Write a function named mapl, that takes two arguments - a function on elements of type t and a
list l of elements of type t and applies the function over all elements of the list l and returns them
as a list.
def mapl(f, l):
  """
  Function that applies f over all elements of l
  Inputs:
    f : function, f takes elements of type t1 and returns elements of type
 t2
    l : list, list of elements of type t1
  Ouptuts:
    f_l : list, list of elements of type t2 obtained by applying f over ea
ch element of l
  """
  ### Write your code here
  return f_l
Now we can use mapl to apply remove_trailing_newlines to all lines in data_raw
data_clean = mapl(remove_trailing_newlines, data_raw)
First five elements of data_clean look like this:
This is a dataset of text messages. We have to classify this into spam or ham. Ham means non-
spam relevant text messages. More details can be found here -
http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
You can see that each line starts by specifying whether the message is ham or spam and then there
is a tab character, \t followed by actual text message.
Now we need to split the lines to extract the two components - data label (ham or spam) and data
sample (the text message).
1.4
Write a function split_at_s that takes two strings - text and s.
It splits the string text into two parts at the first occurence of s.
Then it wraps both parts in a tuple and returns it.
def split_at_s(text, s):
  """Function that splits string text into two parts at the first occurenc
e of string s
  Inputs:
    text: string, string to be split
    s : string, string of length 1 at which to split
  Outputs:
    split_text: tuple of size 2, contains text split in two (do not includ
e the string s at which split occurs in any of the split parts)
  """
  ### Write your code here
  return split_text
Python has a very handy feature used to define short functions called lambda expressions. This is
from official python docs
Use lambda expressions and split_at_s to write a function, split_at_tab that takes only one
argument - text and splits at the first occurence of '\t' character. (If you can't understand
lambda expressions, just define the function in the ususal way)
### Write your code here
1.5
Now apply split_at_tab function over the elements of list data_clean and assign it to
variable named data_clean2.
#### Write your code here
After splitting at '\t' character, one data point looks like
this -
Now let us remove the punctuations in an sms.
import string
def remove_punctuations_and_lower(text):
  """Function that removes punctuations in a text
  Inputs:
    text: string
  Outputs:
    text_wo_punctuations
  """
  return (text.translate(str.maketrans("","", string.punctuation))).lower(
)
1.6
Now use the function remove_punctuations to remove punctuations from the text part of all of
the tuples in data_clean2 and assign it to a variable named dataset
### Write your code here
First 5 elements of dataset look like this now.
Now let us count number of occurences of ham and spam in our dataset.
1.7
Write a function counter that takes two arguments -
         a list l of elements of type t
         a function f:t→u (means f takes an argument of type t and returns values of type u)
Counter returns a dictionary whose keys are u1,u2,…etc - unique values of type u obtained by
applying f over elements of l.
The values corresponding to the keys are the the number of times a particular key say u1 is
obtained when we apply f over elements of l
def counter(l, f):
  """
  Function that returns a dictionary of counts of unique values obtained b
y applying f over elements of l
  Inputs:
    l: list; list of elements of type t
    f: function; f takes arguments of type t and returns values of type u
  Outputs:
    count_dict: dictionary; keys are elements of type u, values are ints
  """
  ### Write your code here
  return count_dict
1.8
Write a function named aux_func that can be passed to counter along with the list dataset to
get a dictionary containing counts of ham and spam
#### Write your code here
The counts of ham and spam as we can see are {'ham': 4827, 'spam': 747}
Now let us split our dataset into training and test sets. We'll first shuffle the elements of the
dataset, then we'll use 80% of data for training and 20% for testing.
1.9
Write a function that takes a list, randomly shuffles it and then returns it.
Hint: Use the random library of python - https://docs.python.org/3/library/random.html
Double-click (or enter) to edit
def random_shuffle(l):
  """Function that returns a randomly shuffled list
  Inputs:
    l: list
  Outputs:
    l_shuffled: list, contains same elements as l but randomly shuffled
  """
  ### Write your code here
  return l_shuffled
1.10
Now split the shuffled list. Take 80% (4459) samples and assign them to a variable called
data_train . Put the rest in a variable called data_test
### Write your code here
2.Data Modeling
We shall use Naive Bayes for modelling our classifier. You can read about Naive Bayes from
here (https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes). But you
don't actually need to read it, because we are going to move step by step in building this
classifier.
First we need to find the probabilities P(wi|C)
We read P(A|B) as probability of event A, given event B.
P(wi|C) is probability that word wi occurs in the sms given that the sms belongs to class C
where C can be either spam or ham .
But we will be finding P~(wi|C) which is the smoothed probability function to take care of
words with 0 probabilities that may cause problems.
P~(wi|C)=Number of occurences of wi in all samples of class C+1Total number of words in all samples of class
C + Vocabulary size
2.1
Find the vocabulary - list of unique words in all smses of data_train and assign it to the
variable vocab
### Write your code here
2.2
For every word wi in vocab, find the count (total number of occurences) of wi in all smses of
type spam. Put these counts in a dictionary and assign it to a variable named dict_spam where
key is the word wi and value is the count.
In a similar way, create a variable called dict_ham which contains counts of each word in
vocabulary in smses of type ham. (This is only w.r.t samples in data_train)
### Write your code here
2.3
For every word wi in vocab, find the smoothed probability P~(wi| spam ) and put in a
dictionary named dict_prob_spam. In a similar way, define the dictionary dict_prob_ham
which contains smoothed probabilities P~(wi| ham )
### Write your code here
3. Prediction
We need to test our model on data_test . For each sample of data_test, prediction procedure
is as follows:
         For all words common to the sample and vocabulary, find spam_score and ham_score
         If spam_score is higher than ham_score, then we predict the sample to be spam and vice
          versa.
         spam_score = P(spam)∗P~(w1| spam )∗P~(w2| spam )∗… where w1,w2,… are
          words which occur both in the test sms and vocabulary.
         Similary, ham_score = P(ham)∗P~(w1| ham )∗P~(w2| ham )∗… where w1,w2,…
          are words which occur both in the test sms and vocabulary.
          Here P(spam)=Number of samples of type spam in training setTotal number of samples in training set
          Similarly, P(ham)=Number of samples of type ham in training setTotal number of samples in training
          set
          (Note: The above is prediction procedure for a single sample in data_test)
          Write a function predict which does this.
3.1
def predict(text, dict_prob_spam, dict_prob_ham, data_train):
   """Function which predicts the label of the sms
   Inputs:
     text: string, sms
     dict_prob_spam: dictionary, contains dict_prob_spam as defined above
     dict_prob_spam: dictionary, contains dict_prob_ham as defined above
     data_train: list, list of tuples of type(label, sms), contains trainin
g dataset
  Outputs:
    prediction: string, one of two strings - either 'spam' or 'ham'
  """
  ### Write your code here
  return prediction
3.2
Now find accuracy of the model. Apply function predict to all the samples in data_test.
accuracy=number of correct predictionssize of test set
Write the function accuracy which applies predict to all samples in data_test and returns
accuracy
def accuracy(data_test, dict_prob_spam, dict_prob_ham, data_train):
  """Function which finds accuracy of model
  Inputs:
    data_test: list, contains tuples of data (label, sms)
    dict_prob_spam: dictionary, contains dict_prob_spam as defined above
    dict_prob_spam: dictionary, contains dict_prob_ham as defined above
    data_train: list, list of tuples of type(label, sms), contains trainin
g dataset
  Outputs:
    accuracy: float, value of accuracy
  """
  ### Write your code here
  return accuracy