AML Unit-3 Material
AML Unit-3 Material
Semi-Supervised Learning:
    Intuitively, one may imagine the three types of learning algorithms as Supervised learning
    where a student is under the supervision of a teacher at both home and school, Unsupervised
    learning where a student has to figure out a concept himself and Semi-Supervised learning
    where a teacher teaches a few concepts in class and gives questions as homework which are
    based on similar concepts.
     Text classification: In text classification, the goal is to classify a given text into one or
      more predefined categories. Semi-supervised learning can be used to train a text
      classification model using a small amount of labeled data and a large amount of unlabeled
      text data.
     Image classification : In image classification, the goal is to classify a given image into one
      or more predefined categories. Semi-supervised learning can be used to train an image
      classification model using a small amount of labeled data and a large amount of unlabeled
      image data.
     Anomaly detection: In anomaly detection, the goal is to detect patterns or observations that
      are unusual or different from the norm
Assumptions followed by Semi-Supervised Learning
1. Speech Analysis: Since labeling audio files is a very intensive task, Semi-Supervised
   learning is a very natural approach to solve this problem.
2. Internet Content Classification: Labeling each webpage is an impractical and unfeasible
   process and thus uses Semi-Supervised learning algorithms. Even the Google search
   algorithm uses a variant of Semi-Supervised learning to rank the relevance of a webpage for
   a given query.
3. Protein Sequence Classification: Since DNA strands are typically very large in size, the
   rise of Semi-Supervised learning has been imminent in this field.
     The most basic disadvantage of any Supervised Learning algorithm is that the dataset
      has to be hand-labeled either by a Machine Learning Engineer or a Data Scientist.
     This is a very costly process, especially when dealing with large volumes of data. The
      most basic disadvantage of any Unsupervised Learning is that its application spectrum
      is limited.
     It is simple to comprehend.
     Semi-supervised learning is powerful when labels are limited and unlabeled data is
      plentiful.
     Your model’s performance and generalization can be enhanced. Without spending time
      and money classifying tens of thousands of additional photos, your model gets exposure
      to situations it might see during deployment.
     In innumerable circumstances, labeled data is not easily accessible. With only a small
      portion of the labeled data, semi-supervised learning can complete typical tasks with
      state-of-the-art outcomes.
     From crawling engines and information aggregation systems to picture and speech
      recognition, semi-supervised learning is used everywhere.
Disadvantages of Semi-supervised
We have collected a large set of unlabeled data that you want to train a model on. Manual
labeling of all this information will probably cost you a fortune, besides taking months to
complete the annotations. That’s when the semi-supervised machine learning method comes to
the rescue.
The working principle is quite simple. Instead of adding tags to the entire dataset, you go through
and hand-label just a small part of the data and use it to train a model, which then is applied to
the ocean of unlabeled data.
Self-training
    Self-training is the simplest semi-supervised learning method and can also be the
    fastest. Self-training algorithms see an application in multiple contexts, including
    NLP and computer vision.
    The objective of self-training is to combine information from unlabeled cases
    with that of labeled cases to iteratively identify labels for the dataset's
    unlabeled examples. On each iteration, the labeled training set is enlarged until
    the entire dataset is labeled.
    The self-training algorithm is typically applied as a wrapper to a base model.
    In this chapter, we'll be using an SVM as the base for our self-training model.
    The self-training algorithm is quite simple and contains very few steps, as
    follows:
        1. A set of labeled data is used to predict labels for a set of unlabeled
            data. (This may be all unlabeled data or part of it.)
        2. Confidence is calculated for all newly labeled cases.
        3. Cases are selected from the newly labeled data to be kept for the next iteration.
        4. The model trains on all labeled cases, including cases selected in previous
            iterations.
        5. The model iterates through steps 1 to 4 until it successfully converges.
Upon completing training, the self-trained model would be tested and validated. This
may be done via cross-validation or even using held-out, labeled data, should this
exist.
Self-training provides real power and time saving, but is also a risky process.
Implementing Self – Training:
    The first step in each iteration of self-training is one in which class labels are generated
    for unlabeled cases. This is achieved by first creating a SelfLearningModel class, which
    takes a base supervised model (basemodel) and an iteration limit as arguments.
The next step is to define functions for the process of semi-supervised model fitting:
Next, we need a loop for iteration. The following code describes a while loop that executes
until there are no cases left in unlabeledy_old (a copy of unlabeledy) or until the max iteration
count is reached. On each iteration, a labeling attempt is made for each case that does not
have a label whose probability exceeds the probability threshold (prob_threshold):
The self.model.fit method then attempts to fit a model to the unlabeled data. This unlabeled
data is presented in a matrix of size [n_samples, n_samples] (as referred to earlier in this
chapter). This matrix is created by appending (with vstack and hstack) the unlabeled cases:
Finally, the iteration performs label predictions, followed by probability predictions for
those labels.
On the next iteration, the model will perform the same process, this time taking the newly
labeled data whose probability predictions exceeded the threshold as part of the dataset used
in the model.fit step.
If one's model does not already include a classification method that can generate label
predictions.
Example:
Heart Dataset:
The heart – dataset is a two class dataset, where the classes are absence or presence of a heart disease. There
are no missing values across the 270 cases for any of its 13 features. This data is unlabeled and many of the
variables needed are usually captured via expensive and sometimes inconvenient tests. The variables are
as follows:
Finessing Self –training implementation:
    Self-training can be a fragile process. If an element of the algorithm is ill-configured
     or the input data contains peculiarities, it is very likely that the iterative process will
     fail once and continue to compound that error by reintroducing incorrectly labeled
     data to future labeling steps.
    As the self-training algorithm iteratively feeds itself, garbage in, garbage out is a
     very real concern.
    In some cases labeled data may not add more useful information. A situation in which
     cases are being added that have no real effect on classification while classification
     accuracy in general deteriorates. Even worse, adding cases that are similar to pre-
     existing cases in enough respects to make them easy to label, but that actually
     misguide the classifier's decision boundary, can introduce misclassification increases.
    Diagnosing what went wrong with a self-training model can sometimes be difficult,
     but as always, a few well-chosen plots add a lot of clarity to the situation. As this type
     of error occurs particularly often within the first few iterations, simply adding an
     element to the label prediction loop that writes the current classification accuracy
     allows us to understand how accuracy trended during early iterations.
    Once the issue has been identified, there are a few possible solutions. If enough labeled
     data exists, a simple solution is to attempt to use a more diverse set of labeled data to
     kick-start the process.
    Another class of error that a self-training model is particularly vulnerable to is biased
      selection.
    If the dataset as a whole, or the labeled subsets used, are biased toward one class,
      then the risk increases that your self-training classifier will overfit. This only
      compounds the problem as the cases provided for the next iteration are liable to be
      insufficiently diverse to solve the problem; whatever incorrect decision boundary
      was set up by the self-training algorithm will be set where it is—overfit to a subset of
      the data.
    Numerical disparity between each class' count of cases is the main symptom here, but
      the more usual methods to spot overfitting can also be helpful in diagnosing
      problems around selection bias.
    A further class of risk introduced by self-training is that the introduction of unlabeled
     data almost always introduces noise. If dealing with datasets where part or all of the
     unlabeled cases are highly noisy, the amount of noise introduced may be sufficient to
     degrade classification accuracy.
    The key to the self-training algorithm working correctly is the accurate calculation of
    confidence for each label projection. Confidence calculation is the key to successful
    self-training.
    During our first explanation of self-training, we used some simplistic values for certain
    parameters, including a parameter closely tied to confidence calculation. In selecting our
    labeled cases, we used a fixed confidence level for comparison against predicted
    probabilities, where we could've adopted any one of several different strategies:
       •   Adding all of the projected labels to the set of labeled data
       •   Using a confidence threshold to select only the few most confident labels to the
           set
       •   Adding all the projected labels to the labeled dataset and weighing each label by
           confidence
 CPLE calculates the relative improvement of any semi-supervised estimate over the
  supervised solution.
 Where the supervised solution outperforms the semi-supervised estimate, the loss
  function shows this and the model can train to adjust the semi-supervised model to reduce
  this loss.
 Where the semi-supervised solution outperforms the supervised solution, the model can
  learn from the semi-supervised model by adjusting model parameters.
 The CPLE algorithm takes the Cartesian product of all label/prediction
  combinations and then selects the posterior distribution that minimizes the gain in
  likelihood.
 CPLE can deliver particular performance improvements on some of the most
  challenging unsupervised learning cases, where the labeled data is a poor
  representation of the unlabeled data.
Example:
The 10000_songs dataset is considered. In this analysis, we'll be attempting to predict genre
from the genre tags provided as targets. We'll take a subset of tags as the labeled data used to
kick-start our learning and will attempt to generate tags for unlabelled data.
    In this iteration, we're going to raise our game as follows:
        •   Using more labeled data. This time, we'll use 1% of the total dataset size (100
            songs), taken at random, as labeled data.
        •   Using an SVM with a linear kernel as our classifier, rather than the simple linear
            discriminant analysis
                               TEXT FEATURE ENGINEERING
Feature Engineering
Feature Engineering is the process of creating new features or transforming existing features to improve
the performance of a machine-learning model. It involves selecting relevant information from raw data
and transforming it into a format that can be easily understood by a model. The goal is to improve
model accuracy by providing more meaningful and relevant information.
Feature:
In the context of machine learning, a feature (also known as a variable or attribute) is an individual
measurable property or characteristic of a data point that is used as input for a machine learning
algorithm.
Features can be numerical, categorical, or text-based, and they represent different aspects of the data
that are relevant to the problem at hand.
        For example, in a dataset of housing prices, features could include the number of bedrooms,
         the square footage, the location, and the age of the property. In a dataset of customer
         demographics, features could include age, gender, income level, and occupation.
        The choice and quality of features are critical in machine learning, as they can greatly
         impact the accuracy and performance of the model.
        Feature engineering is the pre-processing step of machine learning, which extracts features
         from raw data. It helps to represent an underlying problem to predictive models in a better
         way, which as a result, improve the accuracy of the model for unseen data.
        The predictive model contains predictor variables and an outcome variable, and while the
         feature engineering process selects the most useful predictor variables for the model.
     First step should be manually checking the input data. This is pretty critical; with text
      data, one needs to try and understand what issues exist in the data initially so as to
      identify the cleaning needed.
    Example: It's kind of painful to read through a dataset full of hateful Internet commentary,
    so here's an example entry:
      ID         Date                     Comment
      132        20120531031917Z          """\xa0@Flip\xa0how are you not ded"""
        It has an ID field and date field which don't seem to need much work. The text
        fields, however, are quite challenging. From this one case is misspelling and HTML
        inclusions are observed. Furthermore, many entries in the dataset contain attempts to
        bypass swear filtering, usually by including a space or punctuation element mid-
        word. Other data quality issues include multiple vowels (to extend a word), non-
        ASCII characters, hyperlinks... the list goes on.
     Option for cleaning this dataset is to use regular expressions, which run over the
      input data to scrub out data quality issues. However, the quantity and variety of
      problem formats make it impractical to use a regex-based approach, at least to begin
      with.
     We're likely both to miss a lot of cases and also to misjudge the amount of
      preparation needed, leading us to clean too aggressively, or not aggressively
      enough; in specific terms we risk cutting into real text content or leaving parts of
      tags in place. What we need is a solution that will wash out the majority of common
      data quality problems to begin with so that we can focus on the remaining issues
      with a script-based approach.
     BeautifulSoup is a very powerful text cleaning library which can, among other things,
      remove HTML markup.
Tokenisation is the process of creating a set of tokens from a stream of text. Many tokens are words, while
others might be character sets (such as smilies or other punctuation strings, for example, ????????).
re module is used to remove a lot of the HTML ugliness from initial dataset(Operations over regular
expressions, such as substring replacement). A series of operations over our input text on this pass, which
mostly focus on replacing variable or problematic text elements with tokens.
Example: Replacing e-mail addresses with an _EM token:
         text = re.sub(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}', '_EM', text)
We can automatically remove extra or problematic whitespace and newline characters, hyphens, and
underscores. Extended series of punctuation characters are encoded here using codes such as _BQ and BX;
these longer tags are used as a means of differentiating from the more straightforward _Q and _X tags
(which refer to the use of a question mark and exclamation mark, respectively).
We can also use regular expressions to manage extra letters; by cutting down such strings to two
characters at most, we're able to reduce the number of combinations to a manageable amount and tokenize
that reduced group using the _EL token:
        # Format whitespaces
        text = text.replace('"', ' ')
#manage punctuation
One of the more helpful indicators available is the _SW token for swearing. We'll also use regular expressions
to help identify and tokenize smileys into one of four buckets; big and happy smileys (_BS), small and happy
ones (_S), big and sad ones (_BF), and small and sad ones (_F):
         text = re.sub(r'([#%&\*\$]{2,})(\w*)', r'\1\2 _SW', text)
This is a simple application of str.split, which enables the input to be treated as a vector of words
(words) rather than as long strings (re):
         phrases = re.split(r'[;:\.()\n]', text)
phrases = [re.findall(r'[\w%\*&#]+', ph) for ph in phrases] phrases = [ph for ph in phrases if ph]
         words = []
        for ph in phrases:
words.extend(ph)
      ID             Date                    Comments
      132            20120531031917Z         [['Flip', 'how', 'are', 'you', 'not', 'ded']]
Next, we perform a search for single-letter sequences. Sometimes, for emphasis, Internet communication
involves the use of spaced single-letter chains. This may be attempted as a method of avoiding curse
word detection:
        tmp = words
words = []
new_word = ''
if len(word) == 1:
                 if new_word: words.append(new_word)
                      new_word = ''
words.append(word)
      ID             Date                    Words
      132            20120531031917Z         ['_F', 'how', 'are', 'you', 'not', 'ded']
In the first case, we have a misspelled word; we need to find a way to eliminate this.
Secondly, a lot of the words in both examples (for example. are, pm) aren't terribly informative in
and of themselves. The problem we find, particularly for shorter text samples, is that what's left after
cleaning may contain only one or two meaningful terms. If these terms are not terribly common in
the corpus as a whole, it can prove to be very difficult to train a classifier to recognize these terms'
significance.
If we can perform part of speech tagging by identifying and encoding word classes as categorical
variables, we're able to improve the quality of our data by retaining only the valuable content.
Specifically, focus on n-gram tagging and backoff taggers, a pair of complimentary techniques that
allow us to create powerful recursive tagging algorithms.
However, even brief consideration will make it obvious that our use of language is a lot more
complicated than this allows. We may use a word (such as ferry) as one of several parts of speech
and it may not be straightforward to decide how to treat each word in every utterance. A lot of the
time, the correct tag can only be understood contextually given the other words and their positioning
within the phrase.
Sequential tagging
A sequential tagging algorithm is one that works by running through the input dataset, left-to-right and
token-by-token (hence sequential!), tagging each token in succession. The decision over which token to
assign is made based on that token, the tokens that preceded it, and the predicted tags for those preceding
tokens.
 An n-gram tagger is a type of sequential tagger, which is pretrained to identify appropriate tags. The n-
gram tagger takes (n-1)-many preceding POS tags and the current token into consideration in producing a
tag.
The simplest form of n-gram tagger is one where n = 1, referred to as a unigram tagger. A unigram
tagger operates quite simply, by maintaining a conditional frequency distribution for each token.
The tagger assumes that the tag which occurs most frequently for a given token in a given sequence is
likely to be the correct tag for that token. If the term carp is in the training corpus as a noun four times
and as a verb twice, a unigram tagger will assign the noun tag to any token whose type is carp.
This might suffice for a first-pass tagging attempt but clearly, a solution that only ever serves up one tag
for each set of homonyms isn't always going to be ideal. The solution we can tap into is using n-grams
with a larger value of n. With n = 3 (a trigram tagger), for instance, we can see how the tagger might
more easily distinguish the input He tends to carp on a lot from He caught a magnificent carp.
Backoff tagging
Sometimes, a given tagger may not perform reliably. This is particularly common when the tagger has
high accuracy demands and limited training data. At such times, we usually want to build an ensemble
structure that lets us use several taggers simultaneously.
To do this, we create a distinction between two types of taggers: subtaggers and backoff taggers.
Subtaggers are taggers like the ones we saw previously, sequential and Brill taggers. Tagging structures
may contain one or multiple of each kind of tagger.
If a subtagger is unable to determine a tag for a given token, then a backoff tagger may be referred to
instead. A backoff tagger is specifically used to combine the results of an ensemble of (one or more)
subtaggers, as shown in the following example diagram:
    In simple implementations, the backoff tagger will simply poll the subtaggers in order, accepting the
    first none-null tag provided. If all subtaggers return null for a given token, the backoff tagger will
    assign a none tag to that token. The order can be determined.
    Backoffs are typically used with multiple subtaggers of different types; this enables a data scientist
    to harness the benefits of multiple types of tagger simultaneously. Backoffs may refer to other
    backoffs as needed, potentially creating highly redundant or sophisticated tagging structures:
In general terms, backoff taggers provide redundancy and enable you to use multiple taggers in a
composite solution.
         brown_a = nltk.corpus.brown.tagged_sents(categories= 'a')
tagger = None
for n in range(1,4):
words = tagger.tag(words)
         •    Stemming
         •    Lemmatising
         •    Bagging using random forests
Stemming
Another challenge when working with linguistic datasets is that multiple word forms exist for many word
stems. For example, the root dance is the stem of multiple other words—dancing, dancer, dances, and so
on. By finding a way to reduce this plurality of forms into stems, we find ourselves able to improve our n-
gram tagging and apply new techniques such as lemmatisation.
The techniques that enable us to reduce words to their stems are called stemmers. Stemmers work by
parsing words as consonant/vowel strings and applying a series of rules. The most popular stemmer is the
porter stemmer, which works by performing the following steps;
         1. Simplifying the range of suffixes by reducing (for example, ies becomes i) to a smaller
            set.
         2. Removing suffixes in several passes, with each pass removing a set of suffix types (for
            example, past particple or plural suffixes such as ousness or alism).
         3. Once all suffixes are removed, cleaning up word endings by adding 'e's
            where needed (for example, ceas becomes cease).
         4. Removing double 'l's.
         from nltk.stem
         import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem(words)
     The output of this stemmer, as demonstrated on our pre-existing example, is the root form of the
     word. This may be a real word, or it may not; dancing, for instance, becomes danci.
Lemmatisation is a more complex process to determine word stems; unlike porter stemming, it uses
a different normalisation process for different parts of speech. Unlike Porter Stemming it also seeks
to find actual roots for words. Where a stem does not have to be a real word, a lemma does.
Lemmatization also takes on the challenge of reducing synonyms down to their roots.
lemmatizer = WordNetLemmatizer()
Removed stop words and tokenized a range of other noise elements with regex methods. We've also
removed any HTML tagging. Our text data has reached a reasonably processed state. Specifically,
we can use bagging to help quantify the use of terms.
"_F"
"how"
"are"
"you"
"not"
    "ded" "living"
    "proof" "that"
"bath"
    "salts" "effect"
    "thinking"
Using the indices of this list, we can create a 12-part vector for each of the preceding sentences. This
vector's values are filled by traversing the preceding list and counting the number of times each term
occurs for each sentence in the dataset. Given our pre-existing example sentences and the list we
created from them, we end up creating the following bags:
 We can use a term weighting scheme to modify the values within each vector so that terms that are
indicative or helpful for classification are emphasized. Weighting schemes may be straightforward
masks, such as a binary mask that indicates presence versus absence.
Binary masking can be useful if certain terms are used much more frequently than normal; in such
cases, specific scaling (for example, log-scaling) may be needed if a binary mask is not used. At the
same time, though, frequency of term use can be informative (it may indicate emphasis, for instance)
and the decision over whether to apply a binary mask is not always made simply.
     Another weighting option is term frequency-inverse document frequency, or tf-idf. This scheme
     compares frequency of usage within a specific sentence and the dataset as a whole and uses values
     that increase if a term is used more frequently within a given sample than within the whole corpus.
     Variations on tf-idf are frequently used in text mining contexts, including search engines. Scikit-learn
     provides a tf-idf implementation, TfidfVectoriser, which we'll shortly use to employ tf-idf for ourselves.
     The process of implementing bag of words is, again, fairly straightforward. We initialize our
     bagging tool (matter-of-factly referred to as a vectorizer). Note that for this example, we're putting a
     limit on the size of the feature vector.
         from sklearn.feature_extraction.text import TfidfVectorizer
     Our next step is to fit the vectorizer on our word data via fit_transform; as part of the fitting process,
     our data is transformed into feature vectors:
         train_data_features    =    vectorizer.fit_transform(words)   train_data_features   =
train_data_features.toarray()
     We are looking for comments that are intended to be insulting to a person who is a part of the larger
      blog/forum conversation.
     We are NOT looking for insults directed to non-participants (such as celebrities, public figures etc.).
     Insults could contain profanity, racial slurs, or other offensive language. But often times, they do
      not.
     Comments which contain profanity or racial slurs, but are not necessarily insulting to another
      person are considered not insulting.
     The insulting nature of the comment should be obvious, and not subtle.
     There may be a small amount of noise in the labels as they have not been meticulously cleaned.
      However, contestants can be confident the error in the training and testing data is < 1%.
Contestants should also be warned that this problem tends to strongly overfit. The provided data is generally
representative of the full test set, but not exhaustive by any measure. Impermium will be conducting final
evaluations based on an unpublished set of data drawn from a wide sample.
The desired score is the area under the curve (AUC), which is a measure that is very sensitive both to
false positives and to incorrect negative results (specificity and sensitivity).
Specifically, the top 14 participants on the private (test) leaderboard managed to reach an AUC score of
over 0.8. The top scorer managed a pretty impressive 0.84, while over half of the 50 teams who entered
scored above 0.77.
    We then grab the test data and apply our model to predict a score for each test case. We rescale
    these scores using a simple stretching technique:
Finally, we apply the roc_auc function to calculate an AUC score for the model:
0.537894912105