0% found this document useful (0 votes)
17 views26 pages

AML Unit-3 Material

Semi-supervised learning is a machine learning approach that combines a small amount of labeled data with a large amount of unlabeled data to improve model training. It is particularly useful in scenarios where labeling data is expensive or impractical, such as text and image classification, and anomaly detection. The document discusses the advantages, disadvantages, assumptions, algorithms, and applications of semi-supervised learning, including self-training and contrastive pessimistic likelihood estimation.

Uploaded by

npraneethbobby
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views26 pages

AML Unit-3 Material

Semi-supervised learning is a machine learning approach that combines a small amount of labeled data with a large amount of unlabeled data to improve model training. It is particularly useful in scenarios where labeling data is expensive or impractical, such as text and image classification, and anomaly detection. The document discusses the advantages, disadvantages, assumptions, algorithms, and applications of semi-supervised learning, including self-training and contrastive pessimistic likelihood estimation.

Uploaded by

npraneethbobby
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT - III

Semi-Supervised Learning:

 Semi-supervised learning is a type of machine learning that falls in between supervised


and unsupervised learning.
 It is a method that uses a small amount of labeled data and a large amount of unlabeled
data to train a model.
 The goal of semi-supervised learning is to learn a function that can accurately predict
the output variable based on the input variables, similar to supervised learning.
However, unlike supervised learning, the algorithm is trained on a dataset that contains
both labeled and unlabeled data.
 Semi-supervised learning is particularly useful when there is a large amount of
unlabeled data available, but it’s too expensive or difficult to label all of it.

Fig: Semi-Supervised Learning Flow Chart

Intuitively, one may imagine the three types of learning algorithms as Supervised learning
where a student is under the supervision of a teacher at both home and school, Unsupervised
learning where a student has to figure out a concept himself and Semi-Supervised learning
where a teacher teaches a few concepts in class and gives questions as homework which are
based on similar concepts.

Examples of Semi-Supervised Learning

 Text classification: In text classification, the goal is to classify a given text into one or
more predefined categories. Semi-supervised learning can be used to train a text
classification model using a small amount of labeled data and a large amount of unlabeled
text data.
 Image classification : In image classification, the goal is to classify a given image into one
or more predefined categories. Semi-supervised learning can be used to train an image
classification model using a small amount of labeled data and a large amount of unlabeled
image data.
 Anomaly detection: In anomaly detection, the goal is to detect patterns or observations that
are unusual or different from the norm
Assumptions followed by Semi-Supervised Learning

A Semi-Supervised algorithm assumes the following about the data


1. Continuity Assumption: The algorithm assumes that the points which are closer to each
other are more likely to have the same output label.
2. Cluster Assumption: The data can be divided into discrete clusters and points in the same
cluster are more likely to share an output label.
3. Manifold Assumption: The data lie approximately on a manifold of a much lower
dimension than the input space. This assumption allows the use of distances and densities
which are defined on a manifold.

Applications of Semi-Supervised Learning

1. Speech Analysis: Since labeling audio files is a very intensive task, Semi-Supervised
learning is a very natural approach to solve this problem.
2. Internet Content Classification: Labeling each webpage is an impractical and unfeasible
process and thus uses Semi-Supervised learning algorithms. Even the Google search
algorithm uses a variant of Semi-Supervised learning to rank the relevance of a webpage for
a given query.
3. Protein Sequence Classification: Since DNA strands are typically very large in size, the
rise of Semi-Supervised learning has been imminent in this field.

 The most basic disadvantage of any Supervised Learning algorithm is that the dataset
has to be hand-labeled either by a Machine Learning Engineer or a Data Scientist.

 This is a very costly process, especially when dealing with large volumes of data. The
most basic disadvantage of any Unsupervised Learning is that its application spectrum
is limited.

Advantages of Semi-supervised learning

 It is simple to comprehend.
 Semi-supervised learning is powerful when labels are limited and unlabeled data is
plentiful.
 Your model’s performance and generalization can be enhanced. Without spending time
and money classifying tens of thousands of additional photos, your model gets exposure
to situations it might see during deployment.
 In innumerable circumstances, labeled data is not easily accessible. With only a small
portion of the labeled data, semi-supervised learning can complete typical tasks with
state-of-the-art outcomes.
 From crawling engines and information aggregation systems to picture and speech
recognition, semi-supervised learning is used everywhere.
Disadvantages of Semi-supervised

 The outcomes of iterations are unstable.


 Data at the network level is not applicable for semi-supervised learning.
 Because there is no method to confirm that the algorithm has generated labels that are
100% accurate, it produces less reliable results than conventional supervised procedures.

Semi – Supervised Algorithms:

We have collected a large set of unlabeled data that you want to train a model on. Manual
labeling of all this information will probably cost you a fortune, besides taking months to
complete the annotations. That’s when the semi-supervised machine learning method comes to
the rescue.

The working principle is quite simple. Instead of adding tags to the entire dataset, you go through
and hand-label just a small part of the data and use it to train a model, which then is applied to
the ocean of unlabeled data.

Self-training
Self-training is the simplest semi-supervised learning method and can also be the
fastest. Self-training algorithms see an application in multiple contexts, including
NLP and computer vision.
The objective of self-training is to combine information from unlabeled cases
with that of labeled cases to iteratively identify labels for the dataset's
unlabeled examples. On each iteration, the labeled training set is enlarged until
the entire dataset is labeled.
The self-training algorithm is typically applied as a wrapper to a base model.
In this chapter, we'll be using an SVM as the base for our self-training model.
The self-training algorithm is quite simple and contains very few steps, as
follows:
1. A set of labeled data is used to predict labels for a set of unlabeled
data. (This may be all unlabeled data or part of it.)
2. Confidence is calculated for all newly labeled cases.
3. Cases are selected from the newly labeled data to be kept for the next iteration.
4. The model trains on all labeled cases, including cases selected in previous
iterations.
5. The model iterates through steps 1 to 4 until it successfully converges.

Upon completing training, the self-trained model would be tested and validated. This
may be done via cross-validation or even using held-out, labeled data, should this
exist.
Self-training provides real power and time saving, but is also a risky process.
Implementing Self – Training:

The first step in each iteration of self-training is one in which class labels are generated
for unlabeled cases. This is achieved by first creating a SelfLearningModel class, which
takes a base supervised model (basemodel) and an iteration limit as arguments.
The next step is to define functions for the process of semi-supervised model fitting:

Next, we need a loop for iteration. The following code describes a while loop that executes
until there are no cases left in unlabeledy_old (a copy of unlabeledy) or until the max iteration
count is reached. On each iteration, a labeling attempt is made for each case that does not
have a label whose probability exceeds the probability threshold (prob_threshold):

The self.model.fit method then attempts to fit a model to the unlabeled data. This unlabeled
data is presented in a matrix of size [n_samples, n_samples] (as referred to earlier in this
chapter). This matrix is created by appending (with vstack and hstack) the unlabeled cases:

Finally, the iteration performs label predictions, followed by probability predictions for
those labels.
On the next iteration, the model will perform the same process, this time taking the newly
labeled data whose probability predictions exceeded the threshold as part of the dataset used
in the model.fit step.

If one's model does not already include a classification method that can generate label
predictions.

Example:

Heart Dataset:

The heart – dataset is a two class dataset, where the classes are absence or presence of a heart disease. There
are no missing values across the 270 cases for any of its 13 features. This data is unlabeled and many of the
variables needed are usually captured via expensive and sometimes inconvenient tests. The variables are
as follows:
Finessing Self –training implementation:
 Self-training can be a fragile process. If an element of the algorithm is ill-configured
or the input data contains peculiarities, it is very likely that the iterative process will
fail once and continue to compound that error by reintroducing incorrectly labeled
data to future labeling steps.
 As the self-training algorithm iteratively feeds itself, garbage in, garbage out is a
very real concern.
 In some cases labeled data may not add more useful information. A situation in which
cases are being added that have no real effect on classification while classification
accuracy in general deteriorates. Even worse, adding cases that are similar to pre-
existing cases in enough respects to make them easy to label, but that actually
misguide the classifier's decision boundary, can introduce misclassification increases.
 Diagnosing what went wrong with a self-training model can sometimes be difficult,
but as always, a few well-chosen plots add a lot of clarity to the situation. As this type
of error occurs particularly often within the first few iterations, simply adding an
element to the label prediction loop that writes the current classification accuracy
allows us to understand how accuracy trended during early iterations.
 Once the issue has been identified, there are a few possible solutions. If enough labeled
data exists, a simple solution is to attempt to use a more diverse set of labeled data to
kick-start the process.
 Another class of error that a self-training model is particularly vulnerable to is biased
selection.
 If the dataset as a whole, or the labeled subsets used, are biased toward one class,
then the risk increases that your self-training classifier will overfit. This only
compounds the problem as the cases provided for the next iteration are liable to be
insufficiently diverse to solve the problem; whatever incorrect decision boundary
was set up by the self-training algorithm will be set where it is—overfit to a subset of
the data.
 Numerical disparity between each class' count of cases is the main symptom here, but
the more usual methods to spot overfitting can also be helpful in diagnosing
problems around selection bias.
 A further class of risk introduced by self-training is that the introduction of unlabeled
data almost always introduces noise. If dealing with datasets where part or all of the
unlabeled cases are highly noisy, the amount of noise introduced may be sufficient to
degrade classification accuracy.

Improving the Selection Process:

The key to the self-training algorithm working correctly is the accurate calculation of
confidence for each label projection. Confidence calculation is the key to successful
self-training.
During our first explanation of self-training, we used some simplistic values for certain
parameters, including a parameter closely tied to confidence calculation. In selecting our
labeled cases, we used a fixed confidence level for comparison against predicted
probabilities, where we could've adopted any one of several different strategies:
• Adding all of the projected labels to the set of labeled data
• Using a confidence threshold to select only the few most confident labels to the
set
• Adding all the projected labels to the labeled dataset and weighing each label by
confidence

Contrastive Pessimistic Likelihood Estimation:

 A very recent (May 2015) approach to self-supervised learning, CPLE, provides a


more general way to perform semi-supervised parameter estimation.
 CPLE provides a rather remarkable advantage: it produces label predictions that have
been demonstrated to consistently outperform those created by equivalent semi-
supervised classifiers or by supervised classifiers working from the labeled data! In
other words, when performing a linear discriminant analysis, for instance, it is
advised that you perform a CPLE-based, semi-supervised analysis instead of a
supervised one, as will always obtain at least equivalent performance.
 CPLE uses the familiar measure of maximized log-likelihood for parameter
optimization. It is the specific guarantees and assumptions that CPLE incorporates
that make the technique effective.
 CPLE takes the supervised estimates into account explicitly, using the loss incurred
between the semi-supervised and supervised models as a training performance
measure:

 CPLE calculates the relative improvement of any semi-supervised estimate over the
supervised solution.
 Where the supervised solution outperforms the semi-supervised estimate, the loss
function shows this and the model can train to adjust the semi-supervised model to reduce
this loss.
 Where the semi-supervised solution outperforms the supervised solution, the model can
learn from the semi-supervised model by adjusting model parameters.
 The CPLE algorithm takes the Cartesian product of all label/prediction
combinations and then selects the posterior distribution that minimizes the gain in
likelihood.
 CPLE can deliver particular performance improvements on some of the most
challenging unsupervised learning cases, where the labeled data is a poor
representation of the unlabeled data.
Example:

The 10000_songs dataset is considered. In this analysis, we'll be attempting to predict genre
from the genre tags provided as targets. We'll take a subset of tags as the labeled data used to
kick-start our learning and will attempt to generate tags for unlabelled data.
In this iteration, we're going to raise our game as follows:
• Using more labeled data. This time, we'll use 1% of the total dataset size (100
songs), taken at random, as labeled data.
• Using an SVM with a linear kernel as our classifier, rather than the simple linear
discriminant analysis
TEXT FEATURE ENGINEERING

Feature Engineering

Feature Engineering is the process of creating new features or transforming existing features to improve
the performance of a machine-learning model. It involves selecting relevant information from raw data
and transforming it into a format that can be easily understood by a model. The goal is to improve
model accuracy by providing more meaningful and relevant information.

Feature:

In the context of machine learning, a feature (also known as a variable or attribute) is an individual
measurable property or characteristic of a data point that is used as input for a machine learning
algorithm.
Features can be numerical, categorical, or text-based, and they represent different aspects of the data
that are relevant to the problem at hand.
 For example, in a dataset of housing prices, features could include the number of bedrooms,
the square footage, the location, and the age of the property. In a dataset of customer
demographics, features could include age, gender, income level, and occupation.
 The choice and quality of features are critical in machine learning, as they can greatly
impact the accuracy and performance of the model.

 Feature engineering is the pre-processing step of machine learning, which extracts features
from raw data. It helps to represent an underlying problem to predictive models in a better
way, which as a result, improve the accuracy of the model for unseen data.

 The predictive model contains predictor variables and an outcome variable, and while the
feature engineering process selects the most useful predictor variables for the model.

Cleaning Text Data


In real-world contexts, the idea of a naturally clean text dataset is pretty unsafe; text data is
rife with misspellings, non-dictionary constructs like emoticons, and in some cases, HTML
tagging.

Text cleaning with BeautifulSoup

 First step should be manually checking the input data. This is pretty critical; with text
data, one needs to try and understand what issues exist in the data initially so as to
identify the cleaning needed.
Example: It's kind of painful to read through a dataset full of hateful Internet commentary,
so here's an example entry:

ID Date Comment
132 20120531031917Z """\xa0@Flip\xa0how are you not ded"""

It has an ID field and date field which don't seem to need much work. The text
fields, however, are quite challenging. From this one case is misspelling and HTML
inclusions are observed. Furthermore, many entries in the dataset contain attempts to
bypass swear filtering, usually by including a space or punctuation element mid-
word. Other data quality issues include multiple vowels (to extend a word), non-
ASCII characters, hyperlinks... the list goes on.

 Option for cleaning this dataset is to use regular expressions, which run over the
input data to scrub out data quality issues. However, the quantity and variety of
problem formats make it impractical to use a regex-based approach, at least to begin
with.
 We're likely both to miss a lot of cases and also to misjudge the amount of
preparation needed, leading us to clean too aggressively, or not aggressively
enough; in specific terms we risk cutting into real text content or leaving parts of
tags in place. What we need is a solution that will wash out the majority of common
data quality problems to begin with so that we can focus on the remaining issues
with a script-based approach.
 BeautifulSoup is a very powerful text cleaning library which can, among other things,
remove HTML markup.

Managing Punctuation and Tokenizing

Tokenisation is the process of creating a set of tokens from a stream of text. Many tokens are words, while
others might be character sets (such as smilies or other punctuation strings, for example, ????????).
re module is used to remove a lot of the HTML ugliness from initial dataset(Operations over regular
expressions, such as substring replacement). A series of operations over our input text on this pass, which
mostly focus on replacing variable or problematic text elements with tokens.
Example: Replacing e-mail addresses with an _EM token:
text = re.sub(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}', '_EM', text)

Similarly, we can remove URLs, replacing them with the _U token:


text = re.sub(r'\w+:\/\/\S+', r'_U', text)

We can automatically remove extra or problematic whitespace and newline characters, hyphens, and
underscores. Extended series of punctuation characters are encoded here using codes such as _BQ and BX;
these longer tags are used as a means of differentiating from the more straightforward _Q and _X tags
(which refer to the use of a question mark and exclamation mark, respectively).

We can also use regular expressions to manage extra letters; by cutting down such strings to two
characters at most, we're able to reduce the number of combinations to a manageable amount and tokenize
that reduced group using the _EL token:
# Format whitespaces
text = text.replace('"', ' ')

text = text.replace('\'', ' ')

text = text.replace('_', ' ')

text = text.replace('-', ' ')

text = text.replace('\n', ' ') text = text.replace('\\n', ' ')


text = text.replace('\'', ' ')

text = re.sub(' +',' ', text)

text = text.replace('\'', ' ')

#manage punctuation

text = re.sub(r'([^!\?])(\?{2,})(\Z|[^!\?])', r'\1 _BQ\n\3', text)

text = re.sub(r'([^\.])(\.{2,})', r'\1 _SS\n', text)

text = re.sub(r'([^!\?])(\?|!){2,}(\Z|[^!\?])', r'\1 _BX\n\3', text)

text = re.sub(r'([^!\?])\?(\Z|[^!\?])', r'\1 _Q\n\2', text)

text = re.sub(r'([^!\?])!(\Z|[^!\?])', r'\1 _X\n\2', text)

text = re.sub(r'([a-zA-Z])\1\1+(\w*)', r'\1\1\2 _EL', text)

text = re.sub(r'([a-zA-Z])\1\1+(\w*)', r'\1\1\2 _EL', text)

text = re.sub(r'(\w+)\.(\w+)', r'\1\2', text)

text = re.sub(r'[^a-zA-Z]','', text)

One of the more helpful indicators available is the _SW token for swearing. We'll also use regular expressions
to help identify and tokenize smileys into one of four buckets; big and happy smileys (_BS), small and happy
ones (_S), big and sad ones (_BF), and small and sad ones (_F):
text = re.sub(r'([#%&\*\$]{2,})(\w*)', r'\1\2 _SW', text)

text = re.sub(r' [8x;:=]-?(?:\)|\}|\]|>){2,}', r' _BS', text)

text = re.sub(r' (?:[;:=]-?[\)\}\]d>])|(?:<3)', r' _S', text)

text = re.sub(r' [x:=]-?(?:\(|\[|\||\\|/|\{|<){2,}', r' _BF', text)

text = re.sub(r' [x:=]-?[\(\[\|\\/\{<]', r' _F', text)

This is a simple application of str.split, which enables the input to be treated as a vector of words
(words) rather than as long strings (re):
phrases = re.split(r'[;:\.()\n]', text)

phrases = [re.findall(r'[\w%\*&#]+', ph) for ph in phrases] phrases = [ph for ph in phrases if ph]

words = []
for ph in phrases:

words.extend(ph)

ID Date Comments
132 20120531031917Z [['Flip', 'how', 'are', 'you', 'not', 'ded']]

Next, we perform a search for single-letter sequences. Sometimes, for emphasis, Internet communication
involves the use of spaced single-letter chains. This may be attempted as a method of avoiding curse
word detection:
tmp = words

words = []

new_word = ''

for word in tmp:

if len(word) == 1:

new_word = new_word + word else:

if new_word: words.append(new_word)
new_word = ''

words.append(word)

ID Date Words
132 20120531031917Z ['_F', 'how', 'are', 'you', 'not', 'ded']

Raw Cleaned and split


GALLUP DAILY\nMay 24-26, 2012 \ ['GALLUP', 'DAILY', 'May',
u2013 Updates daily at 1 p.m. ET; reflects 'u', 'Updates', 'daily', 'pm',
one-day change\ nNo updates Monday, May 'ET', 'reflects', 'one', 'day',
28; 'change', 'No', 'updates',
next update will be Tuesday, May 29.\nObama 'Monday', 'May', 'next',
Approval48%-\nObama Disapproval45%-1\ 'update', 'Tuesday', 'May',
nPRESIDENTIAL ELECTION\nObama47%-\ 'Obama', 'Approval', 'Obama', 'Disapproval',
nRomney45%-\ n7-day rolling average\n\n It 'PRESIDENTIAL', 'ELECTION', 'Obama',
seems the bump Romney got is over and the 'Romney',
president is on his 'day', 'rolling', 'average',
game. 'It', 'seems', 'bump', 'Romney',
'got', 'president', 'game']

In the first case, we have a misspelled word; we need to find a way to eliminate this.
Secondly, a lot of the words in both examples (for example. are, pm) aren't terribly informative in
and of themselves. The problem we find, particularly for shorter text samples, is that what's left after
cleaning may contain only one or two meaningful terms. If these terms are not terribly common in
the corpus as a whole, it can prove to be very difficult to train a classifier to recognize these terms'
significance.

Tagging and categorizing words:


English language words come in several types—nouns, verbs, adverbs, and so on. These are
commonly referred to as parts of speech. If we know that a certain word is an adjective, as opposed
to a verb or stop word (such as a, the, or of).

If we can perform part of speech tagging by identifying and encoding word classes as categorical
variables, we're able to improve the quality of our data by retaining only the valuable content.
Specifically, focus on n-gram tagging and backoff taggers, a pair of complimentary techniques that
allow us to create powerful recursive tagging algorithms.

import nltk nltk.download()

from nltk.corpus import stopwords

words = [w for w in words if not w in stopwords.words("english")]

Tagging with NLTK


Tagging is the process of identifying parts of speech, as we described previously, and applying tags to
each term.
In its simplest form, tagging can be as straightforward as applying a dictionary over our input data,
just as we did previously with stopwords:
tagged = ntlk.word_tokenize(words)

However, even brief consideration will make it obvious that our use of language is a lot more
complicated than this allows. We may use a word (such as ferry) as one of several parts of speech
and it may not be straightforward to decide how to treat each word in every utterance. A lot of the
time, the correct tag can only be understood contextually given the other words and their positioning
within the phrase.

Sequential tagging
A sequential tagging algorithm is one that works by running through the input dataset, left-to-right and
token-by-token (hence sequential!), tagging each token in succession. The decision over which token to
assign is made based on that token, the tokens that preceded it, and the predicted tags for those preceding
tokens.

An n-gram tagger is a type of sequential tagger, which is pretrained to identify appropriate tags. The n-
gram tagger takes (n-1)-many preceding POS tags and the current token into consideration in producing a
tag.
The simplest form of n-gram tagger is one where n = 1, referred to as a unigram tagger. A unigram
tagger operates quite simply, by maintaining a conditional frequency distribution for each token.

The tagger assumes that the tag which occurs most frequently for a given token in a given sequence is
likely to be the correct tag for that token. If the term carp is in the training corpus as a noun four times
and as a verb twice, a unigram tagger will assign the noun tag to any token whose type is carp.
This might suffice for a first-pass tagging attempt but clearly, a solution that only ever serves up one tag
for each set of homonyms isn't always going to be ideal. The solution we can tap into is using n-grams
with a larger value of n. With n = 3 (a trigram tagger), for instance, we can see how the tagger might
more easily distinguish the input He tends to carp on a lot from He caught a magnificent carp.

Backoff tagging
Sometimes, a given tagger may not perform reliably. This is particularly common when the tagger has
high accuracy demands and limited training data. At such times, we usually want to build an ensemble
structure that lets us use several taggers simultaneously.

To do this, we create a distinction between two types of taggers: subtaggers and backoff taggers.
Subtaggers are taggers like the ones we saw previously, sequential and Brill taggers. Tagging structures
may contain one or multiple of each kind of tagger.

If a subtagger is unable to determine a tag for a given token, then a backoff tagger may be referred to
instead. A backoff tagger is specifically used to combine the results of an ensemble of (one or more)
subtaggers, as shown in the following example diagram:

In simple implementations, the backoff tagger will simply poll the subtaggers in order, accepting the
first none-null tag provided. If all subtaggers return null for a given token, the backoff tagger will
assign a none tag to that token. The order can be determined.

Backoffs are typically used with multiple subtaggers of different types; this enables a data scientist
to harness the benefits of multiple types of tagger simultaneously. Backoffs may refer to other
backoffs as needed, potentially creating highly redundant or sophisticated tagging structures:
In general terms, backoff taggers provide redundancy and enable you to use multiple taggers in a
composite solution.
brown_a = nltk.corpus.brown.tagged_sents(categories= 'a')

tagger = None

for n in range(1,4):

tagger = NgramTagger(n, brown_a, backoff = tagger)

words = tagger.tag(words)

Creating features from Text Data:

• Stemming
• Lemmatising
• Bagging using random forests

Stemming
Another challenge when working with linguistic datasets is that multiple word forms exist for many word
stems. For example, the root dance is the stem of multiple other words—dancing, dancer, dances, and so
on. By finding a way to reduce this plurality of forms into stems, we find ourselves able to improve our n-
gram tagging and apply new techniques such as lemmatisation.

The techniques that enable us to reduce words to their stems are called stemmers. Stemmers work by
parsing words as consonant/vowel strings and applying a series of rules. The most popular stemmer is the
porter stemmer, which works by performing the following steps;

1. Simplifying the range of suffixes by reducing (for example, ies becomes i) to a smaller
set.
2. Removing suffixes in several passes, with each pass removing a set of suffix types (for
example, past particple or plural suffixes such as ousness or alism).
3. Once all suffixes are removed, cleaning up word endings by adding 'e's
where needed (for example, ceas becomes cease).
4. Removing double 'l's.

from nltk.stem
import PorterStemmer

stemmer = PorterStemmer()

stemmer.stem(words)

The output of this stemmer, as demonstrated on our pre-existing example, is the root form of the
word. This may be a real word, or it may not; dancing, for instance, becomes danci.
Lemmatisation is a more complex process to determine word stems; unlike porter stemming, it uses
a different normalisation process for different parts of speech. Unlike Porter Stemming it also seeks
to find actual roots for words. Where a stem does not have to be a real word, a lemma does.
Lemmatization also takes on the challenge of reducing synonyms down to their roots.

As a necessary prerequisite, we need the POS for each input token.


from nltk.stem

import PorterStemmer, WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = lemmatizer.lemmatize(words, pos = 'pos')

The output is now what we'd expect to see:

Source Text Post-lemmatisation


The laughs you two heard were triggered by ['The', 'laugh', 'two', 'hear', 'trigger', 'memory',
memories of his own high-flying exits off 'high',
moving beasts 'fly', 'exit', 'move', 'beast']

Removed stop words and tokenized a range of other noise elements with regex methods. We've also
removed any HTML tagging. Our text data has reached a reasonably processed state. Specifically,
we can use bagging to help quantify the use of terms.

Bagging and Random Forests:

Bagging is part of a family of techniques that may collectively be referred to as subspace


methods. There are several forms of method, with each having a separate name. If we draw
random subsets from the sample cases, then we're performing pasting. If we're sampling from
cases with replacement, it's referred to as bagging. If instead of drawing from cases, we work
with a subset of features, then we're performing attribute bagging. Finally, if we choose to draw
from both sample cases and features, we're employing what's known as a random patches
technique.
The feature-based techniques, attribute bagging, and Random Patch methods are very valuable
in certain contexts, particularly very high-dimensional ones. Medical and genetics contexts both
tend to see a lot of extremely high-dimensional data, so feature-based methods are highly
effective within those contexts.
In NLP contexts, it's common to work with bagging specifically. In the context of linguistic
data, what we'll be dealing with is properly called a bag of words. A bag of words is an
approach to text data preparation that works by identifying all of the distinct words (or tokens)
in a dataset and then counting their occurrence in each sample. Let's begin with a
demonstration, performed over a couple of example cases from our dataset:
ID Date Words
132 20120531031917Z ['_F', 'how', 'are', 'you', 'not', 'ded']

69 20120531173030Z ['you', 'are', 'living', 'proof',


'that', 'bath', 'salts', 'effect', 'thinking']

This gives us the following 12-part list of terms:


[

"_F"

"how"

"are"

"you"

"not"

"ded" "living"
"proof" "that"

"bath"

"salts" "effect"
"thinking"

Using the indices of this list, we can create a 12-part vector for each of the preceding sentences. This
vector's values are filled by traversing the preceding list and counting the number of times each term
occurs for each sentence in the dataset. Given our pre-existing example sentences and the list we
created from them, we end up creating the following bags:

ID Date Comment Bag of words

132 20120531031917Z _F how are you not ded [1, 1, 1, 1, 1, 1, 0,


0, 0, 0, 0, 0, 0]
69 20120531173030Z you are living proof that bath [0, 0, 1, 1, 0, 0, 1,
salts effect thinking 1, 1, 1, 1, 1, 1]

We can use a term weighting scheme to modify the values within each vector so that terms that are
indicative or helpful for classification are emphasized. Weighting schemes may be straightforward
masks, such as a binary mask that indicates presence versus absence.

Binary masking can be useful if certain terms are used much more frequently than normal; in such
cases, specific scaling (for example, log-scaling) may be needed if a binary mask is not used. At the
same time, though, frequency of term use can be informative (it may indicate emphasis, for instance)
and the decision over whether to apply a binary mask is not always made simply.
Another weighting option is term frequency-inverse document frequency, or tf-idf. This scheme
compares frequency of usage within a specific sentence and the dataset as a whole and uses values
that increase if a term is used more frequently within a given sample than within the whole corpus.

Variations on tf-idf are frequently used in text mining contexts, including search engines. Scikit-learn
provides a tf-idf implementation, TfidfVectoriser, which we'll shortly use to employ tf-idf for ourselves.
The process of implementing bag of words is, again, fairly straightforward. We initialize our
bagging tool (matter-of-factly referred to as a vectorizer). Note that for this example, we're putting a
limit on the size of the feature vector.
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(analyzer = "word", \

tokenizer = None, \ preprocessor =


None, \ stop_words = None, \
max_features = 5000)

Our next step is to fit the vectorizer on our word data via fit_transform; as part of the fitting process,
our data is transformed into feature vectors:
train_data_features = vectorizer.fit_transform(words) train_data_features =

train_data_features.toarray()

Testing our prepared data


This is a single-class classification problem. The label is either 0 meaning a neutral comment, or 1 meaning
an insulting comment (neutral can be considered as not belonging to the insult class. Your predictions must
be a real number in the range [0,1] where 1 indicates 100% confident prediction that comment is an
insult.

 We are looking for comments that are intended to be insulting to a person who is a part of the larger
blog/forum conversation.
 We are NOT looking for insults directed to non-participants (such as celebrities, public figures etc.).
 Insults could contain profanity, racial slurs, or other offensive language. But often times, they do
not.
 Comments which contain profanity or racial slurs, but are not necessarily insulting to another
person are considered not insulting.
 The insulting nature of the comment should be obvious, and not subtle.
 There may be a small amount of noise in the labels as they have not been meticulously cleaned.
However, contestants can be confident the error in the training and testing data is < 1%.
Contestants should also be warned that this problem tends to strongly overfit. The provided data is generally
representative of the full test set, but not exhaustive by any measure. Impermium will be conducting final
evaluations based on an unpublished set of data drawn from a wide sample.

The desired score is the area under the curve (AUC), which is a measure that is very sensitive both to
false positives and to incorrect negative results (specificity and sensitivity).
Specifically, the top 14 participants on the private (test) leaderboard managed to reach an AUC score of
over 0.8. The top scorer managed a pretty impressive 0.84, while over half of the 50 teams who entered
scored above 0.77.

Random forest regression model:

We then grab the test data and apply our model to predict a score for each test case. We rescale
these scores using a simple stretching technique:

Finally, we apply the roc_auc function to calculate an AUC score for the model:

Result: Random Forest benchmark AUC score, 100 estimators

0.537894912105

You might also like