Neural Networks
Dataset Description
This dataset comprises 50,000 movie reviews. It is designed for binary sentiment
classification, focusing on predicting whether a review is positive or negative. With a
substantial volume of highly polar reviews, this dataset is suitable for tasks involving
sentiment analysis.
Hyperparameters Used
The hyperparameters used while creating the neural network in the provided code
include:
1. max_words :
Description: Maximum number of unique words considered as features in the
dataset.
Purpose: It limits the vocabulary size and is used in the Tokenizer to build the
word index.
2. embedding_dim :
Description: Dimensionality of the dense vectors representing words in the
embedding layer.
Purpose: Determines the size of the word embeddings, influencing the
complexity and detail in representing words.
3. epochs :
Description: Number of times the entire training dataset is processed by the
neural network during training.
Purpose: Defines the number of training iterations, impacting the model's
learning.
4. batch_size :
Description: Number of samples processed in each iteration during training.
Neural Networks 1
Purpose: Affects the model's weight updates; larger batches may speed up
training, but smaller batches may provide more accurate updates.
5. validation_split :
Description: Fraction of the training data used for validation during training.
Purpose: Monitors the model's performance on unseen data during training,
helping to detect overfitting.
Description of the Code
link for the google collab:
Google Colaboratory
https://colab.research.google.com/drive/18yCvD9kqRos1vsukto
qbdxJnXU2rO7ao?usp=sharing
Importing all the necessary libraries:
pandas is used for data manipulation.
train_test_split from sklearn.model_selection is used to split the dataset into training
and testing sets.
LabelEncoderfrom sklearn.preprocessing is used to encode the target labels
(positive/negative sentiments).
Tokenizerand pad_sequences from tensorflow.keras.preprocessing.text are used to
tokenize and pad the input text data.
Sequential , Embedding , Flatten , Dense ,
and Dropout from tensorflow.keras.models and tensorflow.keras.layers are used to
define the FNN model.
import pandas as pd
from sklearn.model_selection import train_test_split
Neural Networks 2
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense, D
Loading the IMDb movie reviews dataset using pd.read_csv
df = pd.read_csv("/content/drive/MyDrive/IMDB Dataset.csv")
This past of the code helps in transforming the target variable into numerical values
It startes with assigning numerical values to the unique categorical values to the traget
label(the encoder is only used on the target labels).
In this case as there are 2 unique values in the target variable positive and negaitive it
will assign them 0 and 1
The fit_transform fuction fit label encoder and return encoded labels.
# Encode target labels (positive/negative)
le = LabelEncoder()
df['sentiment'] = le.fit_transform(df['sentiment'])
This part of the code helps in assigning a fixed integer id to each word occurring in any
document(any piece of text that is treated as a single entity) of the training set
max_words = 10000 :
This sets the maximum number of unique words to consider in the tokenizer. Only the
most frequent max_words words will be kept during tokenization, and less frequent words
will be ignored.
tokenizer = Tokenizer(num_words=max_words, split=' ') :
Initializes a tokenizer from the Tokenizer class. There are 2 parameters in it.
num_words parameter specifies the maximum number of words to keep
Neural Networks 3
split=' ' indicates that words will be split based on space.
All punctuations are removed.
tokenizer.fit_on_texts(df['review'].values) :
The fit_on_texts fits the tokenizer on the review column if the dataframe.
The .value is then used to convert the 'review' column into a NumPy array.
This step builds the vocabulary and assigns a unique numerical index to each word in
the corpus(data).
X = tokenizer.texts_to_sequences(df['review'].values) :
Each review (sentence) is now transformed into a sequence of numbers. If one of the
review is 'apple orange banana', now it will '1 3 2', where 1 corresponds to 'apple', 3 to
'orange', and 2 to 'banana'.
X = pad_sequences(X) :
Pads the sequences to ensure they all have the same length. This is necessary for
feeding the data into a neural network with fixed input size. If a review has fewer words
than the maximum sequence length, it is padded with zeros at the beginning; if it is
longer, it is truncated.
# Tokenize the text
max_words = 10000
tokenizer = Tokenizer(num_words=max_words, split=' ')
tokenizer.fit_on_texts(df['review'].values)
X = tokenizer.texts_to_sequences(df['review'].values)
X = pad_sequences(X)
This part of the code is spliting the dataset
Inputs:
X : The input data, which typically represents the features or independent variables.
df['sentiment'] : The target variable, which is the sentiment label associated with
each input.
Parameters:
Neural Networks 4
: Specifies that 20% of the data will be used as the test set, and the
test_size=0.2
remaining 80% will be used as the training set.
random_state=42 : Sets a random seed for reproducibility. The same seed ensures
that the random splitting of data is consistent across runs.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, df['senti
1. Embedding Layer: model.add(Embedding(max_words, embedding_dim,
input_length=X.shape[1]))
The Embedding layer is used for word embedding, which represents each word in
the input sequence as a dense vector of fixed size (embedding_dim).
max_words : The maximum number of words to consider as features.
embedding_dim : The dimension of the dense embedding.
input_length=X.shape[1] : The length of the input sequences (number of features in
each sequence).
2. Flatten Layer: model.add(Flatten())
The Flatten layer is used to flatten the output of the embedding layer into a one-
dimensional array. It prepares the data for the fully connected layers.
3. Dense Layer (ReLU Activation): model.add(Dense(256, activation='relu'))
This dense layer has 256 units and uses the Rectified Linear Unit (ReLU) activation
function. It introduces non-linearity to the model.
4. Dropout Layer: model.add(Dropout(0.5))
The Dropout layer helps prevent overfitting by randomly setting a fraction of input units
to zero during training (here, 50%).
5. Dense Output Layer (Sigmoid Activation): model.add(Dense(1, activation='sigmoid'))
The final dense layer has 1 unit (output node) with a sigmoid activation function. This is
common for binary classification problems.
6. Model Compilation: model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
Neural Networks 5
The model is compiled with the specified loss function (binary_crossentropy for binary
classification), optimizer (adam), and metrics (accuracy for evaluation).
# Build the FNN model
embedding_dim = 128
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=X.sha
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metr
epochs and batch_size :
epochs is the number of times the model will be trained on the entire training
dataset.
batch_size is the number of samples (data points) used in each iteration of training.
model.fit():
The fit method trains the model by adjusting internal parameters based on provided
training data (X_train and y_train) to minimize the specified loss function. It requires
inputs like the number of epochs, batch size, and an optional validation split.
validation_split:
The validation split, set as validation_split=0.2, designates 20% of the training data for
validation. Monitoring the model's performance on this set during training offers insights
into its generalization to unseen data.
history:
The method returns a History object (history) with training process details, including loss
and accuracy over each epoch. This information aids analysis and visualization of the
model's performance during and after training.
# Train the model
epochs = 5
Neural Networks 6
batch_size = 32
history = model.fit(X_train, y_train, epochs=epochs, batch_size=
This code evaluates the trained model on the test dataset and prints the resulting loss
and accuracy metrics. The evaluation provides insights into how well the model
generalizes to new, unseen data.
# Evaluate the model
score = model.evaluate(X_test, y_test, verbose=0)
print(f'Test loss: {score[0]}, Test accuracy: {score[1]}')
This step is crucial for preserving the trained model so that it can be later loaded and
used for making predictions on new data without having to retrain the model from
scratch. The saved model file ('imdb_sentiment_analysis_fnn_model.h5') will contain the
architecture, weights, and configuration of the trained neural network.
# Save the model
model.save('imdb_sentiment_analysis_fnn_model.h5')
Results
Test loss: 0.6220757961273193, Test accuracy: 0.87739998
Training Metrics:
Epoch 1: Achieved a training accuracy of approximately 74.4% with a loss of
0.5067.
Neural Networks 7
Epoch 2: Improved to a training accuracy of around 93.9% with a significantly
reduced loss of 0.1683.
Epochs 3 and 4: Achieved even higher training accuracy, reaching 99.0% and
99.7%, respectively. Loss values continued to decrease.
Validation Metrics:
Epoch 1: Validation accuracy was around 88.4%, and the validation loss was
0.2885.
Epoch 2: Validation accuracy remained high at 88.2%, with a slightly increased
loss of 0.2888.
Epochs 3, 4, and 5: Validation accuracy remained stable, ranging from 88.0%
to 88.5%. Validation loss increased slightly.
Test Metrics:
After 5 epochs, the model was evaluated on the test dataset, resulting in a test
loss of approximately 0.6221 and a test accuracy of about 87.7%.
Neural Networks 8