Duplicate Question Detection
Using Random Forest Algorithm
Team Members:
1. Arjun Shrestha
2. Sanjeev Roka
Research Presentation
3. Sushant Khakurel
4. Vijay Dhakal
Under the supervision of
Surya Bam
CONTENTS
1 Introduction
-Problem Definition -Objective -Limitations
2 Methodology
-Data Collection -Algorithm Used
3 Implementation
-Architectural Design -Use Case Diagram -Sequence Diagram
3
4 Demonstration
54
Conclusion
2
INTRODUCTION
3
Problem Definition
• With duplicate Questions:
• There is load in the
database.
• Answerers have to give
same answers repeatedly.
4
Objective
• Allow User to ask the
question.
• Predict whether the
similar question has been
previously asked or not.
5
Limitations
• Difficult to find the
semantics
• Ambiguity in natural
language
6
METHODOLOGY
7
Data collection
• Collected from kaggle
released by Quora.
• Used only a fraction.
i.e. 8000
8
Random Forest Algorithm
• Supervised Machine
Learning Algorithm
• Ensemble of Multiple
Decision trees.
• Is a CART algorithm
9
How random forest works ?
1. Randomly select “k” features from total “m” features where k << m
2. Among the “k” features, calculate the node “d” using the best split point.
3. Split the node into child nodes using the best split.
4. Repeat 1 to 3 steps until “l” number of nodes has been reached.
5. Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of
trees.
10
Split the dataset
11
Selection of features
• Features for each tree is
selected in random
• We used,
-
12
Finding Best Split
• For each selected features,
calculate Gini Index.
• Select the feature with
minimum gini index.
• Split the tree on that node
13
14
Architectural Design
Inputs Question
User Interface
Result
User
s Pre-Processor
ro ces
e -p
Pr
Duplicate Question Get Features
Feature Extractor
Detection
Pr
ed
Fetch
i ct
Random Forest
Model
Questions
Collection
15
Data Preprocessing
• Lower Casing
• Removing noises
• Tokenization
• Stop Word Removal
• Lemmatization
• Translation into vectors
16
Feature Extraction
• Simple features
• Fuzzy Wuzzy features (Based on Edit
distances)
• Distance based features
17
Input Input Input
18
Use Case Diagram
19
Sequence Diagram:
Slide 3
20
Tools Used
21
Demonstratio
n
22
Conclusion
Any Questions?
Input
Input Input
Tree 1
Tree 2 Tree 3
24