0% found this document useful (0 votes)
59 views6 pages

Optimizing Resume Screening With Machine Learning: An NLP Approach

Uploaded by

kadlaginvestment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views6 pages

Optimizing Resume Screening With Machine Learning: An NLP Approach

Uploaded by

kadlaginvestment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Optimizing Resume Screening with Machine

Learning: An NLP Approach


2024 6th International Conference on Computational Intelligence and Networks (CINE) | 979-8-3315-1679-6/24/$31.00 ©2024 IEEE | DOI: 10.1109/CINE63708.2024.10881885

Amiya Ranjan Panda Rohan Kumar* Ahona Ghosh


School of Computer Engineering School of Computer Engineering School of Computer Engineering
KIIT Deemed to be University KIIT Deemed to be University KIIT Deemed to be University
Bhubaneswar, Odisha Bhubaneswar, Odisha Bhubaneswar, Odisha
amiya.pandafcs@kiit.ac.in 2105139@kiit.ac.in 2105098@kiit.ac.in

Lucky Das Manoj Kumar Mishra Mahendra Kumar Gourisaria


School of Computer Engineering School of Computer Engineering School of Computer Engineering
KIIT Deemed to be University KIIT Deemed to be University KIIT Deemed to be University
Bhubaneswar, Odisha Bhubaneswar, Odisha Bhubaneswar, Odisha
2105126@kiit.ac.in manojfcs@kiit.ac.in mkgourisaria2010@gmail.com

Abstract—In the digital hiring landscape, managers face chal- conventional resume screening techniques encounter. We will
lenges in resume screening due to time constraints, biases, and also look at how Machine Learning models can help human
high application volumes. This study takes a closer look at how recruiters by analyzing and classifying resumes automatically
natural language processing and machine learning are being used
to speed things up for resume screening. By diving into lots of using the already established criteria. These algorithms are
research, this study showcases how promising natural language trained on large data-sets of past hiring outcomes.
processing systems can be. Details about the data set have been The objective research is to identify an optimal ML and
elaborated in the methodology part, like how the data is created NLP approach for resume screening, provide a thorough
and all different machine learning models tried out for classifying examination of their effects on talent acquisition effectiveness,
resumes. Using the Bag-of-Words method, the Gaussian Naive
Bayes model we achieved high accuracy rates—almost 99.5%. and expedite the hiring process.
These findings indicate that NLP algorithms can effectively filter The problem statement for the research and the introduction
resumes, aiding recruiters in identifying top candidates efficiently to the study are covered in Section I. Additionally stated
while minimizing biases and saving time and costs. Ultimately, are the suggested approaches and their associated benefits.
we highlight the potential of these systems to enhance the hiring Section II addresses the literature survey relevant to the study
process.
Index Terms—Classification, Machine Learning, Natural Lan- and different approaches considered by other researchers for
guage Processing, Resume Screening resume classification and ranking. Section III discusses the
exploratory approach or design. It includes a description of the
I. I NTRODUCTION training data, system analysis, classification techniques, feature
extraction, classifiers used, and performance metrics. Section
Now a days there is a lot of competition in the employment IV delves deeply into the findings of various machine learning
market, the process of resume screening poses significant models and provides us with both graphical and tabular data
challenges for both employers and job seekers. For hiring representation for better understanding of performance of each
managers and recruiters, the task of screening resumes has model. And, Section V concludes the paper and discusses the
grown more difficult in the digital age. Sorting through tons scope for further research in the near future.
of resumes takes forever. It’s tricky too, with bias and mistakes
creeping in. Job seekers face their own tough hurdle. Crafting II. L ITERATURE S URVEY
resumes that show off their skills and experiences is hard. Plus, Screening resumes is tough for recruiters. It can be re-
they need to shine in a sea of applications. ally overwhelming. That’s where natural language processing
Now, things are changing. The world has seen a real (NLP) comes in handy. It helps automate the resume screening
makeover by adopting natural language processing (NLP) and process by pulling out important info like skills, experience,
machine learning (ML). With these smart tools, companies can and education.
speed up hiring. They can find better candidates and reduce With so many jobs needing filled, hiring managers often
bias in their choices. It’s all about using data and language find it hard to pick the right candidates based on what they
tools to make things easier for everyone. actually need. This challenge is growing, for sure.
We will examine the fundamentals of machine learning and Recently, optimizing resume screening with machine learn-
natural language processing and how they relate to resume ing and NLP has caught a lot of interest[1]. There are so many
screening in this report. We’ll explore the main obstacles that new ideas and methods popping up. Researchers have looked

Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:41:19 UTC from IEEE Xplore. Restrictions apply.
closely at employment recommendation services. For example,
Mujtaba et al. created a detailed resume classification system
that uses NLP and machine learning techniques. This system
shows better accuracy and efficiency when sorting resumes
according to job needs.
The perks of using an e-recruitment portal for companies are
big. Plus, there are factors could sway a candidate’s selection,
along with other key recruitment processes that matter lot too.
Going further in the field, Rajath and his team tried out
K-nearest neighbors (KNN) along with cosine similarity mea- Fig. 1. Block diagram showing all stages of the study
sures. They used these for classifying and ranking resumes.
They proved that these methods are great at giving accurate A. Dataset
rankings based on how well a candidate fits the job[2].
Moreover, Ramos et al. came up with the Term Frequency- The dataset used is a popular open-source dataset available
Inverse Document Frequency (TF-IDF) method. This is uti- on Kaggle. It is a collection of 960 resumes. The data consists
lized for figuring out which words are important in document of one independent variable and one dependent variable. The
queries. It’s still a key technique in NLP-based resume screen- independent variable labeled ”resume” contains the various re-
ing systems. It helps pick out important terms and makes sumes associated with different job categories. The dependent
finding information more accurate[3]. Roy et al. explored variable, labeled ”Category,” depicts the various job categories.
an approach using machine learning for automating resume Table I lists the various job categories related to resumes in
recommendation systems, presenting an efficient methodology the data set and also the label associated with them after
for matching resumes to job descriptions using various ML performing label encoding.
algorithms[4].
To assess job titles, Chandak et al. use also used the TF- TABLE I
IDF vectorization approach. The authors use cosine similarity R ESUME INSTANCES OF EVERY JOB CATEGORY
as a criterion to assess how close candidate profiles are to open
Job Category Resume Instances Category Label
positions in order to further hone their recommendations. This
Advocate 20 0
cutting-edge architecture enhances the relevancy of employ-
ment recommendations given to users while also expediting Arts 36 1
the resume parsing process[6]. Further, in order to forecast Automation Testing 26 2
appropriate job responsibilities, P. V. J. et al. investigated Blockchain 40 3
a method utilizing a custom Convolutional Neural Network Business Analyst 28 4
(CNN) in conjunction with word2vec. They evaluated its per- Civil Engineer 24 5
formance against cosine similarity, Support Vector Machines Data Science 40 6
(SVM), Random Forest, and a pre-trained BERT model[7].
Database 33 7
In order to determine the best approach for precise clas-
DevOps Engineer 55 8
sification, Surendiran et al. presented a thorough solution
DotNet Developer 28 9
that investigates a number of machine learning techniques,
such as Random Forests, Decision Trees, K-Nearest Neighbors Electrical Engineering 30 10
(KNN), and Support Vector Machines (SVM)[8]. ETL Developer 40 11

These collective efforts reflect the ongoing advancements Hadoop 42 12


and critical developments in optimizing resume screening Health and fitness 30 13
through the integration of NLP and machine learning tech- HR 44 14
niques. Java Developer 84 15
Here, we’ll be trying to increase the accuracy by utilizing Mechanical Engineer 40 16
additional pre-processing and classification methods. Network Security Engineer 25 17
Operations Manager 40 18
PMO 30 19
III. M ETHODOLOGY Python Developer 48 20
Sales 40 21
SAP Developer 24 22
This phase describes the Dataset used, Preprocessing tech-
Testing 70 23
niques, model evaluation techniques, Machine Learning mod-
Web Designing 45 24
els, and statistical techniques carried out for the research.

Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:41:19 UTC from IEEE Xplore. Restrictions apply.
C. Natural Language Processing Models
Following pre-processing, the resume text is transformed
into a collection of features that the machine learning model
can use. Features can be extracted from text in a variety of
ways.
• Bag of words: This feature uses word count as a vector
to represent the text.
• Term Frequency-Inverse Document Frequency (TF-IDF):
It presents the text with a bunch of scores. These scores
signify a word’s importance in the document.
• Word2Vec: Utilizes real numbers as vectors to represent
words. A neural network model called Word2Vec is
designed to learn how to represent words in a way that
accurately reflects their semantic links.

D. Machine Learning Models


The original data was split into two parts: 80% for training
and 20% for testing. The features from the training data helped
train different models after we divided the original data set.
Using the test data, the forecast was created, and the
Fig. 2. Distribution of Job Categories
accuracy metrics were recorded.
• Logistic Regression
Using a threshold value, the logistic function is applied to
Our data comprises of 25 different categories. The top three the classification job in logistic regression. It is regarded
job categories in our data are Java developer, Testing, and as one of the simplest classification problem implemen-
DevOps Engineer. tations.
• Decision Tree Classifier
For the decision tree classifier, think of each internal
B. Data Preprocessing
node. It shows a test of an attribute. Every branch? It
The data set is in a CSV (Comma Separated Values) file. It shows the outcome of that test. Finally, every leaf node
parsed resumes from many places. The resume column holds gives the last prediction—like predicting a job category.
all the unfiltered resume info. Preparing this data for text clas- After the tree is built, it can guess new data. It does this
sification takes a lot of time and effort. When preprocessing by moving along the branches that fit the characteristic
the data, we get rid of less useful text. We used the Natural values in the data.
• Random Forest Classifier
Language Processing Toolkit and Python Regular Expressions
to do this. Picks multiple samples from the training data (with
replacement) and independently trains the model for each
The data pre-processing component cleans and prepares the
sampled data-set. The average of all forecasts from all
resume data for machine learning. This includes the following
submodels is used to predict the final result.
steps:
• k-Nearest Neighbors (KNN)
• Stop word removal: Words like ”the”, ”is” and ”of” are An algorithm for machine learning from the instance-
consider as stop words. They’re super common and don’t based learning family. A new data point is classified using
add much meaning. By taking out these stop words from this non-parametric method by comparing it to its nearest
the resume text, we can make the machine learning model neighbors in the training data-set.
more accurate. • Support Vector Machine (SVM)
• Stemming: Stemming is the way we make a word smaller, A supervised learning model. Works on both regression
to its form. For instance, ”running” ”ran” come the word and classification tasks. The support vectors are key
”run.” This method helps machines learn better because here—they are the closest points in the data. They help
it groups similar words together. show different categories and act as a boundary that
• Lemmatization: It’s kind of like stemming, but with makes the space between them better[9].
linguistic capabilities. It finds a word’s root form based • Naive Bayes
on where it is used. So, again, ”running” and ”ran” both A probabilistic algorithm that makes predictions based on
become ”run.” But sometimes, lemmatization takes more the likelihoods that specific events will occur. It learns
computer power than stemming. Still, it often gives better the likelihood of different variables first, considering
results. the different job categories. The Bayes theorem is then

Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:41:19 UTC from IEEE Xplore. Restrictions apply.
applied to ascertain the job category to which a cleaned TABLE II
resume belongs. M ETRIC VALUES WITH BAG OF W ORDS MODEL
It has two main advantages: it is simple and efficient
because it requires relatively little training data and Model Name Accuracy Precision Recall F1-Score
can be taught quickly. This makes it suitable for data Decision Tree 0.9948 0.9958 0.9948 0.9949
with a variety of features and speeds up the process of
computing probability. Naive Bayes Classifiers are of 3 K-Nearest Neighbor 0.9222 0.9571 0.9222 0.9283
types: Logistic Regression 0.9948 0.9954 0.9948 0.9948
– Multinomial Naive Bayes: Mostly used for document
Gaussian Naive Bayes 0.9948 0.9956 0.9948 0.9949
classification problems. Assumes the features to be
taken from a multinomial distribution. Random Forest 0.9948 0.9950 0.9948 0.9947
– Bernoulli Naive Bayes: Works similar to Multino-
SVM Classifier 0.9948 0.9955 0.9948 0.9949
mial Naive Bayes classifier. Assumes that features
are binary variables.
– Gaussian Naive Bayes: This method believes that
continuous features follow a Gaussian distribution.
So, it assumes that the values taken from this dis-
tribution are what predictors use when they have
continuous values instead of discrete ones.
Here, we will be using Gaussian Naive Bayes Classifier for
the classification task due to better accuracy in our case.
IV. I MPLEMENTATION A ND R ESULTS
The outcomes of the various prediction models are discussed
in this section and different metrics have been used to assess
the results.
A. Accuracy
It helps to understand how well the model can predict true
positives and true negatives correctly.
TP + TN Fig. 3. Classification models performance with Bag of Words model
Accuracy =
TP + TN + FP + FN
B. Precision
It helps to evaluate the reliability of the model.
TP
Precision = Table III provides a summary of the metric values for every
TP + FP
prediction model used in conjunction with the TF-IDF model.
C. Recall And, Figure 4 graphically compares these values.
A performance statistic that assesses how well a classifier
can identify every positive occurrence in the data-set among
all of the real positive examples.
TP TABLE III
Recall = M ETRIC VALUES WITH TF-IDF MODEL
TP + FN
D. F1-Score
Model Name Accuracy Precision Recall F1-Score
It takes precision and recall and puts them together into one
number. This gives a good look at how the classifier is doing. Decision Tree 0.9948 0.9955 0.9948 0.9949
Essentially, it’s like finding the average of precision and recall, K-Nearest Neighbor 0.9792 0.9813 0.9792 0.9787
treating them equally.
Logistic Regression 0.9948 0.9953 0.9948 0.9948
2 · Precision · Recall
F1-score = Gaussian Naive Bayes 0.9948 0.9953 0.9948 0.9948
Precision + Recall
Table II provides a summary of the metric values for every Random Forest 0.9948 0.9961 0.9948 0.9950
prediction model used in conjunction with the Bag of Words
SVM Classifier 1.0000 1.0000 1.0000 1.0000
model. And, Figure 3 graphically compares these values.

Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:41:19 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. Classification models performance with TF-IDF model

Table IV provides a summary of the metric values for Fig. 6. Confusion Matrix with Bag of Words model
every prediction model used in conjunction with the Word2Vec
model. And, Figure 5 graphically compares these values.

TABLE IV
M ETRIC VALUES WITH W ORD 2V EC MODEL

Model Name Accuracy Precision F1-Score Recall

Decision Tree 0.9948 0.9953 0.9948 0.9948

K-Nearest Neighbor 0.9533 0.9584 0.9533 0.9528

Logistic Regression 0.9015 0.9212 0.9015 0.8940

Gaussian Naive Bayes 0.8808 0.9192 0.8808 0.8836

Random Forest 0.9948 0.9958 0.9948 0.9950

SVM Classifier 0.9585 0.9653 0.9585 0.9574


Fig. 7. Confusion Matrix with TF-IDF model

Fig. 5. Classification models performance with Word2Vec model

Fig. 8. Confusion Matrix with Word2Vec model


In Figure 6, Figure 7 and Figure 8, you can see the
confusion matrices for Bag of Words , TF-IDF and Word2Vec V. C ONCLUSION
models respectively. These come from using the Gaussian This study puts forward a way for resume screening. It’s
Naive Bayes model to make predictions on the test data set. based on NLP and ML techniques, using a data-set from the

Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:41:19 UTC from IEEE Xplore. Restrictions apply.
real-world. The algorithm, it turns out, got an accuracy of [12] Bhushan Kinge, Shrinivas Mandhare, Pranali Chavan, S. M. Chaware,
about 99.5%. That’s pretty impressive! It used the Bag-of- ”Resume Screening using Machine Learning and NLP: A proposed sys-
tem”, International Journal of Scientific Research in Computer Science,
Words model and the Gaussian Naive Bayes model for making Engineering and Information Technology (IJSRCSEIT), ISSN : 2456-
predictions.Well, it shows that NLP can help filter resumes 3307, Volume 8, Issue 2, pp.253-258, March-April-2022.
effectively for job openings. Recruiters can feel more confident [13] A. Jivtode, K. Jadhav, and D. Kandhare, “Resume Analysis using
Machine Learning and NLP,” in International Research Journal of
that they’re finding the best candidates. Plus, they can save Modernization in Engineering Technology and Science, 2023.
time and money using this suggested system. [14] A. Jivtode, K. Jadhav, and D. Kandhare, “Resume Analysis using
The suggested system has several practical implications. Machine Learning and NLP,” in International Research Journal of
Modernization in Engineering Technology and Science, 2023.
First, by automating the resume screening procedure, it can
save recruiters time and money. Second, by assessing resumes
on the basis of their qualifications rather than on unimportant
details like the candidate’s name, gender, or age, it can aid in
reducing prejudice in the screening process. Third, by taking
a broader look at a candidate’s background, education, and
experience, among other things, recruiters may be able to find
more qualified applicants.
This research project can be extended in two ways. First,
work on increasing the effectiveness of the model using better
data cleaning approaches. Similarly, companies can use deep
learning models to categorize applicants for jobs and optimize
the hiring process.
R EFERENCES
[1] G. Mujtaba, I. Ali, J. Ahmed, N. Mughal and Z. H. Khand, “Resume
Classification System using Natural Language Processing and Machine
Learning Techniques”, Mehran University Research Journal of Engineer-
ing and Technology, Vol. 1, 65-79, January 2022.
[2] Rajath V , Riza Tanaz Fareed , Sharadadevi Kaganurmath, 2021,
“Resume Classification and Ranking using KNN and Cosine Similarity”,
International Journal of Engineering Research & Technology (IJERT)
Volume 10, Issue 08 ,August 2021.
[3] Ramos, J., et al., “Using tf-idf to determine word relevance in document
queries”, Proceedings of the first instructional conference on machine
learning, Piscataway, NJ, pp. 133–142.
[4] Pradeep Kumar Roy, Sarabjeet Singh Chowdhary, Rocky Bhatia, ”A
Machine Learning approach for automation of Resume Recommendation
system”, Procedia Computer Science, Volume 167, 2020, Pages 2318-
2327, ISSN 1877-0509.
[5] S. Pujari, “Resume Screening with Natural Language Processing in
Python”, Department of Computer Engineering, Vidyalankar Institute
of Technology, Mumbai, India, September 2023.
[6] A. V. Chandak, H. Pandey, G. Rushiya and H. Sharma, ”Resume
Parser and Job Recommendation System using Machine Learning,”
2024 International Conference on Emerging Systems and Intelli-
gent Computing (ESIC), Bhubaneswar, India, 2024, pp. 157-162,
doi:10.1109/ESIC60604.2024.10481635.
[7] P. V. J, S. N. J. P, S. Gopinath, U. S and K. C.R., ”Resume Analyzer
and Skill Enhancement Recommender System,” 2024 Asia Pacific Con-
ference on Innovation in Technology (APCIT), MYSORE, India, 2024,
pp. 1-6, doi: 10.1109/APCIT62007.2024.10673530.
[8] B. Surendiran, T. Paturu, H. V. Chirumamilla and M. N. R. Reddy,
”Resume Classification Using ML Techniques,” 2023 International Con-
ference on Signal Processing, Computation, Electronics, Power and
Telecommunication (IConSCEPT), Karaikal, India, 2023, pp. 1-5, doi:
10.1109/IConSCEPT57958.2023.10169907.
[9] Gopal Kamineni, Kandula Akhil Sai, G. SivaNageswara Rao. ”Resume
Classification usingSupport Vector Machine”, 2023 3rd International
Conference on Pervasive Computing and Social Networking (ICPCSN),
2023.
[10] Srijita Chakraborty, Amiya Ranjan Panda, Priyal Vadiya, Sayam Samal,
Ishita Gupta, Niranjan Kumar Ray. ”Predicting Diabetes: A Comparative
Study of Machine Learning Models”, 2023 OITS International Confer-
ence on Information Technology (OCIT), 2023.
[11] D. Jagan Mohan Reddy, S. Regella and S. R. Seelam, ”Recruitment
Prediction using Machine Learning,” 2020 5th International Conference
on Computing, Communication and Security (ICCCS), Patna, India,
2020, pp. 1-4.

Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:41:19 UTC from IEEE Xplore. Restrictions apply.

You might also like