DATA COLLECTION
I began the project by identifying reliable sources for gender pay gap data. I used datasets from
platforms such as Kaggle and government labor databases. These datasets included features
like:
DATA ATTRIBUTE DESCRIPTION
Gender Indicates the gender of the employee (e.g.,
Male, Female)
Salary Annual salary earned by the employee
Job Title The designation or position held by the
employee
Years of Experience Number of years the employee has worked
Education Level Highest qualification achieved (e.g.,
Bachelor's, Master's)
Location Geographical location of the job or employee
Q: How did you clean the data?
I removed missing values, duplicates, encoded categorical variables, and normalized numerical
fields.
Data Cleaning & Preparation:
- Removed null or inconsistent records.
- Converted categorical data (e.g., gender, education level) using label encoding.
- Applied feature scaling to numeric columns for consistent model input.
- Separated data into training and test sets (80:20 split).
- This preprocessing ensured the dataset was clean, balanced, and suitable for AI
modeling.
BUILD YOUR PROTOTYPE
To analyze patterns and predict salary based on features like experience, job title, and
education, I developed a Linear Regression Model using Python.
Tools & Libraries Used:
Pandas for data manipulation
Scikit-learn for building and training the model
Matplotlib and Seaborn for visualization
Steps Taken:
Defined features (X) and target (y as salary).
Trained a LinearRegression model using sklearn.
Evaluated performance using R² Score and Mean Absolute Error (MAE).
Visualized actual vs predicted values to understand how the model fits.
The prototype highlighted pay disparity patterns by comparing salaries for similar experience
and roles, across different genders.
Q: What model did you build and why?
I built a Linear Regression model to predict salary based on factors like experience, education,
and job role.
Q: What tools did you use?
Python, Pandas, Scikit-learn, Matplotlib, and Seaborn for model development and visualization.
SECTION: TEST YOUR SOLUTION
Q: How did you evaluate the model?□
I used R² score and Mean Absolute Error (MAE) on test data to assess model performance.
Q: What did the results show?□
The model revealed a noticeable salary gap even after controlling for experience and education,
supporting our hypothesis.
Data Cleaning & Preparation:
Removed null or inconsistent records.
Converted categorical data (e.g., gender, education level) using label encoding.
Applied feature scaling to numeric columns for consistent model input.
Separated data into training and test sets (80:20 split).
This preprocessing ensured the dataset was clean, balanced, and suitable for AI modeling.
TEST YOUR SOLUTION
To validate the effectiveness of the model, I tested it using unseen test data.
Testing Approach:
Predicted salaries were compared against actual salaries.
Analyzed whether salary predictions showed consistent discrepancies by gender.
Created scatter plots and regression lines to visually compare actual vs predicted values.
Q: How did you evaluate the model?
I used R² score and Mean Absolute Error (MAE) on test data to assess model performance.
Q: What did the results show?
The model revealed a noticeable salary gap even after controlling for experience and education,
supporting our hypothesis
Q: How did you evaluate the model?
I used R² score and Mean Absolute Error (MAE) on test data to assess model performance.
Q: What did the results show?□
The model revealed a noticeable salary gap even after controlling for experience and education,
supporting our hypothesis
Findings:
The model successfully captured general salary trends.
Predicted values exposed subtle gender-based gaps, even when controlling for other variables.
Conclusion of Testing:
The results confirmed the existence of wage disparity and supported our hypothesis that AI can
assist in identifying such patterns. Though the model is a prototype, it can be refined further for
higher accuracy and fairness audits.