Scaler Case Study
Scaler Case Study
Introduction
Scaler, as an emerging tech-versity, endeavors to provide world-class education
learners, especially in terms of their current roles, companies, and experience Clustering similar learners helps in customizing the learning experience, there y increasing retention and satisfact.
Analyzing the vast data of learners can uncover patterns in their professional
backgrounds and preferences. This allows Scaler to make tailored conten recommendations and provide specialized mentorshi
techniques, Scaler can group learners with similar profiles, aiding in delivering
more personalized learning journe
What is Expected?
Assuming you're a data scientist at Scaler, you're tasked with the responsibility of analyzing the dataset to profile the best companies and job positions from Scaler' database. Your primary goal is to execute clustering
techniques, evaluate t e coherence of your clusters, and provide actionable insights for enhanced lear er profiling and course tailoring.y.p.on.er churn.
1. Data
The analysis was done on the data located at - https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/002/856/original/scaler_clustering.csv
2. Libraries
Below are the libraries required
import re
import umap
3. Data Loading
Loading the data into Pandas dataframe for easily handling of data
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 1/18
8/19/24, 7:52 PM Scaler_Case_Study
*************************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205843 entries, 0 to 205842
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 205843 non-null int64
1 company_hash 205799 non-null object
2 email_hash 205843 non-null object
3 orgyear 205757 non-null float64
4 ctc 205843 non-null int64
5 job_position 153279 non-null object
6 ctc_updated_year 205843 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 11.0+ MB
None
*************************************************
*************************************************
Shape of the dataset is (205843, 7)
*************************************************
*************************************************
Number of nan/null values in each column:
Unnamed: 0 0
company_hash 44
email_hash 0
orgyear 86
ctc 0
job_position 52564
ctc_updated_year 0
dtype: int64
*************************************************
*************************************************
Number of unique values in each column:
Unnamed: 0 205843
company_hash 37299
email_hash 153443
orgyear 77
ctc 3360
job_position 1016
ctc_updated_year 7
dtype: int64
*************************************************
*************************************************
Duplicate entries:
False 205843
Name: count, dtype: int64
In [4]: df.describe()
In [5]: df.describe(include='object')
Insight
There are 205843 entries with 7 columns
There are 44 null/missing values in company_hash, 86 in orgyear and 52564 in job_position
There are no duplicates
There are 1016 unique job_position
The column Unnamed: 0 can be dropped as it doesnt provide any new information
def preprocess_string(string):
new_string= re.sub('[^A-Za-z ]+', '', string).lower().strip()
return new_string
df.info()
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 2/18
8/19/24, 7:52 PM Scaler_Case_Study
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205843 entries, 0 to 205842
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company_hash 205843 non-null object
1 email_hash 205843 non-null object
2 orgyear 205757 non-null float64
3 ctc 205843 non-null int64
4 job_position 205843 non-null object
5 ctc_updated_year 205843 non-null float64
dtypes: float64(2), int64(1), object(3)
memory usage: 9.4+ MB
5 yvuuxrj hzbvqqxta bvqptnxzs ucn rna 18f2c4aa2ac9dd3ae8ff74f32d30413f5165565b90d8f2... 2018.0 700000 fullstack engineer 2020.0
lower_bound = df['orgyear'].quantile(0.001)
upper_bound = df['orgyear'].quantile(0.999)
data = data[(data >= lower_bound) & (data <= upper_bound)]
fig, axs = plt.subplots(2,1,figsize=(15,4))
sns.countplot(ax = axs[0], x=data)
axs[0].tick_params(labelrotation=90)
sns.boxplot(ax = axs[1], x=data)
fig.suptitle('Orgyear distribution without outliers', fontsize=15)
plt.tight_layout()
plt.show()
Insight
The column orgyear has a lot of errors. The years close to 0 and the years greater than the current year are all outliers
Maximum number of learners began their employment at the current company in the year 2018
The distribution is left skewed, which is obvious as there are learners who have been working from long time too
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 3/18
8/19/24, 7:52 PM Scaler_Case_Study
plt.tight_layout()
plt.show()
Insight
Maximum learners got their CTC updated in the year 2019, 2020 and 2021
mean = data.mean()
std = data.std()
lower_bound = mean - (3*std)
upper_bound = mean + (3*std)
data = data[(data > lower_bound) & (data < upper_bound)]
fig, axs = plt.subplots(2,1,figsize=(15,4), sharex=True)
sns.histplot(ax = axs[0], x=data)
sns.boxplot(ax = axs[1], x=data)
fig.suptitle('CTC distribution without outliers', fontsize=15)
plt.tight_layout()
plt.show()
Insight
The distribution of CTC is extremely right skewed with an obvious outlier being at CTC around 1.0E9
Without the outlier also, the CTC is right skewed as there are good number of learners with higher CTC
In [13]: df['company_hash'].value_counts()[:10]
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 4/18
8/19/24, 7:52 PM Scaler_Case_Study
Out[13]: company_hash
nvnv wgzohrnvzwj otqcxwto 8266
xzegojo 5348
vbvkgz 3446
zgn vuurxwvmrt vwwghzn 3356
wgszxkvzn 3212
vwwtznhqt 2833
fxuqg rxbxnta 2622
gqvwrt 2496
bxwqgogen 2121
wvustbxzx 2026
Name: count, dtype: int64
Insight
Maximum number of learners have their current employer whose company hash is nvnv wgzohrnvzwj otqcxwto
In [14]: df['email_hash'].value_counts()[:10]
Out[14]: email_hash
bbace3cc586400bbc65765bc6a16b77d8913836cfc98b77c05488f02f5714a4b 10
298528ce3160cc761e4dc37a07337ee2e0589df251d73645aae209b010210eee 9
6842660273f70e9aa239026ba33bfe82275d6ab0d20124021b952b5bc3d07e6c 9
3e5e49daa5527a6d5a33599b238bf9bf31e85b9efa9a94f1c88c5e15a6f31378 9
d598d6f1fb21b45593c2afc1c2f76ae9f4cb7167156cdf93246d4192a89d8065 8
4818edfd67ed8563dde5d083306485d91d19f4f1c95d193a1700e79dd245b75c 8
d15041f58bb01c8ee29f72e33b136e26bc32f3169a40b53d75fe7ae9cbb9a551 8
faf40195f8c58d5c7edc758cc725a762d51920da996410b80ac4a4d85c803da0 8
c0eb129061675da412b0deb15871dd06ef0d7cd86eb5f7e8cc6a20b0d1938183 8
b4d5afa09bec8689017d8b29701b80d664ca37b83cb883376b2e95191320da66 8
Name: count, dtype: int64
Insight
It is suprising to see that many learners have the same email id, with maximum(10) learners having email with hash bbace3cc586400bbc65765bc6a16b77d8913836cfc98b77c05488f02f5714a4b
In [15]: df['job_position'].value_counts()
Out[15]: job_position
nan 52166
backend engineer 43336
fullstack engineer 25826
other 17628
frontend engineer 10341
...
traineeintern 1
staff consultant 1
java devloper 1
associate l 1
azure data factory 1
Name: count, Length: 848, dtype: int64
Insight
Maximum number of learners are Backend Engineers
Insight
It is obvious that the learners who joined/changed company recently have a higher CTC
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 5/18
8/19/24, 7:52 PM Scaler_Case_Study
Insight
Above plot shows few of the positions with top CTCs
It has a mix of software developers, data analysts/scients, leadership etc
Insight
There are around 900+ companies in which more than 50% of the times the avergae CTC of a Data Scientist is greater than that of other roles
5. Data Preprocessing
In [20]: df.info()
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 6/18
8/19/24, 7:52 PM Scaler_Case_Study
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203956 entries, 0 to 203955
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company_hash 203956 non-null object
1 email_hash 203956 non-null object
2 orgyear 203956 non-null float64
3 ctc 203956 non-null int64
4 job_position 203956 non-null object
5 ctc_updated_year 203956 non-null float64
dtypes: float64(2), int64(1), object(3)
memory usage: 9.3+ MB
45062 oxej ntwyzgrgsxto rxbxnta bbace3cc586400bbc65765bc6a16b77d8913836cfc98b7... 2018.0 720000 support engineer 2020.0
101498 oxej ntwyzgrgsxto rxbxnta bbace3cc586400bbc65765bc6a16b77d8913836cfc98b7... 2018.0 720000 fullstack engineer 2020.0
116226 oxej ntwyzgrgsxto rxbxnta bbace3cc586400bbc65765bc6a16b77d8913836cfc98b7... 2018.0 720000 data analyst 2020.0
122888 oxej ntwyzgrgsxto rxbxnta bbace3cc586400bbc65765bc6a16b77d8913836cfc98b7... 2018.0 660000 support engineer 2019.0
142804 oxej ntwyzgrgsxto rxbxnta bbace3cc586400bbc65765bc6a16b77d8913836cfc98b7... 2018.0 660000 fullstack engineer 2019.0
151096 oxej ntwyzgrgsxto rxbxnta bbace3cc586400bbc65765bc6a16b77d8913836cfc98b7... 2018.0 660000 devops engineer 2019.0
There should be a unique entry for a combination of employee's e-mail and CTC. I will remove all the duplicates by keeping only the first entry as the first entry seems to be the latest entry
In [23]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162004 entries, 0 to 162003
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company_hash 162004 non-null object
1 email_hash 162004 non-null object
2 orgyear 162004 non-null float64
3 ctc 162004 non-null int64
4 job_position 162004 non-null object
5 ctc_updated_year 162004 non-null float64
dtypes: float64(2), int64(1), object(3)
memory usage: 7.4+ MB
In [24]: df['email_hash'].value_counts()[:10]
Out[24]: email_hash
58ae1bae06ebf94f022cc06962029090b67e1e0a19c9b367426a0478ed349a41 2
f33f83536090140ab11955aea0c1940d24e52944e22d6b1ac83c07f77da04b45 2
ffa1726ba8fbf5c3824d00f6d311384f3b6873a82ba4c299392e956a0af14f88 2
02ba5874e4bdd9952a6b1a518d15628946cb7c1d72861adbd4f513b8da1fd52f 2
bb318b24ecb7b28951bc201e00cceb3b4de49c47539b4e70cff4d66b7d0e3951 2
85b377da8855513ddc45b010a6eaf02bd1581605d4533a1d8bbe9025aedf005c 2
79a18f18303458d0a393607f951ed8088a696e540812350fb98686c558c6bd73 2
9ce698befbf7aa7fc6fc468e8c7d98c3e069b9f5c176eec471f6d2b3e621fe28 2
7f334761242f8e2bd159707e166fbf0a380700a9fbef62e70c09f4a9e878690c 2
902b1c1a83fbffd623ae728394e826ca6b72c6b2a8e3b1f5bee13761dcca1cf5 2
Name: count, dtype: int64
Out[26]: company_hash 0
email_hash 0
orgyear 0
ctc 0
job_position 0
ctc_updated_year 0
dtype: int64
Insight
I will remove records where company hash is empty or "nan"
I will remove records where job position is empty
I will use imputation for job position with "nan" values
df.loc[df['job_position']=='nan', 'job_position']=np.nan
df.isna().sum()
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 7/18
8/19/24, 7:52 PM Scaler_Case_Study
imputer = KNNImputer(n_neighbors=1)
temp_df = pd.DataFrame(imputer.fit_transform(temp_df), columns=temp_df.columns)
temp_df['job_position'] = encoders['job_position'].inverse_transform(temp_df['job_position'].astype('int32'))
df.isna().sum()
Before -> Number of entries where ctc_updated_year is less than orgyear: 7331
After -> Number of entries where ctc_updated_year is less than orgyear: 0
Calculate CTC rank by comparing the CTC seperately against average CTC per company, average CTC per job position and average CTC per years of experience
Value 1 - CTC is greater than all three average CTCs
Value 2 - CTC is greater than at least 2 of the average CTCs
Value 3 - CTC is greater than at least 1 of average CTCs
Va;ue 4 - CTC is less than all the three average CTCs
In [34]: df.head()
Out[34]: company_hash email_hash orgyear ctc job_position ctc_updated_year years_of_experience years_since_increment ctc_rnk
1 qtrxvzwt xzegwgbb rxbxnta b0aaf1ac138b53cb6e039ba2c3d6604a250d02d5145c10... 2018.0 449999 fullstack engineer 2019.0 5.0 4.0 4
4 qxen sqghu 6ff54e709262f55cb999a1c1db8436cb2055d8f79ab520... 2017.0 1400000 fullstack engineer 2019.0 6.0 4.0 1
Calculate designation, class and tier values based on CTC statistics on company-experience level, company-position level and company level
Value 1 - CTC is greater than 75% of the population of the group
Value 2 - CTC is between 50% and 75% of the population of the group
Value 2 - CTC is less than 50% of the population of the group
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 8/18
8/19/24, 7:52 PM Scaler_Case_Study
return 3
elif x >= x50 and x <= x75:
return 2
elif x > x75:
return 1
Out[37]: company_hash email_hash orgyear ctc job_position ctc_updated_year years_of_experience years_since_increment ctc_rnk designation class tier
backend
2 ojzwnvwnxw vx 4860c670bcd48fb96c02a4b0ae3608ae6fdd98176112e9... 2015.0 2000000 2020.0 8.0 3.0 2 2 2 2
engineer
backend
3 ngpgutaxv effdede7a2e7c2af664c8a31d9346385016128d66bbc58... 2017.0 700000 2019.0 6.0 4.0 4 3 3 3
engineer
fullstack
4 qxen sqghu 6ff54e709262f55cb999a1c1db8436cb2055d8f79ab520... 2017.0 1400000 2019.0 6.0 4.0 1 2 1 1
engineer
Insight
The mean CTC of designation 1 > 2 > 3
The mean years of experience of designation 2 > 1 ~= 3
The top 4 position of all the designations are backend engineer, fullstack engineer, other and frontend engineer
The clustering based on designation is able to differentiate between CTCs but not years of experience or job position
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 9/18
8/19/24, 7:52 PM Scaler_Case_Study
Insight
The mean CTC of class 1 > 2 > 3
The mean years of experience of class 1 > 2 > 3
The top 4 position of all the class are backend engineer, fullstack engineer, other and frontend engineer
The clustering based on class is able to differentiate between CTCs, between years of experience but not job position
Insight
The mean CTC of tier 1 > 2 > 3
The mean years of experience of tier 1 > 2 > 3
The top 4 position of tier 1 are backend engineer, fullstack engineer, engineering leadership and other while for all the other tiers are backend engineer, fullstack engineer, other and frontend engineer
The clustering based on tier is able to differentiate between CTCs, between years of experience and partially between job position
In [42]: temp_df['job_position'].value_counts()[:10]
Out[42]: job_position
backend engineer 10498
fullstack engineer 5236
engineering leadership 3876
other 3148
frontend engineer 2061
data scientist 1637
android engineer 1011
qa engineer 983
devops engineer 959
backend architect 736
Name: count, dtype: int64
Insight
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 10/18
8/19/24, 7:52 PM Scaler_Case_Study
Above are the top 10 employees earning more than most of the employees in the company
The top employees belong to companies which provide software developer roles
145643 xm b8a0bb340583936b5a7923947e9aec21add5ebc50cd60b... 15
Insight
Above are the bottom 10 employees earning less than most of the employees in the company
Insight
Above are the top 10 employees of data scientist role earning more than their peers
Insight
Above are the bottom 10 employees of data scientist role earning less than their peers
In [47]: df[(df['tier'] == 1) & ((df['years_of_experience'] >= 5) & (df['years_of_experience'] <= 7))].sort_values(by='ctc', ascending=False).head(10)[['company_hash', 'email_hash', 'ctc']]
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 11/18
8/19/24, 7:52 PM Scaler_Case_Study
Insight
Above are the top 10 employees with 5,6 or 7 years of experience earning more than most of the employees in the company
Out[48]: ctc
company_hash
mxsmvoptnwgb 37400000
stztojo 37200000
xmb 36100000
ofxssj 36000000
Insight
Above are the top 10 companies based on their CTC
C:\Users\dz31jl\AppData\Local\Temp\ipykernel_4344\1812531026.py:1: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future ver
sion of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to
silence this warning.
temp_df = df[['company_hash', 'job_position', 'ctc']].groupby('company_hash').apply(lambda x: x.nlargest(2, 'ctc')).reset_index(drop=True).sort_values(by='ctc', ascending=False)
Out[49]: job_position
backend engineer 9470
fullstack engineer 7214
other 4869
frontend engineer 3610
engineering leadership 3183
...
software qa engineer 1
credit risk 1
assistant manager 1
web ui designer 1
fullstack web developer 1
Name: count, Length: 255, dtype: int64
Insight
backend engineer and fullstack engineer are the top 2 job positions in most of the company with high CTC
In [51]: X.head()
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 12/18
8/19/24, 7:52 PM Scaler_Case_Study
In [54]: X_scaled.head()
In [55]: sns.pairplot(X_scaled)
plt.show()
6. Model building
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 13/18
8/19/24, 7:52 PM Scaler_Case_Study
#w_distances = nearest neighbour distances from a sample of points from original data X
w_distances , w_indices = nbrs.kneighbors(X_sample , n_neighbors=2)
#distance to the second nearest neighbour (as the first neighbour will be the point itself, with distance = 0)
w_distances = w_distances[: , 1]
u_sum = np.sum(u_distances)
w_sum = np.sum(w_distances)
Out[57]: 0.9875647762150388
As per Hopkins statistics, with the value of ~0.98, the dataset exhibits very good clustering tendency
linkage_matrix = np.column_stack(
[model.children_, model.distances_, counts]
).astype(float)
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 14/18
8/19/24, 7:52 PM Scaler_Case_Study
Insight
From the above we can see a clear pattern that the low number of years of experience are grouped together on the extreme left, medium number of years of experience are grouped together in the middle and
high number of years of experience are grouped together at the extreme right
In [66]: clusters.head()
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 15/18
8/19/24, 7:52 PM Scaler_Case_Study
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 16/18
8/19/24, 7:52 PM Scaler_Case_Study
C:\ProgramData\anaconda3\Lib\site-packages\umap\umap_.py:1945: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")
C:\ProgramData\anaconda3\Lib\site-packages\umap\spectral.py:550: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.
In [70]: clusters.head()
In [72]: clusters.groupby(['label']).agg({'ctc':'mean',
'years_of_experience':'mean',
'years_since_increment':'mean',
'ctc_rnk':pd.Series.mode,
'designation':pd.Series.mode,
'class':pd.Series.mode,
'tier':pd.Series.mode,}).reset_index()
label
0 0.144530
1 0.338950
2 0.319407
3 0.197113
Name: count, dtype: float64
7. Insights
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 17/18
8/19/24, 7:52 PM Scaler_Case_Study
4 clusters were created based on the elbow method and dendrogram chart
Cluster 3 has the highest mean CTC, highest CTC rank, highest designation, highest class and highest tier. It also has the second highest years of experience
Cluster 2 has lowest mean CTC, lowest CTC rank, lowest designation, lowest class and lowest tier. It also has the lowest years of experience
Cluster 1 is similar to Cluster 2 with slighly better CTC, designation, class and tier
Cluster 0 has good CTC, CTC rank, designation, class and tier. It has the highest years of experience
From these observation, it looks like Cluster 0 comprises of highly experienced people with good CTC, Cluster 1 and Cluster 2 comprises of entry level to mid-senior level people with average CTC and Cluster
3 comprise of senior level people with great CTC
Maximum learners belong to Cluster 1 followed by Cluster 2.
8. Recommendation
Scaler has a lot of learners belonging to junior/mid-senior roles and hence should design more courses which will help these learners enhance their skills and move up the career ladder.
Scaler can attract more people to their learning platform by running ads of how Software Development and Data Analyst/Scientist roles get high salary
Scaler should make efforts to pull in more people from Academia(both students and teachers) to increase their learners base as well as make the students industry ready and employable with high salaries.
The clustering algorithm can be bettered by asking the learners to mention their job position more precisely and specifically instead of just mentioning "Others"
Scaler should also ask the learners to mention their domain of study/work which again will be helpful in improving the clustering algorithm
9. Questionnaire
1. What percentage of users fall into the largest cluster?
Ans: Around 34% of learners fall into the largest cluster, 1
2. Comment on the characteristics that differentiate the primary clusters from each other.
Ans: CTC, Years of experience, CTC rank, Designation, Class and Tier are the most important characteristics that differentiate the clusters
3. Is it always true that with an increase in years of experience, the CTC increases? Provide a case where this isn't true.
Ans: No, it is not true that CTC increases with increase in experience. The maximum CTC belongs to a learner with 7 years of experience and the minimum CTC belongs to a 29 year experienced learner
4. Name a job position that is commonly considered entry-level but has a few learners with unusually high CTCs in the dataset.
Ans: Data Analysts is usually considered a entry-level job and there is a learner with Data Analyst job position with a CTC higher than that of a learner with engineering leadership job position
6. For a given company, how does the average CTC of a Data Scientist compare with other roles?
Ans: There are around 900+ companies in which more than 50% of the times the avergae CTC of a Data Scientist is greater than that of other roles
file:///C:/Users/dz31jl/Downloads/Scaler_Case_Study.html 18/18