ACA-NOC: Development of an Automated Coding Algorithm for the Canadian National Occupation Classification
In many research studies examining social determinants is important, occupational information is often needed to augment existing data sets. Such information is usually solicited during interviews with open-ended questions, like “what is your job?” and “what industry sector do you work in?” Before being able to use this information for further analysis, the responses need to be categorized using a coding system, like the Canadian National Occupational Classification (NOC). Manual coding is the usual method, which is a time-consuming and error prone activity, suitable for automation.
To facilitate automated coding we proposed to introduce a rigorous algorithm that is able to identify the NOC (2016) codes using only a job title and industry information as input. Using manually coded data sets we sought to benchmark and iteratively improve the performance of the algorithm.
We developed the ACA-NOC (Automated Coding Algorithm for the Canadian National Occupation Classification) algorithm, based on the National Occupational Classification (NOC) 2016, which allows users to match NOC codes with job titles and industry titles. We employed several different search strategies in the ACA-NOC algorithm to find the best match, including: Exact Search, Minor Exact Search, Like Search, Near (same order) Search, Near (different order) Search, Any Search, and Weak Match Search. In addition a filtering step based on the hierarchical structure of the NOC data was applied in the algorithm to select the best matching codes.
ACA-NOC was applied to over 500 manually coded job titles and industry titles. The accuracy rate at the 4-digit NOC code level was 58.66% and improved when broader job-categories were considered (65.01% at the 3-digit NOC code level, 72.26% at the 2-digit NOC code level, 81.63% at the 1-digit NOC code level).
ACA-NOC is a rigorous algorithm for automatically coding to the Canadian National Occupational Classification system, and has been evaluated using real world data. It allows researchers to code moderate sized data sets with occupation in a timely and cost-efficient manner, so that further analytics are possible. Initial assessments indicate it has state of the art performance and is readily extensible upon further benchmarking on larger data sets.
- The program is coded in python 3.x. Check out https://docs.anaconda.com/anaconda/install/ for installation.
- The
datadirectory contains the input data file in a spreadsheet. - There are three python scripts in
srcdirectory namedpreprocess.py,NOC_Code_Auto.pyandresult_analysis.py.- The preprocessing step
- either replaces the
Current Job Titlewith thePreferred Job Titleif there is noCurrent Industryspecified for thatCurrent Job Title. This happens whenno_replace_if_industry_exist = True. - or replaces the
Current Job Titlewith thePreferred Job Titleregardless ofCurrent Industryspecified for thatCurrent Job Title. This happens whenno_replace_if_industry_exist = False.
- either replaces the
NOC_Code_Auto.pywill generate the NOC Codes in a CSV file calledtitle_noc_result_byprogram.csvresult_analysis.pywill analyse this generated CSV file and print the results of the analysis.
- The preprocessing step
- The Canadian National Occupational Classification (NOC) comprises more than 30,000 occupational titles
gathered into 500 Unit Groups, organized according to 4 skill levels and 10 skill types. Unit Groups
are based on similarity of skills, defined primarily by functions and employment requirements.
Each Unit Group describes main duties and employment requirements as well as detailing examples
of occupational titles. Each unit group has a unique four-digit code. The first three digits of
this code indicate the major and minor groups to which the unit group belongs.
- NOC-2016 is organized in a four level hierarchy, and there are 10 broad occupational categories (first level), 46 major groups (second level), 140 minor groups (third level), and 500 unit groups (fourth level).
- The
resourcesdirectory contains 5 files illustrating such organizations based on NOC-2016. - The
datadirectory contains 2 files:preprocessing_candidates.xlsxspreadsheet listing the candidates to process andNOC-spreadsheet.xlsxas the input data.
.
+-- src
+-- preprocess.py
+-- NOC_Code_Auto.py
+-- result_analysis.py
+-- resources
+-- nocjobtitle.txt
+-- noc_data_get_byws_dealing_slash.csv
+-- NOC_skilltype.csv
+-- NOC_majorgroup.csv
+-- NOC_minorgroup.csv
+-- data
+-- preprocessing_candidates.xlsx
+-- NOC-spreadsheet_BACKUP.xlsx
+-- NOC-spreadsheet.xlsx
+-- README.mdGo to the project directory:
$ cd /path/to/noccodeprojectA sample file NOC-spreadsheet_BACKUP.xlsx is provided with existing data. Rename it to NOC-spreadsheet.xlsx if it does not exist.
Based on the recommended job titles and industries, this step will replace the Current Job Title and Current Industry inside NOC-spreadsheet.xlsx.
Two input files are required:
DATA_FILE = NOC-spreadsheet.xlsxcontaining spreadsheets with column headersParticipant ID,Current Job Title,Current Industry, andNOC code.PREPROCESS_CANDIDATE_FILE = preprocessing_candiadates.xlsxcontaining two spreadsheets:- one with column headers
Current Job Title,Preferred Job Title, andPreferred NOC code. - another with column headers
Current Industry,Preferred Industry, andPreferred NOC code.
- one with column headers
- Set the flag
no_replace_if_industry_exist = True. The default is set to beTrue.
For each spreadsheet NAME_OF_THE_SHEET in NOC-spreadsheet.xlsx, a new p_NAME_OF_THE_SHEET will be generated only if a match exists.
Execute the following command:
$ python src/preprocess.pyOpen NOC-spreadsheet.xlsx and review the spreadsheets titled p_NAME_OF_THE_SHEET to view the results of the preprocessing step.
The source data is an Excel file titled NOC-spreadsheet.xlsx which contains specific column-headers in no specific order: Participant ID, Current Job Title, NOC code, and Current Industry.
Edit NOC_Code_Auto.py and set SHEET_TITLE = p_Janitors to run on the processed Janitors spreadsheet.
Run NOC_Code_Auto.py using the following command:
$ python NOC_Code_Auto.pyAlternatively, create NOC-spreadsheet.xlsx file with the columns:
- Column-1 header:
Participant ID - Column-2 header:
Current Job Title - Column-3 header:
NOC code - Column-4 header:
Current Industry
Populate the spreadsheet with data.
- When records along the
Current Industryis left empty, the algorithm only considers Current Job Title. NOC codecannot be left empty becauseresult_analysis.pyrequires these codes to run the analysis.
Run NOC_Code_Auto.py using the following command:
$ python NOC_Code_Auto.pyThe results are generated in title_noc_result_byprogram.csv file
Run python result_analysis.py to analyze the precision of automated coding:
$ python result_analysis.pyInstall Docker and docker compose. For more information, see Docker and docker desktop.
Go to the project directory:
$ cd /path/to/noccodeprojectTo run preprocess.py, edit Dockerfile and enable the line CMD [ "python", "./src/preprocess.py" ] by removing #, and run the following commands:
$ docker compose build
$ docker compose upTo run NOC_Code_Auto.py, edit Dockerfile and enable the line CMD [ "python", "./src/NOC_Code_Auto.py" ] by
removing #, and run the following commands:
$ docker compose build
$ docker compose upTo run result_analysis.py, edit Dockerfile and enable the line CMD [ "python", "./src/result_analysis.py" ] by
removing #, and run the following commands:
$ docker compose build
$ docker compose up