This project is intended to make a pipeline of data analysis about opportunities for data science career announced at Indeed. However, this pipeline can classify job opportunities of whenever sector, beyond data science.
This pipeline generates a .html file with:
- Clusters 2D Graph
- Clusters Keywords Ranking
- TF-IDF Ranking
Check the "Brazillian Data Science Jobs Market: A Deep Analysis" on the web!
| Folder | Description |
|---|---|
| db/ | Folder where your Scrapy database will be saved |
| output/ | Folder where your graphs and results will be saved |
| ARGS | USAGE |
|---|---|
| [db-title] | It is your Scrapy database title (e. g., datascience_db) |
| [urls-file] | It is your Indeed URL filename (take a look at sample.urls) |
| [toxicwords-file] | It is the filename of list of words for not use in the analysis (take a look at sample.toxicwords) |
| [num-clusters] | Number of clusters to identify, in a range (e. g., 2-8) or single (e. g., 8) |
Paraphrasing The Beatles: " All you need is docker 🐳 "
git clone https://github.com/HelioNeves/mut.git
cd /mutdocker build . -t mutdocker run -ti --name MUT-env mut /bin/bashpython3 scraper.py [db-title] [urls-file]python3 app.py [db-title] [toxicwords-file] [num-clusters]