Web scraping the popular job listing site "Glassdoor" with Python and BeautifulSoup.
- Intended to work without sign-in. User to provide a 'base url' to scrape from, based on desired job role and country.
- User to set a 'target job size' i.e. number of individual job listings to scrape from.
- Python script scrapes job link, role, company and job description from glassdoor results.
- Scrapped information are returned to users in the form of an output csv.
- This script serves as a means of collecting unstructured data of job descriptions provided in job listings.
- With some programming knowledge, one can easily modify the script to work for job listing sites with similar layouts.
- Output data can then be analysed and visualised to generate useful insights.
- The intended audience of this repository is people with some programming experience to improve on and/ or incorporate into their own data science pipelines.
- Script has been tested and verified to work up to a target job size of <2000, of >10 pages of job listing links.
Core Library: Beautiful Soup
Please refer to requirements.txt for list of requirements.
- HTML parser (Beautiful Soup) extracts job listing links (to individual job listing pages) from result page(s).
- HTML parser extracts information from individual job listing pages.
- Loop conditions control the 'movement' from job listing page-to-page.
- Loop conditions control the 'movement' from result page-to-page.
Original configuration.json file has been set to run tests.
- output_sample.txt contains expected results from tests.
- Run command to install prerequisites
pip install -r requirements.txt - Run command to execute script
python main.py - Verify that the resulting output.txt file is as expected.
- Modify the configuration.json file as necessary for deployment.
The following gif shows how a base_url can be obtained.
There are plans to create a data processing pipeline to analyse and visualise to generate useful insights from extracted data in the future. Feel free to collaborate and contribute to this project, or open an issue to suggest more useful features for implementation.