AJPOC stands for Airbus Jobs POrtal Crawler, and therefore this project aims to implement a customizable web crawler for the vacancies listed in the Airbus careers portal.
- Implementation of the Scrapy spider capable of:
- Following the "Next Page" hyperlink in the vacancies listing pages.
- Following the vacancies links to access to its details.
- Parse the vacancies contents and show them formatted in the terminal.
- Customizable logging capabilities both to terminal and to file.
- Implement the necessary classes for storing the information of the parsed "vacancies" objects into models.
- Implement a custom Scrapy pipeline to process the parsed data into this new objects.
- Adapt the spider to use the new pipeline.
- Implement the mapping of the objects into a relational database structure (most likely by the use of a database agnostic Object-Relational-Mapper like SQL Alchemy).
- Provide persistence for the scraped data by storing the obtained information into a relational database:
- Connect to an existing DB (or create it and connect if it does not exist) on startup.
- Adapt the pipeline to add the parsed data into the database.
- Compute the deltas with the previous existing data in the DB.
- Look for the presence of an object with the same Id in the DB.
- If the object already exists and no modifications exist, skip it.
- If the object already exists and any field has been modified, update it and mark it as "Modified".
- If the object is not present, add it and mark it as "Added".
- If an object used to exit, but it is not present any longer, remove it and mark it as "Deleted".
- Provide a summary report with the "Added", "Modified" and "Deleted" elements.
- Implement a filtering mechanism so the user can tweak the vacancies processed:
- Add support for the filters of the Airbus Job Site.
- Add support for custom filters based on keywords of the vacancies contents.
- Provide a push notification mechanism through telegram with the
python-telegram-bot.
- Create a new telegram bot.
- Integrate the python-telegram-bot API in the code.
- Send messages to the subscribed users with the report result after the scraping completion.
- At this stage, the program should be capable of running in a loop on a Raspberry Pi while sending notifications to the user every time a new vacancy is published, modified or removed.
- Testing, tweaking and bug fixing...