Skip to content

cpsancha/AJPOC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AJPOC

AJPOC stands for Airbus Jobs POrtal Crawler, and therefore this project aims to implement a customizable web crawler for the vacancies listed in the Airbus careers portal.

GitHub Travis CI GitHub issues Codecov Codacy

https://img.shields.io/badge/status-in%20development-blue

Project stages:

Stage 1 (Work-in-progress)

  • Implementation of the Scrapy spider capable of:
    • Following the "Next Page" hyperlink in the vacancies listing pages.
    • Following the vacancies links to access to its details.
    • Parse the vacancies contents and show them formatted in the terminal.
  • Customizable logging capabilities both to terminal and to file.

Stage 2 (TODO)

  • Implement the necessary classes for storing the information of the parsed "vacancies" objects into models.
  • Implement a custom Scrapy pipeline to process the parsed data into this new objects.
  • Adapt the spider to use the new pipeline.

Stage 3 (TODO)

  • Implement the mapping of the objects into a relational database structure (most likely by the use of a database agnostic Object-Relational-Mapper like SQL Alchemy).
  • Provide persistence for the scraped data by storing the obtained information into a relational database:
    • Connect to an existing DB (or create it and connect if it does not exist) on startup.
    • Adapt the pipeline to add the parsed data into the database.

Stage 4 (TODO)

  • Compute the deltas with the previous existing data in the DB.
    • Look for the presence of an object with the same Id in the DB.
    • If the object already exists and no modifications exist, skip it.
    • If the object already exists and any field has been modified, update it and mark it as "Modified".
    • If the object is not present, add it and mark it as "Added".
    • If an object used to exit, but it is not present any longer, remove it and mark it as "Deleted".
    • Provide a summary report with the "Added", "Modified" and "Deleted" elements.

Stage 5 (TODO)

  • Implement a filtering mechanism so the user can tweak the vacancies processed:
    • Add support for the filters of the Airbus Job Site.
    • Add support for custom filters based on keywords of the vacancies contents.

Stage 6 (TODO)

  • Provide a push notification mechanism through telegram with the python-telegram-bot.
    • Create a new telegram bot.
    • Integrate the python-telegram-bot API in the code.
    • Send messages to the subscribed users with the report result after the scraping completion.

Stage 7 (TODO)

  • At this stage, the program should be capable of running in a loop on a Raspberry Pi while sending notifications to the user every time a new vacancy is published, modified or removed.
  • Testing, tweaking and bug fixing...

About

Customizable web crawler for the Airbus job portal

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages