Ever get tired of manually scrolling through all the co-op postings on DrexelOne all within one week? Now you don't have to!
, \ / ,
/ \ )\__/( / \
/ \ (_\ /_) / \
__________________/_____\__\@ @/___/_____\_________________
| |\../| |
| \VV/ |
| D-R-E-X-E-L C-O-O-P M-A-T-C-H-E-R |
|__________________________________________________________|
| /\ / \\ \ /\ |
| / V )) V \ |
|/ ` // ' \|
` V '
1. Web Scraping
- Uses Selenium to log into DrexelOne and scrape available co-op postings.
- Saves all the static HTML pages in a directory.
- Uses BeautifulSoup to parse all the scraping into a single json file.
2. LLM pipeline
-
Data Preparation:
- Reads the scraped co-op postings from the JSON file and extracts key information (title, description, qualifications).
- Reads a user's resume from a PDF file, extracting all text content.
-
Embedding & Indexing:
- Uses the Google Gemini embedding model (
models/embedding-001) to generate vector embeddings for both the resume and co-op postings. - Chunks text before embedding to create more meaningful representations.
- Calculates average embeddings for all the co-op posting data.
- Creates a FAISS index from the embeddings of all the scraped co-op postings.
- Uses the Google Gemini embedding model (
-
Similarity Search:
- Uses the FAISS index to perform a similarity search, finding the top
kco-op postings that are most similar to the resume embedding.
- Uses the FAISS index to perform a similarity search, finding the top
-
LLM Ranking:
- Constructs a prompt for a Google Gemini Pro model, including:
- The full text content of the user's resume.
- The top
kco-op postings that are most similar to the user's resume based on the FAISS search.
- Outputs the top positions user should apply for.
- Constructs a prompt for a Google Gemini Pro model, including:
- Clone this repository:
git clone https://github.com/key-r-code/drexel-co-op-matcher.git
cd drexel-co-op-matcher- Set up virtual env
python3 -m venv venv
source venv/bin/activate # On macOS/Linux
venv\Scripts\activate # On Windows- Install dependencies:
pip install -r requirements.txt- Create a
.envfile and replace with your gemini API key. See .env.example
touch .env-
Add resume PDF in the same directory
-
Add Drexel credentials in main.py
-
Create list of interested majors in main.py. See majors.json for all major abbreviations used by the portal.
-
dragonScraper.pycurrently uses the Safari webdriver. Uncomment lines 17-23 to use Chrome or Firefox. -
Run
main.py:
python3 main.pyThis will create a subdirectory and save all the static HTML files.
Change the name of the HTML directory in dragonScraper.py if you used the scraper before.
- Run
parsing_htmls.py:
python3 parsing_htmls.pyThis will parse all the HTML files and create a single JSON file.
- Run
gemini-analysis-starter-nb.ipynb
dragonScraper.py
Add pagination handlingAdd upcoming co-op postings and previously applied co-ops- Replace all time.sleep() calls with self.wait.until
Add Chrome and Firefox support (currently only support Safari WebDriver)
LLM-pipeline
- Add a geminiPipeline class
- Find optimal
chunk_sizeandchunk_overlap
CLI App
- Add API key handling
- Add A/B/C round navigation
- Add major navigation
Contributions are welcome! To contribute:
- Fork the repository.
- Create a new branch (
git checkout -b feature/my-new-feature). - Make your changes.
- Commit your changes (
git commit -am 'Add new feature'). - Push to the branch (
git push origin feature/my-new-feature). - Create a pull request.