Releases: Riminder/jobcurator
Releases · Riminder/jobcurator
vjobcurator-v0.1.12
Release vjobcurator-v0.1.12
vjobcurator-v0.1.11
Release vjobcurator-v0.1.11
vjobcurator-v0.1.10
Release vjobcurator-v0.1.10
vjobcurator-v0.1.1
Release vjobcurator-v0.1.1
joburator-v1
Full Changelog: vjobcurator-v0.1.7...joburator-v1
jobcurator
Full Changelog: https://github.com/Riminder/jobcurator/commits/jobcurator
# jobcurator v0.1.0
Initial public release of **jobcurator** – a hash-based job deduplication and compression library with quality and diversity preservation.
---
## ✨ Highlights
- 📦 **`JobCurator` class**
- Single entrypoint: `JobCurator.dedupe_and_compress(jobs, ratio=...)`
- Compression ratio: keep `ratio` × N jobs (e.g. `0.4` keeps ~40% of jobs)
- 🧱 **Typed job schema**
- `Job` – canonical job object
- `Category` – hierarchical taxonomy (multi-level, multi-dimension)
- `SalaryField` – structured salary (min/max, currency, period)
- `Location3DField` – lat/lon/alt + computed 3D `x,y,z` coordinates
- 🧠 **Quality scoring (no embeddings)**
- Length / job description richness
- Completion (presence of title, text, location, salary, categories, etc.)
- Optional freshness & source quality
- 🧬 **Deduplication & diversity**
- Exact hash based on title + categories + location bucket + salary + text
- SimHash on text + feature-hash on metadata → 128-bit signature
- LSH clustering + union–find to group near-duplicates
- Greedy selection balancing:
- **Quality** (score)
- **Diversity** (Hamming distance between signatures)
- 🌍 **Geo-aware clustering**
- Converts GeoPoints to 3D coordinates
- Optional max distance between cities within the same cluster
---
## 🚀 Installation
```bash
pip install jobcuratorFor local dev (from repo root):
pip install -e .🧪 Quick Usage
from datetime import datetime
from jobcurator import JobCurator, Job, Category, SalaryField, Location3DField
jobs = [
Job(
id="job-1",
title="Senior Backend Engineer",
text="Full description...",
categories={
"job_function": [
Category(
id="backend",
label="Backend",
level=1,
parent_id="eng",
level_path=["Engineering", "Software", "Backend"],
)
]
},
location=Location3DField(
lat=48.8566,
lon=2.3522,
alt_m=35,
city="Paris",
country_code="FR",
),
salary=SalaryField(
min_value=60000,
max_value=80000,
currency="EUR",
period="year",
),
company="HrFlow.ai",
contract_type="Full-time",
source="direct",
created_at=datetime.utcnow(),
),
# ... more jobs
]
curator = JobCurator(ratio=0.4)
compressed_jobs = curator.dedupe_and_compress(jobs)
print(len(jobs), "→", len(compressed_jobs))⚠️ Breaking Changes
- None – this is the initial release.
✅ TODO (next versions)
- Add proper test suite and CI
- Expose more tuning hooks (weights, distance thresholds) via config
- Add serialization / JSON helpers for
Joband related classes - Provide benchmarks and example notebooks
🙌 Contributors
- @your-github-handle
- Contact: [mouhidine.seiv@hrflow.ai](mailto:mouhidine.seiv@hrflow.ai)
Full Changelog: https://github.com/Riminder/jobcurator/commits/jobcurator