Skip to content

Releases: Riminder/jobcurator

vjobcurator-v0.1.12

11 Nov 05:26

Choose a tag to compare

Release vjobcurator-v0.1.12

vjobcurator-v0.1.11

11 Nov 05:20

Choose a tag to compare

Release vjobcurator-v0.1.11

vjobcurator-v0.1.10

11 Nov 05:11

Choose a tag to compare

Release vjobcurator-v0.1.10

vjobcurator-v0.1.1

11 Nov 04:01

Choose a tag to compare

Release vjobcurator-v0.1.1

joburator-v1

11 Nov 04:52

Choose a tag to compare

jobcurator

08 Nov 20:23
4e24e5e

Choose a tag to compare

Full Changelog: https://github.com/Riminder/jobcurator/commits/jobcurator

# jobcurator v0.1.0

Initial public release of **jobcurator** – a hash-based job deduplication and compression library with quality and diversity preservation.

---

## ✨ Highlights

- 📦 **`JobCurator` class**
  - Single entrypoint: `JobCurator.dedupe_and_compress(jobs, ratio=...)`
  - Compression ratio: keep `ratio` × N jobs (e.g. `0.4` keeps ~40% of jobs)

- 🧱 **Typed job schema**
  - `Job` – canonical job object
  - `Category` – hierarchical taxonomy (multi-level, multi-dimension)
  - `SalaryField` – structured salary (min/max, currency, period)
  - `Location3DField` – lat/lon/alt + computed 3D `x,y,z` coordinates

- 🧠 **Quality scoring (no embeddings)**
  - Length / job description richness
  - Completion (presence of title, text, location, salary, categories, etc.)
  - Optional freshness & source quality

- 🧬 **Deduplication & diversity**
  - Exact hash based on title + categories + location bucket + salary + text
  - SimHash on text + feature-hash on metadata → 128-bit signature
  - LSH clustering + union–find to group near-duplicates
  - Greedy selection balancing:
    - **Quality** (score)
    - **Diversity** (Hamming distance between signatures)

- 🌍 **Geo-aware clustering**
  - Converts GeoPoints to 3D coordinates
  - Optional max distance between cities within the same cluster

---

## 🚀 Installation

```bash
pip install jobcurator

For local dev (from repo root):

pip install -e .

🧪 Quick Usage

from datetime import datetime
from jobcurator import JobCurator, Job, Category, SalaryField, Location3DField

jobs = [
    Job(
        id="job-1",
        title="Senior Backend Engineer",
        text="Full description...",
        categories={
            "job_function": [
                Category(
                    id="backend",
                    label="Backend",
                    level=1,
                    parent_id="eng",
                    level_path=["Engineering", "Software", "Backend"],
                )
            ]
        },
        location=Location3DField(
            lat=48.8566,
            lon=2.3522,
            alt_m=35,
            city="Paris",
            country_code="FR",
        ),
        salary=SalaryField(
            min_value=60000,
            max_value=80000,
            currency="EUR",
            period="year",
        ),
        company="HrFlow.ai",
        contract_type="Full-time",
        source="direct",
        created_at=datetime.utcnow(),
    ),
    # ... more jobs
]

curator = JobCurator(ratio=0.4)
compressed_jobs = curator.dedupe_and_compress(jobs)
print(len(jobs), "→", len(compressed_jobs))

⚠️ Breaking Changes

  • None – this is the initial release.

✅ TODO (next versions)

  • Add proper test suite and CI
  • Expose more tuning hooks (weights, distance thresholds) via config
  • Add serialization / JSON helpers for Job and related classes
  • Provide benchmarks and example notebooks

🙌 Contributors

Full Changelog: https://github.com/Riminder/jobcurator/commits/jobcurator