Skip to content

qing10101/JNER

Repository files navigation

JNER — Named Entity Recognition Training Pipeline

NER pipeline for detecting minor children and author gender indications in product reviews, with an optional medical-entity extension. Multiple model backends are provided so accuracy, speed, and zero-shot generalization trade-offs can be compared directly.


Labels

minor_gender_only/ (primary)

Label Meaning
NonfictionalChildRelated Text span indicating a real (non-fictional) minor child — e.g. "my 8-year-old son", "toddler", "kindergarten"
AuthorGenderIndication Text span indicating the reviewer's gender — e.g. "wife", "husband", "mom"

with_medical/ (extended)

Adds three biomedical labels from clinical case reports:

Label Source
MedicalCondition MACCROBAT, Corona2
ClinicalProcedure MACCROBAT
ClinicalEvent MACCROBAT

Data Files

File Description
combined_upload.csv Primary training data (~14 k rows). Columns: ori_review, minor_col, gender_col, medical_col, note_col
mydata.csv Earlier/smaller annotation export, used by some scripts as default
mydata_nomy.csv Variant of mydata.csv with reviewer pronouns removed from gender labels
eval_sample.csv Hand-curated evaluation sample
eval_results_spacy.csv Per-row spaCy evaluation output
eval_results_keyword_baseline.csv Per-row keyword-baseline evaluation output
synthetic_200_edge_cases.txt 200 synthetic edge cases (Python dict format)
synthetic_500_edge_cases.txt 500 synthetic edge cases
synthetic_700_edge_cases.txt Combined 700-entry edge-case set (200 + 500, renumbered)
Corona2.json Medical NER dataset (JLiNER format) — required for with_medical/ only
9764942/ MACCROBAT clinical corpus — required for with_medical/ only

CSV column format (minor_col, gender_col): semicolon-separated span phrases that appear verbatim in ori_review. Empty = no entity of that type in the row.


minor_gender_only/ — Training Scripts

Eight backends are implemented. All scripts default to combined_upload.csv (DeBERTa) or mydata.csv (others); pass --csv to override.

Installation

Install only the backend you intend to run:

pip install -r requirements-spacy.txt      # spaCy
pip install -r requirements-spanmarker.txt # SpanMarker
pip install -r requirements-gliner.txt     # GLiNER / GLiNER2
pip install -r requirements-flant5.txt     # FLAN-T5
pip install -r requirements-deberta.txt    # DeBERTa
pip install -r requirements-llm.txt        # Qwen LLM (LoRA)
pip install -r requirements-setfit.txt     # SetFit

train_spacy_csv.py — spaCy

Token-classification NER using a spaCy pipeline (BIO tagging). Fast inference; base-model general NER is not retained after fine-tuning.

python minor_gender_only/train_spacy_csv.py [options]
Argument Default Description
--model en_core_web_trf spaCy base model
--epochs 15 Training epochs
--batch-size 8 Batch size
--dropout 0.2 Dropout
--output-dir spacy_model_csv Output directory
--val-split 0.1 Validation fraction
--csv mydata.csv Training data path
--minor-oversample 0 Extra copies of minor-label examples
--use-gpu off Enable GPU

Outputs: best/, final/, checkpoint-epoch-N/


train_spanmarker_csv.py — SpanMarker

Span-level BERT classifier using BIO supervision. Best accuracy on fixed label sets; saves best/ by eval_overall_f1 (SpanMarker's built-in seqeval metric).

python minor_gender_only/train_spanmarker_csv.py [options]
Argument Default Description
--model microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext Encoder model ID
--epochs 5 Training epochs
--batch-size 8 Batch size
--lr 5e-5 Learning rate
--output-dir spanmarker_finetuned_csv Output directory
--val-split 0.1 Validation fraction
--csv mydata.csv Training data path
--minor-oversample 0 Extra copies of minor-label examples
--entity-max-length 8 Max words per entity span
--model-max-length 256 Max input sequence length

Outputs: best/, final/


train_gliner_csv.py — GLiNER

Fine-tunes a GLiNER span model. Retains zero-shot generalization for unseen entity types after fine-tuning.

python minor_gender_only/train_gliner_csv.py [options]
Argument Default Description
--model EmergentMethods/gliner_medium_news-v2.1 GLiNER model ID
--epochs 10 Training epochs
--batch-size 8 Batch size
--lr 3e-5 Learning rate
--output-dir gliner_finetuned_csv Output directory
--val-split 0.1 Validation fraction
--csv mydata.csv Training data path
--minor-oversample 0 Extra copies of minor-label examples
--gliner-threshold 0.5 Prediction confidence threshold

Outputs: best/, final/


train_gliner2_csv.py — GLiNER2

Fine-tunes GLiNER2 with LoRA adapters. Uses natural-language label descriptions instead of bare label names, which improves implicit entity detection.

python minor_gender_only/train_gliner2_csv.py [options]
Argument Default Description
--model fastino/gliner2-large-v1 GLiNER2 model ID
--epochs 10 Training epochs
--batch-size 4 Batch size
--grad-accum 2 Gradient accumulation steps
--encoder-lr 1e-5 Encoder learning rate
--task-lr 5e-4 Task head learning rate
--lora-r 8 LoRA rank
--output-dir gliner2_finetuned_csv Output directory
--val-split 0.1 Validation fraction
--csv mydata.csv Training data path
--minor-oversample 0 Extra copies of minor-label examples
--gender-oversample 0 Extra copies of gender-label examples

Outputs: best/, final/


train_flant5_csv.py — FLAN-T5

Seq2seq instruction-tuned model that generates a JSON array of entities. The descriptive label names in the prompt give the model a semantic head-start.

python minor_gender_only/train_flant5_csv.py [options]
Argument Default Description
--model google/flan-t5-base HuggingFace model ID
--epochs 10 Training epochs
--batch-size 8 Batch size
--lr 5e-5 Learning rate
--max-input-length 512 Max input tokens
--max-target-length 256 Max output tokens
--output-dir flant5_finetuned_csv Output directory
--val-split 0.1 Validation fraction
--csv mydata.csv Training data path
--minor-oversample 0 Extra copies of minor-label examples

Outputs: best/, final/


train_deberta_csv.py — DeBERTa-v3

Token-classification NER using DeBERTa-v3-base. Uses character-offset-based label alignment with SentencePiece tokenization. Defaults to combined_upload.csv.

python minor_gender_only/train_deberta_csv.py [options]
Argument Default Description
--model microsoft/deberta-v3-base HuggingFace model ID
--epochs 10 Training epochs
--batch-size 8 Batch size
--lr 2e-5 Learning rate
--max-length 512 Max input sequence length
--output-dir deberta_finetuned_csv Output directory
--val-split 0.1 Validation fraction
--csv combined_upload.csv Training data path
--minor-oversample 0 Extra copies of minor-label examples

Outputs: best/, final/

Note: DeBERTa-v3's SentencePiece tokenizer includes the preceding space in subword offsets (e.g. ▁baby → offset (2, 7) rather than (3, 7)). The eval callback accounts for this by comparing stripped span text rather than raw character indices.


train_llm_csv.py — Qwen LLM (LoRA)

Fine-tunes Qwen3.5-9B with 4-bit quantization and LoRA. The model generates a JSON array of {"text": ..., "label": ...} objects. Slowest to train; best at implicit/contextual entity detection.

python minor_gender_only/train_llm_csv.py [options]
Argument Default Description
--model Qwen/Qwen3.5-9B Base model ID
--epochs 3 Training epochs
--batch-size 2 Per-device batch size
--grad-accum 8 Gradient accumulation steps
--lr 2e-4 Learning rate
--lora-r 16 LoRA rank
--lora-alpha 32 LoRA alpha
--max-seq-length 768 Max sequence length
--output-dir llm_finetuned_csv Output directory
--val-split 0.1 Validation fraction
--csv mydata.csv Training data path
--eval-samples 100 Samples for per-epoch eval (LLM inference is slow)
--minor-oversample 0 Extra copies of minor-label examples
--no-4bit off Disable 4-bit quantization

Outputs: best/, final/, checkpoint-epoch-N/


train_setfit_csv.py — SetFit

Contrastive sentence-embedding classifier over candidate n-gram spans. Trains a span-level classifier: each candidate span is embedded and classified as NonfictionalChildRelated, AuthorGenderIndication, or O. No generative decoding; very fast inference.

python minor_gender_only/train_setfit_csv.py [options]
Argument Default Description
--model sentence-transformers/paraphrase-mpnet-base-v2 Sentence encoder
--epochs 5 Contrastive training epochs
--batch-size 16 Batch size
--max-span-words 6 Max n-gram length for candidate spans
--neg-ratio 3 Negative spans per positive
--output-dir setfit_finetuned_csv Output directory
--val-split 0.1 Validation fraction
--csv mydata.csv Training data path
--minor-oversample 0 Extra copies of minor-label examples

Outputs: best/, final/


Evaluation

minor_gender_only/evaluate_csv.py — Multi-model evaluation on a CSV

Evaluates any combination of trained models on a CSV file. Uses corpus-level and row-average soft F1 (exact match + token-Jaccard partial credit).

python minor_gender_only/evaluate_csv.py \
  --csv combined_upload.csv \
  --spacy minor_gender_only/spacy_model_combined_v2/best \
  --gliner gliner_finetuned_csv/best \
  --spanmarker spanmarker_finetuned_csv/best \
  --keyword-baseline \
  --row-output eval_results.csv
Argument Description
--csv Evaluation CSV (default: mydata.csv)
--spacy PATH spaCy model directory
--gliner PATH GLiNER model directory
--spanmarker PATH SpanMarker model directory
--llm PATH LLM LoRA adapter directory
--llm-base MODEL Base model ID for LLM (default: Qwen/Qwen3.5-9B)
--gliner-threshold GLiNER confidence threshold (default: 0.5)
--spanmarker-threshold SpanMarker confidence threshold (default: 0.5)
--keyword-baseline Also run keyword-matching baseline
--annotated-only Restrict to rows with at least one gold entity
--row-output PATH Write per-row results to CSV

minor_gender_only/eval_edge_cases_spacy.py — Edge-case evaluation

Evaluates the spaCy model against one of the synthetic_*_edge_cases.txt files, which test known hard cases (pet vs. child language, fictional vs. real children, age edge cases, etc.).

python minor_gender_only/eval_edge_cases_spacy.py \
  --model minor_gender_only/spacy_model_combined_v2/best \
  --edge-cases synthetic_700_edge_cases.txt \
  --errors \
  --row-output edge_results.csv
Argument Description
--model spaCy model directory (default: spacy_model_combined_v2/best)
--edge-cases Path to edge-cases .txt file (default: synthetic_700_edge_cases.txt)
--errors Print false-positive / false-negative error analysis
--n-worst Number of lowest-F1 examples to show with --errors (default: 20)
--annotated-only Evaluate only rows with at least one gold entity
--row-output PATH Write per-row results to CSV

minor_gender_only/zero_shot_bart_mnli_csv.py — BART MNLI baseline

Zero-shot baseline: uses BART-large-MNLI to detect whether a sentence contains a minor/gender entity, then falls back to keyword extraction for the span. No fine-tuning required.

python minor_gender_only/zero_shot_bart_mnli_csv.py --csv combined_upload.csv

with_medical/ — Full Label Set (Minor + Gender + Medical)

Trains on MACCROBAT clinical case reports + Corona2.json + mydata.csv in a single pass, adding MedicalCondition, ClinicalProcedure, and ClinicalEvent labels.

Required data:

  • 9764942/MACCROBAT2018/ — BratStandoff .txt + .ann files
  • Corona2.json — place next to the scripts
python with_medical/train_spacy.py      # spaCy
python with_medical/train_gliner.py     # GLiNER
python with_medical/train_spanmarker.py # SpanMarker
python with_medical/train_llm.py        # Qwen LLM

See the argument tables in the old README section or run python <script> --help. All scripts share the same --data-dir, --corona, --csv, and --minor-oversample arguments as their minor_gender_only/ counterparts.


Data Processing (data_processing/)

Utility scripts for preparing and auditing the annotation data.

Script Purpose
combine_upload.py Merges annotation batches into combined_upload.csv
check_labels.py Validates label consistency across rows
check_overlap.py Flags overlapping or conflicting span annotations
entity_diversity.py Reports entity surface-form distribution
sample_reviews.py Samples rows for manual QA

remove_minor_pronouns.py (project root) strips reviewer pronouns from gender/minor columns to reduce noise from ambiguous pronoun labels.


Edge Case Files

The synthetic_*_edge_cases.txt files are Python dicts (loadable with ast.literal_eval) covering known ambiguity categories:

Category Examples
Human child vs. young pet "my fur-baby" vs. "my baby outgrew her bassinet"
Fictional vs. real child "the protagonist's son" vs. "my son loved this"
Age boundary 17-year-old (minor) vs. 18-year-old (adult)
Numeric age in context "my 5-year-old" vs. "my 25-year-old"
Fictional vs. real gender "the character's wife" vs. "my wife"
Institutional role "elementary school teacher" (not a child) vs. "in elementary school"

Backend Comparison

spaCy SpanMarker GLiNER GLiNER2 DeBERTa FLAN-T5 SetFit LLM
Architecture Token classifier Span classifier Span model Span + LoRA Token classifier Seq2seq Embedding classifier Decoder + LoRA
Retains zero-shot No No Yes Yes No Partial No Yes
Inference speed Fast Medium Medium Medium Medium Slow Fast Very slow
Implicit entity detection Weak Medium Good Good Medium Good Weak Best
GPU required No Recommended Recommended Required Recommended Recommended No Required

About

A developing NER toolkit for privacy risks detection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages