JNER — Named Entity Recognition Training Pipeline

NER pipeline for detecting minor children and author gender indications in product reviews, with an optional medical-entity extension. Multiple model backends are provided so accuracy, speed, and zero-shot generalization trade-offs can be compared directly.

Labels

`minor_gender_only/` (primary)

Label	Meaning
`NonfictionalChildRelated`	Text span indicating a real (non-fictional) minor child — e.g. "my 8-year-old son", "toddler", "kindergarten"
`AuthorGenderIndication`	Text span indicating the reviewer's gender — e.g. "wife", "husband", "mom"

`with_medical/` (extended)

Adds three biomedical labels from clinical case reports:

Label	Source
`MedicalCondition`	MACCROBAT, Corona2
`ClinicalProcedure`	MACCROBAT
`ClinicalEvent`	MACCROBAT

Data Files

File	Description
`combined_upload.csv`	Primary training data (~14 k rows). Columns: `ori_review`, `minor_col`, `gender_col`, `medical_col`, `note_col`
`mydata.csv`	Earlier/smaller annotation export, used by some scripts as default
`mydata_nomy.csv`	Variant of mydata.csv with reviewer pronouns removed from gender labels
`eval_sample.csv`	Hand-curated evaluation sample
`eval_results_spacy.csv`	Per-row spaCy evaluation output
`eval_results_keyword_baseline.csv`	Per-row keyword-baseline evaluation output
`synthetic_200_edge_cases.txt`	200 synthetic edge cases (Python dict format)
`synthetic_500_edge_cases.txt`	500 synthetic edge cases
`synthetic_700_edge_cases.txt`	Combined 700-entry edge-case set (200 + 500, renumbered)
`Corona2.json`	Medical NER dataset (JLiNER format) — required for `with_medical/` only
`9764942/`	MACCROBAT clinical corpus — required for `with_medical/` only

CSV column format (minor_col, gender_col): semicolon-separated span phrases that appear verbatim in ori_review. Empty = no entity of that type in the row.

`minor_gender_only/` — Training Scripts

Eight backends are implemented. All scripts default to combined_upload.csv (DeBERTa) or mydata.csv (others); pass --csv to override.

Installation

Install only the backend you intend to run:

pip install -r requirements-spacy.txt      # spaCy
pip install -r requirements-spanmarker.txt # SpanMarker
pip install -r requirements-gliner.txt     # GLiNER / GLiNER2
pip install -r requirements-flant5.txt     # FLAN-T5
pip install -r requirements-deberta.txt    # DeBERTa
pip install -r requirements-llm.txt        # Qwen LLM (LoRA)
pip install -r requirements-setfit.txt     # SetFit

`train_spacy_csv.py` — spaCy

Token-classification NER using a spaCy pipeline (BIO tagging). Fast inference; base-model general NER is not retained after fine-tuning.

python minor_gender_only/train_spacy_csv.py [options]

Argument	Default	Description
`--model`	`en_core_web_trf`	spaCy base model
`--epochs`	`15`	Training epochs
`--batch-size`	`8`	Batch size
`--dropout`	`0.2`	Dropout
`--output-dir`	`spacy_model_csv`	Output directory
`--val-split`	`0.1`	Validation fraction
`--csv`	`mydata.csv`	Training data path
`--minor-oversample`	`0`	Extra copies of minor-label examples
`--use-gpu`	off	Enable GPU

Outputs: best/, final/, checkpoint-epoch-N/

`train_spanmarker_csv.py` — SpanMarker

Span-level BERT classifier using BIO supervision. Best accuracy on fixed label sets; saves best/ by eval_overall_f1 (SpanMarker's built-in seqeval metric).

python minor_gender_only/train_spanmarker_csv.py [options]

Argument	Default	Description
`--model`	`microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext`	Encoder model ID
`--epochs`	`5`	Training epochs
`--batch-size`	`8`	Batch size
`--lr`	`5e-5`	Learning rate
`--output-dir`	`spanmarker_finetuned_csv`	Output directory
`--val-split`	`0.1`	Validation fraction
`--csv`	`mydata.csv`	Training data path
`--minor-oversample`	`0`	Extra copies of minor-label examples
`--entity-max-length`	`8`	Max words per entity span
`--model-max-length`	`256`	Max input sequence length

Outputs: best/, final/

`train_gliner_csv.py` — GLiNER

Fine-tunes a GLiNER span model. Retains zero-shot generalization for unseen entity types after fine-tuning.

python minor_gender_only/train_gliner_csv.py [options]

Argument	Default	Description
`--model`	`EmergentMethods/gliner_medium_news-v2.1`	GLiNER model ID
`--epochs`	`10`	Training epochs
`--batch-size`	`8`	Batch size
`--lr`	`3e-5`	Learning rate
`--output-dir`	`gliner_finetuned_csv`	Output directory
`--val-split`	`0.1`	Validation fraction
`--csv`	`mydata.csv`	Training data path
`--minor-oversample`	`0`	Extra copies of minor-label examples
`--gliner-threshold`	`0.5`	Prediction confidence threshold

Outputs: best/, final/

`train_gliner2_csv.py` — GLiNER2

Fine-tunes GLiNER2 with LoRA adapters. Uses natural-language label descriptions instead of bare label names, which improves implicit entity detection.

python minor_gender_only/train_gliner2_csv.py [options]

Argument	Default	Description
`--model`	`fastino/gliner2-large-v1`	GLiNER2 model ID
`--epochs`	`10`	Training epochs
`--batch-size`	`4`	Batch size
`--grad-accum`	`2`	Gradient accumulation steps
`--encoder-lr`	`1e-5`	Encoder learning rate
`--task-lr`	`5e-4`	Task head learning rate
`--lora-r`	`8`	LoRA rank
`--output-dir`	`gliner2_finetuned_csv`	Output directory
`--val-split`	`0.1`	Validation fraction
`--csv`	`mydata.csv`	Training data path
`--minor-oversample`	`0`	Extra copies of minor-label examples
`--gender-oversample`	`0`	Extra copies of gender-label examples

Outputs: best/, final/

`train_flant5_csv.py` — FLAN-T5

Seq2seq instruction-tuned model that generates a JSON array of entities. The descriptive label names in the prompt give the model a semantic head-start.

python minor_gender_only/train_flant5_csv.py [options]

Argument	Default	Description
`--model`	`google/flan-t5-base`	HuggingFace model ID
`--epochs`	`10`	Training epochs
`--batch-size`	`8`	Batch size
`--lr`	`5e-5`	Learning rate
`--max-input-length`	`512`	Max input tokens
`--max-target-length`	`256`	Max output tokens
`--output-dir`	`flant5_finetuned_csv`	Output directory
`--val-split`	`0.1`	Validation fraction
`--csv`	`mydata.csv`	Training data path
`--minor-oversample`	`0`	Extra copies of minor-label examples

Outputs: best/, final/

`train_deberta_csv.py` — DeBERTa-v3

Token-classification NER using DeBERTa-v3-base. Uses character-offset-based label alignment with SentencePiece tokenization. Defaults to combined_upload.csv.

python minor_gender_only/train_deberta_csv.py [options]

Argument	Default	Description
`--model`	`microsoft/deberta-v3-base`	HuggingFace model ID
`--epochs`	`10`	Training epochs
`--batch-size`	`8`	Batch size
`--lr`	`2e-5`	Learning rate
`--max-length`	`512`	Max input sequence length
`--output-dir`	`deberta_finetuned_csv`	Output directory
`--val-split`	`0.1`	Validation fraction
`--csv`	`combined_upload.csv`	Training data path
`--minor-oversample`	`0`	Extra copies of minor-label examples

Outputs: best/, final/

Note: DeBERTa-v3's SentencePiece tokenizer includes the preceding space in subword offsets (e.g. ▁baby → offset (2, 7) rather than (3, 7)). The eval callback accounts for this by comparing stripped span text rather than raw character indices.

`train_llm_csv.py` — Qwen LLM (LoRA)

Fine-tunes Qwen3.5-9B with 4-bit quantization and LoRA. The model generates a JSON array of {"text": ..., "label": ...} objects. Slowest to train; best at implicit/contextual entity detection.

python minor_gender_only/train_llm_csv.py [options]

Argument	Default	Description
`--model`	`Qwen/Qwen3.5-9B`	Base model ID
`--epochs`	`3`	Training epochs
`--batch-size`	`2`	Per-device batch size
`--grad-accum`	`8`	Gradient accumulation steps
`--lr`	`2e-4`	Learning rate
`--lora-r`	`16`	LoRA rank
`--lora-alpha`	`32`	LoRA alpha
`--max-seq-length`	`768`	Max sequence length
`--output-dir`	`llm_finetuned_csv`	Output directory
`--val-split`	`0.1`	Validation fraction
`--csv`	`mydata.csv`	Training data path
`--eval-samples`	`100`	Samples for per-epoch eval (LLM inference is slow)
`--minor-oversample`	`0`	Extra copies of minor-label examples
`--no-4bit`	off	Disable 4-bit quantization

Outputs: best/, final/, checkpoint-epoch-N/

`train_setfit_csv.py` — SetFit

Contrastive sentence-embedding classifier over candidate n-gram spans. Trains a span-level classifier: each candidate span is embedded and classified as NonfictionalChildRelated, AuthorGenderIndication, or O. No generative decoding; very fast inference.

python minor_gender_only/train_setfit_csv.py [options]

Argument	Default	Description
`--model`	`sentence-transformers/paraphrase-mpnet-base-v2`	Sentence encoder
`--epochs`	`5`	Contrastive training epochs
`--batch-size`	`16`	Batch size
`--max-span-words`	`6`	Max n-gram length for candidate spans
`--neg-ratio`	`3`	Negative spans per positive
`--output-dir`	`setfit_finetuned_csv`	Output directory
`--val-split`	`0.1`	Validation fraction
`--csv`	`mydata.csv`	Training data path
`--minor-oversample`	`0`	Extra copies of minor-label examples

Outputs: best/, final/

Evaluation

`minor_gender_only/evaluate_csv.py` — Multi-model evaluation on a CSV

Evaluates any combination of trained models on a CSV file. Uses corpus-level and row-average soft F1 (exact match + token-Jaccard partial credit).

python minor_gender_only/evaluate_csv.py \
  --csv combined_upload.csv \
  --spacy minor_gender_only/spacy_model_combined_v2/best \
  --gliner gliner_finetuned_csv/best \
  --spanmarker spanmarker_finetuned_csv/best \
  --keyword-baseline \
  --row-output eval_results.csv

Argument	Description
`--csv`	Evaluation CSV (default: `mydata.csv`)
`--spacy PATH`	spaCy model directory
`--gliner PATH`	GLiNER model directory
`--spanmarker PATH`	SpanMarker model directory
`--llm PATH`	LLM LoRA adapter directory
`--llm-base MODEL`	Base model ID for LLM (default: `Qwen/Qwen3.5-9B`)
`--gliner-threshold`	GLiNER confidence threshold (default: `0.5`)
`--spanmarker-threshold`	SpanMarker confidence threshold (default: `0.5`)
`--keyword-baseline`	Also run keyword-matching baseline
`--annotated-only`	Restrict to rows with at least one gold entity
`--row-output PATH`	Write per-row results to CSV

`minor_gender_only/eval_edge_cases_spacy.py` — Edge-case evaluation

Evaluates the spaCy model against one of the synthetic_*_edge_cases.txt files, which test known hard cases (pet vs. child language, fictional vs. real children, age edge cases, etc.).

python minor_gender_only/eval_edge_cases_spacy.py \
  --model minor_gender_only/spacy_model_combined_v2/best \
  --edge-cases synthetic_700_edge_cases.txt \
  --errors \
  --row-output edge_results.csv

Argument	Description
`--model`	spaCy model directory (default: `spacy_model_combined_v2/best`)
`--edge-cases`	Path to edge-cases `.txt` file (default: `synthetic_700_edge_cases.txt`)
`--errors`	Print false-positive / false-negative error analysis
`--n-worst`	Number of lowest-F1 examples to show with `--errors` (default: 20)
`--annotated-only`	Evaluate only rows with at least one gold entity
`--row-output PATH`	Write per-row results to CSV

`minor_gender_only/zero_shot_bart_mnli_csv.py` — BART MNLI baseline

Zero-shot baseline: uses BART-large-MNLI to detect whether a sentence contains a minor/gender entity, then falls back to keyword extraction for the span. No fine-tuning required.

python minor_gender_only/zero_shot_bart_mnli_csv.py --csv combined_upload.csv

`with_medical/` — Full Label Set (Minor + Gender + Medical)

Trains on MACCROBAT clinical case reports + Corona2.json + mydata.csv in a single pass, adding MedicalCondition, ClinicalProcedure, and ClinicalEvent labels.

Required data:

9764942/MACCROBAT2018/ — BratStandoff .txt + .ann files
Corona2.json — place next to the scripts

python with_medical/train_spacy.py      # spaCy
python with_medical/train_gliner.py     # GLiNER
python with_medical/train_spanmarker.py # SpanMarker
python with_medical/train_llm.py        # Qwen LLM

See the argument tables in the old README section or run python <script> --help. All scripts share the same --data-dir, --corona, --csv, and --minor-oversample arguments as their minor_gender_only/ counterparts.

Data Processing (`data_processing/`)

Utility scripts for preparing and auditing the annotation data.

Script	Purpose
`combine_upload.py`	Merges annotation batches into `combined_upload.csv`
`check_labels.py`	Validates label consistency across rows
`check_overlap.py`	Flags overlapping or conflicting span annotations
`entity_diversity.py`	Reports entity surface-form distribution
`sample_reviews.py`	Samples rows for manual QA

remove_minor_pronouns.py (project root) strips reviewer pronouns from gender/minor columns to reduce noise from ambiguous pronoun labels.

Edge Case Files

The synthetic_*_edge_cases.txt files are Python dicts (loadable with ast.literal_eval) covering known ambiguity categories:

Category	Examples
Human child vs. young pet	"my fur-baby" vs. "my baby outgrew her bassinet"
Fictional vs. real child	"the protagonist's son" vs. "my son loved this"
Age boundary	17-year-old (minor) vs. 18-year-old (adult)
Numeric age in context	"my 5-year-old" vs. "my 25-year-old"
Fictional vs. real gender	"the character's wife" vs. "my wife"
Institutional role	"elementary school teacher" (not a child) vs. "in elementary school"

Backend Comparison

	spaCy	SpanMarker	GLiNER	GLiNER2	DeBERTa	FLAN-T5	SetFit	LLM
Architecture	Token classifier	Span classifier	Span model	Span + LoRA	Token classifier	Seq2seq	Embedding classifier	Decoder + LoRA
Retains zero-shot	No	No	Yes	Yes	No	Partial	No	Yes
Inference speed	Fast	Medium	Medium	Medium	Medium	Slow	Fast	Very slow
Implicit entity detection	Weak	Medium	Good	Good	Medium	Good	Weak	Best
GPU required	No	Recommended	Recommended	Required	Recommended	Recommended	No	Required

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JNER — Named Entity Recognition Training Pipeline

Labels

`minor_gender_only/` (primary)

`with_medical/` (extended)

Data Files

`minor_gender_only/` — Training Scripts

Installation

`train_spacy_csv.py` — spaCy

`train_spanmarker_csv.py` — SpanMarker

`train_gliner_csv.py` — GLiNER

`train_gliner2_csv.py` — GLiNER2

`train_flant5_csv.py` — FLAN-T5

`train_deberta_csv.py` — DeBERTa-v3

`train_llm_csv.py` — Qwen LLM (LoRA)

`train_setfit_csv.py` — SetFit

Evaluation

`minor_gender_only/evaluate_csv.py` — Multi-model evaluation on a CSV

`minor_gender_only/eval_edge_cases_spacy.py` — Edge-case evaluation

`minor_gender_only/zero_shot_bart_mnli_csv.py` — BART MNLI baseline

`with_medical/` — Full Label Set (Minor + Gender + Medical)

Data Processing (`data_processing/`)

Edge Case Files

Backend Comparison

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.idea		.idea
data_processing		data_processing
minor_gender_only		minor_gender_only
README.md		README.md
remove_minor_pronouns.py		remove_minor_pronouns.py
requirements-bart-mnli.txt		requirements-bart-mnli.txt
requirements-deberta.txt		requirements-deberta.txt
requirements-flant5.txt		requirements-flant5.txt
requirements-gliner.txt		requirements-gliner.txt
requirements-llm.txt		requirements-llm.txt
requirements-setfit.txt		requirements-setfit.txt
requirements-spacy.txt		requirements-spacy.txt
requirements-spanmarker.txt		requirements-spanmarker.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

JNER — Named Entity Recognition Training Pipeline

Labels

minor_gender_only/ (primary)

with_medical/ (extended)

Data Files

minor_gender_only/ — Training Scripts

Installation

train_spacy_csv.py — spaCy

train_spanmarker_csv.py — SpanMarker

train_gliner_csv.py — GLiNER

train_gliner2_csv.py — GLiNER2

train_flant5_csv.py — FLAN-T5

train_deberta_csv.py — DeBERTa-v3

train_llm_csv.py — Qwen LLM (LoRA)

train_setfit_csv.py — SetFit

Evaluation

minor_gender_only/evaluate_csv.py — Multi-model evaluation on a CSV

minor_gender_only/eval_edge_cases_spacy.py — Edge-case evaluation

minor_gender_only/zero_shot_bart_mnli_csv.py — BART MNLI baseline

with_medical/ — Full Label Set (Minor + Gender + Medical)

Data Processing (data_processing/)

Edge Case Files

Backend Comparison

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`minor_gender_only/` (primary)

`with_medical/` (extended)

`minor_gender_only/` — Training Scripts

`train_spacy_csv.py` — spaCy

`train_spanmarker_csv.py` — SpanMarker

`train_gliner_csv.py` — GLiNER

`train_gliner2_csv.py` — GLiNER2

`train_flant5_csv.py` — FLAN-T5

`train_deberta_csv.py` — DeBERTa-v3

`train_llm_csv.py` — Qwen LLM (LoRA)

`train_setfit_csv.py` — SetFit

`minor_gender_only/evaluate_csv.py` — Multi-model evaluation on a CSV

`minor_gender_only/eval_edge_cases_spacy.py` — Edge-case evaluation

`minor_gender_only/zero_shot_bart_mnli_csv.py` — BART MNLI baseline

`with_medical/` — Full Label Set (Minor + Gender + Medical)

Data Processing (`data_processing/`)

Packages