NER pipeline for detecting minor children and author gender indications in product reviews, with an optional medical-entity extension. Multiple model backends are provided so accuracy, speed, and zero-shot generalization trade-offs can be compared directly.
| Label | Meaning |
|---|---|
NonfictionalChildRelated |
Text span indicating a real (non-fictional) minor child — e.g. "my 8-year-old son", "toddler", "kindergarten" |
AuthorGenderIndication |
Text span indicating the reviewer's gender — e.g. "wife", "husband", "mom" |
Adds three biomedical labels from clinical case reports:
| Label | Source |
|---|---|
MedicalCondition |
MACCROBAT, Corona2 |
ClinicalProcedure |
MACCROBAT |
ClinicalEvent |
MACCROBAT |
| File | Description |
|---|---|
combined_upload.csv |
Primary training data (~14 k rows). Columns: ori_review, minor_col, gender_col, medical_col, note_col |
mydata.csv |
Earlier/smaller annotation export, used by some scripts as default |
mydata_nomy.csv |
Variant of mydata.csv with reviewer pronouns removed from gender labels |
eval_sample.csv |
Hand-curated evaluation sample |
eval_results_spacy.csv |
Per-row spaCy evaluation output |
eval_results_keyword_baseline.csv |
Per-row keyword-baseline evaluation output |
synthetic_200_edge_cases.txt |
200 synthetic edge cases (Python dict format) |
synthetic_500_edge_cases.txt |
500 synthetic edge cases |
synthetic_700_edge_cases.txt |
Combined 700-entry edge-case set (200 + 500, renumbered) |
Corona2.json |
Medical NER dataset (JLiNER format) — required for with_medical/ only |
9764942/ |
MACCROBAT clinical corpus — required for with_medical/ only |
CSV column format (minor_col, gender_col): semicolon-separated span phrases that appear verbatim in ori_review. Empty = no entity of that type in the row.
Eight backends are implemented. All scripts default to combined_upload.csv (DeBERTa) or mydata.csv (others); pass --csv to override.
Install only the backend you intend to run:
pip install -r requirements-spacy.txt # spaCy
pip install -r requirements-spanmarker.txt # SpanMarker
pip install -r requirements-gliner.txt # GLiNER / GLiNER2
pip install -r requirements-flant5.txt # FLAN-T5
pip install -r requirements-deberta.txt # DeBERTa
pip install -r requirements-llm.txt # Qwen LLM (LoRA)
pip install -r requirements-setfit.txt # SetFitToken-classification NER using a spaCy pipeline (BIO tagging). Fast inference; base-model general NER is not retained after fine-tuning.
python minor_gender_only/train_spacy_csv.py [options]| Argument | Default | Description |
|---|---|---|
--model |
en_core_web_trf |
spaCy base model |
--epochs |
15 |
Training epochs |
--batch-size |
8 |
Batch size |
--dropout |
0.2 |
Dropout |
--output-dir |
spacy_model_csv |
Output directory |
--val-split |
0.1 |
Validation fraction |
--csv |
mydata.csv |
Training data path |
--minor-oversample |
0 |
Extra copies of minor-label examples |
--use-gpu |
off | Enable GPU |
Outputs: best/, final/, checkpoint-epoch-N/
Span-level BERT classifier using BIO supervision. Best accuracy on fixed label sets; saves best/ by eval_overall_f1 (SpanMarker's built-in seqeval metric).
python minor_gender_only/train_spanmarker_csv.py [options]| Argument | Default | Description |
|---|---|---|
--model |
microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext |
Encoder model ID |
--epochs |
5 |
Training epochs |
--batch-size |
8 |
Batch size |
--lr |
5e-5 |
Learning rate |
--output-dir |
spanmarker_finetuned_csv |
Output directory |
--val-split |
0.1 |
Validation fraction |
--csv |
mydata.csv |
Training data path |
--minor-oversample |
0 |
Extra copies of minor-label examples |
--entity-max-length |
8 |
Max words per entity span |
--model-max-length |
256 |
Max input sequence length |
Outputs: best/, final/
Fine-tunes a GLiNER span model. Retains zero-shot generalization for unseen entity types after fine-tuning.
python minor_gender_only/train_gliner_csv.py [options]| Argument | Default | Description |
|---|---|---|
--model |
EmergentMethods/gliner_medium_news-v2.1 |
GLiNER model ID |
--epochs |
10 |
Training epochs |
--batch-size |
8 |
Batch size |
--lr |
3e-5 |
Learning rate |
--output-dir |
gliner_finetuned_csv |
Output directory |
--val-split |
0.1 |
Validation fraction |
--csv |
mydata.csv |
Training data path |
--minor-oversample |
0 |
Extra copies of minor-label examples |
--gliner-threshold |
0.5 |
Prediction confidence threshold |
Outputs: best/, final/
Fine-tunes GLiNER2 with LoRA adapters. Uses natural-language label descriptions instead of bare label names, which improves implicit entity detection.
python minor_gender_only/train_gliner2_csv.py [options]| Argument | Default | Description |
|---|---|---|
--model |
fastino/gliner2-large-v1 |
GLiNER2 model ID |
--epochs |
10 |
Training epochs |
--batch-size |
4 |
Batch size |
--grad-accum |
2 |
Gradient accumulation steps |
--encoder-lr |
1e-5 |
Encoder learning rate |
--task-lr |
5e-4 |
Task head learning rate |
--lora-r |
8 |
LoRA rank |
--output-dir |
gliner2_finetuned_csv |
Output directory |
--val-split |
0.1 |
Validation fraction |
--csv |
mydata.csv |
Training data path |
--minor-oversample |
0 |
Extra copies of minor-label examples |
--gender-oversample |
0 |
Extra copies of gender-label examples |
Outputs: best/, final/
Seq2seq instruction-tuned model that generates a JSON array of entities. The descriptive label names in the prompt give the model a semantic head-start.
python minor_gender_only/train_flant5_csv.py [options]| Argument | Default | Description |
|---|---|---|
--model |
google/flan-t5-base |
HuggingFace model ID |
--epochs |
10 |
Training epochs |
--batch-size |
8 |
Batch size |
--lr |
5e-5 |
Learning rate |
--max-input-length |
512 |
Max input tokens |
--max-target-length |
256 |
Max output tokens |
--output-dir |
flant5_finetuned_csv |
Output directory |
--val-split |
0.1 |
Validation fraction |
--csv |
mydata.csv |
Training data path |
--minor-oversample |
0 |
Extra copies of minor-label examples |
Outputs: best/, final/
Token-classification NER using DeBERTa-v3-base. Uses character-offset-based label alignment with SentencePiece tokenization. Defaults to combined_upload.csv.
python minor_gender_only/train_deberta_csv.py [options]| Argument | Default | Description |
|---|---|---|
--model |
microsoft/deberta-v3-base |
HuggingFace model ID |
--epochs |
10 |
Training epochs |
--batch-size |
8 |
Batch size |
--lr |
2e-5 |
Learning rate |
--max-length |
512 |
Max input sequence length |
--output-dir |
deberta_finetuned_csv |
Output directory |
--val-split |
0.1 |
Validation fraction |
--csv |
combined_upload.csv |
Training data path |
--minor-oversample |
0 |
Extra copies of minor-label examples |
Outputs: best/, final/
Note: DeBERTa-v3's SentencePiece tokenizer includes the preceding space in subword offsets (e.g.
▁baby→ offset(2, 7)rather than(3, 7)). The eval callback accounts for this by comparing stripped span text rather than raw character indices.
Fine-tunes Qwen3.5-9B with 4-bit quantization and LoRA. The model generates a JSON array of {"text": ..., "label": ...} objects. Slowest to train; best at implicit/contextual entity detection.
python minor_gender_only/train_llm_csv.py [options]| Argument | Default | Description |
|---|---|---|
--model |
Qwen/Qwen3.5-9B |
Base model ID |
--epochs |
3 |
Training epochs |
--batch-size |
2 |
Per-device batch size |
--grad-accum |
8 |
Gradient accumulation steps |
--lr |
2e-4 |
Learning rate |
--lora-r |
16 |
LoRA rank |
--lora-alpha |
32 |
LoRA alpha |
--max-seq-length |
768 |
Max sequence length |
--output-dir |
llm_finetuned_csv |
Output directory |
--val-split |
0.1 |
Validation fraction |
--csv |
mydata.csv |
Training data path |
--eval-samples |
100 |
Samples for per-epoch eval (LLM inference is slow) |
--minor-oversample |
0 |
Extra copies of minor-label examples |
--no-4bit |
off | Disable 4-bit quantization |
Outputs: best/, final/, checkpoint-epoch-N/
Contrastive sentence-embedding classifier over candidate n-gram spans. Trains a span-level classifier: each candidate span is embedded and classified as NonfictionalChildRelated, AuthorGenderIndication, or O. No generative decoding; very fast inference.
python minor_gender_only/train_setfit_csv.py [options]| Argument | Default | Description |
|---|---|---|
--model |
sentence-transformers/paraphrase-mpnet-base-v2 |
Sentence encoder |
--epochs |
5 |
Contrastive training epochs |
--batch-size |
16 |
Batch size |
--max-span-words |
6 |
Max n-gram length for candidate spans |
--neg-ratio |
3 |
Negative spans per positive |
--output-dir |
setfit_finetuned_csv |
Output directory |
--val-split |
0.1 |
Validation fraction |
--csv |
mydata.csv |
Training data path |
--minor-oversample |
0 |
Extra copies of minor-label examples |
Outputs: best/, final/
Evaluates any combination of trained models on a CSV file. Uses corpus-level and row-average soft F1 (exact match + token-Jaccard partial credit).
python minor_gender_only/evaluate_csv.py \
--csv combined_upload.csv \
--spacy minor_gender_only/spacy_model_combined_v2/best \
--gliner gliner_finetuned_csv/best \
--spanmarker spanmarker_finetuned_csv/best \
--keyword-baseline \
--row-output eval_results.csv| Argument | Description |
|---|---|
--csv |
Evaluation CSV (default: mydata.csv) |
--spacy PATH |
spaCy model directory |
--gliner PATH |
GLiNER model directory |
--spanmarker PATH |
SpanMarker model directory |
--llm PATH |
LLM LoRA adapter directory |
--llm-base MODEL |
Base model ID for LLM (default: Qwen/Qwen3.5-9B) |
--gliner-threshold |
GLiNER confidence threshold (default: 0.5) |
--spanmarker-threshold |
SpanMarker confidence threshold (default: 0.5) |
--keyword-baseline |
Also run keyword-matching baseline |
--annotated-only |
Restrict to rows with at least one gold entity |
--row-output PATH |
Write per-row results to CSV |
Evaluates the spaCy model against one of the synthetic_*_edge_cases.txt files, which test known hard cases (pet vs. child language, fictional vs. real children, age edge cases, etc.).
python minor_gender_only/eval_edge_cases_spacy.py \
--model minor_gender_only/spacy_model_combined_v2/best \
--edge-cases synthetic_700_edge_cases.txt \
--errors \
--row-output edge_results.csv| Argument | Description |
|---|---|
--model |
spaCy model directory (default: spacy_model_combined_v2/best) |
--edge-cases |
Path to edge-cases .txt file (default: synthetic_700_edge_cases.txt) |
--errors |
Print false-positive / false-negative error analysis |
--n-worst |
Number of lowest-F1 examples to show with --errors (default: 20) |
--annotated-only |
Evaluate only rows with at least one gold entity |
--row-output PATH |
Write per-row results to CSV |
Zero-shot baseline: uses BART-large-MNLI to detect whether a sentence contains a minor/gender entity, then falls back to keyword extraction for the span. No fine-tuning required.
python minor_gender_only/zero_shot_bart_mnli_csv.py --csv combined_upload.csvTrains on MACCROBAT clinical case reports + Corona2.json + mydata.csv in a single pass, adding MedicalCondition, ClinicalProcedure, and ClinicalEvent labels.
Required data:
9764942/MACCROBAT2018/— BratStandoff.txt+.annfilesCorona2.json— place next to the scripts
python with_medical/train_spacy.py # spaCy
python with_medical/train_gliner.py # GLiNER
python with_medical/train_spanmarker.py # SpanMarker
python with_medical/train_llm.py # Qwen LLMSee the argument tables in the old README section or run python <script> --help. All scripts share the same --data-dir, --corona, --csv, and --minor-oversample arguments as their minor_gender_only/ counterparts.
Utility scripts for preparing and auditing the annotation data.
| Script | Purpose |
|---|---|
combine_upload.py |
Merges annotation batches into combined_upload.csv |
check_labels.py |
Validates label consistency across rows |
check_overlap.py |
Flags overlapping or conflicting span annotations |
entity_diversity.py |
Reports entity surface-form distribution |
sample_reviews.py |
Samples rows for manual QA |
remove_minor_pronouns.py (project root) strips reviewer pronouns from gender/minor columns to reduce noise from ambiguous pronoun labels.
The synthetic_*_edge_cases.txt files are Python dicts (loadable with ast.literal_eval) covering known ambiguity categories:
| Category | Examples |
|---|---|
| Human child vs. young pet | "my fur-baby" vs. "my baby outgrew her bassinet" |
| Fictional vs. real child | "the protagonist's son" vs. "my son loved this" |
| Age boundary | 17-year-old (minor) vs. 18-year-old (adult) |
| Numeric age in context | "my 5-year-old" vs. "my 25-year-old" |
| Fictional vs. real gender | "the character's wife" vs. "my wife" |
| Institutional role | "elementary school teacher" (not a child) vs. "in elementary school" |
| spaCy | SpanMarker | GLiNER | GLiNER2 | DeBERTa | FLAN-T5 | SetFit | LLM | |
|---|---|---|---|---|---|---|---|---|
| Architecture | Token classifier | Span classifier | Span model | Span + LoRA | Token classifier | Seq2seq | Embedding classifier | Decoder + LoRA |
| Retains zero-shot | No | No | Yes | Yes | No | Partial | No | Yes |
| Inference speed | Fast | Medium | Medium | Medium | Medium | Slow | Fast | Very slow |
| Implicit entity detection | Weak | Medium | Good | Good | Medium | Good | Weak | Best |
| GPU required | No | Recommended | Recommended | Required | Recommended | Recommended | No | Required |