Skip to content

hulln/unidive-cocos

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

unidive-cocos

Pipeline for adding Backchannel= and Coconstruct= annotations to SST speaker-view CoNLL-U files.

Repository structure

  • src/sst/
    • source SST CoNLL-U files (train, dev, test, merged)
  • lexicon/
    • lexical resources (e.g., lexicon/sl_backchannels.txt)
  • scripts/
    • numbered workflow scripts (01 to 07)
    • see scripts/README.md for script-level details
  • docs/
    • process documentation
    • docs/BACKCHANNELS_EXTRACTION.md
    • docs/COCONSTRUCTIONS_EXTRACTION.md
  • output/sst/
    • extracted candidates and generated annotated corpora
    • final package: output/sst/final_bc_coco/

End-to-end workflow

  1. Merge corpus
python3 scripts/01_merge_sst.py
  1. Extract backchannel candidates
python3 scripts/02_extract_backchannel_candidates.py
  1. Apply backchannels (uses filtered rows from the candidate table)
python3 scripts/03_apply_backchannel_annotations.py
  1. Extract coconstruction candidates
python3 scripts/04_extract_coconstruction_candidates.py
  1. Manual coconstruction annotation (outside script)
  • fill is_coconstruction, coconstruct_deprel, governor_token_id for YES cases
  1. Apply coconstructions
python3 scripts/05_apply_coconstruction_annotations.py
  1. Split final merged file back to train/dev/test
python3 scripts/06_split_final_corpus.py
  1. Run strict diff checks
python3 scripts/07_diffcheck_final_vs_src.py

Main docs

  • Backchannels workflow: docs/BACKCHANNELS_EXTRACTION.md
  • Coconstructions workflow + manual annotation: docs/COCONSTRUCTIONS_EXTRACTION.md
  • Docs index: docs/README.md

Current final output location

  • output/sst/final_bc_coco/conllu/sl_sst-ud-merged.conllu
  • output/sst/final_bc_coco/conllu/sl_sst-ud-train.conllu
  • output/sst/final_bc_coco/conllu/sl_sst-ud-dev.conllu
  • output/sst/final_bc_coco/conllu/sl_sst-ud-test.conllu

Releases

No releases published

Packages

 
 
 

Contributors

Languages