Skip to content

TODO #1

@pdasigi

Description

@pdasigi

High priority:

  • Look into MOCHA and finalize how much we can use it for evaluation.
  • If we decide we need additional human annotations like MOCHA, finalize the annotation scheme.
  • Finalize OOD evaluation sets for each of the datasets we are training quality estimators for (MS MARCO, NarrativeQA, Qasper)

Medium priority:

  • Augment training data for the (question, context, prediction) -> F1 score model and see if we can do better on Qasper
  • LSTM with attention on final layer features -> F1 score for Qasper
  • Train QCP -> F1 model on MS MARCO and NarrativeQA

Low priority:

  • Include LM score as an additional feature in the MLP quality estimator

After we have an evaluation setup:

  • Compare different target metrics for quality estimation (e.g.: Is regressing against ROUGE better than doing it against F1?): Evaluation metric will be rank correlation (possibly binned) against human annotations.
  • Compare multi-tasking on regressing against metrics with calibrating on individual metrics. Evaluated based on rank correlation against human annotations.

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions