TODO

High priority:
- [ ] Look into MOCHA and finalize how much we can use it for evaluation.
- [ ] If we decide we need additional human annotations like MOCHA, finalize the annotation scheme.
- [ ] Finalize OOD evaluation sets for each of the datasets we are training quality estimators for (MS MARCO, NarrativeQA, Qasper)

Medium priority:
- [ ] Augment training data for the (question, context, prediction) -> F1 score model and see if we can do better on Qasper
- [ ] LSTM with attention on final layer features -> F1 score for Qasper 
- [ ] Train QCP -> F1 model on MS MARCO and NarrativeQA

Low priority:
- [ ] Include LM score as an additional feature in the MLP quality estimator

After we have an evaluation setup: 
- [ ] Compare different target metrics for quality estimation (e.g.: Is regressing against ROUGE better than doing it against F1?): Evaluation metric will be rank correlation (possibly binned) against human annotations.
- [ ] Compare multi-tasking on regressing against metrics with calibrating on individual metrics. Evaluated based on rank correlation against human annotations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TODO #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

TODO #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions