-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
High priority:
- Look into MOCHA and finalize how much we can use it for evaluation.
- If we decide we need additional human annotations like MOCHA, finalize the annotation scheme.
- Finalize OOD evaluation sets for each of the datasets we are training quality estimators for (MS MARCO, NarrativeQA, Qasper)
Medium priority:
- Augment training data for the (question, context, prediction) -> F1 score model and see if we can do better on Qasper
- LSTM with attention on final layer features -> F1 score for Qasper
- Train QCP -> F1 model on MS MARCO and NarrativeQA
Low priority:
- Include LM score as an additional feature in the MLP quality estimator
After we have an evaluation setup:
- Compare different target metrics for quality estimation (e.g.: Is regressing against ROUGE better than doing it against F1?): Evaluation metric will be rank correlation (possibly binned) against human annotations.
- Compare multi-tasking on regressing against metrics with calibrating on individual metrics. Evaluated based on rank correlation against human annotations.
Metadata
Metadata
Assignees
Labels
No labels