Skip to content

JongSuk1/KorQuad

Repository files navigation

Lanugage Models Are Specialists: Rethinking Fine-tuning Language Models from Diverse Sources

This repository is an implementation of the NLP task project conducted in KAIST School of Computing, CS492(H): Special Topics in Computer Science: with NAVER.

We used KorQuad-open dataset collected from NAVER wiki, blog, web, kin, news. Throughout this task, we implemented it using only NSML resources and want to say thank you to NSML for providing GPU resourses.

Customize open_squad and open_squad_metric

Multiple Paragraph

In general SQuAD dataset, QA and paragraph are one-to-one. However, this dataset has multiple paragraphs for one QA, so one should create multiple squad example for a QA. It is implemented by modifying the existing official code. And too many squad example were created, limiting the number of squad example created per QA.

Use Only Majority Class

We found minority class is mostly not useful, and it prevents the model from well optimized when included in the training step. So we added the option that you can choose source to use for the train. If you activate the --only_wiki option in run_nsml shell file, you can train using only the wiki source. We reached the best accuracy with this option.

Number of paragraphs on inference

The number of paragraphs on inference affects significantly with the performance as below. The highest accuracy appears when using 5 paragraphs, so the default set to use 5 when inference.

Model 1 2 3 4 5 6 7
Electra with Wiki 62.82 67.44 69.47 70.31 70.98 50.29 50.37

Final Results

The final result applied up to the Ensemble method is as follows.

Model Single Model Ensemble
Full 68.40(±0.91) 70.48
Wiki 70.54(±0.36) 72.16

Dataset

Distribution

Distribution of paragraphs per sources. Different colors indicate the position of paragraphs and it is very imbalanced. Also, there are 'No Answer' questions much more than 'Has Answer' questions.

Example

Question: 에버그린이 왕관을 사용하기 전 실험실을 엉망으로 만든 괴물은?
Answer: 마그마
“kdc”: 하지만 왕관을 사용하기 전, 코끼리 같이 생긴 마그마 괴물이 들어닥치고 실험실을 엉망으로 만든다. 
건물이 무너지고 잔해에 깔린 에버그린은 자신이 왕관을 쓸 수 없게 되자, 건터에게 왕관을 쓴 뒤 마음 깊은 
속에 있던 소원(에버그린에게는 ‘혜성을 파괴하는 것’)을 빌라고...

“kin”: 러이트 노벨에 대해선 거의 매니아지만 더 다양한것이 읽고싶네요. 지금까지 읽은건 데어라 내청코 
내여귀 에망선 주문토끼 냐루코 사쿠라장 중2코이 중고코이 나에게 천사 귀여우면 변태라도 간단하게 이정도입니다 
러브코미디.하렘물을 좋아하고 되도록이면 그런쪽으로 추천 받으면 좋겠네요. 최근 나오고 있는것이면 더 좋고 
옛날것도 좋습니다... 이능을 사용하며 평범한 일상을 보내는 모습을 그린 일상물 작품이지요...

Train and Inference in NSML

Train Model

In this project, we tested two types of models: run_squad.py, run_squad_multihead.py. You can choose either use a single head or multi-head for each source. If you want to train multi-head model, modify run_nsml.sh file run_squad.py to run_squad_multihead.py and delete only_wiki option.

> sh run_nsml.sh

> nsml submit {SESSION NAME} {CHECKPOINT} #submit directly

Inference

Ensemble

You can infer your model by ensemble method. You should choose three trained model in NSML, and set checkpoint{i} and session{i} (i= 1,2,3) of them in submit_ensemble.sh Note that we only support the ensemble for only single head models.

> sh submit_ensemble.sh

> nsml submit {SESSION NAME} total_best 

Single Model

Also you can infer with one model.

> sh submit.sh

> nsml submit {SESSION NAME} best 

When you want to control the number of paragraphs when inference on a trained model, edit (line 607) in open_squad.py and use the submit shell.

Original Author

Seonhoon Kim (Naver)

About

Multiple paragraph for one QA problem. (KorQuad dataset collected from NAVER, project conducted in KAIST CS492H)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors