forked from Marker-Inc-Korea/AutoRAG
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Finish new data creation documentation (Marker-Inc-Korea#711)
* add answer generation docs * complete qa creation documentation * finish documentation of beta version data creation --------- Co-authored-by: jeffrey <vkefhdl1@gmail.com>
- Loading branch information
Showing
7 changed files
with
343 additions
and
59 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
# QA creation | ||
|
||
In this section, we will cover how to create QA data for the AutoRAG. | ||
|
||
It is a crucial step to create the good QA data. Because if the QA data is bad, the RAG will not be optimized well. | ||
|
||
## Overview | ||
|
||
The sample QA creation pipeline looks like this. | ||
|
||
```python | ||
from llama_index.llms.openai import OpenAI | ||
|
||
from autorag.data.beta.filter.dontknow import dontknow_filter_rule_based | ||
from autorag.data.beta.generation_gt.llama_index_gen_gt import ( | ||
make_basic_gen_gt, | ||
make_concise_gen_gt, | ||
) | ||
from autorag.data.beta.query.llama_gen_query import factoid_query_gen | ||
from autorag.data.beta.sample import random_single_hop | ||
|
||
llm = OpenAI() | ||
initial_corpus = initial_raw.chunk( | ||
"llama_index_chunk", chunk_method="token", chunk_size=128, chunk_overlap=5 | ||
) | ||
initial_qa = ( | ||
initial_corpus.sample(random_single_hop, n=3) | ||
.map( | ||
lambda df: df.reset_index(drop=True), | ||
) | ||
.make_retrieval_gt_contents() | ||
.batch_apply( | ||
factoid_query_gen, # query generation | ||
llm=llm, | ||
) | ||
.batch_apply( | ||
make_basic_gen_gt, # answer generation (basic) | ||
llm=llm, | ||
) | ||
.batch_apply( | ||
make_concise_gen_gt, # answer generation (concise) | ||
llm=llm, | ||
) | ||
.filter( | ||
dontknow_filter_rule_based, # filter don't know | ||
lang="en", | ||
) | ||
) | ||
|
||
initial_qa.to_parquet('./qa.parquet', './corpus.parquet') | ||
``` | ||
|
||
### 1. Sample retrieval gt | ||
|
||
To create question and answer, you have to sample retrieval gt from the corpus data. | ||
You can get the initial chunk data from the raw data. | ||
And then sample it using the `sample` function. | ||
|
||
```python | ||
from autorag.data.beta.sample import random_single_hop | ||
|
||
qa = initial_corpus.sample(random_single_hop, n=3).map( | ||
lambda df: df.reset_index(drop=True), | ||
) | ||
``` | ||
|
||
You can change the sample method by changing the function to different functions. | ||
Supported methods are below. | ||
|
||
| Method | Description | | ||
|:-----------------:|:------------------------------------------:| | ||
| random_single_hop | Randomly sample one hop from the corpus | | ||
| range_single_hop | Sample single hop with range in the corpus | | ||
|
||
|
||
### 2. Get retrieval gt contents to generate questions | ||
|
||
At the first step, you only sample retrieval gt ids. But to generate questions, you have to get the contents of the retrieval gt. | ||
To achieve this, you can use the `make_retrieval_gt_contents` function. | ||
|
||
```python | ||
qa = qa.make_retrieval_gt_contents() | ||
``` | ||
|
||
### 3. Generate queries | ||
|
||
Now, you use LLM to generate queries. | ||
In this example, we use the `factoid_query_gen` function to generate factoid questions. | ||
|
||
```python | ||
from llama_index.llms.openai import OpenAI | ||
|
||
from autorag.data.beta.query.llama_gen_query import factoid_query_gen | ||
|
||
llm = OpenAI() | ||
qa = qa.batch_apply( | ||
factoid_query_gen, # query generation | ||
llm=llm, | ||
) | ||
``` | ||
|
||
To know more query generation methods, check this [page](./query_gen.md). | ||
|
||
### 4. Generate answers | ||
|
||
After generating questions, you have to generate answers (generation gt). | ||
|
||
```python | ||
from llama_index.llms.openai import OpenAI | ||
|
||
from autorag.data.beta.generation_gt.llama_index_gen_gt import ( | ||
make_basic_gen_gt, | ||
make_concise_gen_gt, | ||
) | ||
|
||
llm = OpenAI() | ||
|
||
qa = qa.batch_apply( | ||
make_basic_gen_gt, # answer generation (basic) | ||
llm=llm, | ||
).batch_apply( | ||
make_concise_gen_gt, # answer generation (concise) | ||
llm=llm, | ||
) | ||
``` | ||
|
||
To know more answer generation methods, check this [page](./answer_gen.md). | ||
|
||
### 5. Filtering questions | ||
|
||
It is natural that LLM generates some bad questions. | ||
So, it is better you filter some bad questions with classification models or LLM models. | ||
|
||
To filtering, we use `filter` method. | ||
|
||
```python | ||
from llama_index.llms.openai import OpenAI | ||
|
||
from autorag.data.beta.filter.dontknow import dontknow_filter_rule_based | ||
|
||
llm = OpenAI() | ||
qa = qa.filter( | ||
dontknow_filter_rule_based, # filter don't know | ||
lang="en", | ||
) | ||
``` | ||
|
||
To know more filtering methods, check this [page](./filter.md). | ||
|
||
### 6. Save the QA data | ||
|
||
Now you can use the QA data for running AutoRAG. | ||
|
||
```python | ||
qa.to_parquet('./qa.parquet', './corpus.parquet') | ||
``` | ||
|
||
```{toctree} | ||
--- | ||
maxdepth: 1 | ||
--- | ||
query_gen.md | ||
answer_gen.md | ||
filter.md | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.