Skip to content

Commit

Permalink
Finish new data creation documentation (Marker-Inc-Korea#711)
Browse files Browse the repository at this point in the history
* add answer generation docs

* complete qa creation documentation

* finish documentation of beta version data creation

---------

Co-authored-by: jeffrey <vkefhdl1@gmail.com>
  • Loading branch information
vkehfdl1 and jeffrey authored Sep 16, 2024
1 parent f095369 commit 57eafc9
Show file tree
Hide file tree
Showing 7 changed files with 343 additions and 59 deletions.
14 changes: 1 addition & 13 deletions docs/source/data_creation/beta/data_creation.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,19 +22,7 @@ In this new data creation pipeline, we have three schemas. `Raw`, `QA`, and `Cor
You can use the corpus to generate the answer for the question.
You have to make corpus data from your documents using parsing and chunking.

### Functions for data creation

We provide some functions for customizing and running a data creation process.
Here are the basic concepts of each function.

### `QA` and `Corpus`

- `batch_apply`: Apply the function to each row of the dataset, but run in parallel using `asyncio`.
You have to use `async` functions for using this function.
Plus, you can specify the batch size.
- `map`: You can use this function to do something to its pd.DataFrame value. The input function must be get input of pd.DataFrame and return pd.DataFrame.


To see the tutorial of the data creation, check [here](./tutorial.md).

```{toctree}
---
Expand Down
46 changes: 42 additions & 4 deletions docs/source/data_creation/beta/qa_creation/answer_gen.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,18 @@
# Answer Generation

## Overview
This is a generation for 'generation ground truth.'
It uses the LLM to generate the answer for the question and the given context (retrieval gt).

## Usage
The answer generation methods can be used in AutoRAG is below.

1. [Basic Generation](#basic-generation)
2. [Concise Generation](#concise-generation)

## Basic Generation
This is just a basic generation for the answer.
It does not have specific constraints on how it generates the answer.

### OpenAI

```python
from autorag.data.beta.schema import QA
Expand All @@ -14,15 +24,43 @@ qa = QA(qa_df)
result_qa = qa.batch_apply(make_basic_gen_gt, client=client)
```

Or using LlamaIndex
### LlamaIndex

```python
from autorag.data.beta.schema import QA
from autorag.data.beta.generation_gt.llama_index_gen_gt import make_basic_gen_gt
from llama_index.llms.openai import OpenAI

llm = OpenAI()

qa = QA(qa_df)
result_qa = qa.batch_apply(make_basic_gen_gt, llm=llm)
```

## Concise Generation
This is a concise generation for the answer.
Concise means that the answer is short and clear, just like a summary.
It is usually just a word that is the answer to the question.

### OpenAI

```python
from autorag.data.beta.schema import QA
from autorag.data.beta.generation_gt.openai_gen_gt import make_concise_gen_gt
from openai import AsyncOpenAI

client = AsyncOpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(make_concise_gen_gt, client=client)
```

### LlamaIndex

```python
from autorag.data.beta.schema import QA
from autorag.data.beta.generation_gt.llama_index_gen_gt import make_concise_gen_gt
from llama_index.llms.openai import OpenAI

llm = OpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(make_concise_gen_gt, llm=llm)
```
5 changes: 5 additions & 0 deletions docs/source/data_creation/beta/qa_creation/filter.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@ After generating QA dataset, you want to filter some generation results.
Because LLM is not perfect and has a lot of mistakes while generating datasets,
it is good if you use some filtering methods to remove some bad results.

The supported filtering methods are below.

1. [Rule-based Don't know Filter](#rule-based-dont-know-filter)
2. [LLM-based Don't know Filter](#llm-based-dont-know-filter)

# 1. Unanswerable question filtering

Sometimes LLM generates unanswerable questions from the given passage.
Expand Down
165 changes: 165 additions & 0 deletions docs/source/data_creation/beta/qa_creation/qa_creation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# QA creation

In this section, we will cover how to create QA data for the AutoRAG.

It is a crucial step to create the good QA data. Because if the QA data is bad, the RAG will not be optimized well.

## Overview

The sample QA creation pipeline looks like this.

```python
from llama_index.llms.openai import OpenAI

from autorag.data.beta.filter.dontknow import dontknow_filter_rule_based
from autorag.data.beta.generation_gt.llama_index_gen_gt import (
make_basic_gen_gt,
make_concise_gen_gt,
)
from autorag.data.beta.query.llama_gen_query import factoid_query_gen
from autorag.data.beta.sample import random_single_hop

llm = OpenAI()
initial_corpus = initial_raw.chunk(
"llama_index_chunk", chunk_method="token", chunk_size=128, chunk_overlap=5
)
initial_qa = (
initial_corpus.sample(random_single_hop, n=3)
.map(
lambda df: df.reset_index(drop=True),
)
.make_retrieval_gt_contents()
.batch_apply(
factoid_query_gen, # query generation
llm=llm,
)
.batch_apply(
make_basic_gen_gt, # answer generation (basic)
llm=llm,
)
.batch_apply(
make_concise_gen_gt, # answer generation (concise)
llm=llm,
)
.filter(
dontknow_filter_rule_based, # filter don't know
lang="en",
)
)

initial_qa.to_parquet('./qa.parquet', './corpus.parquet')
```

### 1. Sample retrieval gt

To create question and answer, you have to sample retrieval gt from the corpus data.
You can get the initial chunk data from the raw data.
And then sample it using the `sample` function.

```python
from autorag.data.beta.sample import random_single_hop

qa = initial_corpus.sample(random_single_hop, n=3).map(
lambda df: df.reset_index(drop=True),
)
```

You can change the sample method by changing the function to different functions.
Supported methods are below.

| Method | Description |
|:-----------------:|:------------------------------------------:|
| random_single_hop | Randomly sample one hop from the corpus |
| range_single_hop | Sample single hop with range in the corpus |


### 2. Get retrieval gt contents to generate questions

At the first step, you only sample retrieval gt ids. But to generate questions, you have to get the contents of the retrieval gt.
To achieve this, you can use the `make_retrieval_gt_contents` function.

```python
qa = qa.make_retrieval_gt_contents()
```

### 3. Generate queries

Now, you use LLM to generate queries.
In this example, we use the `factoid_query_gen` function to generate factoid questions.

```python
from llama_index.llms.openai import OpenAI

from autorag.data.beta.query.llama_gen_query import factoid_query_gen

llm = OpenAI()
qa = qa.batch_apply(
factoid_query_gen, # query generation
llm=llm,
)
```

To know more query generation methods, check this [page](./query_gen.md).

### 4. Generate answers

After generating questions, you have to generate answers (generation gt).

```python
from llama_index.llms.openai import OpenAI

from autorag.data.beta.generation_gt.llama_index_gen_gt import (
make_basic_gen_gt,
make_concise_gen_gt,
)

llm = OpenAI()

qa = qa.batch_apply(
make_basic_gen_gt, # answer generation (basic)
llm=llm,
).batch_apply(
make_concise_gen_gt, # answer generation (concise)
llm=llm,
)
```

To know more answer generation methods, check this [page](./answer_gen.md).

### 5. Filtering questions

It is natural that LLM generates some bad questions.
So, it is better you filter some bad questions with classification models or LLM models.

To filtering, we use `filter` method.

```python
from llama_index.llms.openai import OpenAI

from autorag.data.beta.filter.dontknow import dontknow_filter_rule_based

llm = OpenAI()
qa = qa.filter(
dontknow_filter_rule_based, # filter don't know
lang="en",
)
```

To know more filtering methods, check this [page](./filter.md).

### 6. Save the QA data

Now you can use the QA data for running AutoRAG.

```python
qa.to_parquet('./qa.parquet', './corpus.parquet')
```

```{toctree}
---
maxdepth: 1
---
query_gen.md
answer_gen.md
filter.md
```
32 changes: 2 additions & 30 deletions docs/source/data_creation/beta/qa_creation/query_gen.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,38 +4,10 @@ In this document, we will cover how to generate questions for the QA dataset.

## Overview

You can use `basic_apply` function at `QA` instance to generate questions.
You can use `batch_apply` function at `QA` instance to generate questions.
Before generating a question, the `QA` must have the `qid` and `retrieval_gt` columns.
You can get those to use `sample` at the `Corpus` instance.

```python
from openai import AsyncOpenAI
import pandas as pd
from typing import List

from autorag.data.beta.schema import Raw, QA, Corpus
from autorag.data.beta.sample import random_single_hop
from autorag.data.beta.query.openai_gen_query import factoid_query_gen
from autorag.data.beta.generation_gt.openai_gen_gt import make_concise_gen_gt

openai_client = AsyncOpenAI()
parsing_result = Raw(pd.read_parquet('./parse.parquet'))
initial_corpus = parsing_result.chunk(lambda data: recursive_split(data, chunk_size=128, chunk_overlap=24))
initial_qa = initial_corpus.sample(
random_single_hop, n=50
).batch_apply(
factoid_query_gen, client=openai_client, lang='ko'
).batch_apply(
make_concise_gen_gt, client=openai_client, lang='ko'
)

# Make many corpus and QA instances from the same QA set.
corpus_list: List[Corpus] = parsing_result.chunk_pipeline('./chunk.yaml')
qa_list: List[QA] = list(map(lambda corpus: initial_qa.update_corpus(corpus), corpus_list))
for i, qa in enumerate(qa_list):
qa.to_parquet(qa_path=f'./qa_{i}.parquet', corpus_path=f'./corpus_{i}.parquet')
```

```{attention}
In OpenAI version of data creation, you can use only 'gpt-4o-2024-08-06' and 'gpt-4o-mini-2024-07-18'.
If you want to use another model, use llama_index version instead.
Expand All @@ -45,7 +17,7 @@ If you want to use another model, use llama_index version instead.

1. [Factoid](#1-factoid)
2. [Concept Completion](#2-concept-completion)
3. Two-hop Incremental
3. [Two-hop Incremental](#3-two-hop-incremental)


## 1. Factoid
Expand Down
Loading

0 comments on commit 57eafc9

Please sign in to comment.