Finish new data creation documentation (Marker-Inc-Korea#711)

* add answer generation docs * complete qa creation documentation * finish documentation of beta version data creation --------- Co-authored-by: jeffrey <vkefhdl1@gmail.com>
steelblu · Sep 16, 2024 · 57eafc9 · 57eafc9
1 parent f095369
commit 57eafc9
Show file tree

Hide file tree

Showing 7 changed files with 343 additions and 59 deletions.
diff --git a/docs/source/data_creation/beta/data_creation.md b/docs/source/data_creation/beta/data_creation.md
@@ -22,19 +22,7 @@ In this new data creation pipeline, we have three schemas. `Raw`, `QA`, and `Cor
 You can use the corpus to generate the answer for the question.
 You have to make corpus data from your documents using parsing and chunking.
 
-### Functions for data creation
-
-We provide some functions for customizing and running a data creation process.
-Here are the basic concepts of each function.
-
-### `QA` and `Corpus`
-
-- `batch_apply`: Apply the function to each row of the dataset, but run in parallel using `asyncio`.
-You have to use `async` functions for using this function.
-Plus, you can specify the batch size.
-- `map`: You can use this function to do something to its pd.DataFrame value. The input function must be get input of pd.DataFrame and return pd.DataFrame.
-
-
+To see the tutorial of the data creation, check [here](./tutorial.md).
 
 ```{toctree}
 ---

diff --git a/docs/source/data_creation/beta/qa_creation/answer_gen.md b/docs/source/data_creation/beta/qa_creation/answer_gen.md
@@ -1,8 +1,18 @@
 # Answer Generation
 
-## Overview
+This is a generation for 'generation ground truth.'
+It uses the LLM to generate the answer for the question and the given context (retrieval gt).
 
-## Usage
+The answer generation methods can be used in AutoRAG is below.
+
+1. [Basic Generation](#basic-generation)
+2. [Concise Generation](#concise-generation)
+
+## Basic Generation
+This is just a basic generation for the answer.
+It does not have specific constraints on how it generates the answer.
+
+### OpenAI
 
 ```python
 from autorag.data.beta.schema import QA
@@ -14,15 +24,43 @@ qa = QA(qa_df)
 result_qa = qa.batch_apply(make_basic_gen_gt, client=client)
 ```
 
-Or using LlamaIndex
+### LlamaIndex
 
 ```python
 from autorag.data.beta.schema import QA
 from autorag.data.beta.generation_gt.llama_index_gen_gt import make_basic_gen_gt
 from llama_index.llms.openai import OpenAI
 
 llm = OpenAI()
-
 qa = QA(qa_df)
 result_qa = qa.batch_apply(make_basic_gen_gt, llm=llm)
 ```
+
+## Concise Generation
+This is a concise generation for the answer.
+Concise means that the answer is short and clear, just like a summary.
+It is usually just a word that is the answer to the question.
+
+### OpenAI
+
+```python
+from autorag.data.beta.schema import QA
+from autorag.data.beta.generation_gt.openai_gen_gt import make_concise_gen_gt
+from openai import AsyncOpenAI
+
+client = AsyncOpenAI()
+qa = QA(qa_df)
+result_qa = qa.batch_apply(make_concise_gen_gt, client=client)
+```
+
+### LlamaIndex
+
+```python
+from autorag.data.beta.schema import QA
+from autorag.data.beta.generation_gt.llama_index_gen_gt import make_concise_gen_gt
+from llama_index.llms.openai import OpenAI
+
+llm = OpenAI()
+qa = QA(qa_df)
+result_qa = qa.batch_apply(make_concise_gen_gt, llm=llm)
+```
diff --git a/docs/source/data_creation/beta/qa_creation/filter.md b/docs/source/data_creation/beta/qa_creation/filter.md
@@ -4,6 +4,11 @@ After generating QA dataset, you want to filter some generation results.
 Because LLM is not perfect and has a lot of mistakes while generating datasets,
 it is good if you use some filtering methods to remove some bad results.
 
+The supported filtering methods are below.
+
+1. [Rule-based Don't know Filter](#rule-based-dont-know-filter)
+2. [LLM-based Don't know Filter](#llm-based-dont-know-filter)
+
 # 1. Unanswerable question filtering
 
 Sometimes LLM generates unanswerable questions from the given passage.

diff --git a/docs/source/data_creation/beta/qa_creation/qa_creation.md b/docs/source/data_creation/beta/qa_creation/qa_creation.md
@@ -0,0 +1,165 @@
+# QA creation
+
+In this section, we will cover how to create QA data for the AutoRAG.
+
+It is a crucial step to create the good QA data. Because if the QA data is bad, the RAG will not be optimized well.
+
+## Overview
+
+The sample QA creation pipeline looks like this.
+
+```python
+from llama_index.llms.openai import OpenAI
+
+from autorag.data.beta.filter.dontknow import dontknow_filter_rule_based
+from autorag.data.beta.generation_gt.llama_index_gen_gt import (
+    make_basic_gen_gt,
+    make_concise_gen_gt,
+)
+from autorag.data.beta.query.llama_gen_query import factoid_query_gen
+from autorag.data.beta.sample import random_single_hop
+
+llm = OpenAI()
+initial_corpus = initial_raw.chunk(
+    "llama_index_chunk", chunk_method="token", chunk_size=128, chunk_overlap=5
+)
+initial_qa = (
+    initial_corpus.sample(random_single_hop, n=3)
+    .map(
+        lambda df: df.reset_index(drop=True),
+    )
+    .make_retrieval_gt_contents()
+    .batch_apply(
+        factoid_query_gen,  # query generation
+        llm=llm,
+    )
+    .batch_apply(
+        make_basic_gen_gt,  # answer generation (basic)
+        llm=llm,
+    )
+    .batch_apply(
+        make_concise_gen_gt,  # answer generation (concise)
+        llm=llm,
+    )
+    .filter(
+        dontknow_filter_rule_based,  # filter don't know
+        lang="en",
+    )
+)
+
+initial_qa.to_parquet('./qa.parquet', './corpus.parquet')
+```
+
+### 1. Sample retrieval gt
+
+To create question and answer, you have to sample retrieval gt from the corpus data.
+You can get the initial chunk data from the raw data.
+And then sample it using the `sample` function.
+
+```python
+from autorag.data.beta.sample import random_single_hop
+
+qa = initial_corpus.sample(random_single_hop, n=3).map(
+    lambda df: df.reset_index(drop=True),
+)
+```
+
+You can change the sample method by changing the function to different functions.
+Supported methods are below.
+
+|      Method       |                Description                 |
+|:-----------------:|:------------------------------------------:|
+| random_single_hop |  Randomly sample one hop from the corpus   |
+| range_single_hop  | Sample single hop with range in the corpus |
+
+
+### 2. Get retrieval gt contents to generate questions
+
+At the first step, you only sample retrieval gt ids. But to generate questions, you have to get the contents of the retrieval gt.
+To achieve this, you can use the `make_retrieval_gt_contents` function.
+
+```python
+qa = qa.make_retrieval_gt_contents()
+```
+
+### 3. Generate queries
+
+Now, you use LLM to generate queries.
+In this example, we use the `factoid_query_gen` function to generate factoid questions.
+
+```python
+from llama_index.llms.openai import OpenAI
+
+from autorag.data.beta.query.llama_gen_query import factoid_query_gen
+
+llm = OpenAI()
+qa = qa.batch_apply(
+    factoid_query_gen,  # query generation
+    llm=llm,
+)
+```
+
+To know more query generation methods, check this [page](./query_gen.md).
+
+### 4. Generate answers
+
+After generating questions, you have to generate answers (generation gt).
+
+```python
+from llama_index.llms.openai import OpenAI
+
+from autorag.data.beta.generation_gt.llama_index_gen_gt import (
+    make_basic_gen_gt,
+    make_concise_gen_gt,
+)
+
+llm = OpenAI()
+
+qa = qa.batch_apply(
+    make_basic_gen_gt,  # answer generation (basic)
+    llm=llm,
+).batch_apply(
+    make_concise_gen_gt,  # answer generation (concise)
+    llm=llm,
+)
+```
+
+To know more answer generation methods, check this [page](./answer_gen.md).
+
+### 5. Filtering questions
+
+It is natural that LLM generates some bad questions.
+So, it is better you filter some bad questions with classification models or LLM models.
+
+To filtering, we use `filter` method.
+
+```python
+from llama_index.llms.openai import OpenAI
+
+from autorag.data.beta.filter.dontknow import dontknow_filter_rule_based
+
+llm = OpenAI()
+qa = qa.filter(
+        dontknow_filter_rule_based,  # filter don't know
+        lang="en",
+    )
+```
+
+To know more filtering methods, check this [page](./filter.md).
+
+### 6. Save the QA data
+
+Now you can use the QA data for running AutoRAG.
+
+```python
+qa.to_parquet('./qa.parquet', './corpus.parquet')
+```
+
+```{toctree}
+---
+maxdepth: 1
+---
+query_gen.md
+answer_gen.md
+filter.md
+```
diff --git a/docs/source/data_creation/beta/qa_creation/query_gen.md b/docs/source/data_creation/beta/qa_creation/query_gen.md
@@ -4,38 +4,10 @@ In this document, we will cover how to generate questions for the QA dataset.
 
 ## Overview
 
-You can use `basic_apply` function at `QA` instance to generate questions.
+You can use `batch_apply` function at `QA` instance to generate questions.
 Before generating a question, the `QA` must have the `qid` and `retrieval_gt` columns.
 You can get those to use `sample` at the `Corpus` instance.
 
-```python
-from openai import AsyncOpenAI
-import pandas as pd
-from typing import List
-
-from autorag.data.beta.schema import Raw, QA, Corpus
-from autorag.data.beta.sample import random_single_hop
-from autorag.data.beta.query.openai_gen_query import factoid_query_gen
-from autorag.data.beta.generation_gt.openai_gen_gt import make_concise_gen_gt
-
-openai_client = AsyncOpenAI()
-parsing_result = Raw(pd.read_parquet('./parse.parquet'))
-initial_corpus = parsing_result.chunk(lambda data: recursive_split(data, chunk_size=128, chunk_overlap=24))
-initial_qa = initial_corpus.sample(
-     random_single_hop, n=50
-).batch_apply(
-    factoid_query_gen, client=openai_client, lang='ko'
-).batch_apply(
-    make_concise_gen_gt, client=openai_client, lang='ko'
-)
-
-# Make many corpus and QA instances from the same QA set.
-corpus_list: List[Corpus] = parsing_result.chunk_pipeline('./chunk.yaml')
-qa_list: List[QA] = list(map(lambda corpus: initial_qa.update_corpus(corpus), corpus_list))
-for i, qa in enumerate(qa_list):
-    qa.to_parquet(qa_path=f'./qa_{i}.parquet', corpus_path=f'./corpus_{i}.parquet')
-```
-
 ```{attention}
 In OpenAI version of data creation, you can use only 'gpt-4o-2024-08-06' and 'gpt-4o-mini-2024-07-18'.
 If you want to use another model, use llama_index version instead.
@@ -45,7 +17,7 @@ If you want to use another model, use llama_index version instead.
 
 1. [Factoid](#1-factoid)
 2. [Concept Completion](#2-concept-completion)
-3. Two-hop Incremental
+3. [Two-hop Incremental](#3-two-hop-incremental)
 
 
 ## 1. Factoid