diff --git a/docs/source/data_creation/beta/data_creation.md b/docs/source/data_creation/beta/data_creation.md index 6dc1f1217..775c13cf3 100644 --- a/docs/source/data_creation/beta/data_creation.md +++ b/docs/source/data_creation/beta/data_creation.md @@ -22,19 +22,7 @@ In this new data creation pipeline, we have three schemas. `Raw`, `QA`, and `Cor You can use the corpus to generate the answer for the question. You have to make corpus data from your documents using parsing and chunking. -### Functions for data creation - -We provide some functions for customizing and running a data creation process. -Here are the basic concepts of each function. - -### `QA` and `Corpus` - -- `batch_apply`: Apply the function to each row of the dataset, but run in parallel using `asyncio`. -You have to use `async` functions for using this function. -Plus, you can specify the batch size. -- `map`: You can use this function to do something to its pd.DataFrame value. The input function must be get input of pd.DataFrame and return pd.DataFrame. - - +To see the tutorial of the data creation, check [here](./tutorial.md). ```{toctree} --- diff --git a/docs/source/data_creation/beta/qa_creation/answer_gen.md b/docs/source/data_creation/beta/qa_creation/answer_gen.md index 0e902f0a2..80ffb6858 100644 --- a/docs/source/data_creation/beta/qa_creation/answer_gen.md +++ b/docs/source/data_creation/beta/qa_creation/answer_gen.md @@ -1,8 +1,18 @@ # Answer Generation -## Overview +This is a generation for 'generation ground truth.' +It uses the LLM to generate the answer for the question and the given context (retrieval gt). -## Usage +The answer generation methods can be used in AutoRAG is below. + +1. [Basic Generation](#basic-generation) +2. [Concise Generation](#concise-generation) + +## Basic Generation +This is just a basic generation for the answer. +It does not have specific constraints on how it generates the answer. + +### OpenAI ```python from autorag.data.beta.schema import QA @@ -14,7 +24,7 @@ qa = QA(qa_df) result_qa = qa.batch_apply(make_basic_gen_gt, client=client) ``` -Or using LlamaIndex +### LlamaIndex ```python from autorag.data.beta.schema import QA @@ -22,7 +32,35 @@ from autorag.data.beta.generation_gt.llama_index_gen_gt import make_basic_gen_gt from llama_index.llms.openai import OpenAI llm = OpenAI() - qa = QA(qa_df) result_qa = qa.batch_apply(make_basic_gen_gt, llm=llm) ``` + +## Concise Generation +This is a concise generation for the answer. +Concise means that the answer is short and clear, just like a summary. +It is usually just a word that is the answer to the question. + +### OpenAI + +```python +from autorag.data.beta.schema import QA +from autorag.data.beta.generation_gt.openai_gen_gt import make_concise_gen_gt +from openai import AsyncOpenAI + +client = AsyncOpenAI() +qa = QA(qa_df) +result_qa = qa.batch_apply(make_concise_gen_gt, client=client) +``` + +### LlamaIndex + +```python +from autorag.data.beta.schema import QA +from autorag.data.beta.generation_gt.llama_index_gen_gt import make_concise_gen_gt +from llama_index.llms.openai import OpenAI + +llm = OpenAI() +qa = QA(qa_df) +result_qa = qa.batch_apply(make_concise_gen_gt, llm=llm) +``` diff --git a/docs/source/data_creation/beta/qa_creation/filter.md b/docs/source/data_creation/beta/qa_creation/filter.md index 6e53e6db4..d63c40730 100644 --- a/docs/source/data_creation/beta/qa_creation/filter.md +++ b/docs/source/data_creation/beta/qa_creation/filter.md @@ -4,6 +4,11 @@ After generating QA dataset, you want to filter some generation results. Because LLM is not perfect and has a lot of mistakes while generating datasets, it is good if you use some filtering methods to remove some bad results. +The supported filtering methods are below. + +1. [Rule-based Don't know Filter](#rule-based-dont-know-filter) +2. [LLM-based Don't know Filter](#llm-based-dont-know-filter) + # 1. Unanswerable question filtering Sometimes LLM generates unanswerable questions from the given passage. diff --git a/docs/source/data_creation/beta/qa_creation/qa_creation.md b/docs/source/data_creation/beta/qa_creation/qa_creation.md index e69de29bb..4299a44de 100644 --- a/docs/source/data_creation/beta/qa_creation/qa_creation.md +++ b/docs/source/data_creation/beta/qa_creation/qa_creation.md @@ -0,0 +1,165 @@ +# QA creation + +In this section, we will cover how to create QA data for the AutoRAG. + +It is a crucial step to create the good QA data. Because if the QA data is bad, the RAG will not be optimized well. + +## Overview + +The sample QA creation pipeline looks like this. + +```python +from llama_index.llms.openai import OpenAI + +from autorag.data.beta.filter.dontknow import dontknow_filter_rule_based +from autorag.data.beta.generation_gt.llama_index_gen_gt import ( + make_basic_gen_gt, + make_concise_gen_gt, +) +from autorag.data.beta.query.llama_gen_query import factoid_query_gen +from autorag.data.beta.sample import random_single_hop + +llm = OpenAI() +initial_corpus = initial_raw.chunk( + "llama_index_chunk", chunk_method="token", chunk_size=128, chunk_overlap=5 +) +initial_qa = ( + initial_corpus.sample(random_single_hop, n=3) + .map( + lambda df: df.reset_index(drop=True), + ) + .make_retrieval_gt_contents() + .batch_apply( + factoid_query_gen, # query generation + llm=llm, + ) + .batch_apply( + make_basic_gen_gt, # answer generation (basic) + llm=llm, + ) + .batch_apply( + make_concise_gen_gt, # answer generation (concise) + llm=llm, + ) + .filter( + dontknow_filter_rule_based, # filter don't know + lang="en", + ) +) + +initial_qa.to_parquet('./qa.parquet', './corpus.parquet') +``` + +### 1. Sample retrieval gt + +To create question and answer, you have to sample retrieval gt from the corpus data. +You can get the initial chunk data from the raw data. +And then sample it using the `sample` function. + +```python +from autorag.data.beta.sample import random_single_hop + +qa = initial_corpus.sample(random_single_hop, n=3).map( + lambda df: df.reset_index(drop=True), +) +``` + +You can change the sample method by changing the function to different functions. +Supported methods are below. + +| Method | Description | +|:-----------------:|:------------------------------------------:| +| random_single_hop | Randomly sample one hop from the corpus | +| range_single_hop | Sample single hop with range in the corpus | + + +### 2. Get retrieval gt contents to generate questions + +At the first step, you only sample retrieval gt ids. But to generate questions, you have to get the contents of the retrieval gt. +To achieve this, you can use the `make_retrieval_gt_contents` function. + +```python +qa = qa.make_retrieval_gt_contents() +``` + +### 3. Generate queries + +Now, you use LLM to generate queries. +In this example, we use the `factoid_query_gen` function to generate factoid questions. + +```python +from llama_index.llms.openai import OpenAI + +from autorag.data.beta.query.llama_gen_query import factoid_query_gen + +llm = OpenAI() +qa = qa.batch_apply( + factoid_query_gen, # query generation + llm=llm, +) +``` + +To know more query generation methods, check this [page](./query_gen.md). + +### 4. Generate answers + +After generating questions, you have to generate answers (generation gt). + +```python +from llama_index.llms.openai import OpenAI + +from autorag.data.beta.generation_gt.llama_index_gen_gt import ( + make_basic_gen_gt, + make_concise_gen_gt, +) + +llm = OpenAI() + +qa = qa.batch_apply( + make_basic_gen_gt, # answer generation (basic) + llm=llm, +).batch_apply( + make_concise_gen_gt, # answer generation (concise) + llm=llm, +) +``` + +To know more answer generation methods, check this [page](./answer_gen.md). + +### 5. Filtering questions + +It is natural that LLM generates some bad questions. +So, it is better you filter some bad questions with classification models or LLM models. + +To filtering, we use `filter` method. + +```python +from llama_index.llms.openai import OpenAI + +from autorag.data.beta.filter.dontknow import dontknow_filter_rule_based + +llm = OpenAI() +qa = qa.filter( + dontknow_filter_rule_based, # filter don't know + lang="en", + ) +``` + +To know more filtering methods, check this [page](./filter.md). + +### 6. Save the QA data + +Now you can use the QA data for running AutoRAG. + +```python +qa.to_parquet('./qa.parquet', './corpus.parquet') +``` + +```{toctree} +--- +maxdepth: 1 +--- +query_gen.md +answer_gen.md +filter.md +``` diff --git a/docs/source/data_creation/beta/qa_creation/query_gen.md b/docs/source/data_creation/beta/qa_creation/query_gen.md index 3fe27a3b7..998584741 100644 --- a/docs/source/data_creation/beta/qa_creation/query_gen.md +++ b/docs/source/data_creation/beta/qa_creation/query_gen.md @@ -4,38 +4,10 @@ In this document, we will cover how to generate questions for the QA dataset. ## Overview -You can use `basic_apply` function at `QA` instance to generate questions. +You can use `batch_apply` function at `QA` instance to generate questions. Before generating a question, the `QA` must have the `qid` and `retrieval_gt` columns. You can get those to use `sample` at the `Corpus` instance. -```python -from openai import AsyncOpenAI -import pandas as pd -from typing import List - -from autorag.data.beta.schema import Raw, QA, Corpus -from autorag.data.beta.sample import random_single_hop -from autorag.data.beta.query.openai_gen_query import factoid_query_gen -from autorag.data.beta.generation_gt.openai_gen_gt import make_concise_gen_gt - -openai_client = AsyncOpenAI() -parsing_result = Raw(pd.read_parquet('./parse.parquet')) -initial_corpus = parsing_result.chunk(lambda data: recursive_split(data, chunk_size=128, chunk_overlap=24)) -initial_qa = initial_corpus.sample( - random_single_hop, n=50 -).batch_apply( - factoid_query_gen, client=openai_client, lang='ko' -).batch_apply( - make_concise_gen_gt, client=openai_client, lang='ko' -) - -# Make many corpus and QA instances from the same QA set. -corpus_list: List[Corpus] = parsing_result.chunk_pipeline('./chunk.yaml') -qa_list: List[QA] = list(map(lambda corpus: initial_qa.update_corpus(corpus), corpus_list)) -for i, qa in enumerate(qa_list): - qa.to_parquet(qa_path=f'./qa_{i}.parquet', corpus_path=f'./corpus_{i}.parquet') -``` - ```{attention} In OpenAI version of data creation, you can use only 'gpt-4o-2024-08-06' and 'gpt-4o-mini-2024-07-18'. If you want to use another model, use llama_index version instead. @@ -45,7 +17,7 @@ If you want to use another model, use llama_index version instead. 1. [Factoid](#1-factoid) 2. [Concept Completion](#2-concept-completion) -3. Two-hop Incremental +3. [Two-hop Incremental](#3-two-hop-incremental) ## 1. Factoid diff --git a/docs/source/data_creation/beta/tutorial.md b/docs/source/data_creation/beta/tutorial.md index f0f8146fd..3e4c3bce8 100644 --- a/docs/source/data_creation/beta/tutorial.md +++ b/docs/source/data_creation/beta/tutorial.md @@ -1,4 +1,4 @@ -# Start creating your own evaluation data +# Evaluation data creation tutorial ## Overview For the evaluation of RAGs we need data, but in most cases we have little or no satisfactory data. @@ -10,27 +10,143 @@ The following guide covers how to use LLM to create data in a form that AutoRAG --- ![Data Creation](../../_static/data_creation.png) -AutoRAG aims to work with Python’s ‘primitive data types’ for scalability and convenience. +## 1. Parse +You can make different parsing results from the raw data using the parsing YAML file. +The sample parsing YAML file looks like this. -Therefore, to use AutoRAG, you need to convert your raw data into `corpus data` and `qa data` to our [data format](./data_format.md). +```yaml +modules: + - module_type: langchain_parse + parse_method: [pdfminer, pdfplumber] +``` -## 1. Parse +With this YAML file, you can get the parsed data with pdfminer and pdfplumber. -## 2. QA Creation +You can execute this parsing YAML file by using the following code. + +```python +from autorag.parser import Parser + +filepaths = "./data/*.pdf" +parser = Parser(filepaths, "./parse_project_dir") +parser.start_parsing("./parsing.yaml") +``` -### Question Generation +Then you can check out the parsing result in the `./parse_project_dir` directory. -If you want to learn about more question generation type, check [here](./query_gen.md). +For more details about parsing, please refer [here](./parse/parse.md). +## 2. QA Creation -### Answer Creation +From the parsed results, you can select the best parsed data for AutoRAG. +After you selected, you can create QA data for the AutoRAG. + +The example is shown below, the `initial_raw_df` is selected raw data. + +```python +from llama_index.llms.openai import OpenAI + +from autorag.data.beta.filter.dontknow import dontknow_filter_rule_based +from autorag.data.beta.generation_gt.llama_index_gen_gt import ( + make_basic_gen_gt, + make_concise_gen_gt, +) +from autorag.data.beta.query.llama_gen_query import factoid_query_gen +from autorag.data.beta.sample import random_single_hop +from autorag.data.beta.schema import Raw + +initial_raw = Raw(initial_raw_df) +initial_corpus = initial_raw.chunk( + "llama_index_chunk", chunk_method="token", chunk_size=128, chunk_overlap=5 +) +llm = OpenAI() +initial_qa = ( + initial_corpus.sample(random_single_hop, n=3) + .map( + lambda df: df.reset_index(drop=True), + ) + .make_retrieval_gt_contents() + .batch_apply( + factoid_query_gen, + llm=llm, + ) + .batch_apply( + make_basic_gen_gt, + llm=llm, + ) + .batch_apply( + make_concise_gen_gt, + llm=llm, + ) + .filter( + dontknow_filter_rule_based, + lang="en", + ) +) +initial_qa.to_parquet("./initial_qa.parquet", "./initial_corpus.parquet") +``` + +We recommend you find the optimal pipeline first from this initial data. +Check out [here](../../tutorial.md) to see the optimization tutorial. + +## 3. Chunking Optimization + +After finding the initial optimal pipeline, this time you are to optimize the chunking method. +First, you can create various chunking results from the parsed data. + +The chunking YAML file looks like this. + +```yaml +modules: + - module_type: llama_index_chunk + chunk_method: [ Token, Sentence ] + chunk_size: [ 1024, 512 ] + chunk_overlap: 24 + add_file_name: english + - module_type: llama_index_chunk + chunk_method: [ SentenceWindow ] + sentence_splitter: kiwi + window_size: 3 + add_file_name: english +``` + +With this YAML file, you can get the chunked data with Token, Sentence, and SentenceWindow with different chunk sizes. + +You can execute this chunking YAML file by using the following code. + +```python +from autorag.chunker import Chunker + +chunker = Chunker.from_parquet("./initial_raw.parquet", "./chunk_project_dir") +chunker.start_chunking("./chunking.yaml") +``` + +Then you can check out the chunking result in the `./chunk_project_dir` directory. + +For more details about chunking, please refer [here](./chunk/chunk.md). +## 4. QA - Corpus mapping -### Filter +For the chunking optimization, you can evaluate RAG performance with different corpus data. +You already have the optimal pipeline from the initial QA data, +so you can use this pipeline to evaluate the RAG performance with different corpus data. +Before that, you must update all qa data with the new corpus data. +It uses `update_corpus` method. -## 3. Chunk +It is highly recommending you to keep the initial `QA` instance. +If not, you need to build `QA` instance again from the initial raw (parsed) data and corpus data. +```python +from autorag.data.beta.schema import Raw, Corpus, QA +raw = Raw(initial_raw_df) +corpus = Corpus(initial_corpus_df, raw) +qa = QA(initial_qa_df, corpus) -## 4. QA - Corpus mapping +new_qa = qa.update_corpus(Corpus(new_corpus_df, raw)) +``` + +Now `new_qa` have new `retrieval_gt` data for the new corpus. + +Now with the new corpus data and new qa datas, you can evaluate the RAG performance with different corpus data. diff --git a/docs/source/tutorial.md b/docs/source/tutorial.md index 3f8049494..5cd2f4820 100644 --- a/docs/source/tutorial.md +++ b/docs/source/tutorial.md @@ -25,7 +25,7 @@ So, you need to focus on the quality of your evaluation dataset. Once you have it, the optimal RAG pipeline can be found using AutoRAG easily. So, for users who want to make a good evaluation dataset, -we provide a detailed guide at [here](data_creation/tutorial.md). +we provide a detailed guide at [here](data_creation/beta/data_creation.md). For users who want to use a pre-made evaluation dataset, we provide example datasets at [here](data_creation/data_format.md#samples).