Add few description for better understanding of AutoRAG (Marker-Inc-K…

…orea#596) * add additional information at prompt maker node description * add description about retrieval gt and corpus dataset * change the diagram to latest and add description about essential and optional nodes * change the link of llama index --------- Co-authored-by: jeffrey <vkefhdl1@gmail.com>
khlee369 · Aug 4, 2024 · e38ca5e · e38ca5e
1 parent f01cce0
commit e38ca5e
Show file tree

Hide file tree

Showing 5 changed files with 39 additions and 9 deletions.
diff --git a/docs/source/data_creation/data_format.md b/docs/source/data_creation/data_format.md
@@ -76,6 +76,14 @@ it is okay to save it as 1-d list or just string.
 If you save it as 1-d list, it treats as 'and' operation.
 ```
 
+```{attention}
+The ids of retrieval_gt must be included in the corpus dataset as `doc_id`.
+
+When AutoRAG starts to evaluate, it checks the existence of the retrieval ids in the corpus dataset.
+
+You MUST match the retrieval_gt ids with the corpus dataset.
+```
+
 This column is crucial because AutoRAG evaluate retrieval performance with this column.
 It can affect hugely to optimization performance or nodes like retrieval, query expansion or passage reranker.
 
@@ -111,6 +119,8 @@ A unique identifier for each passage. Its type is `string`.
 
 ```{warning}
 Do not make a duplicate doc id. It can occur unexpected behavior.
+
+Plus, we suggest you to double-check that retrieval_gt ids in the QA dataset are included in the corpus dataset as `doc_id`.
 ```
 
 ### contents

diff --git a/docs/source/local_model.md b/docs/source/local_model.md
@@ -66,7 +66,7 @@ This is the parameter for the LLM model.
 You can set the model parameter for LlamaIndex LLM initialization.
 The most frequently used parameters are `model`, `max_token`, and `temperature`.
 Please check what you can set for the model parameter
-at [LlamaIndex LLM](https://docs.llamaindex.ai/en/latest/api_reference/llms.html).
+at [LlamaIndex LLM](https://docs.llamaindex.ai/en/stable/module_guides/models/llms/).
 
 ### Add more LLM models
 

diff --git a/docs/source/nodes/generator/llama_index_llm.md b/docs/source/nodes/generator/llama_index_llm.md
@@ -7,7 +7,9 @@ myst:
 ---
 # llama_index LLM
 
-The `llama_index_llm` module is generator based on [llama_index](https://docs.llamaindex.ai/en/stable/api_reference/llms.html). It gets the LLM instance from llama index, and returns generated text by the input prompt. 
+The `llama_index_llm` module is generator based
+on [llama_index](https://docs.llamaindex.ai/en/stable/module_guides/models/llms/). It gets the LLM instance from llama
+index, and returns generated text by the input prompt.
 It does not generate log probs.
 
 ## **Module Parameters**

diff --git a/docs/source/nodes/prompt_maker/prompt_maker.md b/docs/source/nodes/prompt_maker/prompt_maker.md
@@ -25,18 +25,30 @@ Please refer to the parameter of [Generator Node](../generator/generator.md) for
 
 #### **Strategy Parameters**:
 
-1. **Metrics**: Metrics such as `bleu`,`meteor`, and `rouge` are used to evaluate the performance of the prompt maker process through its impact on generator (llm) outcomes.
-2. **Speed Threshold**: `speed_threshold` is applied across all nodes, ensuring that any method exceeding the average processing time for a query is not utilized.
-3. **Token Threshold**: `token_threshold` ensuring that output prompt average token length does not exceed the
+1. **Metrics**: (Essential) Metrics such as `bleu`,`meteor`, and `rouge` are used to evaluate the performance of the
+   prompt maker process through its impact on generator (llm) outcomes.
+2. **Speed Threshold**: (Optional) `speed_threshold` is applied across all nodes, ensuring that any method exceeding the
+   average processing time for a query is not utilized.
+3. **Token Threshold**: (Optional) `token_threshold` ensuring that output prompt average token length does not exceed
+   the
    threshold.
-4. **tokenizer**: Since you don't know what LLM model you will use in the next nodes, you can specify the tokenizer name
+4. **tokenizer**: (Optional) Since you don't know what LLM model you will use in the next nodes, you can specify the
+   tokenizer name
    to use in `token_threshold` strategy.
    You can use OpenAI model names or Huggingface model names that support `AutoTokenizer`.
    It will automatically find the tokenizer for the model name you specify.
    Default is 'gpt2'.
-5. **Generator Modules**: The prompt maker node can use all modules and module parameters from the generator node,
+5. **Generator Modules**: (Optional, but recommended to set) The prompt maker node can use all modules and module
+   parameters from the generator node,
    including:
    - [llama_index_llm](../generator/llama_index_llm.md): with `llm` and additional llm parameters
+   - [openai_llm](../generator/openai_llm.md): with `llm` and additional openai `AsyncOpenAI` parameters
+   - [vllm](../generator/vllm.md): with `llm` and additional vllm parameters
+
+   And the default model of generator module for evaluating prompt maker is openai gpt-3.5-turbo model.
+
+   Plus, the evaluation of prompt maker will skip when there are only one combination of prompt maker modules and
+   options.
 
 ### Example config.yaml file
 ```yaml
@@ -69,4 +81,4 @@ maxdepth: 1
 fstring.md
 long_context_reorder.md
 window_replacement.md
-```
+```
diff --git a/docs/source/optimization/optimization.md b/docs/source/optimization/optimization.md
@@ -12,11 +12,17 @@ In this documentation, you can learn about how AutoRAG works under the hood.
 
 ## Swapping modules in Node
 
-![Advanced RAG](../_static/roadmap/advanced_RAG.png)
+![Advanced RAG](https://github.com/Marker-Inc-Korea/AutoRAG/assets/96727832/79dda7ba-e9d8-4552-9e7b-6a5f9edc4c1a)
 
 Here is the diagram of the overall AutoRAG pipeline.
 Each node represents a node, and each node's result passed to the next node.
 
+```{admonition} Do I need to use all nodes?
+No. The essential node for the 'working' RAG pipeline is `retrieval`, `prompt maker` and `generator`.
+
+The other nodes are optional, so you can add it for the better performance.
+```
+
 But remember, you can set multiple modules and multiple parameters in each node. 
 And you get the best result among them.