Hostname: page-component-77f85d65b8-7lfxl Total loading time: 0 Render date: 2026-03-29T11:07:34.324Z Has data issue: false hasContentIssue false

Maximizing RAG efficiency: A comparative analysis of RAG methods

Published online by Cambridge University Press:  30 October 2024

Tolga Şakar*
Affiliation:
Applied Data Science, TED Üniversitesi, Ankara, Turkey
Hakan Emekci
Affiliation:
Applied Data Science, TED Üniversitesi, Ankara, Turkey
*
Corresponding author: Tolga Şakar; Email: tolga.sakar@tedu.edu.tr
Rights & Permissions [Opens in a new window]

Abstract

This paper addresses the optimization of retrieval-augmented generation (RAG) processes by exploring various methodologies, including advanced RAG methods. The research, driven by the need to enhance RAG processes as highlighted by recent studies, involved a grid-search optimization of 23,625 iterations. We evaluated multiple RAG methods across different vectorstores, embedding models, and large language models, using cross-domain datasets and contextual compression filters. The findings emphasize the importance of balancing context quality with similarity-based ranking methods, as well as understanding tradeoffs between similarity scores, token usage, runtime, and hardware utilization. Additionally, contextual compression filters were found to be crucial for efficient hardware utilization and reduced token consumption, despite the evident impacts on similarity scores, which may be acceptable depending on specific use cases and RAG methods.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. First-generation retrieval-augmented generation (RAG) methodology: Non-parametric RAG with a parametric sequence-to-sequence model. (Refer to RAG for knowledge-intensive natural language processing tasks.

Figure 1

Figure 2. Schema for searching and matching similar context from vector database based on user query.

Figure 2

Figure 3. Semantic similarity search process when finding top-K documents in a vector space based on input query to assign score-weighted ranks.

Figure 3

Figure 4. Stuff retrieval-augmented generation method.

Figure 4

Figure 5. Refine retrieval-augmented generation method.

Figure 5

Figure 6. Map Reduce method.

Figure 6

Figure 7. Map re-rank Method.

Figure 7

Figure 8. Impact of ambiguous vs specific query on retrieval.

Figure 8

Figure 9. Contextual Compressor method applied on top-K documents.

Figure 9

Figure 10. Query Step-Down method.

Figure 10

Figure 11. Reciprocal retrieval-augmented generation method.

Figure 11

Figure 12. Median run time (sec) comparisons by retrieval-augmented generation methods, datasets, embedding models, and large language models.

Figure 12

Table 1. Similarity scores, run time, and token usage for various retrieval-augmented generation methods across different datasets. The datasets used in this study differ in complexity and domain specificity, thus the results are separated and evaluated separately

Figure 13

Figure 13. Median run time (sec) comparisons by retrieval-augmented generation methods, datasets, embedding models, and large language models.

Figure 14

Table 2. Utilization metrics for various embedding models and large language models

Figure 15

Figure 14. Median CPU ($\%$) usage comparisons by retrieval-augmented generation methods, datasets, embedding models, and large language models.

Figure 16

Table 3. Performance metrics for vectorstore systems with different embedding models

Figure 17

Figure 15. Median run time (sec), CPU, and memory usage comparisons by retrieval-augmented generation methods on different vectorstores.

Figure 18

Table 4. Performance metrics for vectorstore systems

Figure 19

Table 5. Comparison of Contextual Compression on median and standard deviation of token usage, run time, and score

Figure 20

Table 6. Comparison of run time, score, and similarity threshold

Supplementary material: File

Şakar and Emekci supplementary material

Şakar and Emekci supplementary material
Download Şakar and Emekci supplementary material(File)
File 4.8 MB