The Needle In a Haystack Test: Evaluating the Performance of LLM RAG Systems
The Needle In a Haystack Test: Evaluating the Performance of LLM RAG Systems

Introduction

Retrieval-augmented generation (RAG) underpins many of the LLM applications in the real world today, from companies generating headlines to solo developers solving problems for small businesses. With RAG’s importance likely to grow, ensuring its effectiveness is paramount.

The evaluation of RAG, therefore, has become a critical part in the development and deployment of these systems. One innovative approach to this challenge is the “Needle in a Haystack” test, first outlined by Greg Kamradt in this X post and discussed in detail on his YouTube here.

What Is the Needle In a Haystack Test for LLMs?

The “Needle In a Haystack” test is designed to evaluate the performance of LLM RAG systems across different sizes of context. It works by embedding specific, targeted information (the “needle”) within a larger, more complex body of text (the “haystack”). The goal is to assess an LLM’s ability to identify and utilize this specific piece of information amidst a vast amount of data.

Often in RAG systems, the context window is absolutely overflowing with information. Large pieces of context returned from a vector database are cluttered together with instructions for the language model, templating, and anything else that might exist in the prompt. The Needle in a Haystack evaluation tests the capabilities of an LLM to pinpoint specifics in amongst this mess. Your RAG system might do a stellar job of retrieving the most relevant context, but what use is this if the granular specifics within are overlooked?

We ran this test multiple times across several market leading language models. Let’s take a closer look at the process and overall results, first documented in this X thread.

What Are the Main Takeaways from The Needle In a Haystack Research?

  • Not all LLMs are the same. Models are trained with different objectives and requirements in mind. For example, Anthropic’s Claude is known for being a slightly wordier model, which often stems from its objective to not make unsubstantiated claims.
  • Minute differences in prompts can lead to drastically different outcomes across models due to this fact. Some LLMs need more tailored prompting to perform well at specific tasks.
  • When building on top of LLMs – especially when those models are connected to private data – it is necessary to evaluate retrieval and model performance throughout development and deployment. Seemingly insignificant differences can lead to incredibly large differences in performance, and in turn, customer satisfaction.

The Creation of The Needle In a Haystack Test

The Needle in a Haystack test was first used to evaluate the recall of two popular LLMs, OpenAI’s ChatGPT-4 and Anthropic’s Claude 2.1. An out of place statement, “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day,” was placed at varying depths within snippets of varying lengths taken from essays by Paul Graham, similar to this:

paul graham snippet
Figure 1: About 120 tokens and 50% depth

The models were then prompted to answer what the best thing to do in San Francisco was, only using the provided context. This was then repeated for different depths between 0% (top of document) and 100% (bottom of document) and different context lengths between 1K tokens and the token limit of each model (128k for GPT-4 and 200k for Claude 2.1). The below graphs document the performance of these two models:

Figure 2: ChatGPT-4’s performance

As you can see, ChatGPT’s performance begins to decline at <64k tokens and sharply falls at 100k and over. Interestingly, if the ‘needle’ is positioned towards the beginning of the context, the model tends to overlook or “forget” it, whereas if it’s placed towards the end or as the very first sentence, the model’s performance remains solid. This is truly fascinating stuff.

Figure 3: Claude 2.1’s performance

As for Claude, initial testing did not go as smoothly, finishing with an overall score of 27% retrieval accuracy. A similar phenomenon was observed with performance declining as context length increased, performance generally increasing as the needle was hidden closer to the bottom of the document, and 100% accuracy retrieval if the needle was the first sentence of the context.

Anthropic’s Response

In response to these findings, Anthropic published an article detailing their re-run of this test with a few key changes.

First, they changed the needle to more closely mirror the topic of the haystack. Claude 2.1 was trained to “not [answer] a question based on a document if it doesn’t contain enough information to justify that answer.” Thus, Claude may well have correctly identified eating a sandwich in Dolores Park as the best thing to do in San Francisco. However, amongst an essay about doing great work, this small piece of information may have appeared unsubstantiated. This could have led to a verbose response explaining that Claude cannot confirm that eating a sandwich is the best thing to do in San Francisco or an omission of the detail entirely. When re-running the experiments, researchers at Anthropic found that changing the needle to a small detail originally mentioned in the essay led to significantly increased outcomes.

Second, a small edit was made to the prompt template used to query the model.

anthropic prompt template
Figure 4: Anthropic’s Prompt Template Update

As you can see, a single line was added to the end of the template, directing the model to simply return the most relevant sentence provided in the context. Similar to the first, this change allows us to circumvent the model’s propensity to avoid unsubstantiated claims by directing it to simply return a sentence rather than make an assertion.

These changes led to a significant jump in Claude’s overall retrieval accuracy: from 27% to 98%! Our team found this initial research fascinating and decided to run our own set of experiments using the Needle in a Haystack test.

Our Research

In conducting our own series of tests, we implemented several modifications to the original experiments.The needle we used was a random number that changed each iteration, eliminating the possibility of caching. Additionally, we used our own evaluation library. In doing so we were able to:

  1. reduce the testing time from three days to just two hours, and
  2. use rails to search directly for the random number in the output, cutting through any possible wordiness that would decrease a retrieval score.

Finally, we considered the negative case where the system fails to retrieve the results, marking it as unanswerable. We ran a separate test for this negative case to assess how well the system recognizes when it can’t retrieve the data. These modifications allowed us to conduct a more rigorous and comprehensive evaluation.

The updated tests were run across several different configurations using four different large language models: ChatGPT-4, Claude 2.1 (with and without the aforementioned change to the prompt that Anthropic suggested), and Mistral’s 8X7B-v0.1 and 7B Instruct. Given that small nuances in prompting can lead to vastly different results across models, our team used several prompt templates in the attempt to compare these models performing at their best. The simple template we used for ChatGPT and Mixtral was as follows:

Figure 5: ChatGPT and Mixtral templating

And for Claude, we tested both previously discussed templates.

Figure 6: Claude templating used by Greg Kamradt
Figure 7: Revised Claude templating from Anthropic

All code run to complete these tests can be found in this GitHub repository.

Results

Figure 7: Comparison of GPT-4 results between the initial research (Run #1) and our testing (Run #2)
Figure 8: Comparison of Claude 2.1 (without prompting guidance) results between Run #1 and Run #2

Our results for ChatGPT and Claude (without prompting guidance) did not stray far from Mr. Kamradt’s findings, and the generated graphs appear relatively similar: the upper right (long context, needle near the beginning of the context) is where LLM information retrieval sufferers.

Figure 9: Comparison of Claude 2.1 results with and without prompting guidance

As for Claude 2.1 with prompting guidance, although we were not able to replicate Anthropic’s results of 98% retrieval accuracy, we did see a significant decrease in total misses when the prompt was updated (from 165 to 74). This jump was achieved by simply adding a 10 word instruction to the end of the existing prompt, highlighting that small differences in prompts can have drastically different outcomes for LLMs.

And last but certainly not least, it is interesting to see just how well Mixtral performed at this task despite these being by far the smallest models tested. The Mixture of Experts (MOEs) model was far better than 7B-Instruct, and we are finding that MOEs do much better for retrieval evals.

Conclusion

As these LLMs become integral to an increasing number of products and services, our ability to evaluate and understand their retrieval capabilities will take on elevated importance.

The Needle in a Haystack test is a clever way to quantify an LLM’s ability to parse context to find needed information. Our research concluded with a few main takeaways. First, ChatGPT-4 is the industry’s current leader in this arena along with many other evaluations that we and others have carried out. Second, at first Claude 2.1 seemed to underperform this test, but with tweaks to the prompt structure the model showed significant improvement. Claude is a bit wordier than some other models, and taking extra care to direct it can go a long way in terms of results. Finally, Mixtral 8x7b MOE greatly outperformed our expectations, and we are excited to see Mistral models continually overperform expectations across our research.

Further articles detailing LLM evaluation methods to follow.

Explore More Usescases