This post discusses the performance of GPT-4-128K with long context recall. The findings reveal that recall performance starts to degrade above 73K tokens, low recall is correlated with facts placed between 7%-50% document depth, and facts placed at the beginning or 2nd half of the document are recalled better. It is advised not to guarantee fact retrieval, reduce context for more accuracy, and consider the position of facts. The process involved using Paul Graham essays as background tokens and evaluating GPT-4's answers. Further steps include using a sigmoid distribution and key:value retrieval. More testing is needed to fully understand GPT4's abilities.
This post discusses the performance of Claude 2.1, an LLM model, in recalling facts at different document depths. The findings indicate that facts at the top and bottom of the document were recalled with high accuracy, while performance decreased towards the middle. It is suggested to experiment with prompts and conduct A/B tests to improve retrieval accuracy, not to assume guaranteed retrieval of facts, reduce context length for better accuracy, and consider the position of facts within the document. The test aimed to gain insights into LLM performance and transfer that knowledge to practical use cases.
This article introduces a limit testing on large models, which significantly improves the performance of GPT-4 and Claude2.1 by adding specific prompt statements at the beginning of the responses. The test results show that large models have difficulties in finding specific sentences, but this method can address the issue. In addition, the Kimi team from the Dark Side of the Moon also proposes different solutions and achieves good results. The entire experiment demonstrates that the performance of large models is subject to certain limitations, but it can be improved by appropriate prompts and adjustments.