Knowledge Retrieval Takes Center Stage

<aside> 📘 TL;DR;

RCG 就是比 RAG 更进一步，让 LLM 完全依赖于 Retrieval 的数据进行回答。

但是 RCG 并不仅仅是给 RAG 更多的数据，而是对底层 LLM 的设计方向存在一些根本性的变化：

小型定制化模型可以取代大型模型，在 RCG 领域取得同等的性能。
对未知数据（unseen data）的学习和理解能力。
RCG 追求的是 100% 准确性和透明性，完全消除幻觉（learn and abstract the schema as an emergent capability）。
LLM 不仅仅能比对信息，还能够具有抽象范式的理解能力。

LLM 的研究重点，从内嵌更多的数据，转变为对数据抽象模式的认知能力。

</aside>

To transition from consumer to business deployment for GenAI, solutions should be built primarily around information external to the model using retrieval-centric generation (RCG).

As generative AI (GenAI) begins deployment throughout industries for a wide range of business usages, companies need models that provide efficiency, accuracy, security, and traceability. The original architecture of ChatGPT-like models has demonstrated a major gap in meeting these key requirements. With early GenAI models, retrieval has been used as an afterthought to address the shortcomings of models that rely on memorized information from parametric memory. Current models have made significant progress on that issue by enhancing the solution platforms with a retrieval-augmented generation (RAG) front-end to allow for extracting information external to the model. Perhaps it’s time to further rethink the architecture of generative AI and move from RAG systems where retrieval is an addendum to retrieval-centric generation (RCG) models built around retrieval as the core access to information.

Retrieval-centric generation models can be defined as a generative AI solution designed for systems where the vast majority of data resides outside the model parametric memory and is mostly not seen in pre-training or fine-tuning. With RCG, the primary role of the GenAI model is to interpret rich retrieved information from a company’s indexed data corpus or other curated content. Rather than memorizing data, the model focuses on fine-tuning for targeted constructs, relationships, and functionality. The quality of data in generated output is expected to approach 100% accuracy and timeliness. The ability to properly interpret and use large amounts of data not seen in pre-training requires increased abstraction of the model and the use of schemas as a key cognitive capability to identify complex patterns and relationships in information. These new requirements of retrieval coupled with automated learning of schemata will lead to further evolution in the pre-training and fine-tuning of large language models (LLMs).

Figure 1. Advantages and challenges of retrieval-centric generation (RCG) versus retrieval-augmented generation (RAG). Image credit: Intel Labs.

Substantially reducing the use of memorized data from the parametric memory in GenAI models and instead relying on verifiable indexed sources will improve provenance and play an important role in enhancing accuracy and performance. The prevalent assumption in GenAI architectures up to now has been that more data in the model is better. Based on this currently predominant structure, it is expected that most tokens and concepts have been ingested and cross-mapped so that models can generate better answers from their parametric memory. However, in the common business scenario, the large majority of data utilized for the generated output is expected to come from retrieved inputs. We’re now observing that having more data in the model while relying on retrieved knowledge causes conflicts of information, or inclusion of data that can’t be traced or verified with its source. As I outlined in my last blog, Survival of the Fittest, smaller, nimble targeted models designed to use RCG don’t need to store as much data in parametric memory.

In business settings where the data will come primarily from retrieval, the targeted system needs to excel in interpreting unseen relevant information to meet company requirements. In addition, the prevalence of large vector databases and an increase in context window size (for example, OpenAI has recently increased the context window in GPT-4 Turbo from 32K to 128K) are shifting models toward reasoning and the interpretation of unseen complex data. Models now require intelligence to turn broad data into effective knowledge by utilizing a combination of sophisticated retrieval and fine-tuning. As models become retrieval-centric, cognitive competencies for creating and utilizing schemas will take center stage.

Consumer Versus Business Uses of GenAI

After a decade of rapid growth in AI model size and complexity, 2023 marks a shift in focus to efficiency and the targeted application of generative AI. The transition from a consumer focus to business usage is one of the key factors driving this change on three levels: quality of data, source of data, and targeted uses.

● Quality of data: When generating content and analysis for companies, 95% accuracy is insufficient. Businesses need near or at full accuracy. Fine-tuning for high performance on specific tasks and managing the quality of data used are both required for ensuring quality of output. Furthermore, data needs to be traceable and verifiable. Provenance matters, and retrieval is central for determining the source of content.

● Source of data: The vast majority of the data in business applications is expected to be curated from trusted external sources as well as proprietary business/enterprise data, including information about products, resources, customers, supply chain, internal operations, and more. Retrieval is central to accessing the latest and broadest set of proprietary data not pre-trained in the model. Models large and small can have problems with provenance when using data from their own internal memory versus verifiable, traceable data extracted from business sources. If the data conflicts, it can confuse the model.

● Targeted usages: The constructs and functions of models for companies tend to be specialized on a set of usages and types of data. When GenAI functionality is deployed in a specific workflow or business application, it is unlikely to require all-in-one functionality. And since the data will come primarily from retrieval, the targeted system needs to excel in interpreting relevant information unseen by the model in particular ways required by the company.

For example, if a financial or healthcare company pursues a GenAI model to improve its services, it will focus on a family of functions that are needed for their intended use. They have the option to pre-train a model from scratch and try to include all their proprietary information. However, such an effort is likely to be expensive, require deep expertise, and prone to fall behind quickly as the technology evolves and the company data continuously changes. Furthermore, it will need to rely on retrieval anyway for access to the latest concrete information. A more effective path is to take an existing pre-trained base model (like Meta’s Llama 2) and customize it through fine-tuning and indexing for retrieval. Fine-tuning uses just a small fraction of the information and tasks to refine the behavior of the model, but the extensive business proprietary information itself can be indexed and be available for retrieval as needed. As the base model gets updated with the latest GenAI technology, refreshing the target model should be a relatively straightforward process of repeating the fine-tuning flow.

Shift to Retrieval-Centric Generation: Architecting Around Indexed Information Extraction

Meta AI and university collaborators introduced retrieval-augmented generation in 2021 to address issues of provenance and updating world knowledge in LLMs. Researchers used RAG as a general-purpose approach to add non-parametric memory to pre-trained, parametric-memory generation models. The non-parametric memory used a Wikipedia dense vector index accessed by a pre-trained retriever. In a compact model with less memorized data, there is a strong emphasis on the breadth and quality of the indexed data referenced by the vector database because the model cannot rely on memorized information for business needs. Both RAG and RCG can use the same retriever approach by pulling relevant knowledge from a curated corpora on-the-fly during inference time (see Figure 2). They differ in the way the GenAI system places its information as well as in the interpretation expectations of previously unseen data. With RAG, the model itself is a major source of information, and it’s aided by retrieved data. In contrast, with RCG the vast majority of data resides outside the model parametric memory, making the interpretation of unseen data the model’s primary role.

It should be noted that many current RAG solutions rely on flows like LangChain or Haystack for concatenating a front-end retrieval with an independent vector store to a GenAI model that was not pre-trained with retrieval. These solutions provide an environment for indexing data sources, model choice, and model behavioral training. Other approaches, such as REALM by Google Research, experiment with end-to-end pre-training with integrated retrieval. Currently, OpenAI is optimizing its retrieval GenAI path rather than leaving it to the ecosystem to create the flow for ChatGPT. The company recently released Assistants API, which retrieves proprietary domain data, product information, or user documents external to the model.

Figure 2. Both RCG and RAG retrieve public and private data during inference, but they differ in how they place and interpret unseen data. Image credit: Intel Labs.

In other examples, fast retriever models like Intel Labs’ fastRAG use pre-trained small foundation models to extract requested information from a knowledge base without any additional training, providing a more sustainable solution. Built as an extension to the open-source Haystack GenAI framework, fastRAG uses a retriever model to generate conversational answers by retrieving current documents from an external knowledge base. In addition, a team of researchers from Meta recently published a paper introducing Retrieval-Augmented Dual Instruction Tuning (RA-DIT), “a lightweight fine-tuning methodology that provides a third option by retrofitting any large language model with retrieval capabilities.”