Context

The LLM generation process consists of two key stages: the prefilling stage and the decoding stage.

Prefilling Stage: During this stage, the LLM computes the key-value (KV) caches, which store intermediate results that can be reused during the subsequent decoding stage.
Decoding Stage: In this stage, the LLM generates tokens progressively, often working in a regressive manner to predict the next token based on the previously generated ones.

To optimize the handling of the decoding process, the context manager is designed for saving intermediate results during the decoding stage. The design choice is due to storing the entire KV caches while saving the entire KV caches as intermediate results are still under development and overhead evaluation.

Text-Based Context Switch: This approach stores the generated tokens as intermediate results, which is compatible with both APIs and local models.
Logits-Based Context Switch: Designed for more advanced needs, this approach saves the generated search tree rather than decoded tokens. This is particularly useful for handling different decoding strategies and currently only supports beam search.

Currently, text-based context switch is supported only for OpenAI models and Huggingface native models, logits-based context switch only supports for Huggingface native models. TODO: Support for other model providers (e.g., anthropic, google vertex and vllm) are under development.

PreviousRRScheduler NextMemory

Last updated 4 days ago