diff --git a/README.md b/README.md index 7589a3a6..60fe83cd 100644 --- a/README.md +++ b/README.md @@ -52,6 +52,8 @@ Text Generation Inference (TGI) is a toolkit for deploying and serving Large Lan - Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor)) - Stop sequences - Log probabilities +- [Speculation](https://huggingface.co/docs/text-generation-inference/conceptual/speculation) ~2x latency +- [Guidance/JSON](https://huggingface.co/docs/text-generation-inference/conceptual/guidance). Specify output format to speed up inference and make sure the output is valid according to some specs.. - Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output - Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 964a743a..73c88ccc 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -39,4 +39,8 @@ title: Safetensors - local: conceptual/flash_attention title: Flash Attention + - local: conceptual/speculation + title: Speculation (Medusa, ngram) + - local: conceptual/guidance + title: Guidance, JSON, tools (using outlines) title: Conceptual Guides diff --git a/docs/source/conceptual/guidance.md b/docs/source/conceptual/guidance.md new file mode 100644 index 00000000..8fb46466 --- /dev/null +++ b/docs/source/conceptual/guidance.md @@ -0,0 +1 @@ +## Guidance diff --git a/docs/source/conceptual/speculation.md b/docs/source/conceptual/speculation.md new file mode 100644 index 00000000..071b7b68 --- /dev/null +++ b/docs/source/conceptual/speculation.md @@ -0,0 +1,48 @@ +## Speculation + +Speculative decoding, assisted generation, Medusa, and others are a few different names for the same idea. +The idea is to generate tokens *before* the large model actually runs, and only *check* if those tokens where valid. + +So you are making *more* computations on your LLM, but if you are correct you produce 1, 2, 3 etc.. tokens on a single LLM pass. Since LLMs are usually memory bound (and not compute bound), provided your guesses are correct enough, this is a 2-3x faster inference (It can be much more for code oriented tasks for instance). + +You can check a more [detailed explanation](https://huggingface.co/blog/assisted-generation). + +Text-generation inference supports 2 main speculative methods: + +- Medusa +- N-gram + + +### Medusa + + +Medusa is a [simple method](https://arxiv.org/abs/2401.10774) to create many tokens in a single pass using fine-tuned LM heads in addition to your existing models. + + +You can check a few existing fine-tunes for popular models: + +- [text-generation-inference/gemma-7b-it-medusa](https://huggingface.co/text-generation-inference/gemma-7b-it-medusa) +- [text-generation-inference/Mixtral-8x7B-Instruct-v0.1-medusa](https://huggingface.co/text-generation-inference/Mixtral-8x7B-Instruct-v0.1-medusa) +- [text-generation-inference/Mistral-7B-Instruct-v0.2-medusa](https://huggingface.co/text-generation-inference/Mistral-7B-Instruct-v0.2-medusa) + + +In order to create your own medusa heads for your own finetune, you should check own the original medusa repo. [https://github.com/FasterDecoding/Medusa](https://github.com/FasterDecoding/Medusa) + + +In order to use medusa models in TGI, simply point to a medusa enabled model, and everything will load automatically. + + +### N-gram + + +If you don't have a medusa model, or don't have the resource to fine-tune, you can try to use `n-gram`. +Ngram works by trying to find in the previous sequence existing tokens that match, and use those as speculation. + +This is an extremely simple method, which works best for code, or highly repetitive text. This might not be beneficial, if the speculation misses too much. + + +In order to enable n-gram speculation simply use + +`--speculate 2` in your flags. + +[Details about the flag](https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#speculate) diff --git a/router/src/server.rs b/router/src/server.rs index 2efa9284..9c7046d9 100644 --- a/router/src/server.rs +++ b/router/src/server.rs @@ -242,7 +242,7 @@ async fn generate( headers.insert("x-compute-type", compute_type.parse().unwrap()); headers.insert( "x-compute-time", - total_time.as_millis().to_string().parse().unwrap(), + total_time.as_secs_f64().to_string().parse().unwrap(), ); headers.insert( "x-compute-characters",