hf_text-generation-inference/docs/source/index.md

# Text Generation Inference

Text-Generation-Inference is, an open-source, purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. Text Generation Inference implements optimization for all supported model architectures, including:

- Tensor Parallelism and custom cuda kernels
- Optimized transformers code for inference using flash-attention and Paged Attention on the most popular architectures
- Quantization with bitsandbytes or gptq
- Continuous batching of incoming requests for increased total throughput
- Accelerated weight loading (start-up time) with safetensors
- Logits warpers (temperature scaling, topk, repetition penalty ...)
- Watermarking with A Watermark for Large Language Models
- Stop sequences, Log probabilities
- Token streaming using Server-Sent Events (SSE)
Added index.md and other initial files 2023-07-31 06:56:29 -06:00			`# Text Generation Inference`

			`Text-Generation-Inference is, an open-source, purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. Text Generation Inference implements optimization for all supported model architectures, including:`

			`- Tensor Parallelism and custom cuda kernels`
			`- Optimized transformers code for inference using flash-attention and Paged Attention on the most popular architectures`
			`- Quantization with bitsandbytes or gptq`
			`- Continuous batching of incoming requests for increased total throughput`
			`- Accelerated weight loading (start-up time) with safetensors`
			`- Logits warpers (temperature scaling, topk, repetition penalty ...)`
			`- Watermarking with A Watermark for Large Language Models`
			`- Stop sequences, Log probabilities`
			`- Token streaming using Server-Sent Events (SSE)`