doc: Add metrics documentation and add a 'Reference' section (#2230)

* doc: Add metrics documentation and add a 'Reference' section

* doc: Add API reference

* doc: Refactor API reference

* fix: Message API link

* Bad rebase

* Moving the docs.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
This commit is contained in:
Hugo Larcher 2024-08-16 19:43:30 +02:00 committed by GitHub
parent cb0a29484d
commit 53729b74ac
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
8 changed files with 179 additions and 18 deletions

View File

@ -11,7 +11,7 @@ concurrency:
jobs: jobs:
build: build:
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yaml@main uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
with: with:
commit_sha: ${{ github.event.pull_request.head.sha }} commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }} pr_number: ${{ github.event.number }}

View File

@ -5,7 +5,7 @@ repos:
- id: check-yaml - id: check-yaml
- id: end-of-file-fixer - id: end-of-file-fixer
- id: trailing-whitespace - id: trailing-whitespace
exclude: docs/source/basic_tutorials/launcher.md exclude: docs/source/reference/launcher.md
- repo: https://github.com/psf/black - repo: https://github.com/psf/black
rev: 24.2.0 rev: 24.2.0
hooks: hooks:

View File

@ -17,8 +17,6 @@
title: Installation from source title: Installation from source
- local: supported_models - local: supported_models
title: Supported Models and Hardware title: Supported Models and Hardware
- local: messages_api
title: Messages API
- local: architecture - local: architecture
title: Internal Architecture title: Internal Architecture
- local: usage_statistics - local: usage_statistics
@ -33,8 +31,6 @@
title: Serving Private & Gated Models title: Serving Private & Gated Models
- local: basic_tutorials/using_cli - local: basic_tutorials/using_cli
title: Using TGI CLI title: Using TGI CLI
- local: basic_tutorials/launcher
title: All TGI CLI options
- local: basic_tutorials/non_core_models - local: basic_tutorials/non_core_models
title: Non-core Model Serving title: Non-core Model Serving
- local: basic_tutorials/safety - local: basic_tutorials/safety
@ -48,6 +44,14 @@
- local: basic_tutorials/train_medusa - local: basic_tutorials/train_medusa
title: Train Medusa title: Train Medusa
title: Tutorials title: Tutorials
- sections:
- local: reference/launcher
title: All TGI CLI options
- local: reference/metrics
title: Exported Metrics
- local: reference/api_reference
title: API Reference
title: Reference
- sections: - sections:
- local: conceptual/streaming - local: conceptual/streaming
title: Streaming title: Streaming
@ -64,7 +68,7 @@
- local: conceptual/speculation - local: conceptual/speculation
title: Speculation (Medusa, ngram) title: Speculation (Medusa, ngram)
- local: conceptual/guidance - local: conceptual/guidance
title: How Guidance Works (via outlines title: How Guidance Works (via outlines)
- local: conceptual/lora - local: conceptual/lora
title: LoRA (Low-Rank Adaptation) title: LoRA (Low-Rank Adaptation)

View File

@ -1,11 +1,9 @@
# Messages API # HTTP API Reference
Text Generation Inference (TGI) now supports the Messages API, which is fully compatible with the OpenAI Chat Completion API. This feature is available starting from version 1.4.0. You can use OpenAI's client libraries or third-party libraries expecting OpenAI schema to interact with TGI's Messages API. Below are some examples of how to utilize this compatibility.
> **Note:** The Messages API is supported from TGI version 1.4.0 and above. Ensure you are using a compatible version to access this feature.
#### Table of Contents #### Table of Contents
- [Text Generation Inference custom API](#text-generation-inference-custom-api)
- [OpenAI Messages API](#openai-messages-api)
- [Making a Request](#making-a-request) - [Making a Request](#making-a-request)
- [Streaming](#streaming) - [Streaming](#streaming)
- [Synchronous](#synchronous) - [Synchronous](#synchronous)
@ -13,6 +11,21 @@ Text Generation Inference (TGI) now supports the Messages API, which is fully co
- [Cloud Providers](#cloud-providers) - [Cloud Providers](#cloud-providers)
- [Amazon SageMaker](#amazon-sagemaker) - [Amazon SageMaker](#amazon-sagemaker)
The HTTP API is a RESTful API that allows you to interact with the text-generation-inference component. Two endpoints are available:
* Text Generation Inference [custom API](https://huggingface.github.io/text-generation-inference/)
* OpenAI's [Messages API](#openai-messages-api)
## Text Generation Inference custom API
Check the [API documentation](https://huggingface.github.io/text-generation-inference/) for more information on how to interact with the Text Generation Inference API.
## OpenAI Messages API
Text Generation Inference (TGI) now supports the Messages API, which is fully compatible with the OpenAI Chat Completion API. This feature is available starting from version 1.4.0. You can use OpenAI's client libraries or third-party libraries expecting OpenAI schema to interact with TGI's Messages API. Below are some examples of how to utilize this compatibility.
> **Note:** The Messages API is supported from TGI version 1.4.0 and above. Ensure you are using a compatible version to access this feature.
## Making a Request ## Making a Request
You can make a request to TGI's Messages API using `curl`. Here's an example: You can make a request to TGI's Messages API using `curl`. Here's an example:

View File

@ -0,0 +1,30 @@
# Metrics
TGI exposes multiple metrics that can be collected via the `/metrics` Prometheus endpoint.
These metrics can be used to monitor the performance of TGI, autoscale deployment and to help identify bottlenecks.
The following metrics are exposed:
| Metric Name | Description | Type | Unit |
|--------------------------------------------|------------------------------------------------------------------------------------------|-----------|---------|
| `tgi_batch_current_max_tokens` | Maximum tokens for the current batch | Gauge | Count |
| `tgi_batch_current_size` | Current batch size | Gauge | Count |
| `tgi_batch_decode_duration` | Time spent decoding a batch per method (prefill or decode) | Histogram | Seconds |
| `tgi_batch_filter_duration` | Time spent filtering batches and sending generated tokens per method (prefill or decode) | Histogram | Seconds |
| `tgi_batch_forward_duration` | Batch forward duration per method (prefill or decode) | Histogram | Seconds |
| `tgi_batch_inference_count` | Inference calls per method (prefill or decode) | Counter | Count |
| `tgi_batch_inference_duration` | Batch inference duration | Histogram | Seconds |
| `tgi_batch_inference_success` | Number of successful inference calls per method (prefill or decode) | Counter | Count |
| `tgi_batch_next_size` | Batch size of the next batch | Histogram | Count |
| `tgi_queue_size` | Current queue size | Gauge | Count |
| `tgi_request_count` | Total number of requests | Counter | Count |
| `tgi_request_duration` | Total time spent processing the request (e2e latency) | Histogram | Seconds |
| `tgi_request_generated_tokens` | Generated tokens per request | Histogram | Count |
| `tgi_request_inference_duration` | Request inference duration | Histogram | Seconds |
| `tgi_request_input_length` | Input token length per request | Histogram | Count |
| `tgi_request_max_new_tokens` | Maximum new tokens per request | Histogram | Count |
| `tgi_request_mean_time_per_token_duration` | Mean time per token per request (inter-token latency) | Histogram | Seconds |
| `tgi_request_queue_duration` | Time spent in the queue per request | Histogram | Seconds |
| `tgi_request_skipped_tokens` | Speculated tokens per request | Histogram | Count |
| `tgi_request_success` | Number of successful requests | Counter | |
| `tgi_request_validation_duration` | Time spent validating the request | Histogram | Seconds |

View File

@ -2003,6 +2003,120 @@ async fn start(
.install_recorder() .install_recorder()
.expect("failed to install metrics recorder"); .expect("failed to install metrics recorder");
// Metrics descriptions
metrics::describe_counter!("tgi_request_success", "Number of successful requests");
metrics::describe_histogram!(
"tgi_request_duration",
metrics::Unit::Seconds,
"Request duration"
);
metrics::describe_histogram!(
"tgi_request_validation_duration",
metrics::Unit::Seconds,
"Request validation duration"
);
metrics::describe_histogram!(
"tgi_request_queue_duration",
metrics::Unit::Seconds,
"Request queue duration"
);
metrics::describe_histogram!(
"tgi_request_inference_duration",
metrics::Unit::Seconds,
"Request inference duration"
);
metrics::describe_histogram!(
"tgi_request_mean_time_per_token_duration",
metrics::Unit::Seconds,
"Mean time per token per request"
);
metrics::describe_histogram!(
"tgi_request_generated_tokens",
metrics::Unit::Count,
"Generated tokens per request"
);
metrics::describe_counter!(
"tgi_batch_inference_count",
metrics::Unit::Count,
"Inference calls per method (prefill or decode)"
);
metrics::describe_counter!(
"tgi_request_count",
metrics::Unit::Count,
"Total number of requests"
);
metrics::describe_counter!(
"tgi_batch_inference_success",
metrics::Unit::Count,
"Number of successful inference calls per method (prefill or decode)"
);
metrics::describe_gauge!(
"tgi_batch_current_size",
metrics::Unit::Count,
"Current batch size"
);
metrics::describe_gauge!("tgi_queue_size", metrics::Unit::Count, "Current queue size");
metrics::describe_gauge!(
"tgi_batch_current_max_tokens",
metrics::Unit::Count,
"Maximum tokens for the current batch"
);
metrics::describe_histogram!(
"tgi_request_max_new_tokens",
metrics::Unit::Count,
"Maximum new tokens per request"
);
metrics::describe_histogram!(
"tgi_batch_inference_duration",
metrics::Unit::Seconds,
"Batch inference duration"
);
metrics::describe_histogram!(
"tgi_batch_forward_duration",
metrics::Unit::Seconds,
"Batch forward duration per method (prefill or decode)"
);
metrics::describe_histogram!(
"tgi_request_skipped_tokens",
metrics::Unit::Count,
"Speculated tokens per request"
);
metrics::describe_histogram!(
"tgi_batch_filter_duration",
metrics::Unit::Seconds,
"Time spent filtering batches and sending generated tokens per method (prefill or decode)"
);
metrics::describe_histogram!(
"tgi_request_queue_duration",
metrics::Unit::Seconds,
"Time spent in the queue per request"
);
metrics::describe_histogram!(
"tgi_request_validation_duration",
metrics::Unit::Seconds,
"Time spent validating the request"
);
metrics::describe_histogram!(
"tgi_request_duration",
metrics::Unit::Seconds,
"Total time spent processing the request"
);
metrics::describe_histogram!(
"tgi_batch_decode_duration",
metrics::Unit::Seconds,
"Time spent decoding a batch per method (prefill or decode)"
);
metrics::describe_histogram!(
"tgi_request_input_length",
metrics::Unit::Count,
"Input token length per request"
);
metrics::describe_histogram!(
"tgi_batch_next_size",
metrics::Unit::Count,
"Batch size of the next batch"
);
// CORS layer // CORS layer
let allow_origin = allow_origin.unwrap_or(AllowOrigin::any()); let allow_origin = allow_origin.unwrap_or(AllowOrigin::any());
let cors_layer = CorsLayer::new() let cors_layer = CorsLayer::new()

View File

@ -63,7 +63,7 @@ def check_cli(check: bool):
final_doc += f"## {header}\n```shell\n{rendered_block}\n```\n" final_doc += f"## {header}\n```shell\n{rendered_block}\n```\n"
block = [] block = []
filename = "docs/source/basic_tutorials/launcher.md" filename = "docs/source/reference/launcher.md"
if check: if check:
with open(filename, "r") as f: with open(filename, "r") as f:
doc = f.read() doc = f.read()