hf_text-generation-inference/docs/source/reference/metrics.md

# Metrics

TGI exposes multiple metrics that can be collected via the `/metrics` Prometheus endpoint.
These metrics can be used to monitor the performance of TGI, autoscale deployment and to help identify bottlenecks.

The following metrics are exposed:

| Metric Name                                | Description                                                                              | Type      | Unit    |
|--------------------------------------------|------------------------------------------------------------------------------------------|-----------|---------|
| `tgi_batch_current_max_tokens`             | Maximum tokens for the current batch                                                     | Gauge     | Count   |
| `tgi_batch_current_size`                   | Current batch size                                                                       | Gauge     | Count   |
| `tgi_batch_decode_duration`                | Time spent decoding a batch per method (prefill or decode)                               | Histogram | Seconds |
| `tgi_batch_filter_duration`                | Time spent filtering batches and sending generated tokens per method (prefill or decode) | Histogram | Seconds |
| `tgi_batch_forward_duration`               | Batch forward duration per method (prefill or decode)                                    | Histogram | Seconds |
| `tgi_batch_inference_count`                | Inference calls per method (prefill or decode)                                           | Counter   | Count   |
| `tgi_batch_inference_duration`             | Batch inference duration                                                                 | Histogram | Seconds |
| `tgi_batch_inference_success`              | Number of successful inference calls per method (prefill or decode)                      | Counter   | Count   |
| `tgi_batch_next_size`                      | Batch size of the next batch                                                             | Histogram | Count   |
| `tgi_queue_size`                           | Current queue size                                                                       | Gauge     | Count   |
| `tgi_request_count`                        | Total number of requests                                                                 | Counter   | Count   |
| `tgi_request_duration`                     | Total time spent processing the request (e2e latency)                                    | Histogram | Seconds |
| `tgi_request_generated_tokens`             | Generated tokens per request                                                             | Histogram | Count   |
| `tgi_request_inference_duration`           | Request inference duration                                                               | Histogram | Seconds |
| `tgi_request_input_length`                 | Input token length per request                                                           | Histogram | Count   |
| `tgi_request_max_new_tokens`               | Maximum new tokens per request                                                           | Histogram | Count   |
| `tgi_request_mean_time_per_token_duration` | Mean time per token per request (inter-token latency)                                    | Histogram | Seconds |
| `tgi_request_queue_duration`               | Time spent in the queue per request                                                      | Histogram | Seconds |
| `tgi_request_skipped_tokens`               | Speculated tokens per request                                                            | Histogram | Count   |
| `tgi_request_success`                      | Number of successful requests                                                            | Counter   |         |
| `tgi_request_validation_duration`          | Time spent validating the request                                                        | Histogram | Seconds |
doc: Add metrics documentation and add a 'Reference' section (#2230) * doc: Add metrics documentation and add a 'Reference' section * doc: Add API reference * doc: Refactor API reference * fix: Message API link * Bad rebase * Moving the docs. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> 2024-08-16 11:43:30 -06:00			`# Metrics`

			TGI exposes multiple metrics that can be collected via the `/metrics` Prometheus endpoint.
			`These metrics can be used to monitor the performance of TGI, autoscale deployment and to help identify bottlenecks.`

			`The following metrics are exposed:`

			`\| Metric Name \| Description \| Type \| Unit \|`
			`\|--------------------------------------------\|------------------------------------------------------------------------------------------\|-----------\|---------\|`
			\| `tgi_batch_current_max_tokens` \| Maximum tokens for the current batch \| Gauge \| Count \|
			\| `tgi_batch_current_size` \| Current batch size \| Gauge \| Count \|
			\| `tgi_batch_decode_duration` \| Time spent decoding a batch per method (prefill or decode) \| Histogram \| Seconds \|
			\| `tgi_batch_filter_duration` \| Time spent filtering batches and sending generated tokens per method (prefill or decode) \| Histogram \| Seconds \|
			\| `tgi_batch_forward_duration` \| Batch forward duration per method (prefill or decode) \| Histogram \| Seconds \|
			\| `tgi_batch_inference_count` \| Inference calls per method (prefill or decode) \| Counter \| Count \|
			\| `tgi_batch_inference_duration` \| Batch inference duration \| Histogram \| Seconds \|
			\| `tgi_batch_inference_success` \| Number of successful inference calls per method (prefill or decode) \| Counter \| Count \|
			\| `tgi_batch_next_size` \| Batch size of the next batch \| Histogram \| Count \|
			\| `tgi_queue_size` \| Current queue size \| Gauge \| Count \|
			\| `tgi_request_count` \| Total number of requests \| Counter \| Count \|
			\| `tgi_request_duration` \| Total time spent processing the request (e2e latency) \| Histogram \| Seconds \|
			\| `tgi_request_generated_tokens` \| Generated tokens per request \| Histogram \| Count \|
			\| `tgi_request_inference_duration` \| Request inference duration \| Histogram \| Seconds \|
			\| `tgi_request_input_length` \| Input token length per request \| Histogram \| Count \|
			\| `tgi_request_max_new_tokens` \| Maximum new tokens per request \| Histogram \| Count \|
			\| `tgi_request_mean_time_per_token_duration` \| Mean time per token per request (inter-token latency) \| Histogram \| Seconds \|
			\| `tgi_request_queue_duration` \| Time spent in the queue per request \| Histogram \| Seconds \|
			\| `tgi_request_skipped_tokens` \| Speculated tokens per request \| Histogram \| Count \|
			\| `tgi_request_success` \| Number of successful requests \| Counter \| \|
			\| `tgi_request_validation_duration` \| Time spent validating the request \| Histogram \| Seconds \|