synapse-old/docs/metrics-howto.md

19 KiB

How to monitor Synapse metrics using Prometheus

  1. Install Prometheus:

    Follow instructions at http://prometheus.io/docs/introduction/install/

  2. Enable Synapse metrics:

    In homeserver.yaml, make sure enable_metrics is set to True.

  3. Enable the /_synapse/metrics Synapse endpoint that Prometheus uses to collect data:

    There are two methods of enabling the metrics endpoint in Synapse.

    The first serves the metrics as a part of the usual web server and can be enabled by adding the metrics resource to the existing listener as such as in this example:

    listeners:
      - port: 8008
        tls: false
        type: http
        x_forwarded: true
        bind_addresses: ['::1', '127.0.0.1']
    
        resources:
          # added "metrics" in this line
          - names: [client, federation, metrics]
            compress: false
    

    This provides a simple way of adding metrics to your Synapse installation, and serves under /_synapse/metrics. If you do not wish your metrics be publicly exposed, you will need to either filter it out at your load balancer, or use the second method.

    The second method runs the metrics server on a different port, in a different thread to Synapse. This can make it more resilient to heavy load meaning metrics cannot be retrieved, and can be exposed to just internal networks easier. The served metrics are available over HTTP only, and will be available at /_synapse/metrics.

    Add a new listener to homeserver.yaml as in this example:

    listeners:
      - port: 8008
        tls: false
        type: http
        x_forwarded: true
        bind_addresses: ['::1', '127.0.0.1']
    
        resources:
          - names: [client, federation]
            compress: false
    
      # beginning of the new metrics listener
      - port: 9000
        type: metrics
        bind_addresses: ['::1', '127.0.0.1']
    
  4. Restart Synapse.

  5. Add a Prometheus target for Synapse.

    It needs to set the metrics_path to a non-default value (under scrape_configs):

      - job_name: "synapse"
        scrape_interval: 15s
        metrics_path: "/_synapse/metrics"
        static_configs:
          - targets: ["my.server.here:port"]
    

    where my.server.here is the IP address of Synapse, and port is the listener port configured with the metrics resource.

    If your prometheus is older than 1.5.2, you will need to replace static_configs in the above with target_groups.

  6. Restart Prometheus.

  7. Consider using the grafana dashboard and required recording rules

Monitoring workers

To monitor a Synapse installation using workers, every worker needs to be monitored independently, in addition to the main homeserver process. This is because workers don't send their metrics to the main homeserver process, but expose them directly (if they are configured to do so).

To allow collecting metrics from a worker, you need to add a metrics listener to its configuration, by adding the following under worker_listeners:

  - type: metrics
    bind_address: ''
    port: 9101

The bind_address and port parameters should be set so that the resulting listener can be reached by prometheus, and they don't clash with an existing worker. With this example, the worker's metrics would then be available on http://127.0.0.1:9101.

Example Prometheus target for Synapse with workers:

  - job_name: "synapse"
    scrape_interval: 15s
    metrics_path: "/_synapse/metrics"
    static_configs:
      - targets: ["my.server.here:port"]
        labels:
          instance: "my.server"
          job: "master"
          index: 1
      - targets: ["my.workerserver.here:port"]
        labels:
          instance: "my.server"
          job: "generic_worker"
          index: 1
      - targets: ["my.workerserver.here:port"]
        labels:
          instance: "my.server"
          job: "generic_worker"
          index: 2
      - targets: ["my.workerserver.here:port"]
        labels:
          instance: "my.server"
          job: "media_repository"
          index: 1

Labels (instance, job, index) can be defined as anything. The labels are used to group graphs in grafana.

Renaming of metrics & deprecation of old names in 1.2

Synapse 1.2 updates the Prometheus metrics to match the naming convention of the upstream prometheus_client. The old names are considered deprecated and will be removed in a future version of Synapse. The old names will be disabled by default in Synapse v1.71.0 and removed altogether in Synapse v1.73.0.

New Name Old Name
python_gc_objects_collected_total python_gc_objects_collected
python_gc_objects_uncollectable_total python_gc_objects_uncollectable
python_gc_collections_total python_gc_collections
process_cpu_seconds_total process_cpu_seconds
synapse_federation_client_sent_transactions_total synapse_federation_client_sent_transactions
synapse_federation_client_events_processed_total synapse_federation_client_events_processed
synapse_event_processing_loop_count_total synapse_event_processing_loop_count
synapse_event_processing_loop_room_count_total synapse_event_processing_loop_room_count
synapse_util_caches_cache_hits synapse_util_caches_cache:hits
synapse_util_caches_cache_size synapse_util_caches_cache:size
synapse_util_caches_cache_evicted_size synapse_util_caches_cache:evicted_size
synapse_util_caches_cache synapse_util_caches_cache:total
synapse_util_caches_response_cache_size synapse_util_caches_response_cache:size
synapse_util_caches_response_cache_hits synapse_util_caches_response_cache:hits
synapse_util_caches_response_cache_evicted_size synapse_util_caches_response_cache:evicted_size
synapse_util_metrics_block_count_total synapse_util_metrics_block_count
synapse_util_metrics_block_time_seconds_total synapse_util_metrics_block_time_seconds
synapse_util_metrics_block_ru_utime_seconds_total synapse_util_metrics_block_ru_utime_seconds
synapse_util_metrics_block_ru_stime_seconds_total synapse_util_metrics_block_ru_stime_seconds
synapse_util_metrics_block_db_txn_count_total synapse_util_metrics_block_db_txn_count
synapse_util_metrics_block_db_txn_duration_seconds_total synapse_util_metrics_block_db_txn_duration_seconds
synapse_util_metrics_block_db_sched_duration_seconds_total synapse_util_metrics_block_db_sched_duration_seconds
synapse_background_process_start_count_total synapse_background_process_start_count
synapse_background_process_ru_utime_seconds_total synapse_background_process_ru_utime_seconds
synapse_background_process_ru_stime_seconds_total synapse_background_process_ru_stime_seconds
synapse_background_process_db_txn_count_total synapse_background_process_db_txn_count
synapse_background_process_db_txn_duration_seconds_total synapse_background_process_db_txn_duration_seconds
synapse_background_process_db_sched_duration_seconds_total synapse_background_process_db_sched_duration_seconds
synapse_storage_events_persisted_events_total synapse_storage_events_persisted_events
synapse_storage_events_persisted_events_sep_total synapse_storage_events_persisted_events_sep
synapse_storage_events_state_delta_total synapse_storage_events_state_delta
synapse_storage_events_state_delta_single_event_total synapse_storage_events_state_delta_single_event
synapse_storage_events_state_delta_reuse_delta_total synapse_storage_events_state_delta_reuse_delta
synapse_federation_server_received_pdus_total synapse_federation_server_received_pdus
synapse_federation_server_received_edus_total synapse_federation_server_received_edus
synapse_handler_presence_notified_presence_total synapse_handler_presence_notified_presence
synapse_handler_presence_federation_presence_out_total synapse_handler_presence_federation_presence_out
synapse_handler_presence_presence_updates_total synapse_handler_presence_presence_updates
synapse_handler_presence_timers_fired_total synapse_handler_presence_timers_fired
synapse_handler_presence_federation_presence_total synapse_handler_presence_federation_presence
synapse_handler_presence_bump_active_time_total synapse_handler_presence_bump_active_time
synapse_federation_client_sent_edus_total synapse_federation_client_sent_edus
synapse_federation_client_sent_pdu_destinations_count_total synapse_federation_client_sent_pdu_destinations:count
synapse_federation_client_sent_pdu_destinations_total synapse_federation_client_sent_pdu_destinations:total
synapse_handlers_appservice_events_processed_total synapse_handlers_appservice_events_processed
synapse_notifier_notified_events_total synapse_notifier_notified_events
synapse_push_bulk_push_rule_evaluator_push_rules_invalidation_counter_total synapse_push_bulk_push_rule_evaluator_push_rules_invalidation_counter
synapse_push_bulk_push_rule_evaluator_push_rules_state_size_counter_total synapse_push_bulk_push_rule_evaluator_push_rules_state_size_counter
synapse_http_httppusher_http_pushes_processed_total synapse_http_httppusher_http_pushes_processed
synapse_http_httppusher_http_pushes_failed_total synapse_http_httppusher_http_pushes_failed
synapse_http_httppusher_badge_updates_processed_total synapse_http_httppusher_badge_updates_processed
synapse_http_httppusher_badge_updates_failed_total synapse_http_httppusher_badge_updates_failed
synapse_admin_mau_current synapse_admin_mau:current
synapse_admin_mau_max synapse_admin_mau:max
synapse_admin_mau_registered_reserved_users synapse_admin_mau:registered_reserved_users

Removal of deprecated metrics & time based counters becoming histograms in 0.31.0

The duplicated metrics deprecated in Synapse 0.27.0 have been removed.

All time duration-based metrics have been changed to be seconds. This affects:

msec -> sec metrics
python_gc_time
python_twisted_reactor_tick_time
synapse_storage_query_time
synapse_storage_schedule_time
synapse_storage_transaction_time

Several metrics have been changed to be histograms, which sort entries into buckets and allow better analysis. The following metrics are now histograms:

Altered metrics
python_gc_time
python_twisted_reactor_pending_calls
python_twisted_reactor_tick_time
synapse_http_server_response_time_seconds
synapse_storage_query_time
synapse_storage_schedule_time
synapse_storage_transaction_time

Block and response metrics renamed for 0.27.0

Synapse 0.27.0 begins the process of rationalising the duplicate *:count metrics reported for the resource tracking for code blocks and HTTP requests.

At the same time, the corresponding *:total metrics are being renamed, as the :total suffix no longer makes sense in the absence of a corresponding :count metric.

To enable a graceful migration path, this release just adds new names for the metrics being renamed. A future release will remove the old ones.

The following table shows the new metrics, and the old metrics which they are replacing.

New name Old name
synapse_util_metrics_block_count synapse_util_metrics_block_timer:count
synapse_util_metrics_block_count synapse_util_metrics_block_ru_utime:count
synapse_util_metrics_block_count synapse_util_metrics_block_ru_stime:count
synapse_util_metrics_block_count synapse_util_metrics_block_db_txn_count:count
synapse_util_metrics_block_count synapse_util_metrics_block_db_txn_duration:count
synapse_util_metrics_block_time_seconds synapse_util_metrics_block_timer:total
synapse_util_metrics_block_ru_utime_seconds synapse_util_metrics_block_ru_utime:total
synapse_util_metrics_block_ru_stime_seconds synapse_util_metrics_block_ru_stime:total
synapse_util_metrics_block_db_txn_count synapse_util_metrics_block_db_txn_count:total
synapse_util_metrics_block_db_txn_duration_seconds synapse_util_metrics_block_db_txn_duration:total
synapse_http_server_response_count synapse_http_server_requests
synapse_http_server_response_count synapse_http_server_response_time:count
synapse_http_server_response_count synapse_http_server_response_ru_utime:count
synapse_http_server_response_count synapse_http_server_response_ru_stime:count
synapse_http_server_response_count synapse_http_server_response_db_txn_count:count
synapse_http_server_response_count synapse_http_server_response_db_txn_duration:count
synapse_http_server_response_time_seconds synapse_http_server_response_time:total
synapse_http_server_response_ru_utime_seconds synapse_http_server_response_ru_utime:total
synapse_http_server_response_ru_stime_seconds synapse_http_server_response_ru_stime:total
synapse_http_server_response_db_txn_count synapse_http_server_response_db_txn_count:total
synapse_http_server_response_db_txn_duration_seconds synapse_http_server_response_db_txn_duration:total

Standard Metric Names

As of synapse version 0.18.2, the format of the process-wide metrics has been changed to fit prometheus standard naming conventions. Additionally the units have been changed to seconds, from milliseconds.

New name Old name
process_cpu_user_seconds_total process_resource_utime / 1000
process_cpu_system_seconds_total process_resource_stime / 1000
process_open_fds (no 'type' label) process_fds

The python-specific counts of garbage collector performance have been renamed.

New name Old name
python_gc_time reactor_gc_time
python_gc_unreachable_total reactor_gc_unreachable
python_gc_counts reactor_gc_counts

The twisted-specific reactor metrics have been renamed.

New name Old name
python_twisted_reactor_pending_calls reactor_pending_calls
python_twisted_reactor_tick_time reactor_tick_time