<buttonid="sidebar-toggle"class="icon-button"type="button"title="Toggle Table of Contents"aria-label="Toggle Table of Contents"aria-controls="sidebar">
<ahref="https://github.com/matrix-org/synapse/edit/develop/docs/usage/administration/understanding_synapse_through_grafana_graphs.md"title="Suggest an edit"aria-label="Suggest an edit">
<iid="git-edit-button"class="fa fa-edit"></i>
</a>
</div>
</div>
<divid="search-wrapper"class="hidden">
<formid="searchbar-outer"class="searchbar-outer">
<inputtype="search"id="searchbar"name="searchbar"placeholder="Search this book ..."aria-controls="searchresults-outer"aria-describedby="searchresults-header">
<h2id="understanding-synapse-through-grafana-graphs"><aclass="header"href="#understanding-synapse-through-grafana-graphs">Understanding Synapse through Grafana graphs</a></h2>
<p>It is possible to monitor much of the internal state of Synapse using <ahref="https://prometheus.io">Prometheus</a>
metrics and <ahref="https://grafana.com/">Grafana</a>.
A guide for configuring Synapse to provide metrics is available <ahref="../../metrics-howto.html">here</a>
and information on setting up Grafana is <ahref="https://github.com/matrix-org/synapse/tree/master/contrib/grafana">here</a>.
In this setup, Prometheus will periodically scrape the information Synapse provides and
store a record of it over time. Grafana is then used as an interface to query and
present this information through a series of pretty graphs.</p>
<p>Once you have grafana set up, and assuming you're using <ahref="https://github.com/matrix-org/synapse/blob/master/contrib/grafana/synapse.json">our grafana dashboard template</a>, look for the following graphs when debugging a slow/overloaded Synapse:</p>
<p>This, along with the CPU and Memory graphs, is a good way to check the general health of your Synapse instance. It represents how long it takes for a user on your homeserver to send a message.</p>
<h2id="transaction-count-and-transaction-duration"><aclass="header"href="#transaction-count-and-transaction-duration">Transaction Count and Transaction Duration</a></h2>
<p>These graphs show the database transactions that are occurring the most frequently, as well as those are that are taking the most amount of time to execute.</p>
<p>In the first graph, we can see obvious spikes corresponding to lots of <code>get_user_by_id</code> transactions. This would be useful information to figure out which part of the Synapse codebase is potentially creating a heavy load on the system. However, be sure to cross-reference this with Transaction Duration, which states that <code>get_users_by_id</code> is actually a very quick database transaction and isn't causing as much load as others, like <code>persist_events</code>:</p>
<p>Still, it's probably worth investigating why we're getting users from the database that often, and whether it's possible to reduce the amount of queries we make by adjusting our cache factor(s).</p>
<p>The <code>persist_events</code> transaction is responsible for saving new room events to the Synapse database, so can often show a high transaction duration.</p>
<p>The charts in the "Federation" section show information about incoming and outgoing federation requests. Federation data can be divided into two basic types:</p>
<ul>
<li>PDU (Persistent Data Unit) - room events: messages, state events (join/leave), etc. These are permanently stored in the database.</li>
<li>EDU (Ephemeral Data Unit) - other data, which need not be stored permanently, such as read receipts, typing notifications.</li>
</ul>
<p>The "Outgoing EDUs by type" chart shows the EDUs within outgoing federation requests by type: <code>m.device_list_update</code>, <code>m.direct_to_device</code>, <code>m.presence</code>, <code>m.receipt</code>, <code>m.typing</code>.</p>
<p>If you see a large number of <code>m.presence</code> EDUs and are having trouble with too much CPU load, you can disable <code>presence</code> in the Synapse config. See also <ahref="https://github.com/matrix-org/synapse/issues/3971">#3971</a>.</p>
<p>This is quite a useful graph. It shows how many times Synapse attempts to retrieve a piece of data from a cache which the cache did not contain, thus resulting in a call to the database. We can see here that the <code>_get_joined_profile_from_event_id</code> cache is being requested a lot, and often the data we're after is not cached.</p>
<p>Cross-referencing this with the Eviction Rate graph, which shows that entries are being evicted from <code>_get_joined_profile_from_event_id</code> quite often:</p>
<p>we should probably consider raising the size of that cache by raising its cache factor (a multiplier value for the size of an individual cache). Information on doing so is available <ahref="https://github.com/matrix-org/synapse/blob/ee421e524478c1ad8d43741c27379499c2f6135c/docs/sample_config.yaml#L608-L642">here</a> (note that the configuration of individual cache factors through the configuration file is available in Synapse v1.14.0+, whereas doing so through environment variables has been supported for a very long time). Note that this will increase Synapse's overall memory usage.</p>
<p>Forward extremities are the leaf events at the end of a DAG in a room, aka events that have no children. The more that exist in a room, the more <ahref="https://spec.matrix.org/v1.1/server-server-api/#room-state-resolution">state resolution</a> that Synapse needs to perform (hint: it's an expensive operation). While Synapse has code to prevent too many of these existing at one time in a room, bugs can sometimes make them crop up again.</p>
<p>If a room has >10 forward extremities, it's worth checking which room is the culprit and potentially removing them using the SQL queries mentioned in <ahref="https://github.com/matrix-org/synapse/issues/1760">#1760</a>.</p>