Commit Graph

704 Commits

Author SHA1 Message Date
David Robertson c3627d0f99
Correctly read to-device stream pos on SQLite (#16682) 2023-11-24 13:42:38 +00:00
Erik Johnston 6fec2d035f
Also discard 'caches' and 'backfill' stream POSITIONS (#16655)
Follow on from #16640
2023-11-17 14:14:29 +00:00
Erik Johnston fef08cbee8
Fix sending out of order `POSITION` over replication (#16639)
If a worker reconnects to Redis we send out the current positions of all our streams. However, if we're also trying to send out a backlog of RDATA at the same time then we can end up sending a `POSITION` with the current token *before* we've sent all the RDATA before the current token.

This doesn't cause actual bugs as the receiving servers see the POSITION, fetch the relevant rows from the DB, and then ignore the old RDATA as they come in. However, this is inefficient so it'd be better if we didn't  send out-of-order positions
2023-11-16 13:05:09 +00:00
Erik Johnston 898655fd12
More efficiently handle no-op POSITION (#16640)
We may receive `POSITION` commands where we already know that worker has
advanced past that position, so there is no point in handling it.
2023-11-16 12:32:17 +00:00
Erik Johnston 408c13801a
Add fast path for replication events stream fetch (#16580)
We can bail early if the from token is greater than or equal to the
current token.
2023-10-30 14:47:57 +00:00
Erik Johnston 8c63e93286
Fix HTTP repl response to use minimum token (#16578) 2023-10-30 12:27:14 +00:00
Erik Johnston 5413cefe32
Reduce amount of caches POSITIONS we send (#16561)
Follow on from / actually correctly does #16557
2023-10-27 16:07:11 +01:00
Erik Johnston 89dbbd68e1
Reduce spurious replication catchup (#16555) 2023-10-27 13:27:20 +00:00
Erik Johnston 0680d76659
Reduce replication traffic due to reflected cache stream POSITION (#16557) 2023-10-27 12:51:08 +01:00
Erik Johnston ba47fea528
Allow multiple workers to write to receipts stream. (#16432)
Fixes #16417
2023-10-25 16:16:19 +01:00
Jason Little ffbe9b7666
Remove duplicate call to wake a remote destination when using federation sending worker (#16515) 2023-10-24 08:09:59 -04:00
Erik Johnston 8f35f8148e
Fix bug where a new writer advances their token too quickly (#16473)
* Fix bug where a new writer advances their token too quickly

When starting a new writer (for e.g. persisting events), the
`MultiWriterIdGenerator` doesn't have a minimum token for it as there
are no rows matching that new writer in the DB.

This results in the the first stream ID it acquired being announced as
persisted *before* it actually finishes persisting, if another writer
gets and persists a subsequent stream ID. This is due to the logic of
setting the minimum persisted position to the minimum known position of
across all writers, and the new writer starts off not being considered.

* Fix sending out POSITIONs when our token advances without update

Broke in #14820

* For replication HTTP requests, only wait for minimal position
2023-10-23 16:57:30 +01:00
Patrick Cloke 49c9745b45
Avoid sending massive replication updates when purging a room. (#16510) 2023-10-18 12:26:01 -04:00
Richard van der Hoff 109882230c
Clean up logging on event persister endpoints (#16488) 2023-10-14 17:57:27 +01:00
Patrick Cloke ae5b997cfa
Fix comments related to replication. (#16428) 2023-10-06 07:25:44 -04:00
Patrick Cloke 4e302b30b6
Add __slots__ to replication commands. (#16429)
To slightly reduce the amount of memory each command takes.
2023-10-05 07:38:55 -04:00
Erik Johnston 80ec81dcc5
Some refactors around receipts stream (#16426) 2023-10-04 16:28:40 +01:00
Erik Johnston 20fb08ec80
Downgrade repl stream time out error to warning (#16401)
This is because if a worker reaches ~100% CPU then everything starts
lagging and we hit the log line a lot. When at error we invoke sentry
and that has a lot of overhead, which then puts even more pressure on
the worker.
2023-09-29 11:52:48 +00:00
Patrick Cloke f84da3c32e
Add a cache around server ACL checking (#16360)
* Pre-compiles the server ACLs onto an object per room and
  invalidates them when new events come in.
* Converts the server ACL checking into Rust.
2023-09-26 11:57:50 -04:00
Erik Johnston 329597022e
Some minor performance fixes for task schedular (#16313) 2023-09-14 16:20:47 +01:00
Erik Johnston ab13fb08bf
Improve logging of replication (#16309) 2023-09-13 09:51:50 +00:00
Erik Johnston 1cd410a783
Recheck if remote device is cached before requesting it (#16252)
This fixes a bug where we could get stuck re-requesting the device over
replication again and again.
2023-09-07 12:45:43 +00:00
Patrick Cloke 55c20da4a3 Merge remote-tracking branch 'origin/release-v1.91' into release-v1.92 2023-09-06 11:25:28 -04:00
Quentin Gliech 1940d990a3
Revert MSC3861 introspection cache, admin impersonation and account lock (#16258) 2023-09-06 15:19:51 +01:00
Erik Johnston d35bed8369
Don't wake up destination transaction queue if they're not due for retry. (#16223) 2023-09-04 17:14:09 +01:00
David Robertson e9eb26e3af
Cache device resync requests over replication (#16241) 2023-09-04 11:57:59 +01:00
Patrick Cloke e9235d92f2
Track currently syncing users by device for presence (#16172)
Refactoring to use both the user ID & the device ID when tracking
the currently syncing users in the presence handler.

This is done both locally and over replication. Note that the device
ID is discarded but will be used in a future change.
2023-08-29 11:44:07 -04:00
Patrick Cloke 40901af5e0
Pass the device ID around in the presence handler (#16171)
Refactoring to pass the device ID (in addition to the user ID) through
the presence handler (specifically the `user_syncing`, `set_state`,
and `bump_presence_active_time` methods and their replication
versions).
2023-08-28 13:08:49 -04:00
Patrick Cloke 1bf143699c
Combine logic about not overriding BUSY presence. (#16170)
Simplify some of the presence code by reducing duplicated code between
worker & non-worker modes.

The main change is to push some of the logic from `user_syncing` into
`set_state`. This is done by passing whether the user is setting the presence
via a `/sync` with a new `is_sync` flag to `set_state`. If this is `true` some
additional logic is performed:

* Don't override `busy` presence.
* Update the `last_user_sync_ts`.
* Never update the status message.
2023-08-28 11:03:23 -04:00
Mathieu Velten 501da8ecd8
Task scheduler: add replication notify for new task to launch ASAP (#16184) 2023-08-28 14:03:51 +00:00
Erik Johnston 803f63df1c
Fix perf of `wait_for_stream_positions` (#16148) 2023-08-22 15:11:22 +00:00
Shay 69048f7b48
Add an admin endpoint to allow authorizing server to signal token revocations (#16125) 2023-08-22 14:15:34 +00:00
Patrick Cloke ad3f43be9a
Run pyupgrade for python 3.7 & 3.8. (#16110) 2023-08-15 08:11:20 -04:00
Erik Johnston ae55cc1e6b
Add ability to wait for locks and add locks to purge history / room deletion (#15791)
c.f. #13476
2023-07-31 10:58:03 +01:00
Shay 68b2611783
Clarify comment on key uploads over replication (#16016) 2023-07-27 15:08:46 -07:00
Jason Little c835befd10
Add Unix socket support for Redis connections (#15644)
Adds a new configuration setting to connect to Redis via a Unix
socket instead of over TCP. Disabled by default.
2023-05-26 15:28:39 -04:00
Jason Little 1df0221bda
Use a custom scheme & the worker name for replication requests. (#15578)
All the information needed is already in the `instance_map`, so
use that instead of passing the hostname / IP & port manually
for each replication request.

This consolidates logic for future improvements of using e.g.
UNIX sockets for workers.
2023-05-23 09:05:30 -04:00
Patrick Cloke 375b0a8a11
Update code to refer to "workers". (#15606)
A bunch of comments and variables are out of date and use
obsolete terms.
2023-05-16 15:56:38 -04:00
Roel ter Maat 2611433b70
Add redis SSL configuration options (#15312)
* Add SSL options to redis config

* fix lint issues

* Add documentation and changelog file

* add missing . at the end of the changelog

* Move client context factory to new file

* Rename ssl to tls and fix typo

* fix lint issues

* Added when redis attributes were added
2023-05-11 13:02:51 +01:00
Jason Little e4f545c452
Remove `worker_replication_*` settings (#15491)
* Add master to the instance_map as part of Complement, have ReplicationEndpoint look at instance_map for master.

* Fix typo in drive by.

* Remove unnecessary worker_replication_* bits from unit tests and add master to instance_map(hopefully in the right place)

* Several updates:

1. Switch from master to main for naming the main process in the instance_map. Add useful constants for easier adjustment of names in the future.
2. Add backwards compatibility for worker_replication_* to allow time to transition to new style. Make sure to prioritize declaring main directly on the instance_map.
3. Clean up old comments/commented out code.
4. Adjust unit tests to match with new code.
5. Adjust Complement setup infrastructure to only add main to the instance_map if workers are used and remove now unused options from the worker.yaml template.

* Initial Docs upload

* Changelog

* Missed some commented out code that can go now

* Remove TODO comment that no longer holds true.

* Fix links in docs

* More docs

* Remove debug logging

* Apply suggestions from code review

Co-authored-by: reivilibre <olivier@librepush.net>

* Apply suggestions from code review

Co-authored-by: reivilibre <olivier@librepush.net>

* Update version to latest, include completeish before/after examples in upgrade notes.

* Fix up and docs too

---------

Co-authored-by: reivilibre <olivier@librepush.net>
2023-05-11 11:30:56 +01:00
Jason Little d3bd03559b
HTTP Replication Client (#15470)
Separate out a HTTP client for replication in preparation for
also supporting using UNIX sockets. The major difference from
the base class is that this does not use treq to handle HTTP
requests.
2023-05-09 14:25:20 -04:00
Alok Kumar Singh 197fbb123b
Remove legacy code of single user device resync api (#15418)
* Removed single-user resync usage and updated it to use multi-user counterpart

Signed-off-by: Alok Kumar Singh alokaks601@gmail.com
2023-04-21 12:06:39 +01:00
Mathieu Velten 9228ae633f
Add some clarification to the doc/comments regarding TCP replication (#15354) 2023-03-30 12:51:35 +02:00
David Robertson 1bc9985eb7
Have replication clients remove _INT_STREAM_POS (#15309)
* Have replication clients remove _INT_STREAM_POS

Suppose worker A makes an internal http request from worker B. B may
make changes that A later learns about over replication. We want A's
request to block until it has seen those changes—mainly to ensure A's
caches are invalidated promptly. This helps provide read-after-write
consistency, eliminating entire categories of races and test flakes.

To implement this, B includes a top-level field `_INT_STREAM_POS` in its
response JSON. Roughly speaking, the field's value tells A what to wait
for. But we weren't removing that internal field before A's request
completed!

Introduced in https://github.com/matrix-org/synapse/pull/14820.
Fixes #15308.

* Changelog
2023-03-22 12:53:55 +00:00
Patrick Cloke afb216c202
Remove no-op send_command for Redis replication. (#15274)
With Redis commands do not need to be re-issued by the main
process (they fan-out to all processes at once) and thus it is no
longer necessary to worry about them reflecting recursively forever.
2023-03-16 11:13:30 -04:00
Patrick Cloke 3bf973edc7
Remove unused class: DirectTcpReplicationClientFactory. (#15272) 2023-03-15 15:42:20 -04:00
Dirk Klimpel ecbe0ddbe7
Add support for knocking to workers. (#15133) 2023-03-02 12:59:53 -05:00
H. Shay b2fd03d075 Merge branch 'master' into develop 2023-02-28 10:14:20 -08:00
Erik Johnston b2357a898c
Fix bug where 5s delays would occasionally happen. (#15150)
This only affects deployments using workers.
2023-02-24 14:39:50 +00:00
dependabot[bot] 9bb2eac719
Bump black from 22.12.0 to 23.1.0 (#15103) 2023-02-22 15:29:09 -05:00