Commit Graph

191 Commits

Author SHA1 Message Date
Richard van der Hoff 2e0c46ca07 Merge branch 'release-v1.13.0' into develop 2020-05-06 11:58:31 +01:00
Richard van der Hoff a8c17da245 Merge branch 'release-v1.13.0' into rav/fix_dropped_messages 2020-05-05 23:01:12 +01:00
Richard van der Hoff 1242267316 Merge branch 'release-v1.13.0' into rav/fix_dropped_messages 2020-05-05 22:38:44 +01:00
Richard van der Hoff 7f7eedbebb Wait for a POSITION on the right connection before accepting RDATA
... otherwise we can believe we're up to date when we're not.
2020-05-05 22:38:16 +01:00
Brendan Abolivier 5b8023dc7f
Move logs about discarded RDATA to debug (#7421) 2020-05-05 21:07:33 +02:00
Richard van der Hoff d78265af0c Wait to subscribe before sending REPLICATE 2020-05-05 19:31:37 +01:00
Richard van der Hoff d5aa7d93ed
Fix catchup-on-reconnect for the Federation Stream (#7374)
looks like we managed to break this during the refactorathon.
2020-05-05 14:15:57 +01:00
Erik Johnston 350421e058
Fix redis password support. (#7401)
We forgot to set the password on the subscriber connection, as well as
not calling super methods for overridden connectionMade/connectionLost
functions.
2020-05-04 14:04:09 +01:00
Erik Johnston 0e719f2398
Thread through instance name to replication client. (#7369)
For in memory streams when fetching updates on workers we need to query the source of the stream, which currently is hard coded to be master. This PR threads through the source instance we received via `POSITION` through to the update function in each stream, which can then be passed to the replication client for in memory streams.
2020-05-01 17:19:56 +01:00
Erik Johnston 3085cde577
Use `stream.current_token()` and remove `stream_positions()` (#7172)
We move the processing of typing and federation replication traffic into their handlers so that `Stream.current_token()` points to a valid token. This allows us to remove `get_streams_to_replicate()` and `stream_positions()`.
2020-05-01 15:21:35 +01:00
Richard van der Hoff b2dba06079
Workaround for assertion errors from db_query_to_update_function (#7378)
Hopefully this is no worse than what we have on master...
2020-05-01 09:25:16 +01:00
Erik Johnston 37f6823f5b
Add instance name to RDATA/POSITION commands (#7364)
This is primarily for allowing us to send those commands from workers, but for now simply allows us to ignore echoed RDATA/POSITION commands that we sent (we get echoes of sent commands when using redis). Currently we log a WARNING on the master process every time we receive an echoed RDATA.
2020-04-29 16:23:08 +01:00
Erik Johnston 3eab76ad43
Don't relay REMOTE_SERVER_UP cmds to same conn. (#7352)
For direct TCP connections we need the master to relay REMOTE_SERVER_UP
commands to the other connections so that all instances get notified
about it. The old implementation just relayed to all connections,
assuming that sending back to the original sender of the command was
safe. This is not true for redis, where commands sent get echoed back to
the sender, which was causing master to effectively infinite loop
sending and then re-receiving REMOTE_SERVER_UP commands that it sent.

The fix is to ensure that we only relay to *other* connections and not
to the connection we received the notification from.

Fixes #7334.
2020-04-29 14:10:59 +01:00
Richard van der Hoff c2e1a2110f
Fix limit logic for EventsStream (#7358)
* Factor out functions for injecting events into database

I want to add some more flexibility to the tools for injecting events into the
database, and I don't want to clutter up HomeserverTestCase with them, so let's
factor them out to a new file.

* Rework TestReplicationDataHandler

This wasn't very easy to work with: the mock wrapping was largely superfluous,
and it's useful to be able to inspect the received rows, and clear out the
received list.

* Fix AssertionErrors being thrown by EventsStream

Part of the problem was that there was an off-by-one error in the assertion,
but also the limit logic was too simple. Fix it all up and add some tests.
2020-04-29 12:30:36 +01:00
Erik Johnston 38919b521e
Run replication streamers on workers (#7146)
Currently we never write to streams from workers, but that will change soon
2020-04-28 13:34:12 +01:00
Richard van der Hoff ce428a1abe Fix EventsStream raising assertions when it falls behind
Figuring out how to correctly limit updates from this stream without dropping
entries is far more complicated than just counting the number of rows being
returned. We need to consider each query separately and, if any one query hits
the limit, truncate the results from the others.

I think this also fixes some potentially long-standing bugs where events or
state changes could get missed if we hit the limit on either query.
2020-04-24 13:59:21 +01:00
Richard van der Hoff 9cbdfb3a2f Make it clear that the limit for an update_function is a target 2020-04-23 15:45:12 +01:00
Richard van der Hoff 23b28266ac Remove 'limit' param from `get_repl_stream_updates` API
there doesn't seem to be much point in passing this limit all around, since
both sides agree it's meant to be 100.
2020-04-23 15:44:35 +01:00
Richard van der Hoff 71a1abb8a1
Stop the master relaying USER_SYNC for other workers (#7318)
Long story short: if we're handling presence on the current worker, we shouldn't be sending USER_SYNC commands over replication.

In an attempt to figure out what is going on here, I ended up refactoring some bits of the presencehandler code, so the first 4 commits here are non-functional refactors to move this code slightly closer to sanity. (There's still plenty to do here :/). Suggest reviewing individual commits.

Fixes (I hope) #7257.
2020-04-22 22:39:04 +01:00
Erik Johnston 841c581c40
Fix replication metrics when using redis (#7325) 2020-04-22 16:26:19 +01:00
Richard van der Hoff 82d8b1dd1f
Another go at fixing one-word commands (#7326)
I messed this up last time I tried (#7239 / e13c6c7).
2020-04-22 14:34:31 +01:00
Erik Johnston 51f7eaf908
Add ability to run replication protocol over redis. (#7040)
This is configured via the `redis` config options.
2020-04-22 13:07:41 +01:00
Richard van der Hoff 0f8f02bc39
On catchup, process each row with its own stream id (#7286)
Other parts of the code (such as the StreamChangeCache) assume that there will
not be multiple changes with the same stream id.

This code was introduced in #7024, and I hope this fixes #7206.
2020-04-20 11:43:29 +01:00
Richard van der Hoff 67ff7b8ba0
Improve type checking in `replication.tcp.Stream` (#7291)
The general idea here is to get rid of the type: ignore annotations on all of the current_token and update_function assignments, which would have caught #7290.

After a bit of experimentation, it seems like the least-awful way to do this is to pass the offending functions in as parameters to the Stream constructor. Unfortunately that means that the concrete implementations no longer have the same constructor signature as Stream itself, which means that it gets hard to correctly annotate STREAMS_MAP.

I've also introduced a couple of new types, to take out some duplication.
2020-04-17 14:49:55 +01:00
Richard van der Hoff d7d42387f5
Fix 'generator object is not subscriptable' error (#7290)
Some of the query functions return generators rather than lists, so we can't
index into the result. Happily we already have a copy of the results.

(think this was introduced in #7024)
2020-04-16 14:37:06 +01:00
Richard van der Hoff e13c6c7a96 Handle one-word replication commands correctly
`REPLICATE` is now a valid command, and it's nice if you can issue it from the
console without remembering to call it `REPLICATE ` with a trailing space.
2020-04-07 17:43:46 +01:00
Richard van der Hoff c3e4b4edb2 Fix warnings about not calling superclass constructor
Separate `SimpleCommand` from `Command`, so that things which don't want to use
the `data` property don't have to, and thus fix the warnings PyCharm was giving
me about not calling `__init__` in the base class.
2020-04-07 17:40:22 +01:00
Richard van der Hoff 6a519a0ca0 Remove vestigal references to SYNC replication command
We've ripped pretty much all of this out: let's remove the remains.
2020-04-07 17:40:07 +01:00
Erik Johnston ce72355d7f
Fix race in replication (#7226)
Fixes a race between handling `POSITION` and `RDATA` commands. We do this by simply linearizing handling of them.
2020-04-07 11:01:04 +01:00
Erik Johnston 82498ee901
Move server command handling out of TCP protocol (#7187)
This completes the merging of server and client command processing.
2020-04-07 10:51:07 +01:00
Erik Johnston 5016b162fc
Move client command handling out of TCP protocol (#7185)
The aim here is to move the command handling out of the TCP protocol classes and to also merge the client and server command handling (so that we can reuse them for redis protocol). This PR simply moves the client paths to the new `ReplicationCommandHandler`, a future PR will move the server paths too.
2020-04-06 09:58:42 +01:00
Erik Johnston dfa0782254
Remove connections per replication stream metric. (#7195)
This broke in a recent PR (#7024) and is no longer useful due to all
replication clients implicitly subscribing to all streams, so let's
just remove it.
2020-04-01 10:40:46 +01:00
Erik Johnston 4f21c33be3
Remove usage of "conn_id" for presence. (#7128)
* Remove `conn_id` usage for UserSyncCommand.

Each tcp replication connection is assigned a "conn_id", which is used
to give an ID to a remotely connected worker. In a redis world, there
will no longer be a one to one mapping between connection and instance,
so instead we need to replace such usages with an ID generated by the
remote instances and included in the replicaiton commands.

This really only effects UserSyncCommand.

* Add CLEAR_USER_SYNCS command that is sent on shutdown.

This should help with the case where a synchrotron gets restarted
gracefully, rather than rely on 5 minute timeout.
2020-03-30 16:37:24 +01:00
Erik Johnston 4cff617df1
Move catchup of replication streams to worker. (#7024)
This changes the replication protocol so that the server does not send down `RDATA` for rows that happened before the client connected. Instead, the server will send a `POSITION` and clients then query the database (or master out of band) to get up to date.
2020-03-25 14:54:01 +00:00
Richard van der Hoff a564b92d37
Convert `*StreamRow` classes to inner classes (#7116)
This just helps keep the rows closer to their streams, so that it's easier to
see what the format of each stream is.
2020-03-23 13:59:11 +00:00
Richard van der Hoff b3cee0ce67
Fix processing of `groups` stream, and use symbolic names for streams (#7117)
`groups` != `receipts`

Introduced in #6964
2020-03-23 11:39:36 +00:00
Erik Johnston fdb1344716
Remove concept of a non-limited stream. (#7011) 2020-03-20 14:40:47 +00:00
Erik Johnston 9ce4e344a8 Change device list replication to match new semantics.
Instead of sending down batches of user ID/host tuples, send down a row
per entity (user ID or host).
2020-02-28 11:25:34 +00:00
Erik Johnston 1f773eec91
Port PresenceHandler to async/await (#6991) 2020-02-26 15:33:26 +00:00
Erik Johnston 0bd8cf435e Increase MAX_EVENTS_BEHIND for replication clients 2020-02-21 09:04:33 +00:00
Erik Johnston c3d4ad8afd
Fix sending server up commands from workers (#6811)
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
2020-01-30 16:42:11 +00:00
Erik Johnston d5275fc55f
Propagate cache invalidates from workers to other workers. (#6748)
Currently if a worker invalidates a cache it will be streamed to master, which then didn't forward those to other workers.
2020-01-27 13:47:50 +00:00
Erik Johnston 5d7a6ad223
Allow streaming cache invalidate all to workers. (#6749) 2020-01-22 10:37:00 +00:00
Erik Johnston a8a50f5b57
Wake up transaction queue when remote server comes back online (#6706)
This will be used to retry outbound transactions to a remote server if
we think it might have come back up.
2020-01-17 10:27:19 +00:00
Erik Johnston 48c3a96886
Port synapse.replication.tcp to async/await (#6666)
* Port synapse.replication.tcp to async/await

* Newsfile

* Correctly document type of on_<FOO> functions as async

* Don't be overenthusiastic with the asyncing....
2020-01-16 09:16:12 +00:00
Erik Johnston e8b68a4e4b
Fixup synapse.replication to pass mypy checks (#6667) 2020-01-14 14:08:06 +00:00
Richard van der Hoff 6964ea095b
Reduce the reconnect time when replication fails. (#6617) 2020-01-03 14:19:09 +00:00
Andrew Morgan cd96b4586f lint 2019-11-08 15:45:45 +00:00
Andrew Morgan c4bdf2d785 Remove content from being sent for account data rdata stream 2019-11-08 15:44:02 +00:00
Richard van der Hoff cc6243b4c0
document the REPLICATE command a bit better (#6305)
since I found myself wonder how it works
2019-11-04 12:40:18 +00:00
Hubert Chathi 9c94b48bf1 Merge branch 'develop' into uhoreg/cross_signing_fix_workers_notify 2019-10-31 12:32:07 -04:00
Andrew Morgan 54fef094b3
Remove usage of deprecated logger.warn method from codebase (#6271)
Replace every instance of `logger.warn` with `logger.warning` as the former is deprecated.
2019-10-31 10:23:24 +00:00
Hubert Chathi 998f7fe7d4 make user signatures a separate stream 2019-10-30 17:22:52 -04:00
Andrew Morgan 4548d1f87e
Remove unnecessary parentheses around return statements (#5931)
Python will return a tuple whether there are parentheses around the returned values or not.

I'm just sick of my editor complaining about this all over the place :)
2019-08-30 16:28:26 +01:00
Amber Brown 4806651744
Replace returnValue with return (#5736) 2019-07-23 23:00:55 +10:00
Amber Brown 463b072b12
Move logging utilities out of the side drawer of util/ and into logging/ (#5606) 2019-07-04 00:07:04 +10:00
Amber Brown 32e7c9e7f2
Run Black. (#5482) 2019-06-20 19:32:02 +10:00
Erik Johnston b5c62c6b26 Fix relations in worker mode 2019-05-16 10:38:13 +01:00
Richard van der Hoff 4b91c313a9 Combine the CurrentStateDeltaStream into the EventStream 2019-03-27 22:07:05 +00:00
Richard van der Hoff 1f6d6f918a Make EventStream rows have a type
... as a precursor to combining it with the CurrentStateDelta stream.
2019-03-27 22:07:05 +00:00
Richard van der Hoff 015b3622eb Skip building a ROW_TYPE when building updates
We're about to turn it straight into a JSON object anyway so building a
ROW_TYPE is a bit pointless, and reduces flexibility in the update_function.
2019-03-27 21:58:03 +00:00
Richard van der Hoff f570916a3e Add parse_row method to replication stream class
This will allow individual stream classes to override how a row is parsed.
2019-03-27 21:32:33 +00:00
Richard van der Hoff 71dcb275f1 move FederationStream out to its own file 2019-03-27 21:13:14 +00:00
Richard van der Hoff aa1e017864 move EventsStream out to its own file 2019-03-27 21:13:14 +00:00
Richard van der Hoff a5798de067 Move replication.tcp.streams into a package 2019-03-27 21:13:14 +00:00
Richard van der Hoff acaa18f7dd
Fix/improve some docstrings in the replication code. (#4949) 2019-03-27 21:12:36 +00:00
Richard van der Hoff 8cbbedaa2b
Fix ClientReplicationStreamProtocol.__str__ (#4929)
`__str__` depended on `self.addr`, which was absent from
ClientReplicationStreamProtocol, so attempting to call str on such an object
would raise an exception.

We can calculate the peer addr from the transport, so there is no need for addr
anyway.
2019-03-25 16:41:51 +00:00
Richard van der Hoff 9bde730ef8
Fix bug where read-receipts lost their timestamps (#4927)
Make sure that they are sent correctly over the replication stream.

Fixes: #4898
2019-03-25 16:38:05 +00:00
Richard van der Hoff cdb8036161
Add a config option for torture-testing worker replication. (#4902)
Setting this to 50 or so makes a bunch of sytests fail in worker mode.
2019-03-20 16:04:35 +00:00
Andrew Morgan b9f6163092 Simplify token replication logic 2019-03-05 13:58:30 +00:00
Andrew Morgan fe7bd23a85 Clean up logic and add comments 2019-03-04 15:08:15 +00:00
Andrew Morgan 9f7cdf3da1 Clearer branching, fix missing list clear 2019-03-04 14:36:52 +00:00
Andrew Morgan 5f0c449dd5 Prevent replication wedging 2019-03-04 14:03:18 +00:00
Erik Johnston 7590e9fa28
Merge pull request #4749 from matrix-org/erikj/replication_connection_backoff
Fix tightloop over connecting to replication server
2019-02-27 11:00:59 +00:00
Erik Johnston 6bb1c028f1 Limit cache invalidation replication line length (#4748) 2019-02-27 10:28:37 +00:00
Erik Johnston 6870fc496f Move connecting logic into ClientReplicationStreamProtocol 2019-02-27 10:23:51 +00:00
Erik Johnston 25814921f1 Increase the max delay between retry attempts
Otherwise if you have many workers they can easily take out master with
their connection attempts
2019-02-26 15:12:33 +00:00
Erik Johnston 313987187e Fix tightloop over connecting to replication server
If the client failed to process incoming commands during the initial set
up of the replication connection it would immediately disconnect and
reconnect, resulting in a tightloop.

This can happen, for example, when subscribing to a stream that has a
row that is too long in the backlog.

The fix here is to not consider the connection successfully set up until
the client has succesfully subscribed and caught up with the streams.
This ensures that the retry logic timers aren't reset until then,
meaning that if an error does happen during start up the client will
continue backing off before retrying again.
2019-02-26 15:05:41 +00:00
Erik Johnston a163b748a5 Don't truncate command name in metrics 2018-10-29 17:34:21 +00:00
Amber Brown c4b3698a80
Make the replication logger quieter (#4108) 2018-10-29 22:59:44 +11:00
Travis Ralston f1a7264663
Fix minor typo in exception 2018-09-13 11:51:12 -06:00
Erik Johnston 3e242dc149 Remove conn_id 2018-09-04 11:45:52 +01:00
Erik Johnston b13836da7f Remove conn_id from repl prometheus metrics
`conn_id` gets set to a random string, and so we end up filling up
prometheus with tonnes of data series, which is bad.
2018-09-03 17:22:49 +01:00
Richard van der Hoff 0e8d78f6aa Logcontexts for replication command handlers
Run the handlers for replication commands as background processes. This should
improve the visibility in our metrics, and reduce the number of "running db
transaction from sentinel context" warnings.

Ideally it means converting the things that fire off deferreds into the night
into things that actually return a Deferred when they are done. I've made a bit
of a stab at this, but it will probably be leaky.
2018-08-17 00:43:43 +01:00
Richard van der Hoff f59be4eb0e Fix unit tests
on_notifier_poke no longer runs synchonously, so we have to do a different hack
to make sure that the replication data has been sent. Let's actually listen for
its arrival.
2018-07-25 10:30:36 +01:00
Richard van der Hoff 371da42ae4 Wrap a number of things that run in the background
This will reduce the number of "Starting db connection from sentinel context"
warnings, and will help with our metrics.
2018-07-25 09:41:12 +01:00
Amber Brown 49af402019 run isort 2018-07-09 16:09:20 +10:00
Amber Brown 6350bf925e
Attempt to be more performant on PyPy (#3462) 2018-06-28 14:49:57 +01:00
Amber Brown 07cad26d65
Remove all global reactor imports & pass it around explicitly (#3424) 2018-06-25 14:08:28 +01:00
Amber Brown 99b77aa829
Fix tcp protocol metrics naming (#3410) 2018-06-21 09:39:27 +01:00
Richard van der Hoff b7e7fd2d0e Fix replication metrics
fix bug introduced in #3256
2018-06-04 16:23:05 +01:00
Amber Brown 754826a830 Merge remote-tracking branch 'origin/develop' into 3218-official-prom 2018-05-28 18:57:23 +10:00
Amber Brown 1f69693347
Merge pull request #3244 from NotAFile/py3-six-4
replace some iteritems with six
2018-05-24 13:04:07 -05:00
Amber Brown b6063631c3 more cleanup 2018-05-22 17:36:20 -05:00
Amber Brown 228f1f584e fix the test failures 2018-05-22 15:02:38 -05:00
Amber Brown 8f5a688d42 cleanups, self-registration 2018-05-22 10:56:03 -05:00
Amber Brown a8990fa2ec Merge remote-tracking branch 'origin/develop' into 3218-official-prom 2018-05-22 10:50:26 -05:00
Richard van der Hoff 9ea219c514 Send users a server notice about consent
When a user first syncs, we will send them a server notice asking them to
consent to the privacy policy if they have not already done so.
2018-05-22 11:54:51 +01:00
Amber Brown fcc525b0b7 rest of the changes 2018-05-21 19:48:57 -05:00
Amber Brown df9f72d9e5 replacing portions 2018-05-21 19:47:37 -05:00