* Limit AS transactions to 100 events
* Update changelog.d/8606.feature
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
* Add tests
* Update synapse/appservice/scheduler.py
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
I noticed in https://github.com/matrix-org/synapse/issues/8575 that the `end_error` variable in `synapse_port_db` is set to an `Exception`, even though later we expect it to be a `str`.
This PR simply casts an exception raised to a string. I'm doing this instead of having `end_error` be of type exception as we explicitly set `end_error` to a str here:
d25eb8f370/scripts/synapse_port_db (L542-L547)
This whole file could probably use some heavy refactoring, but until then at least this fix will prevent exception contents from being hidden from us and users.
* Add `DeferredCache.get_immediate` method
A bunch of things that are currently calling `DeferredCache.get` are only
really interested in the result if it's completed. We can optimise and simplify
this case.
* Remove unused 'default' parameter to DeferredCache.get()
* another get_immediate instance
* type annotations for LruCache
* changelog
* Apply suggestions from code review
Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
* review comments
Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
This implements a more standard API for instantiating a homeserver and
moves some of the dependency injection into the test suite.
More concretely this stops using `setattr` on all `kwargs` passed to `HomeServer`.
This PR makes several changes to the `./scripts-dev/lint.sh` script, which lints the codebase with a number of tools:
* Adds usage information, with `-h` flag to show it. Otherwise it will show when providing an unknown flag.
* Adds option `-d` which will check both staged and unstaged files that have changed since the last commit and add them to the list of files to lint.
- Note that only files without an extension, or with a `.py` extension will be allowed. This prevents editing bash scripts causing the linters to break on non-python files.
* Improves the print-out of which files/directories are being linted.
We asserted that the IDs returned by postgres sequence was greater than
any we had seen, however this is technically racey as we may update the
current positions out of order.
We now assert that the sequences are correct on startup, so the
assertion is no longer really required, so we remove them.
Autocommit means that we don't wrap the functions in transactions, and instead get executed directly. Introduced in #8456. This will help:
1. reduce the number of `could not serialize access due to concurrent delete` errors that we see (though there are a few functions that often cause serialization errors that we don't fix here);
2. improve the DB performance, as it no longer needs to deal with the overhead of `REPEATABLE READ` isolation levels; and
3. improve wall clock speed of these functions, as we no longer need to send `BEGIN` and `COMMIT` to the DB.
Some notes about the differences between autocommit mode and our default `REPEATABLE READ` transactions:
1. Currently `autocommit` only applies when using PostgreSQL, and is ignored when using SQLite (due to silliness with [Twisted DB classes](https://twistedmatrix.com/trac/ticket/9998)).
2. Autocommit functions may get retried on error, which means they can get applied *twice* (or more) to the DB (since they are not in a transaction the previous call would not get rolled back). This means that the functions need to be idempotent (or otherwise not care about being called multiple times). Read queries, simple deletes, and updates/upserts that replace rows (rather than generating new values from existing rows) are all idempotent.
3. Autocommit functions no longer get executed in [`REPEATABLE READ`](https://www.postgresql.org/docs/current/transaction-iso.html) isolation level, and so data can change queries, which is fine for single statement queries.
We asserted that the IDs returned by postgres sequence was greater than
any we had seen, however this is technically racey as we may update the
current positions out of order.
We now assert that the sequences are correct on startup, so the
assertion is no longer really required, so we remove them.
* Fix outbound federaion with multiple event persisters.
We incorrectly notified federation senders that the minimum persisted
stream position had advanced when we got an `RDATA` from an event
persister.
Notifying of federation senders already correctly happens in the
notifier, so we just delete the offending line.
* Change some interfaces to use RoomStreamToken.
By enforcing use of `RoomStreamTokens` we make it less likely that
people pass in random ints that they got from somewhere random.
Currently background proccesses stream the events stream use the "minimum persisted position" (i.e. `get_current_token()`) rather than the vector clock style tokens. This is broadly fine as it doesn't matter if the background processes lag a small amount. However, in extreme cases (i.e. SyTests) where we only write to one event persister the background processes will never make progress.
This PR changes it so that the `MultiWriterIDGenerator` keeps the current position of a given instance as up to date as possible (i.e using the latest token it sees if its not in the process of persisting anything), and then periodically announces that over replication. This then allows the "minimum persisted position" to advance, albeit with a small lag.
This could, very occasionally, cause:
```
tests.test_visibility.FilterEventsForServerTestCase.test_large_room
===============================================================================
[ERROR]
Traceback (most recent call last):
File "/src/tests/rest/media/v1/test_media_storage.py", line 86, in test_ensure_media_is_in_local_cache
self.wait_on_thread(x)
File "/src/tests/unittest.py", line 296, in wait_on_thread
self.reactor.advance(0.01)
File "/src/.tox/py35/lib/python3.5/site-packages/twisted/internet/task.py", line 826, in advance
self._sortCalls()
File "/src/.tox/py35/lib/python3.5/site-packages/twisted/internet/task.py", line 787, in _sortCalls
self.calls.sort(key=lambda a: a.getTime())
builtins.ValueError: list modified during sort
tests.rest.media.v1.test_media_storage.MediaStorageTests.test_ensure_media_is_in_local_cache
```
This PR allows Synapse modules making use of the `ModuleApi` to create and send non-membership events into a room. This can useful to have modules send messages, or change power levels in a room etc. Note that they must send event through a user that's already in the room.
The non-membership event limitation is currently arbitrary, as it's another chunk of work and not necessary at the moment.
==============================
Bugfixes
--------
- Fix duplication of events on high traffic servers, caused by PostgreSQL `could not serialize access due to concurrent update` errors. ([\#8456](https://github.com/matrix-org/synapse/issues/8456))
Internal Changes
----------------
- Add Groovy Gorilla to the list of distributions we build `.deb`s for. ([\#8475](https://github.com/matrix-org/synapse/issues/8475))
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEBTGR3/RnAzBGUif3pULk7RsPrAkFAl9+7JIQHGVyaWtAbWF0
cml4Lm9yZwAKCRClQuTtGw+sCUc1B/9Bz8uAEJWX9qahuWZGrsvj0ukF/9O6YWrU
ZZGHnyaAXg4MF0lR6DVCV2NhW/l3RWO01G/LgGT//0vac+EFxWDbjuaHP9pRVBln
02TQpn12ceZtW/n+X+8wmQzfSn1zNWRUycV43MyGu49z3IAG7KVjjVDQPkY0mEHC
h6THy30SYIOKDCdXUUL9i+sV1JLzKPX0NlzO3ONKe2LbfLiBTvdfDRlZGq6rJu5H
GAsIPwgadhanhXBPhG4WrealG31TkcPbZCwFMv9aFzltU841jps+QN6OKvbGFVer
FboD2df0ZJqKNJOJVVlOMs7I5uSYbrg+n1zolZfrbo0lk/xEyIAT
=F5cw
-----END PGP SIGNATURE-----
Merge tag 'v1.21.0rc3' into develop
Synapse 1.21.0rc3 (2020-10-08)
==============================
Bugfixes
--------
- Fix duplication of events on high traffic servers, caused by PostgreSQL `could not serialize access due to concurrent update` errors. ([\#8456](https://github.com/matrix-org/synapse/issues/8456))
Internal Changes
----------------
- Add Groovy Gorilla to the list of distributions we build `.deb`s for. ([\#8475](https://github.com/matrix-org/synapse/issues/8475))
Added shields directing to synapse-dev room, showing license, latest version on PyPi and supported Python versions.
I've moved substitution definitions to the bottom to improve readability.
Signed-off-by: Mateusz Przybyłowicz <uamfhq@gmail.com>
We call `_update_stream_positions_table_txn` a lot, which is an UPSERT
that can conflict in `REPEATABLE READ` isolation level. Instead of doing
a transaction consisting of a single query we may as well run it outside
of a transaction.
We call `_update_stream_positions_table_txn` a lot, which is an UPSERT
that can conflict in `REPEATABLE READ` isolation level. Instead of doing
a transaction consisting of a single query we may as well run it outside
of a transaction.
Currently when using multiple event persisters we (in the worst case) don't tell clients about events until all event persisters have persisted new events after the original event. This is a suboptimal, especially if one of the event persisters goes down.
To handle this, we encode the position of each event persister in the room tokens so that we can send events to clients immediately. To reduce the size of the token we do two things:
1. We create a unique immutable persistent mapping between instance names and a generated small integer ID, which we can encode in the tokens instead of the instance name; and
2. We encode the "persisted upto position" of the room token and then only explicitly include instances that have positions strictly greater than that.
The new tokens look something like: `m3478~1.3488~2.3489`, where the first number is the min position, and the subsequent `-` separated pairs are the instance ID to positions map. (We use `.` and `~` as separators as they're URL safe and not already used by `StreamToken`).
It seems most of these blacklisted tests do actually pass most of the time.
I'm of the opinion that having them blacklisted here means there is very little incentive for us to deflake any flaky tests, and meanwhile any value in those tests is completely lost.
Lots of different module apis is not easy to maintain.
Rather than adding yet another ModuleApi(hs, hs.get_auth_handler()) incantation, first add an hs.get_module_api() method and use it where possible.
https://github.com/matrix-org/synapse/tree/develop/docs/sphinx doesn't seem to really be utilised or changed recently since the initial commit. I like the idea of exportable documentation of the codebase, but at the moment after running through the build instructions the generated website wasn't very useful...
* Optimise and test state fetching for 3p event rules
Getting all the events at once is much more efficient than getting them
individually
* Test that 3p event rules can modify events
PR #8292 tried to maintain backwards compat with modules which don't provide a
`check_visibility_can_be_modified` method, but the tests weren't being run,
and the check didn't work.
This PR allows `ThirdPartyEventRules` modules to view, manipulate and block changes to the state of whether a room is published in the public rooms directory.
While the idea of whether a room is in the public rooms list is not kept within an event in the room, `ThirdPartyEventRules` generally deal with controlling which modifications can happen to a room. Public rooms fits within that idea, even if its toggle state isn't controlled through a state event.
There's no need for it to be in the dict as well as the events table. Instead,
we store it in a separate attribute in the EventInternalMetadata object, and
populate that on load.
This means that we can rely on it being correctly populated for any event which
has been persited to the database.
This is so we can tell what is going on when things are taking a while to start up.
The main change here is to ensure that transactions that are created during startup get correctly logged like normal transactions.
#7124 changed the behaviour of remote thumbnails so that the thumbnailing method was included in the filename of the thumbnail. To support existing files, it included a fallback so that we would check the old filename if the new filename didn't exist.
Unfortunately, it didn't apply this logic to storage providers, so any thumbnails stored on such a storage provider was broken.
For negative streams we have to negate the internal stream ID before
querying the DB.
The effect of this bug was to query far too many rows, slowing start up
time, but we would correctly filter the results afterwards so there was
no ill effect.
This converts a few more of our inline HTML templates to Jinja. This is somewhat part of #7280 and should make it a bit easier to customize these in the future.
The idea is that in future tokens will encode a mapping of instance to position. However, we don't want to include the full instance name in the string representation, so instead we'll have a mapping between instance name and an immutable integer ID in the DB that we can use instead. We'll then do the lookup when we serialize/deserialize the token (we could alternatively pass around an `Instance` type that includes both the name and ID, but that turns out to be a lot more invasive).
* Don't check whether a 3pid is allowed to register during password reset
This endpoint should only deal with emails that have already been approved, and
are attached with user's account. There's no need to re-check them here.
* Changelog
* Fix table scan of events on worker startup.
This happened because we assumed "new" writers had an initial stream
position of 0, so the replication code tried to fetch all events written
by the instance between 0 and the current position.
Instead, set the initial position of new writers to the current
persisted up to position, on the assumption that new writers won't have
written anything before that point.
* Consider old writers coming back as "new".
Otherwise we'd try and fetch entries between the old stale token and the
current position, even though it won't have written any rows.
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
This PR adds a script that:
* Builds the local Synapse checkout using our existing `docker/Dockerfile` image.
* Downloads [Complement](https://github.com/matrix-org/complement/)'s source code.
* Builds the [Synapse.Dockerfile](https://github.com/matrix-org/complement/blob/master/dockerfiles/Synapse.Dockerfile) using the above dockerfile as a base.
* Builds and runs Complement against it.
This set up differs slightly from [that of the dendrite repo](https://github.com/matrix-org/dendrite/blob/master/build/scripts/complement.sh) (`complement.sh`, `Complement.Dockerfile`), which instead stores a separate, but slightly modified, dockerfile in Dendrite's repo rather than running the one stored in Complement's repo. That synapse equivalent to that dockerfile (`Synapse.Dockerfile`) in Complement's repo is just based on top of `matrixdotorg/synapse:latest`, which we opt to build here locally.
Thus copying over the files from Complement's repo wouldn't change any functionality, and would result in two instances of the same files. So just using the dockerfile in Complement's repo was decided upon instead.
Broken in https://github.com/matrix-org/synapse/pull/8275 and has yet to be put in a release. Fixes https://github.com/matrix-org/synapse/issues/8418.
`next_link` is an optional parameter. However, we were checking whether the `next_link` param was valid, even if it wasn't provided. In that case, `next_link` was `None`, which would clearly not be a valid URL.
This would prevent password reset and other operations if `next_link` was not provided, and the `next_link_domain_whitelist` config option was set.
* Remove `on_timeout_cancel` from `timeout_deferred`
The `on_timeout_cancel` param to `timeout_deferred` wasn't always called on a
timeout (in particular if the canceller raised an exception), so it was
unreliable. It was also only used in one place, and to be honest it's easier to
do what it does a different way.
* Fix handling of connection timeouts in outgoing http requests
Turns out that if we get a timeout during connection, then a different
exception is raised, which wasn't always handled correctly.
To fix it, catch the exception in SimpleHttpClient and turn it into a
RequestTimedOutError (which is already a documented exception).
Also add a description to RequestTimedOutError so that we can see which stage
it failed at.
* Fix incorrect handling of timeouts reading federation responses
This was trapping the wrong sort of TimeoutError, so was never being hit.
The effect was relatively minor, but we should fix this so that it does the
expected thing.
* Fix inconsistent handling of `timeout` param between methods
`get_json`, `put_json` and `delete_json` were applying a different timeout to
the response body to `post_json`; bring them in line and test.
Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
Co-authored-by: Erik Johnston <erik@matrix.org>
This table was created in #8034 (1.20.0). It references
`ui_auth_sessions`, which is ignored, so this one should be too.
Signed-off-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
* Fix test_verify_json_objects_for_server_awaits_previous_requests
It turns out that this wasn't really testing what it thought it was testing
(in particular, `check_context` was turning failures into success, which was
making the tests pass even though it wasn't clear they should have been.
It was also somewhat overcomplex - we can test what it was trying to test
without mocking out perspectives servers.
* Fix warnings about finished logcontexts in the keyring
We need to make sure that we finish the key fetching magic before we run the
verifying code, to ensure that we don't mess up our logcontexts.
Co-authored-by: Benjamin Koch <bbbsnowball@gmail.com>
This adds configuration flags that will match a user to pre-existing users
when logging in via OpenID Connect. This is useful when switching to
an existing SSO system.
On startup `MultiWriteIdGenerator` fetches the maximum stream ID for
each instance from the table and uses that as its initial "current
position" for each writer. This is problematic as a) it involves either
a scan of events table or an index (neither of which is ideal), and b)
if rows are being persisted out of order elsewhere while the process
restarts then using the maximum stream ID is not correct. This could
theoretically lead to race conditions where e.g. events that are
persisted out of order are not sent down sync streams.
We fix this by creating a new table that tracks the current positions of
each writer to the stream, and update it each time we finish persisting
a new entry. This is a relatively small overhead when persisting events.
However for the cache invalidation stream this is a much bigger relative
overhead, so instead we note that for invalidation we don't actually
care about reliability over restarts (as there's no caches to
invalidate) and simply don't bother reading and writing to the new table
in that particular case.
The idea is to remove some of the places we pass around `int`, where it can represent one of two things:
1. the position of an event in the stream; or
2. a token that partitions the stream, used as part of the stream tokens.
The valid operations are then:
1. did a position happen before or after a token;
2. get all events that happened before or after a token; and
3. get all events between two tokens.
(Note that we don't want to allow other operations as we want to change the tokens to be vector clocks rather than simple ints)
I'd like to get a better insight into what we are doing with respect to state
res. The list of state groups we are resolving across should be short (if it
isn't, that's a massive problem in itself), so it should be fine to log it in
ite entiretly.
I've done some grepping and found approximately zero cases in which the
"shortcut" code delivered the result, so I've ripped that out too.
When updating the `room_stats_state` table, we try to check for null bytes slipping in to the content for state events. It turns out we had added `guest_access` as a field to room_stats_state without including it in the null byte check.
Lo and behold, a null byte in a `m.room.guest_access` event then breaks `room_stats_state` updates.
This PR adds the check for `guest_access`.