Add reason why the archive bot is joining the room (#262)

Using the join `reason` added in [MSC2367](https://github.com/matrix-org/matrix-spec-proposals/pull/2367). Unfortunately, this PR doesn't have much effect because it doesn't look like many clients support it yet (Element doesn't support it for example).

Part of https://github.com/matrix-org/matrix-public-archive/issues/257
This commit is contained in:
Eric Eastwood 2023-06-09 16:05:20 -05:00 committed by GitHub
parent 8da9b3d957
commit 1dd63212c0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 63 additions and 19 deletions

View File

@ -17,19 +17,32 @@ And with the introduction of the jump to date API via
[MSC3030](https://github.com/matrix-org/matrix-spec-proposals/pull/3030), we could show [MSC3030](https://github.com/matrix-org/matrix-spec-proposals/pull/3030), we could show
messages from any given date and day-by-day navigation. messages from any given date and day-by-day navigation.
## How do I opt out and keep my room from being indexed by search engines? ## Why did the archive bot join my room?
All public Matrix rooms are accessible to view in the Matrix Public Archive. But only Only public Matrix rooms with `shared` or `world_readable` [history
rooms with history visibility set to `world_readable` are indexable by search engines. visibility](https://spec.matrix.org/latest/client-server-api/#room-history-visibility) are
accessible in the Matrix Public Archive. In some clients like Element, the `shared`
option equates to "Members only (since the point in time of selecting this option)" and
`world_readable` to "Anyone" under the **room settings** -> **Security & Privacy** ->
**Who can read history?**.
Also see https://github.com/matrix-org/matrix-public-archive/issues/47 to track better But the archive bot (`@archive:matrix.org`) will join any public room because it doesn't
opt out controls. know the history visibility without first joining. Any room without `world_readable` or
`shared` history visibility will lead a `403 Forbidden`. And if the public room is in
the room directory, it will be listed in the archive but will still lead to a `403
Forbidden` in that case.
For [archive.matrix.org](https://archive.matrix.org/), you can ban the The Matrix Public Archive doesn't hold onto any data (it's
`@archive:matrix.org` user if you don't want your room content to be shown in the stateless) and requests the messages from the homeserver every time. The
archive at all. [archive.matrix.org](https://archive.matrix.org/) instance has some caching in place, 5
minutes for the current day, and 2 days for past content.
## Why does the archive user join rooms instead of browsing them as a guest? The Matrix Public Archive only allows rooms with `world_readable` history visibility to
be indexed by search engines. See the [opt
out](#how-do-i-opt-out-and-keep-my-room-from-being-indexed-by-search-engines) topic
below for more details.
### Why does the archive user join rooms instead of browsing them as a guest?
Guests require `m.room.guest_access` to access a room. Most public rooms do not allow Guests require `m.room.guest_access` to access a room. Most public rooms do not allow
guests because even the `public_chat` preset when creating a room does not allow guest guests because even the `public_chat` preset when creating a room does not allow guest
@ -37,11 +50,22 @@ access. Not being able to view most public rooms is the major blocker on being a
use guest access. The idea is if I can view the messages from a Matrix client as a use guest access. The idea is if I can view the messages from a Matrix client as a
random user, I should also be able to see the messages in the archive. random user, I should also be able to see the messages in the archive.
Keep in mind that only rooms with history visibility set to `world_readable` are Guest access is also a much different ask than read-only access since guests can also
indexable by search engines. The Matrix Public Archive doesn't hold onto any data (it's send messages in the room which isn't always desirable. The archive bot is read-only and
stateless) and requests the messages from the homeserver every time. The does not send messages.
[archive.matrix.org](https://archive.matrix.org/) instance has some caching in place, 5
minutes for the current day, and 2 days for past content. ## How do I opt out and keep my room from being indexed by search engines?
Only public Matrix rooms with `shared` or `world_readable` history visibility are
accessible to view in the Matrix Public Archive. But only rooms with history visibility
set to `world_readable` are indexable by search engines.
Also see https://github.com/matrix-org/matrix-public-archive/issues/47 to track better
opt out controls.
As a workaround for [archive.matrix.org](https://archive.matrix.org/) today, you can ban
the `@archive:matrix.org` user if you don't want your room content to be shown in the
archive at all.
## Technical details ## Technical details

View File

@ -3,14 +3,19 @@
const assert = require('assert'); const assert = require('assert');
const urlJoin = require('url-join'); const urlJoin = require('url-join');
const StatusError = require('../errors/status-error');
const { fetchEndpointAsJson } = require('../fetch-endpoint'); const { fetchEndpointAsJson } = require('../fetch-endpoint');
const getServerNameFromMatrixRoomIdOrAlias = require('./get-server-name-from-matrix-room-id-or-alias'); const getServerNameFromMatrixRoomIdOrAlias = require('./get-server-name-from-matrix-room-id-or-alias');
const MatrixPublicArchiveURLCreator = require('matrix-public-archive-shared/lib/url-creator');
const config = require('../config'); const config = require('../config');
const StatusError = require('../errors/status-error'); const basePath = config.get('basePath');
assert(basePath);
const matrixServerUrl = config.get('matrixServerUrl'); const matrixServerUrl = config.get('matrixServerUrl');
assert(matrixServerUrl); assert(matrixServerUrl);
const matrixPublicArchiveURLCreator = new MatrixPublicArchiveURLCreator(basePath);
async function ensureRoomJoined( async function ensureRoomJoined(
accessToken, accessToken,
roomIdOrAlias, roomIdOrAlias,
@ -43,6 +48,19 @@ async function ensureRoomJoined(
method: 'POST', method: 'POST',
accessToken, accessToken,
abortSignal, abortSignal,
body: {
reason:
`Joining room to check history visibility. ` +
`If your room is public with shared or world readable history visibility, ` +
`it will be accessible at ${matrixPublicArchiveURLCreator.archiveUrlForRoom(
roomIdOrAlias
// We don't need to include the `viaServers` option here because the archive
// will already be joined to the room from this request itself and we don't
// need to make the URL any longer/noisier than it needs to be.
)}. ` +
`See the FAQ for more details: ` +
`https://github.com/matrix-org/matrix-public-archive/blob/main/docs/faq.md#why-did-the-archive-bot-join-my-room`,
},
}); });
assert( assert(
joinData.room_id, joinData.room_id,

View File

@ -14,6 +14,7 @@ const chalk = require('chalk');
const RethrownError = require('../server/lib/errors/rethrown-error'); const RethrownError = require('../server/lib/errors/rethrown-error');
const MatrixPublicArchiveURLCreator = require('matrix-public-archive-shared/lib/url-creator'); const MatrixPublicArchiveURLCreator = require('matrix-public-archive-shared/lib/url-creator');
const { fetchEndpointAsText, fetchEndpointAsJson } = require('../server/lib/fetch-endpoint'); const { fetchEndpointAsText, fetchEndpointAsJson } = require('../server/lib/fetch-endpoint');
const ensureRoomJoined = require('../server/lib/matrix-utils/ensure-room-joined');
const config = require('../server/lib/config'); const config = require('../server/lib/config');
const { const {
MS_LOOKUP, MS_LOOKUP,
@ -999,10 +1000,11 @@ describe('matrix-public-archive', () => {
// avoid problems jumping to the latest activity since we can't control the // avoid problems jumping to the latest activity since we can't control the
// timestamp of the membership event. // timestamp of the membership event.
const archiveAppServiceUserClient = await getTestClientForAs(); const archiveAppServiceUserClient = await getTestClientForAs();
await joinRoom({ // We use `ensureRoomJoined` instead of `joinRoom` because we're joining
client: archiveAppServiceUserClient, // the archive user here and want the same join `reason` to avoid a new
roomId: roomId, // state event being created (`joinRoom` -> `{ displayname, membership }`
}); // whereas `ensureRoomJoined` -> `{ reason, displayname, membership }`)
await ensureRoomJoined(archiveAppServiceUserClient.accessToken, roomId);
// Just spread things out a bit so the event times are more obvious // Just spread things out a bit so the event times are more obvious
// and stand out from each other while debugging and so we just have // and stand out from each other while debugging and so we just have