-
Notifications
You must be signed in to change notification settings - Fork 399
MSC3752: Markup locations for text #3752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
gleachkr
wants to merge
8
commits into
matrix-org:main
Choose a base branch
from
opentower:text-markup
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 6 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
871c5f3
markup locations for text, initial commit
gleachkr 4291834
rename msc file
gleachkr 7269502
Add Text Position, Text Quote locations
gleachkr d35d167
Add text ranges
gleachkr 3411691
Balance brackets
gleachkr 3e4a47e
Add WADM serialization and unstable prefix
gleachkr 562ba45
Use PR links
gleachkr bf9b7c8
Enquote JSON strings.
gleachkr File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,230 @@ | ||
# Markup locations for Text | ||
|
||
[MSC3574](https://github.com/opentower/matrix-doc/blob/main/proposals/3574-resource-markup.md) | ||
proposes a mechanism for marking up resources (webpages, documents, videos, and | ||
other files) using Matrix. The proposed mechanism requires an | ||
`m.markup.location` schema for representing the location of annotations within | ||
different kinds of resources. MSC3574 punts on what standard location types | ||
might be available, deferring that large family of questions to other MSCs. | ||
This MSC aims to provide basic location types for marking up textual resources. | ||
|
||
## Proposal | ||
|
||
Markup locations for text should approximately follow the format for textual | ||
annotations provided by the w3c's [web annotation data | ||
model](https://www.w3.org/TR/annotation-model/). This will simplify | ||
interoperability with WADM-based annotation systems like | ||
[hypothes.is](https://hypothes.is). | ||
|
||
Markup locations for text should applicable to `text/*` Media Types, including | ||
markdown and html. It should also be at least partly applicable to formats that | ||
provide an associated text stream, such as `application/pdf`, | ||
`application/epub+zip`, and video or audio files with embedded lyrics or | ||
captions. | ||
|
||
The WADM model provides two basic notions of locations in text: "Text Position" | ||
(roughly, an offset) and "Text Quote" (roughly, a search query). In practice, | ||
both should be provided for a given text location whenever possible, for robust | ||
anchoring in contexts where the underlying text may change (for example, on the | ||
web). In these cases, clients can use the Text Position offset to find an | ||
approximate position, and look for the nearest approximately matching Text | ||
Quote. | ||
|
||
### Text Positions | ||
|
||
Text Positions will be represented within an `m.markup.location` as follows: | ||
|
||
``` | ||
m.markup.location: { | ||
m.markup.text.position: { | ||
start: .. | ||
end: .. | ||
} | ||
.. | ||
} | ||
``` | ||
|
||
The `start` and `end` values should be non-negative integers, with 0 indicating | ||
a position before the first character of the document's text, 1 indicating the | ||
position after the first character and before the second, and so on. | ||
|
||
The following requirements from the web annotation data model must be | ||
respected: | ||
|
||
> The selection of the text must be in terms of unicode code points (the | ||
"character number"), not in terms of code units (that number expressed using a | ||
selected data type). Selections should not start or end in the middle of a | ||
grapheme cluster. The selection must be based on the logical order of the text, | ||
rather than the visual order, especially for bidirectional text. | ||
|
||
> The text must be normalized before recording in the Annotation. Thus HTML/XML | ||
tags should be removed, and character entities should be replaced with the | ||
character that they encode. | ||
|
||
In view of the ambiguity of the markdown format (and similar text formats), and | ||
the resulting complexity of normalization, special markdown characters should | ||
*not* be removed before generating a text position. | ||
|
||
### Text Quotes | ||
|
||
Text Quotes will be represented within an `m.markup.location` as follows: | ||
|
||
``` | ||
m.markup.location: { | ||
m.markup.text.quote: { | ||
exact: ... | ||
prefix: ... | ||
suffix: ... | ||
} | ||
.. | ||
} | ||
``` | ||
|
||
The `exact` value should be the text occupying the designated location. The | ||
`prefix` should be a snippet of text occurring before the designated location, | ||
and the `suffix` should be a snippet occurring after the designated location. | ||
`prefix` and `suffix` may be omitted in cases where they're clearly unnecessary | ||
to disambiguate the location. | ||
|
||
Text should be normalized as above. In the case of multiple matches, all | ||
matches should be treated as part of the location. | ||
|
||
### Text Ranges | ||
|
||
There may be cases in which we want to use the selectors above to indicate the | ||
endpoints of a text range, because we want, for example, to select from the | ||
beginning of a document to a certain phrase, or because we want to select a | ||
long quote without including the contents of the quote in the `exact` value. | ||
|
||
In these cases, we can use a Text Range location, `m.markup.text.range`. Each | ||
endpoint of the range should be given either as a non-negative integer, or as a | ||
`prefix`/`suffix` pair. So for example, | ||
|
||
``` | ||
m.markup.location: { | ||
m.markup.text.range: { | ||
start: 0, | ||
end: { | ||
prefix: "the", | ||
suffix: " end" | ||
} | ||
} | ||
} | ||
``` | ||
|
||
would indicate all of "this is the end" except " end". | ||
|
||
### Web Annotation Data Model Serialization | ||
|
||
[MSC3574](https://github.com/opentower/matrix-doc/blob/main/proposals/3574-resource-markup.md) | ||
includes a scheme for serializing matrix markup events as web annotations in | ||
the web annotation data model. The scheme requires each markup location type to | ||
have a canonical serialization as [a web annotation | ||
selector](https://www.w3.org/TR/annotation-model/#selectors]). In this section, | ||
we describe how to serialize `m.markup.text.range`, `m.markup.text.quote` and | ||
`m.markup.text.position` as WADM selectors. | ||
|
||
The correspondence between `m.markup.text.quote` and `m.markup.text.position` | ||
and WADM | ||
[TextQuoteSelector](https://www.w3.org/TR/annotation-model/#text-quote-selector) | ||
and | ||
[TextPositionSelector](https://www.w3.org/TR/annotation-model/#text-position-selector) | ||
selectors is very direct. In each case, we only need to add a field indicating | ||
the selector type. So a location like: | ||
|
||
``` | ||
m.markup.text.quote: { | ||
exact: ... | ||
prefix: ... | ||
suffix: ... | ||
} | ||
``` | ||
|
||
becomes | ||
|
||
``` | ||
{ | ||
type: "TextQuoteSelector" | ||
exact: ... | ||
prefix: ... | ||
suffix: ... | ||
} | ||
``` | ||
|
||
and | ||
|
||
``` | ||
m.markup.text.position: { | ||
start: ... | ||
end: ... | ||
} | ||
``` | ||
|
||
becomes | ||
|
||
``` | ||
{ | ||
type: "TextPositionSelector" | ||
start: ... | ||
end: ... | ||
} | ||
``` | ||
|
||
The more complicated `m.markup.text.range` should be serialized via the WADM | ||
[RangeSelector](https://www.w3.org/TR/annotation-model/#range-selector) selector, which | ||
combines two WADM selectors to designate an area reaching from the beginning of | ||
the area designated by the first selector to the beginning of the area | ||
designated by the second selector. | ||
|
||
If either endpoint of the `m.markup.text.range` location is an offset, that | ||
endpoint should be represented by a WADM `TextPositionSelector` with both the | ||
start and end values equal to the offset. If either endpoint of an | ||
`m.markup.text.range` location is a `prefix`/`suffix` pair, it should be | ||
represented by a `TextQuoteSelector` with the corresponding `prefix`, but with | ||
the `exact` value equal to the suffix, and with no suffix provided. | ||
|
||
So, for example, | ||
|
||
``` | ||
m.markup.text.range: { | ||
start: 0, | ||
end: { | ||
prefix: "the", | ||
suffix: " end" | ||
} | ||
} | ||
``` | ||
|
||
becomes | ||
|
||
``` | ||
{ | ||
type: RangeSelector, | ||
startSelector: { | ||
type: TextPositionSelector, | ||
start: 0, | ||
end: 0 | ||
} | ||
endSelector: { | ||
type: TextQuoteSelector | ||
prefix: "the", | ||
exact: " end" | ||
} | ||
} | ||
``` | ||
|
||
## Security considerations | ||
|
||
Because room state is unencrypted, `m.space.child` events conveying locations | ||
via `m.markup.location.quote` could leak information about an encrypted | ||
resource text. This is part of a more general problem with state events | ||
potentially leaking information, and deserves a general resolution, a la | ||
[MSC3414](https://github.com/matrix-org/matrix-spec-proposals/blob/travis/msc/encrypted-state/proposals/3414-encrypted-state.md) | ||
|
||
## Unstable prefix | ||
|
||
| Proposed Final Identifier | Purpose | Development Identifier | | ||
| ------------------------- | ---------------------------------------------------------- | --------------------------------------------- | | ||
| `m.markup.text.quote` | key in `m.markup.location` | `com.open-tower.msc3752.markup.text.quote` | | ||
| `m.markup.text.position` | key in `m.markup.location` | `com.open-tower.msc3752.markup.text.position` | | ||
| `m.markup.text.range` | key in `m.markup.location` | `com.open-tower.msc3752.markup.text.range` | |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.