Skip to content

[ty] First cut at semantic token provider #19108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

UnboundVariable
Copy link
Collaborator

@UnboundVariable UnboundVariable commented Jul 2, 2025

This PR implements a basic semantic token provider for ty's language server. This allows for more accurate semantic highlighting / coloring within editors that support this LSP functionality.

Here are screen shots that show how code appears in VS Code using the "rainbow" theme both before and after this change.

461737617-15630625-d4a9-4ec5-9886-77b00eb7a41a

461737624-d6dcf5f0-7b9b-47de-a410-e202c63e2058

The token types and modifier tags in this implementation largely mirror those used in Microsoft's default language server for Python.

The implementation supports two LSP interfaces. The first provides semantic tokens for an entire document, and the second returns semantic tokens for a requested range within a document.

The PR includes unit tests. It also includes comments that document known limitations and areas for future improvements.

Copy link
Contributor

github-actions bot commented Jul 2, 2025

mypy_primer results

Changes were detected when running on open source projects
parso (https://github.com/davidhalter/parso)
-     memo fields = ~49MB
+     memo fields = ~54MB

beartype (https://github.com/beartype/beartype)
- TOTAL MEMORY USAGE: ~88MB
+ TOTAL MEMORY USAGE: ~97MB

Expression (https://github.com/cognitedata/Expression)
-     memo fields = ~54MB
+     memo fields = ~49MB

ignite (https://github.com/pytorch/ignite)
-     memo fields = ~171MB
+     memo fields = ~189MB

cki-lib (https://gitlab.com/cki-project/cki-lib)
-     memo fields = ~72MB
+     memo fields = ~80MB

altair (https://github.com/vega/altair)
-     memo fields = ~228MB
+     memo fields = ~251MB

pytest (https://github.com/pytest-dev/pytest)
- TOTAL MEMORY USAGE: ~276MB
+ TOTAL MEMORY USAGE: ~251MB

bokeh (https://github.com/bokeh/bokeh)
- TOTAL MEMORY USAGE: ~251MB
+ TOTAL MEMORY USAGE: ~276MB

scikit-learn (https://github.com/scikit-learn/scikit-learn)
-     memo fields = ~593MB
+     memo fields = ~539MB

manticore (https://github.com/trailofbits/manticore)
-     memo fields = ~593MB
+     memo fields = ~652MB

Copy link

codspeed-hq bot commented Jul 2, 2025

CodSpeed WallTime Performance Report

Merging #19108 will not alter performance

Comparing UnboundVariable:semantic_tokens (b50a4f1) with main (44f2f77)

Summary

✅ 7 untouched benchmarks

@UnboundVariable UnboundVariable changed the title First cut at semantic token provider. [ty] First cut at semantic token provider Jul 2, 2025
Copy link
Contributor

@carljm carljm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I'd prefer for Micha or Dhruv (who know the LSP and our plans in that area better) to review this; just one initial comment.

Copy link
Member

@dhruvmanila dhruvmanila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start, thank you for doing this!

I've been reviewing this for an hour and I need to take a lunch break but here are my initial comments. I'll look at the remaining parts after the lunch break

@AlexWaygood AlexWaygood added server Related to the LSP server ty Multi-file analysis & type inference labels Jul 3, 2025
Copy link
Member

@dhruvmanila dhruvmanila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, my review is finally done, this is an awesome PR!

The test cases are extremely thorough, thank you for writing them!

I think the only required change would be to make sure that we convert the ty location values back to the LSP values in the common function for the semantic tokens handler. And, around the visitor implementation where some of the calls that might lead to duplicate tokens could be removed.

A lot of my review comments could be considered as suggestions so feel free to either apply them or discard them. It's totally fine if those are done as follow-up. I can also own them as follow-up to this PR if you prefer.

@AlexWaygood AlexWaygood removed their request for review July 3, 2025 10:54
@UnboundVariable
Copy link
Collaborator Author

@dhruvmanila, thanks for the thorough and insightful code review comments! I think I've addressed them all.

Copy link
Member

@dhruvmanila dhruvmanila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thank you for taking this on and addressing all the review comments!

}

/// Convert to LSP modifier indices for encoding
pub fn to_lsp_indices(self) -> Vec<u32> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand why do we need this method. Can we not just use the contained u32 directly which is what the LSP response is expecting? The usage of to_lsp_indices is constructing the same u32 as the one that's contained in SemanticTokenModifier:

        let token_modifiers = token
            .modifiers
            .to_lsp_indices()
            .into_iter()
            .fold(0u32, |acc, modifier_index| acc | (1 << modifier_index));

}
}

fn visit_body(&mut self, body: &[Stmt]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't required because the SourceOrderVisitor already has the visit_body method with the same logic as this.

Comment on lines +623 to +644
ast::Expr::StringLiteral(string_literal) => {
// For implicitly concatenated strings, emit separate tokens for each string part
for string_part in &string_literal.value {
self.add_token(
string_part.range(),
SemanticTokenType::String,
SemanticTokenModifier::empty(),
);
}
walk_expr(self, expr);
}
ast::Expr::BytesLiteral(bytes_literal) => {
// For implicitly concatenated bytes, emit separate tokens for each bytes part
for bytes_part in &bytes_literal.value {
self.add_token(
bytes_part.range(),
SemanticTokenType::String,
SemanticTokenModifier::empty(),
);
}
walk_expr(self, expr);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be implementing the visit_string_literal and visit_bytes_literal method which are part of the SourceOrderVisitor trait here instead because otherwise this is going to miss adding the string literal tokens which are inside an f-string. For example, in f"foo {'nested'}", the 'nested' is a StringLiteral but this logic won't emit the string token as it's not part of the Expr::StringLiteral but Expr::FString

    fn visit_string_literal(&mut self, string_literal: &ast::StringLiteral) {
        self.add_token(
            string_literal.range(),
            SemanticTokenType::String,
            SemanticTokenModifier::empty(),
        );
    }

    fn visit_bytes_literal(&mut self, bytes_literal: &ast::BytesLiteral) {
        self.add_token(
            bytes_literal.range(),
            SemanticTokenType::String,
            SemanticTokenModifier::empty(),
        );
    }

Can we also add a test case using f-strings if it's not already present?

Comment on lines +645 to +668
ast::Expr::NumberLiteral(_) => {
self.add_token(
expr.range(),
SemanticTokenType::Number,
SemanticTokenModifier::empty(),
);
walk_expr(self, expr);
}
ast::Expr::BooleanLiteral(_) => {
self.add_token(
expr.range(),
SemanticTokenType::BuiltinConstant,
SemanticTokenModifier::empty(),
);
walk_expr(self, expr);
}
ast::Expr::NoneLiteral(_) => {
self.add_token(
expr.range(),
SemanticTokenType::BuiltinConstant,
SemanticTokenModifier::empty(),
);
walk_expr(self, expr);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These walk_expr calls looks redundant as there are no expressions that are nested inside any of these expression variants.

Suggested change
ast::Expr::NumberLiteral(_) => {
self.add_token(
expr.range(),
SemanticTokenType::Number,
SemanticTokenModifier::empty(),
);
walk_expr(self, expr);
}
ast::Expr::BooleanLiteral(_) => {
self.add_token(
expr.range(),
SemanticTokenType::BuiltinConstant,
SemanticTokenModifier::empty(),
);
walk_expr(self, expr);
}
ast::Expr::NoneLiteral(_) => {
self.add_token(
expr.range(),
SemanticTokenType::BuiltinConstant,
SemanticTokenModifier::empty(),
);
walk_expr(self, expr);
}
ast::Expr::NumberLiteral(_) => {
self.add_token(
expr.range(),
SemanticTokenType::Number,
SemanticTokenModifier::empty(),
);
}
ast::Expr::BooleanLiteral(_) => {
self.add_token(
expr.range(),
SemanticTokenType::BuiltinConstant,
SemanticTokenModifier::empty(),
);
}
ast::Expr::NoneLiteral(_) => {
self.add_token(
expr.range(),
SemanticTokenType::BuiltinConstant,
SemanticTokenModifier::empty(),
);
}

Comment on lines +793 to +800
#[test]
fn test_semantic_tokens_variables() {
let test = cursor_test(
"
x = 42
y = 'hello'<CURSOR>
",
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I totally missed this in my first review and it would've probably saved you some time. We've been using snapshot based testing for individual IDE capabilities like the hover example below:

#[test]
fn hover_member() {
let test = cursor_test(
r#"
class Foo:
a: int = 10
def __init__(a: int, b: str):
self.a = a
self.b: str = b
foo = Foo()
foo.<CURSOR>a
"#,
);
assert_snapshot!(test.hover(), @r"
int
---------------------------------------------
```text
int
```
---------------------------------------------
info[hover]: Hovered content is
--> main.py:10:9
|
9 | foo = Foo()
10 | foo.a
| ^^^^-
| | |
| | Cursor offset
| source
|
");
}

The assert_snapshot! macro is important here. What happens is that we create an output format for these requests which are human readable and use assert_snapshot!(test.hover(), ""). Then, running the tests will automatically replace the "" (second argument) with the generated content from test.hover() call. And, finally we'd verify that those are correct. The output format is decided by us and is based on what information is required to verify the implementation.

I don't think we need to change anything in this PR but it'd be useful to convert these tests into snapshots instead. I'll open a new issue to keep track of it and it would be a good "help wanted" issue a contributor could pick up.

Comment on lines +215 to +220
debug_assert!(
self.tokens.is_empty() || self.tokens.last().unwrap().start() <= range.start(),
"Tokens must be added in file order: previous token ends at {:?}, new token starts at {:?}",
self.tokens.last().map(SemanticToken::start),
range.start()
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead have a strict < check instead of <=? I'm asking because adding duplicate tokens (same range) wouldn't raise an assertion here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
server Related to the LSP server ty Multi-file analysis & type inference
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants