[ty] First cut at semantic token provider #19108

UnboundVariable · 2025-07-02T22:45:03Z

This PR implements a basic semantic token provider for ty's language server. This allows for more accurate semantic highlighting / coloring within editors that support this LSP functionality.

Here are screen shots that show how code appears in VS Code using the "rainbow" theme both before and after this change.

The token types and modifier tags in this implementation largely mirror those used in Microsoft's default language server for Python.

The implementation supports two LSP interfaces. The first provides semantic tokens for an entire document, and the second returns semantic tokens for a requested range within a document.

The PR includes unit tests. It also includes comments that document known limitations and areas for future improvements.

github-actions · 2025-07-02T22:52:36Z

`mypy_primer` results

Changes were detected when running on open source projects

parso (https://github.com/davidhalter/parso)
-     memo fields = ~49MB
+     memo fields = ~54MB

beartype (https://github.com/beartype/beartype)
- TOTAL MEMORY USAGE: ~88MB
+ TOTAL MEMORY USAGE: ~97MB

Expression (https://github.com/cognitedata/Expression)
-     memo fields = ~54MB
+     memo fields = ~49MB

ignite (https://github.com/pytorch/ignite)
-     memo fields = ~171MB
+     memo fields = ~189MB

cki-lib (https://gitlab.com/cki-project/cki-lib)
-     memo fields = ~72MB
+     memo fields = ~80MB

altair (https://github.com/vega/altair)
-     memo fields = ~228MB
+     memo fields = ~251MB

pytest (https://github.com/pytest-dev/pytest)
- TOTAL MEMORY USAGE: ~276MB
+ TOTAL MEMORY USAGE: ~251MB

bokeh (https://github.com/bokeh/bokeh)
- TOTAL MEMORY USAGE: ~251MB
+ TOTAL MEMORY USAGE: ~276MB

scikit-learn (https://github.com/scikit-learn/scikit-learn)
-     memo fields = ~593MB
+     memo fields = ~539MB

manticore (https://github.com/trailofbits/manticore)
-     memo fields = ~593MB
+     memo fields = ~652MB

codspeed-hq · 2025-07-02T22:57:49Z

CodSpeed WallTime Performance Report

Merging #19108 will not alter performance

_{Comparing UnboundVariable:semantic_tokens (b50a4f1) with main (44f2f77)}

Summary

✅ 7 untouched benchmarks

carljm

Thanks for the PR! I'd prefer for Micha or Dhruv (who know the LSP and our plans in that area better) to review this; just one initial comment.

crates/ty_server/src/server/api/requests/semantic_tokens.rs

dhruvmanila

This is a great start, thank you for doing this!

I've been reviewing this for an hour and I need to take a lunch break but here are my initial comments. I'll look at the remaining parts after the lunch break

crates/ty_ide/src/semantic_tokens.rs

dhruvmanila

Ok, my review is finally done, this is an awesome PR!

The test cases are extremely thorough, thank you for writing them!

I think the only required change would be to make sure that we convert the ty location values back to the LSP values in the common function for the semantic tokens handler. And, around the visitor implementation where some of the calls that might lead to duplicate tokens could be removed.

A lot of my review comments could be considered as suggestions so feel free to either apply them or discard them. It's totally fine if those are done as follow-up. I can also own them as follow-up to this PR if you prefer.

crates/ty_ide/src/semantic_tokens.rs

crates/ty_server/src/server/api/requests/semantic_tokens.rs

crates/ty_ide/src/semantic_tokens.rs

crates/ty_server/src/server/api/requests/semantic_tokens.rs

…ory allocation.

…None` means "entire document".

…er than `Option<SemanticTokens>`.

…g `tokens` as a public field.

…ters.

…als.

UnboundVariable · 2025-07-04T22:08:01Z

@dhruvmanila, thanks for the thorough and insightful code review comments! I think I've addressed them all.

dhruvmanila

This is great! Thank you for taking this on and addressing all the review comments!

dhruvmanila · 2025-07-07T05:42:45Z

crates/ty_ide/src/semantic_tokens.rs

+    }
+
+    /// Convert to LSP modifier indices for encoding
+    pub fn to_lsp_indices(self) -> Vec<u32> {


I don't really understand why do we need this method. Can we not just use the contained u32 directly which is what the LSP response is expecting? The usage of to_lsp_indices is constructing the same u32 as the one that's contained in SemanticTokenModifier:

let token_modifiers = token .modifiers .to_lsp_indices() .into_iter() .fold(0u32, |acc, modifier_index| acc | (1 << modifier_index));

dhruvmanila · 2025-07-07T05:53:17Z

crates/ty_ide/src/semantic_tokens.rs

+        }
+    }
+
+    fn visit_body(&mut self, body: &[Stmt]) {


This isn't required because the SourceOrderVisitor already has the visit_body method with the same logic as this.

dhruvmanila · 2025-07-07T06:07:01Z

crates/ty_ide/src/semantic_tokens.rs

+            ast::Expr::StringLiteral(string_literal) => {
+                // For implicitly concatenated strings, emit separate tokens for each string part
+                for string_part in &string_literal.value {
+                    self.add_token(
+                        string_part.range(),
+                        SemanticTokenType::String,
+                        SemanticTokenModifier::empty(),
+                    );
+                }
+                walk_expr(self, expr);
+            }
+            ast::Expr::BytesLiteral(bytes_literal) => {
+                // For implicitly concatenated bytes, emit separate tokens for each bytes part
+                for bytes_part in &bytes_literal.value {
+                    self.add_token(
+                        bytes_part.range(),
+                        SemanticTokenType::String,
+                        SemanticTokenModifier::empty(),
+                    );
+                }
+                walk_expr(self, expr);
+            }


I think we should be implementing the visit_string_literal and visit_bytes_literal method which are part of the SourceOrderVisitor trait here instead because otherwise this is going to miss adding the string literal tokens which are inside an f-string. For example, in f"foo {'nested'}", the 'nested' is a StringLiteral but this logic won't emit the string token as it's not part of the Expr::StringLiteral but Expr::FString

fn visit_string_literal(&mut self, string_literal: &ast::StringLiteral) { self.add_token( string_literal.range(), SemanticTokenType::String, SemanticTokenModifier::empty(), ); } fn visit_bytes_literal(&mut self, bytes_literal: &ast::BytesLiteral) { self.add_token( bytes_literal.range(), SemanticTokenType::String, SemanticTokenModifier::empty(), ); }

Can we also add a test case using f-strings if it's not already present?

dhruvmanila · 2025-07-07T06:13:22Z

crates/ty_ide/src/semantic_tokens.rs

+            ast::Expr::NumberLiteral(_) => {
+                self.add_token(
+                    expr.range(),
+                    SemanticTokenType::Number,
+                    SemanticTokenModifier::empty(),
+                );
+                walk_expr(self, expr);
+            }
+            ast::Expr::BooleanLiteral(_) => {
+                self.add_token(
+                    expr.range(),
+                    SemanticTokenType::BuiltinConstant,
+                    SemanticTokenModifier::empty(),
+                );
+                walk_expr(self, expr);
+            }
+            ast::Expr::NoneLiteral(_) => {
+                self.add_token(
+                    expr.range(),
+                    SemanticTokenType::BuiltinConstant,
+                    SemanticTokenModifier::empty(),
+                );
+                walk_expr(self, expr);
+            }


These walk_expr calls looks redundant as there are no expressions that are nested inside any of these expression variants.

Suggested change

ast::Expr::NumberLiteral(_) => {

self.add_token(

expr.range(),

SemanticTokenType::Number,

SemanticTokenModifier::empty(),

);

walk_expr(self, expr);

}

ast::Expr::BooleanLiteral(_) => {

self.add_token(

expr.range(),

SemanticTokenType::BuiltinConstant,

SemanticTokenModifier::empty(),

);

walk_expr(self, expr);

}

ast::Expr::NoneLiteral(_) => {

self.add_token(

expr.range(),

SemanticTokenType::BuiltinConstant,

SemanticTokenModifier::empty(),

);

walk_expr(self, expr);

}

ast::Expr::NumberLiteral(_) => {

self.add_token(

expr.range(),

SemanticTokenType::Number,

SemanticTokenModifier::empty(),

);

}

ast::Expr::BooleanLiteral(_) => {

self.add_token(

expr.range(),

SemanticTokenType::BuiltinConstant,

SemanticTokenModifier::empty(),

);

}

ast::Expr::NoneLiteral(_) => {

self.add_token(

expr.range(),

SemanticTokenType::BuiltinConstant,

SemanticTokenModifier::empty(),

);

}

dhruvmanila · 2025-07-07T06:18:27Z

crates/ty_ide/src/semantic_tokens.rs

+    #[test]
+    fn test_semantic_tokens_variables() {
+        let test = cursor_test(
+            "
+x = 42
+y = 'hello'<CURSOR>
+",
+        );


Sorry, I totally missed this in my first review and it would've probably saved you some time. We've been using snapshot based testing for individual IDE capabilities like the hover example below:

ruff/crates/ty_ide/src/hover.rs

Lines 168 to 202 in 28ab61d

#[test]

fn hover_member() {

let test = cursor_test(

r#"

class Foo:

a: int = 10

def __init__(a: int, b: str):

self.a = a

self.b: str = b

foo = Foo()

foo.<CURSOR>a

"#,

);

assert_snapshot!(test.hover(), @r"

int

---------------------------------------------

```text

int

```

---------------------------------------------

info[hover]: Hovered content is

--> main.py:10:9

|

9 | foo = Foo()

10 | foo.a

| ^^^^-

| | |

| | Cursor offset

| source

|

");

}

The assert_snapshot! macro is important here. What happens is that we create an output format for these requests which are human readable and use assert_snapshot!(test.hover(), ""). Then, running the tests will automatically replace the "" (second argument) with the generated content from test.hover() call. And, finally we'd verify that those are correct. The output format is decided by us and is based on what information is required to verify the implementation.

I don't think we need to change anything in this PR but it'd be useful to convert these tests into snapshots instead. I'll open a new issue to keep track of it and it would be a good "help wanted" issue a contributor could pick up.

dhruvmanila · 2025-07-07T06:20:35Z

crates/ty_ide/src/semantic_tokens.rs

+        debug_assert!(
+            self.tokens.is_empty() || self.tokens.last().unwrap().start() <= range.start(),
+            "Tokens must be added in file order: previous token ends at {:?}, new token starts at {:?}",
+            self.tokens.last().map(SemanticToken::start),
+            range.start()
+        );


Should we instead have a strict < check instead of <=? I'm asking because adding duplicate tokens (same range) wouldn't raise an assertion here.

First cut at semantic token provider.

6e23135

UnboundVariable requested review from carljm, MichaReiser, AlexWaygood, sharkdp and dcreager as code owners July 2, 2025 22:45

Fixed formatting issues.

cfa7aaf

UnboundVariable changed the title ~~First cut at semantic token provider.~~ [ty] First cut at semantic token provider Jul 2, 2025

Fixed typos

aeb36f5

carljm reviewed Jul 3, 2025

View reviewed changes

crates/ty_server/src/server/api/requests/semantic_tokens.rs Outdated Show resolved Hide resolved

dhruvmanila reviewed Jul 3, 2025

View reviewed changes

AlexWaygood added server Related to the LSP server ty Multi-file analysis & type inference labels Jul 3, 2025

dhruvmanila requested changes Jul 3, 2025

View reviewed changes

AlexWaygood removed their request for review July 3, 2025 10:54

UnboundVariable added 13 commits July 4, 2025 10:48

Changed token modifiers from a vector to bit flags to avoid extra mem…

a4e1e0e

…ory allocation.

Changed semantic token function to accept Option<TextRange> where `…

b5ad248

…None` means "entire document".

Simplified semantic_tokens function to return SemanticTokens rath…

5541bcc

…er than `Option<SemanticTokens>`.

Changed SemanticTokens to implement Deref trait and avoid exposin…

4f4cef6

…g `tokens` as a public field.

Abstracted out visit_parameters and added handling of lambda parame…

b7e3414

…ters.

Eliminated the need to sort tokens.

816cf9e

Incorporated code review feedback.

720163c

Added support for concatenated string literals and bytes string liter…

98f356f

…als.

Removed unnecessary use of raw string literals.

7694654

Refactored tests to use a helper function.

7fb25e5

Improved tests.

108a201

Moved semantic token range request handler to its own file.

d58d1f7

Added Ranged trait to SemanticToken.

8065af4

UnboundVariable and others added 5 commits July 4, 2025 14:03

Fixed offset encoding logic.

83cb322

Formatting update

a728999

Moved visitor functions as per code review feedback.

41ed581

Merge branch 'main' into semantic_tokens

6fd0d9b

Fixed merge-related issues.

b50a4f1

dhruvmanila approved these changes Jul 7, 2025

View reviewed changes

	#[test]
	fn hover_member() {
	let test = cursor_test(
	r#"
	class Foo:
	a: int = 10

	def __init__(a: int, b: str):
	self.a = a
	self.b: str = b

	foo = Foo()
	foo.<CURSOR>a
	"#,
	);

	assert_snapshot!(test.hover(), @r"
	int
	---------------------------------------------
	```text
	int
	```
	---------------------------------------------
	info[hover]: Hovered content is
	--> main.py:10:9
	\|
	9 \| foo = Foo()
	10 \| foo.a
	\| ^^^^-
	\| \| \|
	\| \| Cursor offset
	\| source
	\|
	");
	}

[ty] First cut at semantic token provider #19108

Are you sure you want to change the base?

[ty] First cut at semantic token provider #19108

Conversation

UnboundVariable commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mypy_primer results

Uh oh!

codspeed-hq bot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed WallTime Performance Report

Merging #19108 will not alter performance

Summary

Uh oh!

carljm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dhruvmanila left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dhruvmanila left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

UnboundVariable commented Jul 4, 2025

Uh oh!

dhruvmanila left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhruvmanila Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

dhruvmanila Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

dhruvmanila Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

dhruvmanila Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

dhruvmanila Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

dhruvmanila Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

UnboundVariable commented Jul 2, 2025 •

edited

Loading

github-actions bot commented Jul 2, 2025 •

edited

Loading

`mypy_primer` results

codspeed-hq bot commented Jul 2, 2025 •

edited

Loading

dhruvmanila left a comment •

edited

Loading

dhruvmanila left a comment •

edited

Loading