Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize TextColor::GetColor #10779

Merged
1 commit merged into from
Aug 2, 2021
Merged

Vectorize TextColor::GetColor #10779

1 commit merged into from
Aug 2, 2021

Conversation

lhecker
Copy link
Member

@lhecker lhecker commented Jul 24, 2021

I was watching a video about vectorized instructions and I wanted to
try out some new things, as I had never written AVX code before.
This commit is the result of this tiny Thursday morning detour into
AVX land. It improves performance of TextColor::GetColor by about 3x.

Validation Steps Performed

  • Default colors are still properly shifted +8 ✔️

@lhecker lhecker force-pushed the dev/lhecker/vectorize-text-color branch from 316a285 to 69adabb Compare July 24, 2021 01:27
Copy link

@zadjii zadjii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱

@lhecker lhecker force-pushed the dev/lhecker/vectorize-text-color branch from 69adabb to 4f96a19 Compare July 24, 2021 02:51
@j4james
Copy link
Collaborator

j4james commented Jul 24, 2021

FYI, one of things I've been working on is a refactoring of the color handling, which would include removing that looping color match altogether. Essentially there would be a separate color table entry for default bold which would get set automatically whenever the color table is updated. This is necessary to support a user-chosen bold color (issue #5682), as well as some of the DEC VT52x color sequences that can set the bold color.

Technically we might still want to do a looping color match if the bold color is set to "auto", but that would happen only when the color table is changed, rather than on every color lookup, so performance is less of an issue.

@lhecker
Copy link
Member Author

lhecker commented Jul 24, 2021

This PR doesn't noticeably improve performance. I just did it because I was trying to get back into vectorized programming. The value I see in this is that it sets a precedent of being less afraid of low-level operations.

Copy link
Member

@miniksa miniksa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this.

I do like that you're dipping our toes into being able to do these sorts of things. We leveraged stuff like this with libpopcnt, but it's cool to see us do it ourselves too.

If you didn't "just write it for a sunny day"... I probably wouldn't advise writing it. But it exists. So let's check it in.

@lhecker lhecker marked this pull request as ready for review July 27, 2021 21:08
Copy link
Member

@zadjii-msft zadjii-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'm, not really sure if I'm qualified to sign off on this, but the comments are excellent so I trust that it works.

const auto needle = _mm256_set1_epi32(__builtin_bit_cast(int, defaultColor)); // 2.
const auto result = _mm256_cmpeq_epi32(haystack, needle); // 3.
const auto mask = _mm256_movemask_ps(_mm256_castsi256_ps(result)); // 4.
unsigned long index;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to get initialized to 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_BitScanForward compiles to the assembly instruction bsf, whose resulting value (the one stored in index) is undefined if the given value has no set bits. Since I only use index when _BitScanForward return true (meaning at least 1 bit was set) I'm never reading any uninitialized values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okeydokey good enough for me!

@lhecker lhecker added Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Area-Performance Performance-related issue AutoMerge Marked for automatic merge by the bot when requirements are met labels Aug 2, 2021
@ghost
Copy link

ghost commented Aug 2, 2021

Hello @lhecker!

Because this pull request has the AutoMerge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@ghost ghost merged commit fc64ff3 into main Aug 2, 2021
@ghost ghost deleted the dev/lhecker/vectorize-text-color branch August 2, 2021 19:03
@kasper93
Copy link
Contributor

kasper93 commented Aug 7, 2021

@lhecker You might want to also add run-time dispatch to one of the version, because you won't ship AVX2 built terminal, do you? So this code is basically dead coz no one will build/use it.

@lhecker
Copy link
Member Author

lhecker commented Aug 7, 2021

@kasper93 The difference between the SSE2 and AVX2 variant are only 1.5ns/call vs. 1.6ns/call. The only reason the AVX2 variant exists is because it made explaining the code easier (and because I build my own terminal with /arch:AVX2). This code in particular isn't all that important to be honest.

But I'm adding more important & impactful vectorized code in the future, which I've already planned to use dynamic dispatch. Currently I've got my own dispatch logic but I might use Google's highway or the SIMD-everywhere lib when I finalize the code.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Area-Performance Performance-related issue AutoMerge Marked for automatic merge by the bot when requirements are met
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants