-
Notifications
You must be signed in to change notification settings - Fork 5k
Interlocked.Read from 32-bit apps are much slower on Intel Sapphire Rapids CPUs #93624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Intel provided the following links. https://lwn.net/Articles/790464 Bit 29 in MSR 0x33 can be used to disable exception generation that is used by kernel/hypervisor to rate limit the bus locks Do we really need to take a lock to read a 64-bit int on 64-bit processors? |
Tagging subscribers to this area: @mangod9 Issue DetailsDescriptionInterlocked.Read from 32-bit apps are much slower (100x) on Intel Sapphire Rapids CPUs. using System; namespace UnmanagedMemoryStreamPerfTest
} Configuration.NET 4.8 and 6.0 Regression?No, new CPU behavior. DataWPR of real world apps are available. AnalysisThis was found tuning WPF applications on VDI and seeing a noticeable performance drop opening new screens on these new processors. Many large enterprise apps have 32-bit 3rd-party dependencies, so cannot be rebuilt as 64-bit processes to avoid this perf issue.
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsDescriptionInterlocked.Read from 32-bit apps are much slower (100x) on Intel Sapphire Rapids CPUs. using System; namespace UnmanagedMemoryStreamPerfTest
} Configuration.NET 4.8 and 6.0 Regression?No, new CPU behavior. DataWPR of real world apps are available. AnalysisThis was found tuning WPF applications on VDI and seeing a noticeable performance drop opening new screens on these new processors. Many large enterprise apps have 32-bit 3rd-party dependencies, so cannot be rebuilt as 64-bit processes to avoid this perf issue.
|
This might be true, but it always bugs me. |
@DeepakRajendrakumaran is that the issue we dicussed? Long story short - the perf bottle-neck is in runtime/src/libraries/System.Private.CoreLib/src/System/IO/UnmanagedMemoryStream.cs Line 262 in 45ed0ab
the long field is not guaranteed to be aligned to 8-bytes on 32bit and that causes the perf regression. In theory, we could offer 8-byte alignment for such types (all types with long fields?) just like we do it for some cases with doubles (aligning to 8-bytes via fake objects) but that probably a bit of work, as, presumably, VM will have to align fields to 8-bytes then + potential heap size regression on 32bit. Anyway we don't think it's a codegen issue in any way. |
Yes. This is the same issue @EgorBo : Is using a 4 byte data type instead of |
Not for the |
is there a better implementation for Interlocked.Read that is better for modern hardware? MemoryStream doesn't
FileStream doesn't either
|
I am not aware of simpler ways to perform a volatile/interlocked 8-byte load on x86 (32bit) other than https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange64
Presumably, to make it thread-safe so when some other thread is changing length, the value won't be torn if read by other thread - I don't know why it's done for PS: ah, MemoryStream's length is actually |
Just curious, wouldn't it work if the field was changed to a double making it aligned allowing a 8 byte load to vector register? However it would lead to increased complexity and worse code on x64 (which might be somewhat negated by different code paths based on size of intptr) so it is not be a generic solution |
Regarding SIMD, I don't think a simd load can guarantee us atomicity? |
There are other places we do something similar(End up doing a CompareExchange on long) - For eg.
runtime/src/libraries/System.Private.CoreLib/src/System/Threading/Interlocked.cs Line 347 in fa3babd
|
|
...on 32bit+ systems. |
When data is aligned then it's atomic. Otherwise no guarantee, and for cacheline split it's not atomic. |
Only on AVX+ hardware, if aligned to 16b and special aligned loads/stores are used. It makes it useless for us for atomicity |
FYI - I locally switched the p.s. This was just an expt to verify and confirm that this is indeed the issue. |
UnmanagedMemoryStream used Interlocked operations to update its state to prevent tearing of 64-bit values on 32-bit platforms. This pattern is expensive in general and it was found to be prohibitively expensive on recent hardware.. This change removes the expensive Interlocked operations and addresses the tearing issues in alternative way: - The _length field is converted to nuint that is guaranteed to be updated atomically. - Writes to _length field are volatile to guaranteed the unininitialized memory cannot be read. - The _position field remains long and it has a risk of tearing. It is not a problem since tearing of this field cannot lead to buffer overruns. Fixes dotnet#93624
Tagging subscribers to this area: @dotnet/area-system-io Issue DetailsDescriptionInterlocked.Read from 32-bit apps are much slower (100x) on Intel Sapphire Rapids CPUs. using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
namespace UnmanagedMemoryStreamPerfTest
{
class Program
{
static bool _Is32BitProcess = true;
static Stream _Stream = null;
static void Main(string[] args)
{
_Is32BitProcess = GetIs32BitProcess();
_Stream = GetStream();
Console.WriteLine("Is 32 bit process: " + _Is32BitProcess);
Stopwatch sw = new Stopwatch();
sw.Start();
MainAsync();
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
Console.ReadLine();
}
static async void MainAsync()
{
//100 threads to make it eaven more obvious
List<Task> tasks = new List<Task>();
for (int i = 0; i < 100; i++)
tasks.Add(DoStuff(GetStream()));
await Task.WhenAll(tasks);
}
static unsafe async Task DoStuff(Stream stream)
{
//contrived, but this is just so we can see it.
//for every XAML file, WPF will read baml streams from the assembly.
//for apps with dense UIs, that can thousands of streams to open a screen
//the problem, 32-bit will be about 3-20 times slower. 200 times slower on high end 2023 AMD and Intel chips due to a throttling feature for the type of lock UnmangedMemoryStream takes.
int count = 1000000;
for (int i = 0; i < count; i++)
{
DoStuffStream(stream);
//DoStuffInterlocked()
//DoStuffInterlockedTreF()
}
}
static unsafe long DoStuffStream(Stream stream)
{
return stream.Length;
}
static unsafe long DoStuffInterlocked()
{
long len = 0;
Interlocked.Read(ref len);
return len;
}
static unsafe long DoStuffInterlockedTreF()
{
// inspriation from Parallel.For() fix //https://devdiv.visualstudio.com/DevDiv/_search?text=969699&type=workitem&pageSize=25&filters=Projects%7BDevDiv%7D
//Read uses Add under the hood, so this should be ok
//gotcha is that 32-bit apps could then only use 2GB streams, limited by 32-bit int
//maybe another Flag _BigStream
long len = 0;
if (_Is32BitProcess /*&& !_BigStream*/)
{
long* indexPtr = &len;
{
Interlocked.Add(ref *(int*)indexPtr, 0);
}
}
else
{
Interlocked.Read(ref len);
}
return len;
}
static unsafe Stream GetStream()
{
//https://learn.microsoft.com/en-us/dotnet/api/system.io.unmanagedmemorystream?view=net-7.0
//this is over kill, but its copy/paste MS sample code...
byte[] message = UnicodeEncoding.Unicode.GetBytes(GetMessage());
// Allocate a block of unmanaged memory and return an IntPtr object.
IntPtr memIntPtr = Marshal.AllocHGlobal(message.Length);
// Get a byte pointer from the IntPtr object.
byte* memBytePtr = (byte*)memIntPtr.ToPointer();
// Create an UnmanagedMemoryStream object using a pointer to unmanaged memory.
UnmanagedMemoryStream writeStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Write);
// Write the data.
writeStream.Write(message, 0, message.Length);
// Close the stream.
writeStream.Close();
writeStream.Dispose();
// Create another UnmanagedMemoryStream object using a pointer to unmanaged memory.
UnmanagedMemoryStream readStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Read);
// Create a byte array to hold data from unmanaged memory.
byte[] outMessage = new byte[message.Length];
// Read from unmanaged memory to the byte array.
readStream.Read(outMessage, 0, message.Length);
// Close the stream.
//readStream.Close();
//readStream.Dispose();
return readStream;
}
static string GetMessage()
{
return "Here is some data.";
}
static bool GetIs32BitProcess()
{
return IntPtr.Size == 4;
}
}
} Configuration.NET 4.8 and 6.0 Regression?No, new CPU behavior. DataWPR of real world apps are available. AnalysisThis was found tuning WPF applications on VDI and seeing a noticeable performance drop opening new screens on these new processors. Many large enterprise apps have 32-bit 3rd-party dependencies, so cannot be rebuilt as 64-bit processes to avoid this perf issue.
|
* Improve performance of UnmanagedMemoryStream UnmanagedMemoryStream used Interlocked operations to update its state to prevent tearing of 64-bit values on 32-bit platforms. This pattern is expensive in general and it was found to be prohibitively expensive on recent hardware.. This change removes the expensive Interlocked operations and addresses the tearing issues in alternative way: - The _length field is converted to nuint that is guaranteed to be updated atomically. - Writes to _length field are volatile to guaranteed the unininitialized memory cannot be read. - The _position field remains long and it has a risk of tearing. It is not a problem since tearing of this field cannot lead to buffer overruns. Fixes #93624 * Add comment
UnmanagedMemoryStream used Interlocked operations to update its state to prevent tearing of 64-bit values on 32-bit platforms. This pattern is expensive in general and it was found to be prohibitively expensive on recent hardware.. This change removes the expensive Interlocked operations and addresses the tearing issues in alternative way: - The _length field is converted to nuint that is guaranteed to be updated atomically. - Writes to _length field are volatile to guaranteed the unininitialized memory cannot be read. - The _position field remains long and it has a risk of tearing. It is not a problem since tearing of this field cannot lead to buffer overruns. Fixes #93624
* Improve performance of UnmanagedMemoryStream UnmanagedMemoryStream used Interlocked operations to update its state to prevent tearing of 64-bit values on 32-bit platforms. This pattern is expensive in general and it was found to be prohibitively expensive on recent hardware.. This change removes the expensive Interlocked operations and addresses the tearing issues in alternative way: - The _length field is converted to nuint that is guaranteed to be updated atomically. - Writes to _length field are volatile to guaranteed the unininitialized memory cannot be read. - The _position field remains long and it has a risk of tearing. It is not a problem since tearing of this field cannot lead to buffer overruns. Fixes #93624 * Add comment --------- Co-authored-by: Jan Kotas <[email protected]>
* [release/8.0] Stable branding for .NET 8 GA * Handle an empty bandPreleaseVersion * [release/8.0] Update dependencies from dotnet/emsdk (#93801) * Update dependencies from https://github.com/dotnet/emsdk build 2023102.2 Microsoft.NET.Workload.Emscripten.Current.Manifest-8.0.100.Transport From Version 8.0.0-rtm.23511.3 -> To Version 8.0.0-rtm.23520.2 * switch to the stable version now * Update dependencies * Also fix the trigger * pin the wbt sdk to avoid the latest analizer nonsense * Use source generation for the serializer * Try to make sourcebuild happy * Try again to make sourcebuild happy * Use reflection and suppress trim analysis instead This reverts commit 768b65b. * Fix reverting too much --------- Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com> Co-authored-by: Larry Ewing <[email protected]> * [release/8.0] Update APICompat settings under source build (#93865) * Update APICompat settings under source build * Update resolveContract.targets --------- Co-authored-by: Viktor Hofer <[email protected]> * Override InformationalVersion for NativeAOT corelib too * [release/8.0] Improve performance of UnmanagedMemoryStream (#93812) * Improve performance of UnmanagedMemoryStream UnmanagedMemoryStream used Interlocked operations to update its state to prevent tearing of 64-bit values on 32-bit platforms. This pattern is expensive in general and it was found to be prohibitively expensive on recent hardware.. This change removes the expensive Interlocked operations and addresses the tearing issues in alternative way: - The _length field is converted to nuint that is guaranteed to be updated atomically. - Writes to _length field are volatile to guaranteed the unininitialized memory cannot be read. - The _position field remains long and it has a risk of tearing. It is not a problem since tearing of this field cannot lead to buffer overruns. Fixes #93624 * Add comment --------- Co-authored-by: Jan Kotas <[email protected]> * Update dependencies from https://github.com/dotnet/emsdk build 20231023.2 (#93881) Microsoft.SourceBuild.Intermediate.emsdk , Microsoft.NET.Workload.Emscripten.Current.Manifest-8.0.100 From Version 8.0.0-rtm.23520.2 -> To Version 8.0.0-rtm.23523.2 Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com> * [release/8.0] Update dependencies from dnceng/internal/dotnet-optimization (#93827) * Update dependencies from https://dev.azure.com/dnceng/internal/_git/dotnet-optimization build 20231021.3 optimization.linux-arm64.MIBC.Runtime , optimization.linux-x64.MIBC.Runtime , optimization.windows_nt-arm64.MIBC.Runtime , optimization.windows_nt-x64.MIBC.Runtime , optimization.windows_nt-x86.MIBC.Runtime , optimization.PGO.CoreCLR From Version 1.0.0-prerelease.23519.5 -> To Version 1.0.0-prerelease.23521.3 * Update dependencies from https://dev.azure.com/dnceng/internal/_git/dotnet-optimization build 20231021.3 optimization.linux-arm64.MIBC.Runtime , optimization.linux-x64.MIBC.Runtime , optimization.windows_nt-arm64.MIBC.Runtime , optimization.windows_nt-x64.MIBC.Runtime , optimization.windows_nt-x86.MIBC.Runtime , optimization.PGO.CoreCLR From Version 1.0.0-prerelease.23519.5 -> To Version 1.0.0-prerelease.23521.3 --------- Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com> * [release/8.0][wasm] Fix perf pipeline runs (#93888) * Remove --experimental-wasm-eh argument from the wasm_args used for wasm performance runs. (#93357) (cherry picked from commit a770017) * performance-setup.sh: Use `release/8.0` as the default channel * performance-setup.ps1: use release/8.0 as the default channel * Fix passing wasmArgs for bdn --------- Co-authored-by: Parker Bibus <[email protected]> * [release/8.0] Honor JsonSerializerOptions.PropertyNameCaseInsensitive in property name conflict resolution. (#93935) * Honor JsonSerializerOptions.PropertyNameCaseInsensitive in property name conflict resolution. * Update src/libraries/System.Text.Json/tests/Common/PropertyNameTests.cs Co-authored-by: Jeff Handley <[email protected]> * Address feedback --------- Co-authored-by: Eirik Tsarpalis <[email protected]> Co-authored-by: Jeff Handley <[email protected]> * [8.0] Update MsQuic (#93979) * Try pinning the installer version to a 8.01xx sdk * Target net8.0 in SatelliteAssemblyFromProjectRef --------- Co-authored-by: Carlos Sánchez López <[email protected]> Co-authored-by: Larry Ewing <[email protected]> Co-authored-by: dotnet-maestro[bot] <42748379+dotnet-maestro[bot]@users.noreply.github.com> Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Viktor Hofer <[email protected]> Co-authored-by: Alexander Köplinger <[email protected]> Co-authored-by: Jan Kotas <[email protected]> Co-authored-by: Ankit Jain <[email protected]> Co-authored-by: Parker Bibus <[email protected]> Co-authored-by: Eirik Tsarpalis <[email protected]> Co-authored-by: Jeff Handley <[email protected]>
Description
Interlocked.Read from 32-bit apps are much slower (100x) on Intel Sapphire Rapids CPUs.
Configuration
.NET 4.8 and 6.0
Regression?
No, new CPU behavior.
Data
WPR of real world apps are available.
Analysis
This was found tuning WPF applications on VDI and seeing a noticeable performance drop opening new screens on these new processors.
Interlocked.Read is used by a lot of APIs and apps. If we can improve perf here, a lot of apps will magically run faster on modern hardware. Otherwise, this issue will have to be addressed on an individual basis.
Many large enterprise apps have 32-bit 3rd-party dependencies, so cannot be rebuilt as 64-bit processes to avoid this perf issue.
The text was updated successfully, but these errors were encountered: