Fast Hashing with BLAKE2 Part 2: Proper SIMD Comes to .NET
In Part 1 of this post, I evaluated the 8 NuGet packages currently available that list support for the BLAKE2 hashing algorithm(s). My original goal was to identify the best ones, to see if I could improve on them by using the new X86 SIMD Intrinsics support in .NET Core 2.1. I found that of those 8 published packages, exactly one was both RFC-compliant and acceptably fast. That was an implementation of BLAKE2s (blake2s-net), which is the lesser-used of the BLAKE2 variants. There were zero acceptable implementations of BLAKE2b available on NuGet, although there was a good implementation (Blake2Sharp) available in source form on GitHub.
With the current state being such a mess, it would be difficult to not improve on it, but I’ll have a go anyway.
How fast is BLAKE2?
To start, though, I want to take a step back and look at where things stand using those current best .NET implementations of BLAKE2. If you read up on the benefits of BLAKE2, one of its selling points is that it is simple to implement and has excellent performance in software-only implementations. In fact, it’s claimed to be faster than the much-less-secure MD5, which is still commonly used for file fingerprinting because of its low calculation cost.
I figured I would check that claim out by comparing those winning C# implementations of BLAKE2 against MD5, SHA-256 and SHA-512 to see how they stack up. One factor that’s often overlooked is that different hashing algorithms are designed for optimal processing on different architectures. BLAKE2b is supposed to be faster than BLAKE2s on 64-bit platforms and the opposite should be true on 32-bit. Similarly, SHA512 should be faster in 64-bit while SHA256 should be faster in 32-bit. I decided to test all the algorithms head-to-head on both platforms using BenchmarkDotNet to see if all those assumptions were true in .NET land.
My environment (updated to .NET Core 2.1 final after Part 1):
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134 Intel Xeon CPU E3-1505M v6 3.00GHz, 1 CPU, 8 logical and 4 physical cores Frequency=2929685 Hz, Resolution=341.3336 ns, Timer=TSC .NET Core SDK=2.1.300 [Host] : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT DefaultJob : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT
And the results:
Method | Platform | Mean | Error | StdDev | Allocated | ------------ |--------- |----------:|----------:|----------:|----------:| Blake2Sharp | X64 | 16.91 ms | 0.3498 ms | 0.4164 ms | 864 B | blake2s-net | X64 | 22.54 ms | 0.1468 ms | 0.1301 ms | 536 B | MD5 | X64 | 21.13 ms | 0.2073 ms | 0.1939 ms | 0 B | SHA256 | X64 | 46.57 ms | 0.4872 ms | 0.4557 ms | 0 B | SHA512 | X64 | 28.16 ms | 0.2036 ms | 0.1905 ms | 304 B | | | | | | | Blake2Sharp | X86 | 170.02 ms | 1.0865 ms | 1.0163 ms | 712 B | blake2s-net | X86 | 37.44 ms | 0.1919 ms | 0.1602 ms | 0 B | MD5 | X86 | 20.21 ms | 0.1491 ms | 0.1395 ms | 0 B | SHA256 | X86 | 52.78 ms | 0.4144 ms | 0.3876 ms | 0 B | SHA512 | X86 | 44.76 ms | 0.4982 ms | 0.4416 ms | 0 B |
I ran only the 10MiB data set here to keep the test results brief. The x64 results are in line with expectations. BLAKE2b (using Blake2Sharp) is, in fact, faster than anything else in 64-bit, but it performs particularly poorly in 32-bit. BLAKE2s (using blake2s-net) does admirably on both platforms, but it still trails the performance of MD5, especially on 32-bit.
One wrinkle in these tests is that the built-in hashing algorithms in .NET Core use platform-native implementations. See the HashProviderDispenser variants for details. I’m on Windows, so I’m getting the CNG implementations.
I can’t find any references at the moment, but I’m fairly certain I’ve read before that Windows CNG has at least some SIMD-accelerated code in some of the algorithms. That might explain why SHA512 is faster on both platforms than SHA256. Either way, since those native implementations are what you get in the box in .NET Core, I think you have to try to beat them if you’re considering BLAKE2 for its performance benefits.
The other wrinkle is that RyuJIT-32 in .NET Core 2.1 performs particularly poorly at generating optimal code for BLAKE2b. I raised an issue regarding that on the coreclr repo, and it is currently under investigation. The Blake2Sharp timings with the legacy jit32 were closer to 100ms.
And keep in mind that because not a single one of the BLAKE2b implementations available on NuGet was properly optimized, they came in anywhere from 3-17000x(!) slower than Blake2Sharp on x64, meaning they aren’t actually competitive with SHA512 at all.
Finally, it’s important to remember that when it comes to cryptographic hashing, performance problems can be security problems. It’s common when hashing passwords, for example, to use an iterative hashing function like PBKDF2 and to base the iteration count on a balance between making things difficult for an attacker to brute-force your hashes and making the application responsive to users attempting to log on. Someone attacking your password hashes will most certainly be using optimized code, so if you’re not, you’re giving the attacker the advantage.
Let’s Get Optimizin’
As I mentioned in Part 1, the Blake2Sharp reference repo has some incomplete code related to optional BLAKE2b features, and those extra features complicate the code unnecessarily. The blake2s-net code is derived from that same base, and it has some of the same issues. For my versions, I decided to start from the RFC implementations and optimize from there, keeping things as simple as possible. My first goal was to be at least as fast as those benchmark reference versions with only scalar code.
I was able to make some nice improvements by simplifying the initialization code, by using a bit of unsafe code to avoid making an unnecessary copy of the data during hash updates, and by re-ordering some operations to avoid CPU pipeline stalls.
Here are the numbers from my scalar implementations (I call them Blake2Fast) compared with the fastest versions previously available.
3-byte input
Method | Hash | Mean | Error | StdDev | Gen 0 | Allocated | ------------ |----------- |---------:|----------:|----------:|-------:|----------:| Blake2Sharp | 44229FC0EF | 521.9 ns | 5.5792 ns | 5.2188 ns | 0.2050 | 864 B | Blake2bFast | 44229FC0EF | 225.8 ns | 1.3251 ns | 1.2395 ns | 0.0074 | 32 B | | | | | | | | blake2s-net | FE4D57BA07 | 359.6 ns | 3.0930 ns | 2.8932 ns | 0.1273 | 536 B | Blake2sFast | FE4D57BA07 | 180.1 ns | 0.5917 ns | 0.5245 ns | 0.0074 | 32 B |
3.19KiB input
Method | Hash | Mean | Error | StdDev | Gen 0 | Allocated | ------------ |----------- |---------:|----------:|----------:|-------:|----------:| Blake2Sharp | 61EB59036B | 5.537 us | 0.0286 us | 0.0267 us | 0.1984 | 864 B | Blake2bFast | 61EB59036B | 4.243 us | 0.0222 us | 0.0186 us | - | 32 B | | | | | | | | blake2s-net | 62320CA3FC | 7.331 us | 0.0593 us | 0.0555 us | 0.1221 | 536 B | Blake2sFast | 62320CA3FC | 6.554 us | 0.0296 us | 0.0276 us | - | 32 B |
10MiB input
Method | Hash | Mean | Error | StdDev | Allocated | ------------ |----------- |---------:|----------:|----------:|----------:| Blake2Sharp | 7B6AB409B7 | 16.60 ms | 0.1319 ms | 0.1234 ms | 864 B | Blake2bFast | 7B6AB409B7 | 13.19 ms | 0.0533 ms | 0.0472 ms | 0 B | | | | | | | blake2s-net | 6500962DE3 | 22.24 ms | 0.1427 ms | 0.1335 ms | 536 B | Blake2sFast | 6500962DE3 | 20.66 ms | 0.1719 ms | 0.1436 ms | 0 B |
As you can see, the changes give my Blake2Fast versions a nice speed advantage over the existing best implementations, using only scalar code. The lower initialization overhead makes them roughly twice as fast with the smallest input, and the other optimizations show their benefits on the larger inputs.
One other change I made was to store the hash working state in a struct rather than a class. This makes Blake2Fast allocation-free (except for the array it allocates for the hash result itself) when using it in an all-at-once call. BLAKE2 is optimized to use very little memory for its hash state, so there’s no risk to keeping it on the stack when possible.
Bring on the Intrinsics
Having made the scalar code as fast as I could get it, it was time to see what could be done with the new Intrinsics support in .NET Core 2.1. But First a bit of background for those not familiar…
In .NET, a JIT Intrinsic is a method that is specially recognized by RyuJIT and has its IL implementation replaced with an optimized bit of machine code during JIT compilation. These first came into wide use in the System.Numerics.Vectors assembly, where Vector3, Vector4, Matrix4x4, Vector<T> and friends have Intrinsic methods that are replaced by the JIT with SIMD instructions on platforms that support them. System.Numerics.Vectors opened up a new world of performance in .NET code, and I made heavy use of its Intrinsics in the resizing, sharpening, and pixel format conversion code in MagicScaler. But it wasn’t without its problems.
First, not everything in System.Numerics.Vectors is JIT Intrinsic. For example, Vector4.Min() is implemented with a single SSE instruction that operates on 4 floats at once, as is Vector4.Max(). But Vector4.Clamp(), rather than using those two SSE instructions, has a complicated (and much slower) managed implementation designed to preserve compatibility with the Direct3D HLSL behavior. The documentation makes no mention of the difference, so the only way to know what you’re getting is to look at the source code for the method or to view the generated assembly language from the JIT. Those sorts of performance traps can be very confusing.
Second, the optimized versions of the methods are only emitted by the JIT if optimizations are enabled when you build your assembly. This means that normally, you’ll get a managed (and very slow) version in Debug mode and the optimized SIMD instructions in Release mode. Further, there can be differences in the results between the managed and Intrinsic versions of the code, so you may get different results in Debug and Release builds.
Third, Vector<T> can be very complicated to use because its size is different in different environments. Vector<float>, for example, holds 4 floats in its managed implementation or in the Intrinsic implementation on older hardware. On newer hardware that supports the AVX2 instruction set, Vector<float> holds 8 floats. That makes it difficult to design algorithms since you have to account for both possible sizes.
And finally, System.Numerics.Vectors implements only a tiny fraction of the SIMD instructions available on modern processors. Its API surface was designed with 3D game development in mind, so anything not useful for 3D graphics is almost certainly absent.
Enter System.Runtime.Intrinsics
In order to properly expose the complete set of SIMD instructions supported by modern processors, the CoreCLR team, along with help from Intel and Samsung (hooray for open source!), have been working on a lower-level set of APIs for .NET, implemented as JIT Intrinsics. Unlike the abstracted classes and methods in System.Numerics.Vectors, these new Intrinsics map directly to individual SIMD instruction sets and instructions. These are beginning to come together, and .NET Core 2.1 has the first useable bits in it, although they are designated ‘Experimental’ at this time.
Interestingly, the new Intrinsics support wasn’t listed among the new features in the .NET Core 2.1 RC1 release announcement, but Microsoft did publish a NuGet package with the reference assemblies, and the support is present in the JIT in both the RC1 and RTM/RTW versions of .NET Core 2.1.
Unfortunately, it appears the NuGet package was published to nuget.org by mistake, and now that .NET Core 2.1 has been released in its final version, the package has been de-listed. Its RTM version exists only on myget.org.
Let that serve as a warning to you; this is all experimental. The APIs that work work, but not all of them do work. And the APIs may change in .NET Core 2.2 or 3.0.
Unlike System.Numerics.Vectors, nothing in System.Runtime.Intrinsics has a managed implementation. The NuGet package contains only a reference assembly, and all the methods in that assembly will throw a PlatformNotSupportedException unless the JIT recognizes and substitutes them with the appropriate instructions. This means that the new Intrinsics can’t be used without a compatible JIT, which means they will only work in .NET Core 2.1 (and the 2.2 previews) for now.
Fortunately, the Experimental version does have nearly complete support for SSE-SSE4.1, and quite a bit of AVX is present. That allows for a lot of algorithms to be implemented, including some existing optimized versions of BLAKE2. Since there’s already a good SSE4.1 implementation of BLAKE2 available in the reference repo, all I had to do was port the existing code over to see how well it performed on .NET Core 2.1.
I’ve published that code on GitHub, so I’ll jump straight into the benchmarks, comparing with the previous best BLAKE2 implementations and the built-in hashing algorithms. This is the same 10MiB benchmark from the beginning of this post, on both 32-bit and 64-bit versions of the .NET Core 2.1 runtime.
Method | Platform | Mean | Error | StdDev | Allocated | ------------ |--------- |----------:|----------:|----------:|----------:| Blake2Sharp | X64 | 16.56 ms | 0.1586 ms | 0.1484 ms | 864 B | *Blake2bFast | X64 | 12.13 ms | 0.0870 ms | 0.0771 ms | 0 B | blake2s-net | X64 | 22.26 ms | 0.1443 ms | 0.1350 ms | 536 B | *Blake2sFast | X64 | 16.27 ms | 0.1362 ms | 0.1274 ms | 0 B | MD5 | X64 | 21.22 ms | 0.1190 ms | 0.1113 ms | 0 B | SHA256 | X64 | 46.16 ms | 0.2564 ms | 0.2398 ms | 0 B | SHA512 | X64 | 27.89 ms | 0.0982 ms | 0.0871 ms | 304 B | | | | | | | Blake2Sharp | X86 | 168.31 ms | 0.5426 ms | 0.4810 ms | 712 B | *Blake2bFast | X86 | 16.56 ms | 0.0879 ms | 0.0779 ms | 0 B | blake2s-net | X86 | 37.46 ms | 0.2728 ms | 0.2552 ms | 0 B | *Blake2sFast | X86 | 16.36 ms | 0.1103 ms | 0.1032 ms | 0 B | MD5 | X86 | 20.06 ms | 0.0996 ms | 0.0931 ms | 0 B | SHA256 | X86 | 52.47 ms | 0.3252 ms | 0.3042 ms | 0 B | SHA512 | X86 | 44.07 ms | 0.1643 ms | 0.1372 ms | 0 B |
The SSE4.1 versions (marked with *) of both Blake2Fast algorithms improve on the previous best versions and are faster than all the common built-in hashing algorithms from .NET (Windows CNG). The 32-bit runtime is where the SIMD advantage really shows up, though. Blake2sFast with SIMD is over twice as fast as blake2s-net, and Blake2bFast is 10x faster than Blake2Sharp. Both are faster than even CNG’s MD5.
So, there you have it. Proper SIMD is coming to .NET, and you can get started experimenting with it today. System.Runtime.Intrinsics will not be officially supported by Microsoft until at least .NET Core 2.2, but the more useful feedback they get now, the sooner they can be sure they’ve got it right.
Update: I’ve published my Blake2Fast implementation to NuGet since it’s a significant improvement over anything else previously available there. Because the Intrinsics support makes such a large performance difference and because RyuJIT-32 in .NET Core 2.1 does so poorly with BLAKE2b, I’ve included the SIMD version of the code in the .NET Core 2.1 build. Other build targets will get the optimized scalar code. I’ve tested it quite thoroughly and am confident that it’s reliable (on Windows, at least), but I must reiterate that the Intrinsics support is not officially supported, so if you pick up the .NET Core 2.1 version, you do so at your own risk.