PhotoSauce Blog

What Makes sRGB a Special Color Space?

Mon, 10 May 2021 01:44:47 -0700

It’s been a long time since I updated the blog, but I was making some new color profiles a few days ago and remembered I never did resolve the small disagreement I had with other people in the color management community around primary colorant values I calculated for my compact sRGB ICC profiles.

I wrote a series of posts about the process I used to create my profiles, which included comparison to other sRGB profiles in common use, and I arrived at the conclusion that every other profile in circulation was using the wrong (not by much, but still…) colorant values. There was a lot of hand-waving around the fact I was writing about the sRGB spec without having seen it, and a lot of the content in my post on that topic was inferred from information on Wikipedia and in the draft spec for sRGB. One of the people who disagreed with my interpretation was none other than Graeme Gill, author of ArgyllCMS and acknowledged subject matter expert, who commented on the post.

I was confident enough in my understanding to stick with my calculated values, but they were still based on inferences. Well, in the intervening years, I was finally able to read the specs for both sRGB and scRGB, and it turns out I did, in fact, have the right idea and the right numbers. I want to clarify a few points from the specs for posterity, however, because sRGB is unlike any other ‘normal’ color space.

The key to this difference can be found in the following quote from the sRGB spec (Section 3.1):

The encoding transformations between 1931 CIEXYZ values and 8 bit RGB values provide unambiguous methods to represent optimum image colorimetry when viewed on the reference display in the reference viewing conditions by the reference observer.

There are 3 very important bits of information communicated in that sentence.

sRGB defines an encoding rather than a color space in the traditional abstract sense.
The encoding given in the spec is unambiguous, i.e. not open to interpretation.
It is intended for 8-bit values.

I think those points resolve the discrepancy between the fact that sRGB gives the Rec.709 primaries in its reference display description and the fact that its actual listed matrix values are intentionally imprecise derivations of the values from Rec.709. The distinction between reference display environment -- which is the only place x,y coordinates are given in the spec – and the actual primary colorants defined in the spec’s encoding is quite clear when reading the actual spec, whereas the Wikipedia summary conflates them.

That does leave a question, though: What happens when you need more than 8-bit precision? I showed that calculating the transform matrix with greater precision, by starting from the Rec.709 x,y coordinates, yields numbers that don’t agree with the rounded numbers in the spec, specifically in the XYZ->RGB matrix. I suggested, however, that one could invert the given RGB->XYZ matrix to arbitrary precision as long as the starting values were the rounded ones in the spec.

Well, it turns out that the scRGB spec, which was written to cover images with higher bit depths, does exactly that. It includes the following statement (IEC 61966-2-2:2003, Section 4.1):

The encoding transformations provide unambiguous methods to transform between CIE 1931 XYZ tristimulus values and 16-bit values for each channel of scRGB.

Again, an unambiguous encoding, but this time for 16-bit precision. The forward scRGB->XYZ matrix is exactly the same as given for sRGB->XYZ, with the exact same 4 decimal places of precision.

0.4124 0.3576 0.1805
0.2126 0.7152 0.0722
0.0193 0.1192 0.9505

The inverse matrix, however, is given with 6 decimal places of precision.

 3.240625 -1.537208 -0.498629
-0.968931  1.875756  0.041518
 0.055710 -0.204021  1.056996

Those numbers match the ones I calculated in my original post, right up to the sixth decimal place, and of course you can calculate that inverse to higher precision if you need to, without breaking compatibility.

This was the only way that I could make the numbers agree with each other when researching this initially, but it was good to finally see it in black and white in the spec. So unlike every normal color space -- where you can calculate both the forward and inverse matrices to arbitrary precision by starting from the x,y values -- with sRGB you can only get the correct values by starting from the unambiguously specified forward matrix, rounded to exactly 4 decimal places.

One thing I got wrong in my guesses about the content of the spec was around the definition of the XYZ whitepoint. Based on the whitepoint given in the description of the spec on Wikipedia, I inferred that it was intentionally rounded in the wrong direction in the spec. In fact, the XYZ whitepoint value isn’t given in the spec at all. We simply calculate the whitepoint by adding the primary colorants together, which gives a total that doesn’t match the correctly rounded value of the D65 whitepoint when calculated from its x,y coordinates. Again, this is a difference between the reference display conditions and the encoding given in the spec.

The result is the same; the whitepoint is intentionally not exactly D65, because it’s based on intentionally rounded colorants. This is not a matter of them being imprecise. Rather, they are precisely defined as being very slightly different from Rec.709. You can't calculate them 'more precisely' by going back to the x,y coordinates because they are already precisely and unambiguously defined in the spec.

So that’s the final answer. You can’t treat sRGB like any other color space, because it’s not just a color space; it’s a specific encoding of a color space. To be truly compatible, you must match the encoding, and that’s what my profiles do.

BTW, the new profiles I created are now in the GitHub repo. I added a couple of video color spaces and new versions of the Adobe color spaces. I had initially done those as ICC V2 profiles only, but due to a software compatibility issue, I needed V4 variants of those. The V4 profiles have 16-bit fractional gamma values as opposed to the 8-bit values supported by V2 profiles. Unfortunately, they’re 90 bytes bigger because of other changes between the standard versions, but they may be of use to someone.

Fast Hashing with BLAKE2 Part 2: Proper SIMD Comes to .NET

Tue, 12 Jun 2018 18:56:00 -0700

In Part 1 of this post, I evaluated the 8 NuGet packages currently available that list support for the BLAKE2 hashing algorithm(s). My original goal was to identify the best ones, to see if I could improve on them by using the new X86 SIMD Intrinsics support in .NET Core 2.1. I found that of those 8 published packages, exactly one was both RFC-compliant and acceptably fast. That was an implementation of BLAKE2s (blake2s-net), which is the lesser-used of the BLAKE2 variants. There were zero acceptable implementations of BLAKE2b available on NuGet, although there was a good implementation (Blake2Sharp) available in source form on GitHub.

With the current state being such a mess, it would be difficult to not improve on it, but I’ll have a go anyway.

How fast is BLAKE2?

To start, though, I want to take a step back and look at where things stand using those current best .NET implementations of BLAKE2. If you read up on the benefits of BLAKE2, one of its selling points is that it is simple to implement and has excellent performance in software-only implementations. In fact, it’s claimed to be faster than the much-less-secure MD5, which is still commonly used for file fingerprinting because of its low calculation cost.

I figured I would check that claim out by comparing those winning C# implementations of BLAKE2 against MD5, SHA-256 and SHA-512 to see how they stack up. One factor that’s often overlooked is that different hashing algorithms are designed for optimal processing on different architectures. BLAKE2b is supposed to be faster than BLAKE2s on 64-bit platforms and the opposite should be true on 32-bit. Similarly, SHA512 should be faster in 64-bit while SHA256 should be faster in 32-bit. I decided to test all the algorithms head-to-head on both platforms using BenchmarkDotNet to see if all those assumptions were true in .NET land.

My environment (updated to .NET Core 2.1 final after Part 1):

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Xeon CPU E3-1505M v6 3.00GHz, 1 CPU, 8 logical and 4 physical cores
Frequency=2929685 Hz, Resolution=341.3336 ns, Timer=TSC
.NET Core SDK=2.1.300
  [Host]     : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT

And the results:

      Method | Platform |      Mean |     Error |    StdDev | Allocated |
------------ |--------- |----------:|----------:|----------:|----------:|
 Blake2Sharp |      X64 |  16.91 ms | 0.3498 ms | 0.4164 ms |     864 B |
 blake2s-net |      X64 |  22.54 ms | 0.1468 ms | 0.1301 ms |     536 B |
         MD5 |      X64 |  21.13 ms | 0.2073 ms | 0.1939 ms |       0 B |
      SHA256 |      X64 |  46.57 ms | 0.4872 ms | 0.4557 ms |       0 B |
      SHA512 |      X64 |  28.16 ms | 0.2036 ms | 0.1905 ms |     304 B |
             |          |           |           |           |           |
 Blake2Sharp |      X86 | 170.02 ms | 1.0865 ms | 1.0163 ms |     712 B |
 blake2s-net |      X86 |  37.44 ms | 0.1919 ms | 0.1602 ms |       0 B |
         MD5 |      X86 |  20.21 ms | 0.1491 ms | 0.1395 ms |       0 B |
      SHA256 |      X86 |  52.78 ms | 0.4144 ms | 0.3876 ms |       0 B |
      SHA512 |      X86 |  44.76 ms | 0.4982 ms | 0.4416 ms |       0 B |

I ran only the 10MiB data set here to keep the test results brief. The x64 results are in line with expectations. BLAKE2b (using Blake2Sharp) is, in fact, faster than anything else in 64-bit, but it performs particularly poorly in 32-bit. BLAKE2s (using blake2s-net) does admirably on both platforms, but it still trails the performance of MD5, especially on 32-bit.

One wrinkle in these tests is that the built-in hashing algorithms in .NET Core use platform-native implementations. See the HashProviderDispenser variants for details. I’m on Windows, so I’m getting the CNG implementations.

I can’t find any references at the moment, but I’m fairly certain I’ve read before that Windows CNG has at least some SIMD-accelerated code in some of the algorithms. That might explain why SHA512 is faster on both platforms than SHA256. Either way, since those native implementations are what you get in the box in .NET Core, I think you have to try to beat them if you’re considering BLAKE2 for its performance benefits.

The other wrinkle is that RyuJIT-32 in .NET Core 2.1 performs particularly poorly at generating optimal code for BLAKE2b. I raised an issue regarding that on the coreclr repo, and it is currently under investigation. The Blake2Sharp timings with the legacy jit32 were closer to 100ms.

And keep in mind that because not a single one of the BLAKE2b implementations available on NuGet was properly optimized, they came in anywhere from 3-17000x(!) slower than Blake2Sharp on x64, meaning they aren’t actually competitive with SHA512 at all.

Finally, it’s important to remember that when it comes to cryptographic hashing, performance problems can be security problems. It’s common when hashing passwords, for example, to use an iterative hashing function like PBKDF2 and to base the iteration count on a balance between making things difficult for an attacker to brute-force your hashes and making the application responsive to users attempting to log on. Someone attacking your password hashes will most certainly be using optimized code, so if you’re not, you’re giving the attacker the advantage.

Let’s Get Optimizin’

As I mentioned in Part 1, the Blake2Sharp reference repo has some incomplete code related to optional BLAKE2b features, and those extra features complicate the code unnecessarily. The blake2s-net code is derived from that same base, and it has some of the same issues. For my versions, I decided to start from the RFC implementations and optimize from there, keeping things as simple as possible. My first goal was to be at least as fast as those benchmark reference versions with only scalar code.

I was able to make some nice improvements by simplifying the initialization code, by using a bit of unsafe code to avoid making an unnecessary copy of the data during hash updates, and by re-ordering some operations to avoid CPU pipeline stalls.

Here are the numbers from my scalar implementations (I call them Blake2Fast) compared with the fastest versions previously available.

3-byte input

      Method |       Hash |     Mean |     Error |    StdDev |  Gen 0 | Allocated |
------------ |----------- |---------:|----------:|----------:|-------:|----------:|
 Blake2Sharp | 44229FC0EF | 521.9 ns | 5.5792 ns | 5.2188 ns | 0.2050 |     864 B |
 Blake2bFast | 44229FC0EF | 225.8 ns | 1.3251 ns | 1.2395 ns | 0.0074 |      32 B |
             |            |          |           |           |        |           |
 blake2s-net | FE4D57BA07 | 359.6 ns | 3.0930 ns | 2.8932 ns | 0.1273 |     536 B |
 Blake2sFast | FE4D57BA07 | 180.1 ns | 0.5917 ns | 0.5245 ns | 0.0074 |      32 B |

3.19KiB input

      Method |       Hash |     Mean |     Error |    StdDev |  Gen 0 | Allocated |
------------ |----------- |---------:|----------:|----------:|-------:|----------:|
 Blake2Sharp | 61EB59036B | 5.537 us | 0.0286 us | 0.0267 us | 0.1984 |     864 B |
 Blake2bFast | 61EB59036B | 4.243 us | 0.0222 us | 0.0186 us |      - |      32 B |
             |            |          |           |           |        |           |
 blake2s-net | 62320CA3FC | 7.331 us | 0.0593 us | 0.0555 us | 0.1221 |     536 B |
 Blake2sFast | 62320CA3FC | 6.554 us | 0.0296 us | 0.0276 us |      - |      32 B |

10MiB input

      Method |       Hash |     Mean |     Error |    StdDev | Allocated |
------------ |----------- |---------:|----------:|----------:|----------:|
 Blake2Sharp | 7B6AB409B7 | 16.60 ms | 0.1319 ms | 0.1234 ms |     864 B |
 Blake2bFast | 7B6AB409B7 | 13.19 ms | 0.0533 ms | 0.0472 ms |       0 B |
             |            |          |           |           |           |
 blake2s-net | 6500962DE3 | 22.24 ms | 0.1427 ms | 0.1335 ms |     536 B |
 Blake2sFast | 6500962DE3 | 20.66 ms | 0.1719 ms | 0.1436 ms |       0 B |

As you can see, the changes give my Blake2Fast versions a nice speed advantage over the existing best implementations, using only scalar code. The lower initialization overhead makes them roughly twice as fast with the smallest input, and the other optimizations show their benefits on the larger inputs.

One other change I made was to store the hash working state in a struct rather than a class. This makes Blake2Fast allocation-free (except for the array it allocates for the hash result itself) when using it in an all-at-once call. BLAKE2 is optimized to use very little memory for its hash state, so there’s no risk to keeping it on the stack when possible.

Bring on the Intrinsics

Having made the scalar code as fast as I could get it, it was time to see what could be done with the new Intrinsics support in .NET Core 2.1. But First a bit of background for those not familiar…

In .NET, a JIT Intrinsic is a method that is specially recognized by RyuJIT and has its IL implementation replaced with an optimized bit of machine code during JIT compilation. These first came into wide use in the System.Numerics.Vectors assembly, where Vector3, Vector4, Matrix4x4, Vector<T> and friends have Intrinsic methods that are replaced by the JIT with SIMD instructions on platforms that support them. System.Numerics.Vectors opened up a new world of performance in .NET code, and I made heavy use of its Intrinsics in the resizing, sharpening, and pixel format conversion code in MagicScaler. But it wasn’t without its problems.

First, not everything in System.Numerics.Vectors is JIT Intrinsic. For example, Vector4.Min() is implemented with a single SSE instruction that operates on 4 floats at once, as is Vector4.Max(). But Vector4.Clamp(), rather than using those two SSE instructions, has a complicated (and much slower) managed implementation designed to preserve compatibility with the Direct3D HLSL behavior. The documentation makes no mention of the difference, so the only way to know what you’re getting is to look at the source code for the method or to view the generated assembly language from the JIT. Those sorts of performance traps can be very confusing.

Second, the optimized versions of the methods are only emitted by the JIT if optimizations are enabled when you build your assembly. This means that normally, you’ll get a managed (and very slow) version in Debug mode and the optimized SIMD instructions in Release mode. Further, there can be differences in the results between the managed and Intrinsic versions of the code, so you may get different results in Debug and Release builds.

Third, Vector<T> can be very complicated to use because its size is different in different environments. Vector<float>, for example, holds 4 floats in its managed implementation or in the Intrinsic implementation on older hardware. On newer hardware that supports the AVX2 instruction set, Vector<float> holds 8 floats. That makes it difficult to design algorithms since you have to account for both possible sizes.

And finally, System.Numerics.Vectors implements only a tiny fraction of the SIMD instructions available on modern processors. Its API surface was designed with 3D game development in mind, so anything not useful for 3D graphics is almost certainly absent.

Enter System.Runtime.Intrinsics

In order to properly expose the complete set of SIMD instructions supported by modern processors, the CoreCLR team, along with help from Intel and Samsung (hooray for open source!), have been working on a lower-level set of APIs for .NET, implemented as JIT Intrinsics. Unlike the abstracted classes and methods in System.Numerics.Vectors, these new Intrinsics map directly to individual SIMD instruction sets and instructions. These are beginning to come together, and .NET Core 2.1 has the first useable bits in it, although they are designated ‘Experimental’ at this time.

Interestingly, the new Intrinsics support wasn’t listed among the new features in the .NET Core 2.1 RC1 release announcement, but Microsoft did publish a NuGet package with the reference assemblies, and the support is present in the JIT in both the RC1 and RTM/RTW versions of .NET Core 2.1.

Unfortunately, it appears the NuGet package was published to nuget.org by mistake, and now that .NET Core 2.1 has been released in its final version, the package has been de-listed. Its RTM version exists only on myget.org.

Let that serve as a warning to you; this is all experimental. The APIs that work work, but not all of them do work. And the APIs may change in .NET Core 2.2 or 3.0.

Unlike System.Numerics.Vectors, nothing in System.Runtime.Intrinsics has a managed implementation. The NuGet package contains only a reference assembly, and all the methods in that assembly will throw a PlatformNotSupportedException unless the JIT recognizes and substitutes them with the appropriate instructions. This means that the new Intrinsics can’t be used without a compatible JIT, which means they will only work in .NET Core 2.1 (and the 2.2 previews) for now.

Fortunately, the Experimental version does have nearly complete support for SSE-SSE4.1, and quite a bit of AVX is present. That allows for a lot of algorithms to be implemented, including some existing optimized versions of BLAKE2. Since there’s already a good SSE4.1 implementation of BLAKE2 available in the reference repo, all I had to do was port the existing code over to see how well it performed on .NET Core 2.1.

I’ve published that code on GitHub, so I’ll jump straight into the benchmarks, comparing with the previous best BLAKE2 implementations and the built-in hashing algorithms. This is the same 10MiB benchmark from the beginning of this post, on both 32-bit and 64-bit versions of the .NET Core 2.1 runtime.

      Method | Platform |      Mean |     Error |    StdDev | Allocated |
------------ |--------- |----------:|----------:|----------:|----------:|
 Blake2Sharp |      X64 |  16.56 ms | 0.1586 ms | 0.1484 ms |     864 B |
*Blake2bFast |      X64 |  12.13 ms | 0.0870 ms | 0.0771 ms |       0 B |
 blake2s-net |      X64 |  22.26 ms | 0.1443 ms | 0.1350 ms |     536 B |
*Blake2sFast |      X64 |  16.27 ms | 0.1362 ms | 0.1274 ms |       0 B |
         MD5 |      X64 |  21.22 ms | 0.1190 ms | 0.1113 ms |       0 B |
      SHA256 |      X64 |  46.16 ms | 0.2564 ms | 0.2398 ms |       0 B |
      SHA512 |      X64 |  27.89 ms | 0.0982 ms | 0.0871 ms |     304 B |
             |          |           |           |           |           |
 Blake2Sharp |      X86 | 168.31 ms | 0.5426 ms | 0.4810 ms |     712 B |
*Blake2bFast |      X86 |  16.56 ms | 0.0879 ms | 0.0779 ms |       0 B |
 blake2s-net |      X86 |  37.46 ms | 0.2728 ms | 0.2552 ms |       0 B |
*Blake2sFast |      X86 |  16.36 ms | 0.1103 ms | 0.1032 ms |       0 B |
         MD5 |      X86 |  20.06 ms | 0.0996 ms | 0.0931 ms |       0 B |
      SHA256 |      X86 |  52.47 ms | 0.3252 ms | 0.3042 ms |       0 B |
      SHA512 |      X86 |  44.07 ms | 0.1643 ms | 0.1372 ms |       0 B |

The SSE4.1 versions (marked with *) of both Blake2Fast algorithms improve on the previous best versions and are faster than all the common built-in hashing algorithms from .NET (Windows CNG). The 32-bit runtime is where the SIMD advantage really shows up, though. Blake2sFast with SIMD is over twice as fast as blake2s-net, and Blake2bFast is 10x faster than Blake2Sharp. Both are faster than even CNG’s MD5.

So, there you have it. Proper SIMD is coming to .NET, and you can get started experimenting with it today. System.Runtime.Intrinsics will not be officially supported by Microsoft until at least .NET Core 2.2, but the more useful feedback they get now, the sooner they can be sure they’ve got it right.

Update: I’ve published my Blake2Fast implementation to NuGet since it’s a significant improvement over anything else previously available there. Because the Intrinsics support makes such a large performance difference and because RyuJIT-32 in .NET Core 2.1 does so poorly with BLAKE2b, I’ve included the SIMD version of the code in the .NET Core 2.1 build. Other build targets will get the optimized scalar code. I’ve tested it quite thoroughly and am confident that it’s reliable (on Windows, at least), but I must reiterate that the Intrinsics support is not officially supported, so if you pick up the .NET Core 2.1 version, you do so at your own risk.

Fast Hashing with BLAKE2 Part 1: NuGet is a Minefield

Sat, 26 May 2018 20:38:00 -0700

Free as in…

Before I get in to the titular topic of this post, I’d like to discuss my motivation for writing it. Free software has become increasingly important in the business world over the last couple of decades, and the industry has adopted phrases like “free as in beer” and “free as in speech” to define what ‘free’ even means.

For those not familiar, “free as in beer” speaks to the acquisition cost of the software. In the old days, closed-source freeware and shareware were common. They were free (of cost) to download and use, but using them was all you could do. You were not free (as in freedom) to see or modify the code. In light of that, it became important to differentiate software that was truly free, in that you can use it in any way you please, modify it, or enhance it. That software is “free as in speech”.

In the Microsoft world, the .NET Framework has always been “free as in beer” – assuming you discounted the Windows license you had to pay for in order to have a place to run it. With .NET Core, .NET finally became “free as in speech”, and it has improved at an unprecedented pace as people use that freedom to give back to the project. That change has accelerated (or at the very least coincided with) the uptake of Free Open Source Software (FOSS) in the corporate world as well, especially among the ‘Microsoft shops’ that typically eschewed open source. And that has led to more conversations about the true cost and true value of FOSS.

When talking about overall cost, another phrase now in common use is “free as in puppy”. That phrase is somewhat less well-defined than the others. To some, it means that adopting free software comes with some responsibility. It has to be cared for on an ongoing basis, or else it may grow unhealthy and eventually die. That’s true to some extent, but I do agree with Jeff’s take on it as well. If a piece of software requires as much maintenance as a puppy, you shouldn’t be using it, free or not.

Another way of looking at it is that the acquisition cost of the software is inconsequential compared to the cost of evaluation, training, integration, testing, and maintenance of said software within a larger system. It doesn’t matter whether you pick up a stray puppy off the street or buy a fancy $1k designer puppy from a breeder; the cost of caring for that puppy over its lifetime will dwarf the acquisition cost. Likewise, in a business environment, whether you pay $10k for a piece of software or get it for free, if you use it long enough, the acquisition cost will become largely irrelevant.

Which brings me to another phrase I saw recently: “free as in mattress”. I think many of us with a corporate development background have learned to view free software in this way. For small bits of functionality (like a simple hashing algorithm), a corporate team often has the choice to build or buy – whether at zero acquisition cost or some higher number. If the team is good, the cost to build can be estimated fairly accurately as can the maintenance cost. So, like a new mattress, it has a known upfront cost and known risks. When you buy (or take for free) a piece of software, you often don’t know what you’re getting into – not unlike a used mattress. Maybe that free mattress looks nice on the outside. But when you’re dealing with a bedbug infestation a few months later, ‘free’ is a much less good deal. Many would prefer to avoid the risk altogether and buy the new mattress every time.

I’ve seen enough bad code offered up in blog posts, CodeProject articles, StackOverflow answers, and SourceForge/GitHub projects to be very wary of all but the largest corporate-sponsored projects. I don’t mean to pick on the people who write that code. It takes courage to publish the code you write for the world to see (and criticize). And it takes generosity to offer up something you worked hard on for others to use, with no benefit (and likely added headaches) to you. But it also takes a lot of trust to bring that mattress into your house – or code into your project. And, of course, as an author of open source projects myself, I do appreciate the irony in having that attitude.

Caveat Implementor

Despite the benefits that come with the larger swing in the direction of embracing FOSS, maybe sometimes it’s good to remember the lessons we’ve learned over the years when it comes to software quality and maintenance cost. I was reminded of that recently when evaluating implementations of the BLAKE2 hashing algorithm.

I had looked at BLAKE2 a few years ago when choosing a hashing algorithm to use for the cache file naming in WebRSize. I use a base-32 encoded 40-bit hash of the settings used to generate an image when naming its cache file. One neat thing about BLAKE2 is that the spec allows for hashes of any length between 1 and 64 bytes, and the hash length is fed into the parameter block that is mixed with the initialization vector, so a 40-bit hash is not just the first 40 bits of the full-length hash; it’s a different value altogether.

Although I wanted to use the BLAKE2 algorithm, Microsoft doesn’t supply one in .NET, and the only NuGet packages available were a random assortment from developers I don’t know or trust. It was a perfect example of a “free as in mattress” problem, if that’s how you tend to view these things. I didn’t want to take the time to evaluate the available libraries properly nor to write my own, so I decided to simply take the first 40 bits of a SHA256 hash instead, using the hash algorithm built in to .NET (which uses CNG on Windows) .

When .NET Core 2.1 RC1 was released a couple of weeks ago, I was trying to come up with a small project I could use to try out the new X86 Intrinsics support. The reference BLAKE2 implementation includes SSE-optimized versions of the algorithms, so I though porting those would let me test out the new functionality while getting the trustworthy BLAKE2 implementation I had wanted. And since I had to set up a reference implementation and test harness for checking correctness and for benchmarking, I decided to go all out and check all the NuGet packages I could find using the same tests/standards. What I found was that the “free as in mattress” view of things is as legit as ever.

BLAKE2 in a Nutshell

BLAKE2 is derived from the BLAKE algorithm, which was one of the leading candidates from the SHA-3 competition. Ultimately, it wasn’t chosen (that honor went to Keccak), but it has some interesting properties that make it useful for general-purpose secure hashing. The short version is, they claim it’s more secure than SHA-2, and the modifications in BLAKE2 make it faster than MD5 when calculated in software. Basically, anywhere people use MD5 today, BLAKE2 is a definite upgrade.

BLAKE2 comes in two flavors: BLAKE2b and BLAKE2s. BLAKE2b produces a 512-bit hash using an internal state made up of 8 64-bit values and is optimized for 64-bit platforms. BLAKE2s uses 8 32-bit values to produce a 256-bit hash so it can be faster on 32-bit platforms. In Part 2 of this post, we’ll see that use of the SSE instruction sets can make BLAKE2b perform nearly equally in 32-bit and 64-bit, but let’s not jump ahead…

The Reference

The designers of BLAKE2 published several reference implementations in a GitHub repo, so that’s a natural place to start.

Among those is a C# implementation (Blake2Sharp), which should be the perfect reference to use for my testing. The only thing that gave me pause was that the code is incomplete. Not all the BLAKE2 functionality is finished. Tree hashing mode, for example, is partially there but commented out. And there are three different versions of the core algorithm implementation, with two of those commented out – both slower than the ‘unrolled’ version that is enabled. It’s also missing the BLAKE2s variant. Bottom line: it looks like a work in progress and hasn’t been updated in years. I decided to include it in my testing but figured I should bring along some backup just to be safe…

The C reference implementation is complete, including both the BLAKE2b and BLAKE2s variants. And there are the aforementioned SSE-optimized versions. I decided to compile the scalar version into a DLL and call it via PInvoke as a baseline performance reference.

Finally, there are simpler and tidier implementations of both variants available in the RFC that describes BLAKE2. Although they’re written in C, it was very easy to port those over to C# to serve as another set of backup references. Those implementations are designed to be simple and correct, with no optimization. The RFC versions omit the optional features like tree mode hashing, but the implementations are less than 200 lines of code each and very easy to follow. My C# conversion is as true to the C reference as possible, including the lack of optimization.

The Tests

With those references chosen (3 for BLAKE2b and 2 for BLAKE2s), I set about methodically testing every NuGet package I could find. My requirements were simple: the implementation had to support the basic features defined in the RFC. That is, keyed and non-keyed hashing with variable digest length from 8 bits up to 256 or 512, as appropriate. I tested the qualifying implementations for speed and correctness.

Benchmarking was performed with BenchmarkDotNet under .NET Core 2.1-rc1 on Windows 10, using the following inputs.

The ASCII string ‘abc’
The contents the of sRGB ICC profile that ships with Windows 10 (3.19KiB)
10MiB of random data.

I also tested each implementation with other data sizes and with keyed hashes, but for the sake of brevity, I’ll just include the results for the three listed above. The output was a 40-bit hash, and I included a custom column in the BenchmarkDotNet output to allow for a quick check of output correctness. Note that BLAKE2b and BLAKE2s are different algorithms and produce different outputs by design. Test code is published here, and this is my test environment:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Xeon CPU E3-1505M v6 3.00GHz, 1 CPU, 8 logical and 4 physical cores
Frequency=2929692 Hz, Resolution=341.3328 ns, Timer=TSC
.NET Core SDK=2.1.300-rc1-008673
  [Host]     : .NET Core 2.1.0-rc1 (CoreCLR 4.6.26426.02, CoreFX 4.6.26426.04), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0-rc1 (CoreCLR 4.6.26426.02, CoreFX 4.6.26426.04), 64bit RyuJIT

Here’s what I found:

The Good

Of the 8 NuGet packages I found that list BLAKE2 support, only one had an implementation that was complete and correct according to the RFC as well as being fast enough for general-purpose hashing. The winner is blake2s-net

This implementation appears to be a straight conversion of the Blake2Sharp reference code to support the BLAKE2s algorithm, with original credit going to Dustin Sparks.

Here are its benchmark results compared with the 2 references:

3-byte input

           Method |       Hash |     Mean |    Error |   StdDev |  Gen 0 | Allocated |
----------------- |----------- |---------:|---------:|---------:|-------:|----------:|
 Blake2sRefNative | FE4D57BA07 | 259.8 ns | 1.444 ns | 1.351 ns | 0.0072 |      32 B |
       Blake2sRFC | FE4D57BA07 | 794.8 ns | 4.051 ns | 3.789 ns | 0.0067 |      32 B |
      blake2s-net | FE4D57BA07 | 366.0 ns | 2.053 ns | 1.921 ns | 0.1273 |     536 B |

3.19KiB input

           Method |       Hash |      Mean |     Error |    StdDev |  Gen 0 | Allocated |
----------------- |----------- |----------:|----------:|----------:|-------:|----------:|
 Blake2sRefNative | 62320CA3FC |  9.818 us | 0.0503 us | 0.0446 us |      - |      32 B |
       Blake2sRFC | 62320CA3FC | 39.240 us | 0.3034 us | 0.2689 us |      - |      32 B |
      blake2s-net | 62320CA3FC |  7.274 us | 0.0326 us | 0.0305 us | 0.1221 |     536 B |

10MiB input

           Method |       Hash |      Mean |     Error |    StdDev | Allocated |
----------------- |----------- |----------:|----------:|----------:|----------:|
 Blake2sRefNative | 6500962DE3 |  30.87 ms | 0.1184 ms | 0.0989 ms |       0 B |
       Blake2sRFC | 6500962DE3 | 122.67 ms | 0.5827 ms | 0.5451 ms |       0 B |
      blake2s-net | 6500962DE3 |  22.27 ms | 0.1013 ms | 0.0898 ms |     536 B |

This is exactly what you’d expect from a version that’s correctly implemented and optimized. The only knock on this package is that it is compiled with a .NET Framework target, so it can’t be used with older .NET Core or .NET Standard projects. It does work, however, with the .NET Framework library support in .NET Core 2.0 and up. And this one only implements the BLAKE2s variant, so for BLAKE2b, you’ll need to look elsewhere.

[Note that in Part 2 of this post, I’ll cover my own optimized BLAKE2s implementation which does better than this one.]

You can also see here that the RFC implementation is, as expected, very slow. It’s correct, but I wouldn’t use it in any real project. Remember that speed is one of the main reasons for choosing BLAKE2 over other hashing algorithms, so a slow implementation makes it rather pointless.

The Bad

I can’t say I was surprised to find that one of the 8 packages contained an incorrect implementation of the BLAKE2 algorithm, but I was surprised to find that it was the one with the highest download count. If you search ‘BLAKE2’ on nuget.org today, the top match will likely be Konscious.Security.Cryptography.Blake2

This appears to be a from-scratch implementation of BLAKE2b based on the RFC but with a mistake that will show up shortly. Let’s jump straight into the benchmark results.

3-byte input

           Method |       Hash |       Mean |    Error |   StdDev |  Gen 0 | Allocated |
----------------- |----------- |-----------:|---------:|---------:|-------:|----------:|
 Blake2bRefNative | 44229FC0EF |   330.2 ns | 2.326 ns | 2.176 ns | 0.0072 |      32 B |
       Blake2bRFC | 44229FC0EF | 1,134.0 ns | 8.745 ns | 8.180 ns | 0.0057 |      32 B |
      Blake2Sharp | 44229FC0EF |   519.0 ns | 3.886 ns | 3.635 ns | 0.2050 |     864 B |
        Konscious | 44229FC0EF | 1,524.1 ns | 9.384 ns | 8.318 ns | 0.2213 |     936 B |

3.19KiB input

           Method |       Hash |      Mean |     Error |    StdDev |  Gen 0 | Allocated |
----------------- |----------- |----------:|----------:|----------:|-------:|----------:|
 Blake2bRefNative | 61EB59036B |  6.143 us | 0.0276 us | 0.0244 us |      - |      32 B |
       Blake2bRFC | 61EB59036B | 26.434 us | 0.1139 us | 0.1010 us |      - |      32 B |
      Blake2Sharp | 61EB59036B |  5.549 us | 0.0295 us | 0.0276 us | 0.1984 |     864 B |
        Konscious | 61EB59036B | 20.954 us | 0.1704 us | 0.1510 us | 0.2136 |     936 B |

10MiB input

           Method |       Hash |     Mean |     Error |    StdDev | Allocated |
----------------- |----------- |---------:|----------:|----------:|----------:|
 Blake2bRefNative | 7B6AB409B7 | 18.94 ms | 0.1008 ms | 0.0894 ms |       0 B |
       Blake2bRFC | 7B6AB409B7 | 83.18 ms | 0.6921 ms | 0.6135 ms |       0 B |
      Blake2Sharp | 7B6AB409B7 | 16.61 ms | 0.1297 ms | 0.1214 ms |     864 B |
        Konscious | 1636541AC6 | 63.99 ms | 0.4153 ms | 0.3885 ms |     936 B |

First, I’ll point out that the Blake2Sharp reference implementation does slightly better than the native reference version on all but the tiniest input, just as the blake2s-net conversion from that same base did better than its native reference. And the RFC version, once again, is the slowest.

Check out the Konscious version, though. Not only is it 3-4x slower than the Blake2Sharp implementation, it produced a bad hash on the 10MiB input. It turns out, that implementation has a bug that affects any input that is an even multiple of the [128 byte] block size. At an even 10MiB, that last test input triggered the bug.

I have reported the bug to the owner of that package/project, and it may be fixed by the time you read this. But that may not be a good thing for anyone already using this library. If you generate hashes and then save them somewhere with the intention of validating things against them later, you can’t just ‘fix’ a problem in the hash implementation, because you will invalidate any hashes created with the broken version. And because the hash, by definition, reveals nothing about its input data, there’s no way to identify which hashes are correct and which are incorrect after the fact. You may be better off keeping it broken, bad as that may be.

The Ugly

Sorry, I had to do it.

Although it doesn’t have any logic bugs, there isn’t much else nice I can say about System.Data.HashFunction.Blake2

This looks like another from-scratch implementation. And although it produces good hash values, check out the benchmarks:

3-byte input

           Method |       Hash |       Mean |     Error |    StdDev |  Gen 0 | Allocated |
----------------- |----------- |-----------:|----------:|----------:|-------:|----------:|
       Blake2bRFC | 44229FC0EF | 1,154.3 ns | 12.779 ns | 11.954 ns | 0.0057 |      32 B |
      Blake2Sharp | 44229FC0EF |   523.7 ns |  4.712 ns |  4.408 ns | 0.2050 |     864 B |
 S.D.HashFunction | 44229FC0EF | 2,364.9 ns | 27.715 ns | 25.925 ns | 0.4120 |    1744 B |

3.19KiB input

           Method |       Hash |      Mean |     Error |    StdDev |  Gen 0 | Allocated |
----------------- |----------- |----------:|----------:|----------:|-------:|----------:|
       Blake2bRFC | 61EB59036B | 26.745 us | 0.1249 us | 0.1168 us |      - |      32 B |
      Blake2Sharp | 61EB59036B |  5.682 us | 0.0397 us | 0.0331 us | 0.1984 |     864 B |
 S.D.HashFunction | 61EB59036B | 36.869 us | 0.1811 us | 0.1513 us | 2.1973 |    9344 B |

10MiB input

           Method |       Hash |      Mean |     Error |    StdDev |     Gen 0 |  Allocated |
----------------- |----------- |----------:|----------:|----------:|----------:|-----------:|
       Blake2bRFC | 7B6AB409B7 |  82.62 ms | 0.3159 ms | 0.2800 ms |         - |        0 B |
      Blake2Sharp | 7B6AB409B7 |  16.59 ms | 0.1275 ms | 0.1193 ms |         - |      864 B |
 S.D.HashFunction | 7B6AB409B7 | 113.15 ms | 0.3898 ms | 0.3646 ms | 5937.5000 | 24905120 B |

I dropped the native DLL version from this run since we’ve already shown Blake2Sharp is faster, which makes it the proper reference to use going forward.

Notice that this implementation, in addition to being much slower than even the slow RFC version, uses several times more memory than the size of the input data. A hashing function should only read the input and perform computations on it, not make multiple copies of it. I didn’t dig into the code to see what went wrong here, but this is a hidden performance trap waiting to get anyone who dares use this library.

Sadly, I’m sure some people will pick this one from NuGet either because they mistake it for a Microsoft package or simply because they like the naming that looks like the Microsoft packages. There is a new policy in place on NuGet that prevents third-party packages named starting with ‘System.’, but Microsoft is allowing any existing packages to stay put. Beware.

This one also has a sibling package called System.Data.HashFunction.Blake2.Net40Async

I wasn’t able to get that one to work in my benchmark app, although I’ll admit I didn’t try very hard. It appears to be the same basic thing as the one above but with the added trap of a ComputeHashAsync method. Hashing is a CPU-bound operation, so there’s no place for async in it. Trying to run the hash itself asynchronously just adds extra thread context-switching overhead.

If you are receiving data from somewhere asynchronously, simply use a hashing implementation that allows for incremental updates (the BLAKE2 algorithms support this) and update the hash synchronously with each data packet you receive asynchronously.

The Butfor

But for one simple mistake, there would be 2 libraries in the ‘Good’ section. I like the honesty in the readme for Blake2Core

“This is quite literally a copy/paste from BLAKE2 and built into a NuGet package, available here. I needed it in my .Net Core project, and I'm sure other people as well.”

This is an exact copy of the Blake2Sharp reference code, and it would have been exactly as good as my reference copy except that the NuGet package contains a debug build of the DLL, with optimizations disabled. In many cases, there isn’t much difference in performance between Release and Debug builds of .NET code, but for something computation-heavy like hashing, it can make a huge difference.

3-byte input

      Method |       Hash |       Mean |     Error |    StdDev |  Gen 0 | Allocated |
------------ |----------- |-----------:|----------:|----------:|-------:|----------:|
  Blake2bRFC | 44229FC0EF | 1,134.2 ns |  5.146 ns |  4.814 ns | 0.0057 |      32 B |
 Blake2Sharp | 44229FC0EF |   524.6 ns |  4.869 ns |  4.316 ns | 0.2050 |     864 B |
  Blake2Core | 44229FC0EF | 1,877.0 ns | 11.314 ns | 10.583 ns | 0.2041 |     864 B |

3.19KiB input

      Method |       Hash |      Mean |     Error |    StdDev |  Gen 0 | Allocated |
------------ |----------- |----------:|----------:|----------:|-------:|----------:|
  Blake2bRFC | 61EB59036B | 26.367 us | 0.1776 us | 0.1661 us |      - |      32 B |
 Blake2Sharp | 61EB59036B |  5.652 us | 0.0292 us | 0.0259 us | 0.1984 |     864 B |
  Blake2Core | 61EB59036B | 26.023 us | 0.1694 us | 0.1584 us | 0.1831 |     864 B |

10MiB input

      Method |       Hash |     Mean |     Error |    StdDev | Allocated |
------------ |----------- |---------:|----------:|----------:|----------:|
  Blake2bRFC | 7B6AB409B7 | 83.79 ms | 0.4101 ms | 0.3636 ms |       0 B |
 Blake2Sharp | 7B6AB409B7 | 16.58 ms | 0.1105 ms | 0.1033 ms |     864 B |
  Blake2Core | 7B6AB409B7 | 78.03 ms | 0.3949 ms | 0.3694 ms |     864 B |

Without JIT optimization, this library is almost as slow as the RFC version. The only place it has an advantage is that it doesn’t do all the byte shuffling to ensure the words are in little-endian order as required by BLAKE2. The RFC code does that shuffling whether it’s needed or not. The Blake2Sharp code copies the data without shuffling if it’s already ordered correctly, and that savings shows up in the 10MiB run.

By the way, BenchmarkDotNet has a validator that detects this problem and actually refuses to run benchmarks unless you override it. I had to do that for this run so we could see the impact.

Ultimately, this one counts as another performance trap, so don’t use it unless it gets an update.

[Once again, I’ll detail a better BLAKE2b implementation in the second part of this post]

This library also uses a .NET Standard 1.6 build target, so it can’t be used with older versions of .NET Framework (including 4.6). There’s no reason it wouldn’t be compatible; it’s just not multi-targeted.

The Weird

I’m honestly not sure what to make of Isopoh.Cryptography.Blake2b

The hashing implementation itself is taken straight from the Blake2Sharp reference. This library, however, adds a feature that uses a ‘SecureArray’ during the hashing. From what I understand, the SecureArray uses PInvoke to request that the OS lock access to memory during hashing, and then it securely zeroes that memory before returning. This is not without overhead, however, as the benchmarks show.

3-byte input

      Method |       Hash |           Mean |          Error |           StdDev |     Gen 0 |     Gen 1 |     Gen 2 |   Allocated |
------------ |----------- |---------------:|---------------:|-----------------:|----------:|----------:|----------:|------------:|
  Blake2bRFC | 44229FC0EF |     1,142.3 ns |       7.136 ns |         6.326 ns |    0.0057 |         - |         - |        32 B |
 Blake2Sharp | 44229FC0EF |       534.7 ns |       4.650 ns |         4.349 ns |    0.2050 |         - |         - |       864 B |
      Isopoh | 44229FC0EF | 9,187,594.5 ns | 386,206.608 ns | 1,114,294.368 ns | 2332.5195 | 2314.4531 | 2314.4531 | 710953144 B |

3.19KiB input

      Method |       Hash |         Mean |       Error |        StdDev |     Gen 0 |     Gen 1 |     Gen 2 |   Allocated |
------------ |----------- |-------------:|------------:|--------------:|----------:|----------:|----------:|------------:|
  Blake2bRFC | 61EB59036B |    26.880 us |   0.1841 us |     0.1722 us |         - |         - |         - |        32 B |
 Blake2Sharp | 61EB59036B |     5.629 us |   0.0273 us |     0.0256 us |    0.1984 |         - |         - |       864 B |
      Isopoh | 61EB59036B | 8,094.502 us | 727.7956 us | 2,134.4986 us | 1724.1211 | 1710.4492 | 1710.4492 | 524302827 B |

10MiB input

      Method |       Hash |     Mean |     Error |    StdDev | Allocated |
------------ |----------- |---------:|----------:|----------:|----------:|
  Blake2bRFC | 7B6AB409B7 | 82.77 ms | 0.4741 ms | 0.4202 ms |       0 B |
 Blake2Sharp | 7B6AB409B7 | 16.63 ms | 0.1210 ms | 0.1132 ms |     864 B |
      Isopoh | 7B6AB409B7 | 16.67 ms | 0.1183 ms | 0.1106 ms |     984 B |

I can’t tell whether the ridiculous amount of memory allocated is a bug or by design. It’s very odd that it’s highest with the smallest input. And I can’t tell whether the lack of extra allocation on the 10MiB input is because it simply skips the extra processing past a certain size threshold or because the memory use is related to partially-filled blocks.

Although it would be accurate to say it’s more than 17000x slower than Blake2Sharp with small inputs, it might be more fair to say it has a high fixed overhead. Either way, it’s not suitable for general-purpose hashing. But unlike the libraries I’ve reviewed so far, this one doesn’t necessarily claim to be. I’m not sure of the value of securing the hash state memory when both the key and message data have been passed around unsecurely before the library has a chance to use them, but I might be missing something.

I’d recommend you stay away from this library unless you truly need whatever specialized benefit it offers and have good explanations for the issues I pointed out above.

The Others

I have to give an honorable mention to NSec.Cryptography

This library is based on libsodium, which is a relatively mature platform-native security library. It didn’t meet my criteria in that in explicitly disallows hashes less than 32 bytes and is, therefore, not RFC-compliant. I couldn’t tell whether this was a limitation of libsodium or of its .NET wrapper. I also didn’t see a way to do a keyed hash, but I might have just missed it. I can say that for general-purpose hashing, if you don’t need to use a key and can use a full-length digest, this library works and is slightly faster than the best I could do with managed code. In fact, the only thing I found that’s faster is an AVX2 version of the BLAKE2 reference code. I’ll be doing a port of that AVX2 version once support is available (should be coming in .NET Core 2.2) so check back for that later.

And finally, there’s Multiformats.Hash

This one lists BLAKE2 among its algorithms, but to quote from its readme:

“This is not a general purpose hashing library, but a library to encode/decode Multihashes which is a "container" describing what hash algorithm the digest is calculated with. The library also support calculating the digest, but that is not it's main purpose. If you're looking for a library that supports many algorithms and only want the raw digest, try BouncyCastle or the built-ins of the .net framework.”

Enough said there. It may or may not be any good at what it does, but it definitely does not do what I need it to do.

The Conclusion

Obviously, this was a very small sample size from the 115k+ packages on NuGet today and may not be representative of packages of all types. But the lesson is clear: there are no quality checks on NuGet, and download count is absolutely no indication of quality. In fact, download count tends to be self-reinforcing. People gravitate toward the “popular” packages, making it even more dangerous when one of these has a serious bug or design flaw. Not to mention, nuget.org seems to sort by popularity.

It’s dangerous to bring a library into your project without proper testing, and the presence of unit tests in a project or a lack of open issues are no guarantee that the code isn’t broken. As I like to say, “bad coders code bad tests that test their bad code badly”. Always test for yourself.

Tune in next time for some details on my own improved BLAKE2 implementations using the new X86 Intrinsics in .NET Core 2.1. Until then, sleep tight, and don’t let the bedbugs bite…

Making a Minimal sRGB ICC Profile Part 4: Final Results

Tue, 24 Apr 2018 18:32:55 -0700

*Note: If you’re just here for the profiles, I have published those in a new github repo. Get them all here.

Thanks to some much-needed vacation time, it’s taken me a while to get to this final part of the series, but now it’s time to put everything together and get some profiles finalized. In the first three parts of this series, I examined ways to pack an ICC v2 profile as small as possible, an approach for finding an ideal point-based TRC fit with the minimum size, and how to derive the correct color primaries and whitepoint for an sRGB-compliant profile. In this final part, I will assemble some profiles using those techniques/values and test them out. I had difficulty devising real-world tests that would demonstrate the differences between profiles, but I think I’ve finally nailed down some good approximations that are fair and realistic.

My initial test plan was simply to re-create the worst case scenario for profile conversion. If a profile performs acceptably in the worst case, it should do even better under less extreme circumstances. For this reason, I decided to focus on conversions from sRGB to ProPhoto RGB. The thinking behind this is that an embedded sRGB profile will be used to convert to other colorspaces, and the colorspace that is the most different from sRGB would be the worst case. It would be possible to construct a custom colorspace that would be even more different than ProPhoto, but that wouldn’t be realistic. ProPhoto is a real colorspace that people actually use, and it has both a gamut that is much, much larger than sRGB and a response curve that is quite different (reference gamma 1.8 vs 2.2). An even more common scenario might be something like sRGB to Adobe RGB or Rec. 2020, but again, if a profile does well with ProPhoto, the others will work even better.

The Reference Image

Having settled on an evaluation strategy, I needed to pick some test images. This turned out to be more difficult than I anticipated. I originally selected a few real-world images that had extremely saturated colors and a few with lots of different shades of blue and green. These are areas where ProPhoto and sRGB would have maximum differences, and that should highlight any errors. Unfortunately, I found it was impossible to compare fairly with real-world images for two main reasons:

No real-world image covers the entire color gamut of sRGB, so an error might not show up simply because the color value that would show the error isn’t present in the image.
Real-world images tend to have areas of repeated pixel values. This means that if one profile causes a specific color to have an error, and if that color is over-represented in the image, it amplifies the error measured from the profile.

For those reasons, I settled on testing with a single reference image. That image comes from Bruce Lindbloom’s site and is a 16.7megapixel generated image that simply contains every color combination possible with 8-bit RGB. The image consists of 256 squares, each with a different blue value. And each of those squares consists of 256 rows and 256 columns, where the red value increases in each column and the green value increases in each row. I found this image makes it easy to see exactly where the errors are focused.

The Reference Profile

The second problem I had was establishing a reference to compare to. In testing my tone reproduction curves, I tested each candidate curve against the true sRGB inverse gamma curve. For the final profile testing, however, I wanted to test real images with real profiles using a real CMS. So I needed a real ICC profile to serve as a reference. Unfortunately, as we discovered in Part 3 of this series, there aren’t any profiles I could find anywhere that are truly sRGB-compliant. Nor could I use the standard 1024-point TRC as a reference, because one thing I want to evaluate is whether the 182- and 212-point curves I found in Part 2 might actually be better than the 1024-point curve used in most profiles.

This series is focused on creating v2 ICC profiles, but v4 profiles have a newer feature that allows the TRC to be defined as a parametric curve rather than a point-based curve with linear interpolation. The parametric curve type allows the sRGB gamma function to be duplicated rather than approximated. Software support for v4 profiles is not great, so they aren’t used frequently, but a v4 profile with a parametric curve would serve as a good reference for testing my v2 profiles. That left me with a new problem, which was to find an optimal v4 profile.

Although the parametric curve type can duplicate the sRGB curve’s basic logic, the parameters themselves are defined in the ICC s15Fixed16Number format, meaning they have limited precision. I decided to evaluate the accuracy of a v4 sRGB curve using the same measures I used to evaluate my point-based curves in order to see how close it was to the true sRGB curve. Once again, I started with an example from Elle’s profile collection.

Here are the stats from that profile’s TRC compared with the best-performing point-based curves from Part 2.

Points | Max Error | Mean Error | RMS Error | Max DeltaL | Mean DeltaL | RMS DeltaL | Max RT Error
   182 |  0.001022 |   0.000092 |  0.000230 |   0.003107 |    0.000440 |   0.000736 | 0
   212 |  0.001650 |   0.000118 |  0.000357 |   0.002817 |    0.000449 |   0.000707 | 0
  1024 |  0.008405 |   0.000205 |  0.000996 |   0.003993 |    0.000475 |   0.000819 | 0
  4096 |  0.008405 |   0.000175 |  0.000860 |   0.003054 |    0.000472 |   0.000782 | 0
    v4 |  0.000177 |   0.000034 |  0.000051 |   0.000564 |    0.000317 |   0.000371 | 0

As you can see, the v4 parametric curve results in significantly less error than even the best point-based options. Its error, however, is still surprisingly high. Let’s take a look at the parameter values from that profile and see why that is.

Param | sRGB Value     | sRGB Decimal   | Profile Hex | Profile Decimal | Diff
    g | 2.4            | 2.4            |  0x00026666 |  2.399993896484 | -6.103516e-6
    a | 1.000/1.055    | 0.947867298578 |  0x0000f2a7 |  0.947860717773 | -6.580805e-6
    b | 0.055/1.055    | 0.052132701422 |  0x00000d59 |  0.052139282227 |  6.580805e-6
    c | 1.000/12.92    | 0.077399380805 |  0x000013d0 |  0.077392578125 | -6.802680e-6
    d | 0.04045        | 0.04045        |  0x00000a5b |  0.040451049805 |  1.049805e-6

Once quantized to s15Fixed16Number format, none of the numbers stored in the profile are exactly correct, and two of the parameters that have the largest impact on the output value (g and a) are both rounded down. Rounding both values in the same direction effectively combines their error. I decided to try ‘nudging’ all the parameter values to try to find a better fit than was produced by simple rounding. It turned out, the best fit I was able to achieve was by bumping the ‘g’ value up and leaving all the rest as they were. By using a ‘g’ value of 0x00026669, or 2.400039672852, I was able to cut the error to less than half that of the rounded values.

Points | Max Error | Mean Error | RMS Error | Max DeltaL | Mean DeltaL | RMS DeltaL | Max RT Error
    v4 |  0.000177 |   0.000034 |  0.000051 |   0.000564 |    0.000317 |   0.000371 | 0
   ^v4 |  0.000088 |   0.000012 |  0.000022 |   0.000240 |    0.000124 |   0.000143 | 0

While it’s not perfect, that is as close as it’s possible to get to the true sRGB inverse gamma function in an ICC profile. So with that and the primary colorant values from Part 3, I had my reference profile. I decided while I was making a reference profile, I may as well make it as small as I could so that I would have another compact profile option for embedding. That profile is here.

I also decided to create a reference v4 ICC profile for ProPhoto to use as my destination profile. That one was much simpler in that the default rounded values worked out to be the closest fit for the TRC, and the colorant values have a single, unambiguous definition. That profile is here.

The Reference CMS(s)

Once again, this turned out to be more complicated than I anticipated. One might expect that with the detail in the ICC specifications, there wouldn’t be much difference between CMS implementations. I’ve worked predominately with the Windows Color System (WCS) by way of the Windows Imaging Component (WIC), and I always assumed it did a reasonable job at color conversions. However, when looking for an easy way to test conversions using multiple profiles, I stumbled on the tifficc command-line utility from Little CMS.

In testing with tifficc, I found that the results were mostly in line with my expectations, but when using my candidate profiles as a target rather than a source, it appeared Little CMS was doing an upgrade or substitution of the specified profile to an internal reference version of the colorspace. That’s definitely a desirable behavior in that it ensures correct output regardless of minor profile differences, but it’s not desirable when trying to measure those minor differences. WCS, on the other hand, produced output different for each profile, and entirely different from Little CMS. And while that output was more in line with my expectations from my previous testing, it seems that it might not be as correct.

I had been planning for some time to replace some of my dependencies on WCS with my own color management implementation, but this has provided the final push I needed to motivate me to get it done. In the end, I decided to consider the output from both CMSs, so it would be easier to predict what might happen when using the profiles in different scenarios and with different software.

The Reference Scenario

This is where things finally get easy. The purpose of a compact profile is to be embedded in an image. Obviously, it would only be embedded in an image of a matching colorspace and would only be used as a source profile in those cases. I had already chosen ProPhoto as a destination colorspace because of its extreme difference from sRGB while still being a realistic conversion path. And having already decided to use a v4 ProPhoto profile as a reference destination, that left only one choice to make. I had already decided that I would test with an 8-bit reference input image because that’s the most common image type in the wild. But wide-gamut colorspaces like ProPhoto are not well-suited for use with 8-bit images. Squishing the sRGB gamut down into its corresponding place in the ProPhoto gamut at 8-bit resolution tends to cause posterization. So I decided to test 8-bit sRGB input and 16-bit ProPhoto output. I was also able to test the reverse of that transform, going from the 16-bit ProPhoto images back to 8-bit sRGB. In the interest of time, I won’t document the full details of those tests, but those tests are the ones that led to my conclusion that Little CMS does some kind of profile substitution and that WCS is probably Not Very Good. I may do another post on that at some point in the future.

For the Little CMS trials, I used a command-line similar to the following:

tifficc -v -t1 -isrgb-v4-ref.icc -oprophoto-v4-ref.icc -e -w rgb16million.tif rgb16milpp-ref.tif

For the WCS trials, I wrote a small utility that uses the WIC IWICColorTransform interface to perform the same conversion.

The Measurements

Having established a reference scenario and created a reference profile, all I had to do was run the conversion(s) in question using the reference profile as well as all the v2 candidates and then compare their output. I also figured it would be worthwhile to try some variants using more common values, like the 1024- and 4096-point TRCs and the ArgyllCMS and HP colorant values. That should allow a complete picture of how the candidate profiles perform as well as a good basis for understanding which parts of the profile contribute to greater differences in output.

Measuring the differences between the profile output mathematically is a simple task, but I wanted to be able to visualize those differences for easier comparison and so that it would be possible to see not only how much difference there was, but also where the differences occurred. I considered using ΔE-CIE2000 for these comparisons, but the reality is, the results are so close visually that there isn’t much meaning to the color difference. I also found the results of the raw differences interesting because of the way the visualization shows patterns in the error.

I’ve referenced the Beyond Compare image comparison tool a few times before because I like the way it works and the way it shows differences. The only problem with using it for these tests is that while it does load 16-bit images, it seems to convert them to 8-bit before doing the comparison. That means I couldn’t get the kind of detail I wanted to see in the diffs. Normally, when I use that tool, I set it up with a threshold of 1, meaning its default visualization will show pixels that are equal between two images in greyscale, pixels that are off by 1/255 in blue, and pixels that are off by more than 1/255 in red. In doing some trials with 8-bit output and comparing them in Beyond Compare, I found that none of the profiles in my test suite created output that differed from the reference by more than 1 on any given pixel. That’s good news, in that it backs up the theory that none of the profiles I’m testing will produce output that is significantly different visually. But it would make it difficult to draw any conclusions about which profiles are better, especially when the differences get more subtle. That issue, combined with the fact that ProPhoto isn’t recommended for 8-bit images anyway, led me to create my own variant of the image comparison tool that worked at higher bit-depth.

The second problem was visualizing the differences. As I said, I like the way Beyond Compare does it, but when you compare 16-bit images, it’s difficult to find a threshold for color-coding the differences. I ended up with something like the original but enhanced for the extra sample resolution. Instead of coloring different pixels either solid blue or solid red depending on the threshold, I created a gradient from blue to red. I chose the threshold rather arbitrarily, setting it at 65/65535, or roughly 0.1%. That threshold worked out well in that allowed me to create a gradient from blue to red for differences between 1 and 65, and then differences over 65 could be colored solid red. Note that the solid red doesn’t necessarily mean a difference would be distinguishable visually. And as you’ll see, differences that great were very rare in the tests anyway.

And finally, I added some stats to the diff images to provide a little more detail than can be seen visually. I grouped the errors into four buckets (1-17,18-33,34-49,50-65) and added raw pixel counts for each error bucket, plus the count of pixels over the 65 threshold. I also calculated the Max, Mean, and Root Mean Square error for each test image versus the reference. Those stats are written into the upper-left corner of each diff image.

The Results

From here on out, there will be a lot of images. These are the diff images created by the tool I described above. Again, grey pixels in the image indicate that the candidate profile produced output identical to the reference profile. Pixels that are tinted blue represent the smallest differences, and there is a gradient from blue to purple to red (more pink, really) for increasing error levels. Finally, any pixels colored solid red were different by more than 65/65535. All the images below are thumbnails, and you can click them to get the full diff image. Be aware, though, the diff images are 16.7megapixels in size, so don’t click them if you’re on a device that can’t handle the size (bytes or pixels). Oh, and the diff images themselves are 8-bit, even though they represent differences between 16-bit images. Since the diff is just a visualization, I wanted to keep them as small as possible. They’re already as much as 10MiB each saved as 8-bit-per-channel PNGs.

For each profile, I’ll include the results from both the LCMS tifficc utility and my WCS/WIC conversion utility. The differences are interesting to see.

The Colorant Factor

I’ll start with the effect of different primary colorant values used in common sRGB profiles. In Part 3 of this series, I looked at the differences between the odd colorant values from the HP/Microsoft sRGB profile as well as the Rec. 709-derived colorants used by the ArgyllCMS reference sRGB profile. For these tests, I created v4 ICC profiles using the same modified parametric curve from my reference profile, so that only the colorants are different. Converting the reference image to ProPhoto using those as source profiles, these are the diffs compared with output from my reference sRGB profile.

CMS	LCMS	Diff Counts
Colors	HP/MS	1-17	16.5M
TRC	v4 Ref	18-33	0
Max Diff	16	34-49	0
Mean Diff	5.4643	50-65	0
RMS Diff	5.8626	>65	0

CMS	LCMS	Diff Counts
Colors	Rec. 709	1-17	13.6M
TRC	v4 Ref	18-33	0
Max Diff	16	34-49	0
Mean Diff	1.7412	50-65	0
RMS Diff	2.2036	>65	0

And the same using WCS

CMS	WCS	Diff Counts
Colors	HP/MS	1-17	12.8M
TRC	v4 Ref	18-33	950745
Max Diff	63	34-49	17799
Mean Diff	6.3570	50-65	1955
RMS Diff	8.4811	>65	0

CMS	WCS	Diff Counts
Colors	Rec. 709	1-17	8.6M
TRC	v4 Ref	18-33	928751
Max Diff	63	34-49	17484
Mean Diff	4.7129	50-65	1955
RMS Diff	7.5958	>65	0

The thing that really stands out to me is the difference in the way these profiles are handled by LCMS and WCS. Based on the splotchiness of the WCS diff images (you’ll have to view them full-size to see), my guess is that it’s using lower-precision calculations than LCMS. In both cases, though, the differences are quite small and should be below the threshold of visible difference. That’s certainly the case with the Rec. 709 colors vs the sRGB reference colors, but the unbalanced colors from the HP/Microsoft profile don’t result in as much difference in the converted result as one might expect. I think the differences here also make a good reference point for determining the significance of the differences caused by different TRC approximations.

The 26-Point Curves

In Part 2 of this series, I did some detailed analysis of both the TinyRGB/c2 26-point approximated TRC and the proposed ‘improved’ curve used in the sRGBz profile. That analysis predicted that the sRGBz curve would perform less well than the TinyRGB curve, and it found another alternate 26-point curve that it predicted would do better. I figured some real-world testing of those predictions would be a good start. Although I measured and tuned the curves primarily using ΔL, which is a measure of visual difference, we can see that the results are the same even when measuring absolute pixel differences after conversion.

Note that in testing these curves, I created new profiles that all shared the same reference sRGB colorant values to limit any differences to the curves themselves.

It’s difficult to see in the thumbnails, but at full size, the visualization shows pronounced improvement in error levels between my alternate 26-point curve and either of the others. The sRGBz curve has both the largest mean error and the most individual pixels with high error levels.

CMS	LCMS	Diff Counts
Colors	sRGB Ref	1-17	13.6M
TRC	sRGBz	18-33	2.9M
Max Diff	67	34-49	243867
Mean Diff	13.6471	50-65	17624
RMS Diff	15.8336	>65	59

CMS	LCMS	Diff Counts
Colors	sRGB Ref	1-17	14.6M
TRC	TinyRGB/c2	18-33	2M
Max Diff	72	34-49	143994
Mean Diff	13.5533	50-65	13082
RMS Diff	15.1258	>65	1

CMS	LCMS	Diff Counts
Colors	sRGB Ref	1-17	14.7M
TRC	26-Point Alt	18-33	2M
Max Diff	64	34-49	111588
Mean Diff	13.4069	50-65	3636
RMS Diff	14.8904	>65	0

And again with WCS

CMS	WCS	Diff Counts
Colors	sRGB Ref	1-17	11.9M
TRC	sRGBz	18-33	4.4M
Max Diff	69	34-49	487042
Mean Diff	15.6611	50-65	51775
RMS Diff	18.7551	>65	335

CMS	WCS	Diff Counts
Colors	sRGB Ref	1-17	13.1M
TRC	TinyRGB/c2	18-33	3.4M
Max Diff	62	34-49	269412
Mean Diff	13.8746	50-65	13434
RMS Diff	16.2264	>65	0

CMS	WCS	Diff Counts
Colors	sRGB Ref	1-17	13.2M
TRC	26-Point Alt	18-33	3.3M
Max Diff	54	34-49	275720
Mean Diff	13.9112	50-65	5374
RMS Diff	16.2414	>65	0

Once again, the output from WCS seems to have amplified the error in the profiles, but the relative results are the same. The sRGBz curve is less accurate than TinyRGB’s, which is less accurate than my alternate 26-point curve. It’s also worth noting how much more error these curves contribute compared to the error from the alternate primary colorants. This level of error is still quite acceptable for the profiles’ primary intended use-case, but we’ll look at some other options.

The Alternate Compact Curves

I picked out a few of the interesting compact curves that my solver found in Part 2 of this series to see how they compare in terms of size/accuracy ratio. Here are those comparisons, again using both CMS’s. First LCMS…

CMS	LCMS	Diff Counts
Colors	sRGB Ref	1-17	8.0M
TRC	20-Point	18-33	7.4M
Max Diff	82	34-49	1.1M
Mean Diff	23.1414	50-65	212433
RMS Diff	25.6127	>65	18122

CMS	LCMS	Diff Counts
Colors	sRGB Ref	1-17	16.5M
TRC	32-Point	18-33	294858
Max Diff	48	34-49	12
Mean Diff	8.7495	50-65	0
RMS Diff	9.6480	>65	0

CMS	LCMS	Diff Counts
Colors	sRGB Ref	1-17	16.7M
TRC	42-Point	18-33	5485
Max Diff	32	34-49	0
Mean Diff	5.1398	50-65	0
RMS Diff	5.7177	>65	0

CMS	LCMS	Diff Counts
Colors	sRGB Ref	1-17	16.1M
TRC	63-Point	18-33	12
Max Diff	32	34-49	0
Mean Diff	2.4592	50-65	0
RMS Diff	2.8069	>65	0

And once more using WCS

CMS	WCS	Diff Counts
Colors	sRGB Ref	1-17	8.1M
TRC	20-Point	18-33	6.9M
Max Diff	82	34-49	1.5M
Mean Diff	23.0439	50-65	227729
RMS Diff	26.0615	>65	34926

CMS	WCS	Diff Counts
Colors	sRGB Ref	1-17	15.6M
TRC	32-Point	18-33	1.1M
Max Diff	38	34-49	1162
Mean Diff	9.6662	50-65	0
RMS Diff	11.4510	>65	0

CMS	WCS	Diff Counts
Colors	sRGB Ref	1-17	16.6M
TRC	42-Point	18-33	152587
Max Diff	25	34-49	0
Mean Diff	7.3962	50-65	0
RMS Diff	8.4588	>65	0

CMS	WCS	Diff Counts
Colors	sRGB Ref	1-17	16.7M
TRC	63-Point	18-33	187
Max Diff	23	34-49	0
Mean Diff	6.7250	50-65	0
RMS Diff	7.3211	>65	0

And now a few notes on these interesting curves…

Although the 20-point curve diff images look like a bit of a bloodbath, allow me to point out a couple of things. First, as I mentioned before, I chose the threshold for the full red pixels rather arbitrarily. I wanted the small differences between all the profile variants to be visible even in the thumbnails here, and I chose my thresholds based on my choice of a 64-value gradient for the smaller errors. Red doesn’t necessarily mean danger in this case; it just means the error is higher than the worst from the other curves. Not a lot worse, mind you, but the line has to be drawn somewhere, and I just happen to have drawn that line just under the max error of the 20-point curve. Second, you’ll note that most of the worst error is concentrated toward the upper left of the image and the upper left of each square within the image. These are the darker parts of the image, where a larger absolute pixel difference represents a smaller visual difference than it would at the mid-tones. The choice to concentrate the error in those areas less visible was a key part of the tuning algorithm used in my curve solver. I believe the 20-point curve is perfectly adequate for some 8-bit image embedding, particularly for thumbnail-sized images where file size is important. Weighing in at only 410 bytes, I believe this is the smallest possible usable sRGB-compatible profile.

The other three candidate curves performed very well indeed. The 32-point curve is quite a significant improvement over the 26-point curves used in the existing compact profiles and with a cost of only 12 additional bytes in the profile. So once again, I’ll say that 26 is not a magic number in this case. But really, if you’re looking for a magic number, wouldn’t you just skip straight to 42? The error level in the 42-point curve is quite good. It’s actually awfully close to the error caused by the bad colorant values used in the very popular HP/Microsoft sRGB profile, so it makes an excellent compromise if you’re looking to save space. The 63-point curve halved the error of the 42-point curve when using LCMS but didn’t do as much better with WCS, so while it would also be a good choice, I think 42 is the magic number for my compact profile.

The Big Curves

That just leaves us with the larger curves to evaluate. My solver identified curves of 182 and 212 points that appeared to be a closer fit to true sRGB than the standard 1024- and 4096-point curves used in many profiles. I wanted to see if that was true in a real-world test. Here are the results when using all four of those.

CMS	LCMS	Diff Counts
Colors	sRGB Ref	1-17	7.6M
TRC	182-Point	18-33	0
Max Diff	16	34-49	0
Mean Diff	0.7436	50-65	0
RMS Diff	1.2585	>65	0

CMS	LCMS	Diff Counts
Colors	sRGB Ref	1-17	8.2M
TRC	212-Point	18-33	0
Max Diff	16	34-49	0
Mean Diff	0.7736	50-65	0
RMS Diff	1.2533	>65	0

CMS	LCMS	Diff Counts
Colors	sRGB Ref	1-17	8.4M
TRC	1024-Point	18-33	0
Max Diff	16	34-49	0
Mean Diff	0.7879	50-65	0
RMS Diff	1.2471	>65	0

CMS	LCMS	Diff Counts
Colors	sRGB Ref	1-17	2.8M
TRC	4096-Point	18-33	0
Max Diff	16	34-49	0
Mean Diff	0.2164	50-65	0
RMS Diff	0.5570	>65	0

And repeated one last time using WCS

CMS	WCS	Diff Counts
Colors	sRGB Ref	1-17	16.6M
TRC	182-Point	18-33	1446
Max Diff	23	34-49	0
Mean Diff	5.7849	50-65	0
RMS Diff	6.2883	>65	0

CMS	WCS	Diff Counts
Colors	sRGB Ref	1-17	16.6M
TRC	212-Point	18-33	12261
Max Diff	23	34-49	0
Mean Diff	5.6512	50-65	0
RMS Diff	6.1638	>65	0

CMS	WCS	Diff Counts
Colors	sRGB Ref	1-17	16.6M
TRC	1024-Point	18-33	74
Max Diff	17	34-49	0
Mean Diff	5.6634	50-65	0
RMS Diff	6.1857	>65	0

CMS	WCS	Diff Counts
Colors	sRGB Ref	1-17	16.6M
TRC	4096-Point	18-33	144
Max Diff	17	34-49	0
Mean Diff	5.6006	50-65	0
RMS Diff	6.1383	>65	0

I must admit, these results have left me a bit puzzled. With LCMS, the 182- and 212-point curves generally outperformed the 1024-point curve, as I expected. But the 4096-point curve really blew the others away. In doing some round-trip testing early on with LCMS, I found that it was producing perfect output with the 1024- and 4096-point TRCs when it really shouldn’t have, so I suspected there may be some sort of internal substitution happening with those larger TRCs. I tested that theory out by modifying the first 20 points of the 1024-point profile to use the same large value. The output didn’t change, which lends some support to that theory. I didn’t dig into the LCMS code to see what’s happening, but I can say that the same substitution/upgrade did not occur when using my 182- or 212-point TRCs. So if you’re using LCMS and want the most accuracy (for destination profiles at least), you may be better off sticking with the standard TRCs. When used as a source profile, however, there is something special about those smaller curves. I think they’ll work nicely for embedded profiles when size is a factor.

The results when using WCS to convert were a bit more in line with my expectations. The 4096-point curve had more pixels with larger individual errors compared to the 1024-point curve, but it made up some of the gap in the average error by better fitting the top part of the curve. The 182- and 212-point TRCs performed admirably but didn’t offer an upgrade over the larger versions. Again, they have almost the same accuracy as the standard curves, so if size is a concern, they’re a viable option. I’ll go ahead and publish a profile that uses the 212-point curve because I think it has some value, but it’s not quite the upgrade over the standard curves I thought it might be.

The CMS Factor

I found the differences between the results with the two CMS’s interesting enough to do all the tests in both, but I wanted to show one last comparison to lend a bit of perspective to all the other comparisons/diffs in this post. Here’s what it looks like when you compare the reference outputs from each CMS with each other.

CMS	LCMS/WCS	Diff Counts
Colors	sRGB Ref	1-17	11.1M
TRC	v4 Ref	18-33	2.9M
Max Diff	404	34-49	1.0M
Mean Diff	22.7936	50-65	563030
RMS Diff	35.1353	>65	1.1M

Now that’s a bloodbath. And that serves to make an important point: the CMS implementation details can easily have a far greater impact on the output results than any of the profile aspects we’ve looked at. I’m not quite sure which is more accurate between LCMS and WCS, but I strongly suspect it’s the former. I’ll do some more testing on that as I do my own color conversion implementations in MagicScaler. If I find anything interesting and if I remember, I’ll come back and update this post.

Making a Minimal sRGB ICC Profile Part 3: Choose Your Colors Carefully

Sun, 08 Apr 2018 05:39:42 -0700

When I started the task of creating a minimal sRGB profile, I assumed the part that would require the least thought would be the colorant and whitepoint tags in the profile. To review, the ICC V2 specification requires 9 tags for RGB profiles. Those are: copyright (cprt) and description (desc), which I discussed in Part 1 of this series; the tone reproduction curves (rTRC, gTRC, and bTRC), which I covered thoroughly in Part 2; and finally, the colorant and whitepoint tags (rXYZ, gXYZ, bXYZ, and wtpt), which ended up with their own post, too. This is that post.

I have referenced Elle Stone’s treatise on well-behaved profiles a couple of times already, and I’ll start this post by referring there again. I’ll also refer you to her systematic examination of ICC profiles seen in the wild and finally, to her very detailed explanation of how to create an sRGB-compatible ICC profile using the values from the sRGB spec.

I drew two main conclusions from reading through those articles. The first was that well-behaved and correct sRGB profiles are difficult to create and are, consequently, rare. The second was that the reference sRGB profile shipped with ArgyllCMS happens to be that most mythical of profiles. The unicorn profile, if you will.

And that’s what was supposed to make this step simple; I’d just steal the colorant tag values from the ArgyllCMS profile and call it a day.

Facebook’s TinyRGB profile had used the colorant tag values from the original HP/Microsoft sRGB profile, and while that remains the most commonly-seen sRGB profile in the wild, Elle had convinced me it was also one of the most wrong.

In his writeup on the sRGBz profile, Øyvind Kolås (Pippin) mentioned generating new colorant tag values from babl with improved accuracy. My first assumption was that it must also use those magical Argyll values. Imagine my surprise when I looked at the tag data and saw that they were a completely different set of values. And imagine my further surprise when I ran them through Elle’s xicclu test and found that they were also well-behaved.

That left me with two sets of possible correct sRGB colorant values, and I simply had to know which was right. The rabbit-hole deepens…

Better-Behaved?

If there are two well-behaved sRGB-like profiles with different colorant values, they must not be as difficult to come by as I originally thought. Elle describes creating well-behaved profiles as a process of calculating the most correct values for the colorspace and then ‘nudging’ them so that their rounding errors when converted to the ICC s15Fixed16Number format balance out. I learned that this is actually a very simple process and that testing for well-behavedness is a matter of simple arithmetic. Let’s start with the actual colorant tag values stored in the three profiles I was examining.

      |     HP/Microsoft     |        sRGBz         |      ArgyllCMS     
-------------------------------------------------------------------------
      |     X      Y      Z  |     X      Y      Z  |     X      Y      Z
Red   |  6FA2   38F5   0390  |  6FA1   38F6   0391  |  6FA0   38F5   0390
Green |  6299   B785   18DA  |  6297   B787   18DA  |  6297   B787   18D9
Blue  |  24A0   0F84   B6CF  |  249E   0F83   B6C2  |  249F   0F84   B6C4
Sum   |  F6DB   FFFE   D339  |  F6D6  10000   D32D  |  F6D6  10000   D32D

I’ve kept the values in hex for now, because 1) I read them out with a hex editor and 2) it’s easier to see what’s up when looking at the integer representation of the numbers. What the table above shows is the XYZ values for the Red, Green, and Blue primaries stored in each of the three profiles I examined. I also included a row that shows the sum of the X, Y, and Z values for the three color channels. One thing should stand out immediately: the sRGBz and ArgyllCMS color values are different, but their sums are the same. What’s not obvious from the table is what those sums represent.

First, let me explain how the color values are stored in an ICC profile. The ICC spec defines the s15Fixed16Number format for storing XYZ (and other) values. In that format, 16 bits are allocated to the signed integer portion of the number and 16 bits are allocated to the fractional part of the number. The conversion between floating-point decimal and the fixed-point format is simply to multiply by 2¹⁶ and round to the nearest integer. Conversion back to floating-point decimal is done by dividing by 2¹⁶ (65536). In that format, 1.0 is represented by 0x00010000 because its integer part is 1 and it has no fractional part.

This is in contrast to the response16Number format used to store the TRC points we examined in the last post. In that format, 16 bits are used to represent the full range of 0-1, inclusive. For that number format, the divisor is 2¹⁶-1 (65535), so that 1.0 is represented by 0xFFFF. This has come up as a point of confusion in some things I’ve read, so I thought I’d clear that up.

Now back to the values in the table…

Every developer who works with RGB colors knows that in 8-bit color, black is [0,0,0], full red is [255,0,0], etc. We also know that if you add full red, full green, and full blue together, you get [255,255,255], which is white. Things work the same in XYZ. Adding the three primary colors together at their full intensity (as defined within the colorspace) will give you white (also as defined in the colorspace). In most practical colorspaces, the XYZ values are normalized so that white has a Y value of 1.0. In some representations, you may see the XYZ numbers scaled up to a range of 0-100, but both the ICC and sRGB specs declare that the nominal range is 0-1.

Knowing that white should have a Y value of 1.0 and that 1.0 in s15Fixed16Number format is 0x10000, you should see an immediate problem with the values in the HP/Microsoft sRGB profile: their Y values sum to less than 1.0, meaning they’re scaled improperly.

Its X and Z values are wrong as well. The ICC V2 spec requires that all profiles have their color values adapted to the D50 whitepoint, which allows for simple translation between colorspaces. Since XYZ conversion to and from RGB is whitepoint-dependent, using a common whitepoint for all profiles makes it easy for software to implement that translation. As an aside, let me point out that RGB values we are referring to linear RGB values. That is, the red, green, and blue values that are the output of the TRC that undoes their stored gamma correction. Those values are also normalized to the range 0-1 for computation.

The ICC spec is explicit about the XYZ value to use for the D50 whitepoint, giving it a value of [X=0.9642,Y=1,Z=0.8249]. When converted to s15Fixed16Number format, that value becomes [X=0xF6D6,Y=0x10000,Z=0xD32D], which is the exact value stored as the Profile Illuminant in the header of every V2 ICC profile.

If you refer back to the table of values from the three profiles I examined, you will find that the sRGBz and ArgyllCMS primary colorant values sum to exactly the value of the D50 Illuminant given by the ICC. Put simply, that’s what makes a profile well-behaved. And you can see that the HP/Microsoft profile is quite far off from that value, which is why it’s no good.

Knowing that making a profile well-behaved is simply a matter of normalizing the primary colors so that they sum at full intensity to make white, it’s easy to see how we ended up with two different sRGB-like profiles that are both well-behaved. But that still leaves the question of which is right.

In Which I Learn They’re Both Wrong

It’s been a while since I linked to Nine Degrees Below, so let me start this section with another link to Elle’s research into the proper color values to use for sRGB. In that article, Elle rounds up every possible definition of every color referenced in the sRGB spec, does every bit of mathematical wrangling imaginable, and comes up with a final set of numbers that are very close to the ones that ArgyllCMS has in its reference sRGB profile. Score one for Argyll.

I decided to do a similar exercise and find my own answer to compare to the others. One problem with making things match the sRGB spec is that the spec itself isn’t published freely. If you want to read the actual spec, it will cost you 170 Swiss francs to buy it from the IEC web store.

Fortunately, there are enough references available elsewhere that I believe we can put together an accurate picture of the spec without ponying up. I like to save my francs for Swiss chocolate and wine, thank you very much. I can enjoy those while I read the Wikipedia entry on sRGB.

At this point, it’s worth considering exactly what numbers we’re trying calculate. Depending where you look you may see the colorant values described in one of two ways:

They represent the direct XYZ translation of the three primary colors (red, green, and blue) at their full intensity under the specified illuminant.
They make up a matrix which can be used to translate any set of [R,G,B] values to their corresponding [X,Y,Z] values.

In reality, both are correct. But I think for our purposes, it’s important to focus on the second definition. The reason that’s important is that the colorant values stored in the ICC profile are used in exactly that way. They’re also used in another, related way. The matrix created from those XYZ colorant values can be inverted to create the matrix that translates from XYZ back to RGB. This will become very important later, so I wanted to mention it now and let it soak in a bit.

If you use a tool like the Profile Inspector available on the ICC site, it reinforces the first definition I gave. They show the XYZ values converted to decimal, the calculated x and y coordinates for that color, and a nice chromaticity diagram with the color plotted on it. Here’s the rXYZ tag information from the Argyll sRGB profile.

Exiftool, on the other hand, presents the data using the second definition. Here’s the relevant output from the same profile:

Red Matrix Column:   0.43604 0.22249 0.01392
Green Matrix Column: 0.38512 0.71690 0.09706
Blue Matrix Column:  0.14305 0.06061 0.71393

You can see that the XYZ values are the same in each, and that makes perfect sense. If you make a matrix using the columns given by Exiftool

0.43604 0.38512 0.14305
0.22249 0.71690 0.06061
0.01392 0.09706 0.71393

and then multiply the linear value for pure red [1,0,0] by that matrix, you get the left column back, giving you the same XYZ value shown for the red primary in the Profile Inspector. Again, it’s the same thing… but the matrix usage is the more important definition.

So, that leaves the question: how do we calculate that matrix, given the values in the sRGB spec?

That answer has two parts. For the first part, we must get the matrix for converting RGB to XYZ using sRGB’s native illuminant/whitepoint, which is D65. Then, because the colors in an ICC profile must be given relative to the D50 illuminant, we must adapt that matrix from D65-relative values to D50-relative values. Bruce Lindbloom has a reference on the basic theory of Chromatic Adaptation as well as some different adaptation matrices on his site. But I’ll caution you not to use his pre-calculated matrix for D65->D50; it’s wrong for our purposes.

Let’s start with the matrix for calculating RGB->XYZ under D65. The Wikipedia article on sRGB has the matrix printed right in it. It also points out explicitly that the values listed are the exact ones in the sRGB spec. Again, I don’t have the actual spec, but I have no reason to doubt the veracity of that statement. The sRGB spec reportedly is full of exact numbers, rounded to 4 decimal places.

Well, like Elle, I started with the assumption that more precision is better, and I didn’t like the look of those imprecise numbers from the spec. So, like Elle, I tried to calculate my own more precise matrix using the the published x,y values of the primaries. That effort was a complete failure. What I mean is, while I succeeded in the calculation, if I then inverted my matrix to create the XYZ->RGB matrix, it didn’t match the one in the spec to the 4 decimal places it has listed. It turns out, the best way to get the correct (according to the spec) inverse matrix is to use the explicitly rounded values given in the RGB->XYZ matrix.

0.4124  0.3576  0.1805
0.2126  0.7152  0.0722
0.0193  0.1192  0.9505

Invert that, and you get

 3.2406254773200500 -1.5372079722103200 -0.4986285986982480
-0.9689307147293190  1.8757560608852400  0.0415175238429540
 0.0557101204455106 -0.2040210505984870  1.0569959422543900

Which matches the spec (as described by wiki) very nicely, and with lots of decimal places, if that’s your thing. Round that back to 4 decimal places, like such, to match the spec again,

 3.2406 -1.5372 -0.4986
-0.9689  1.8758  0.0415
 0.0557 -0.2040  1.0570

Invert that, and you get

0.4124103360777000  0.3575962178119540  0.1804991017305060
0.2126157251481260  0.7151958779229800  0.0722134074030765
0.0193021307575116  0.1191881265507680  0.9504992763896300

Which, again, rounds to match the spec at 4 decimal places. These, I’m confident, are the correct numbers for sRGB at D65, which just leaves the D65->D50 adaptation matrix to work out.

This, again, seems like a place where more precision would pay off, but in fact, there is a value listed in the spec, rounded to 4 decimal places, that is perfect for this use. That value given for D65 is [X=0.9505,Y=1,Z=1.0890]. We also have a standard XYZ value for D50, given by the ICC spec. That value is [X=0.9642,Y=1,Z=0.8249].

As an aside, you may recall that the actual D50 value stored in the profile header is in s15Fixed16Number format and is, in hex [X=0xF6D6,Y=0x10000,Z=0xD32D]. Converted back to floating-point decimal, that value is [X=0.964202880859375,Y=1,Z=0.8249053955078125]. If you want to be extra precise, that value is also acceptable to use. It worked out that when creating the adaptation matrices, it didn’t matter which number I used. I got the same results once the quantization to s15Fixed16Number format was done for the final calculated values. For the calculations shown below, I used the rounded value published in the ICC spec.

Using the D50 XYZ value from the ICC spec, the D65 XYZ value from the sRGB spec and the Bradford cone response matrix given by Bruce Lindbloom, we get the following values for the D65->D50 adaptation matrix:

 1.0478414713468100  0.0228955556744975  -0.0502009864000404
 0.0295477450604968  0.9905065286192130  -0.0170722316797199
-0.0092509594572860  0.0150723678359253   0.7517177861599870

Notice that the matrix I calculated is different than the one Bruce gives on his site:

 1.0478112  0.0228866 -0.0501270
 0.0295424  0.9904844 -0.0170491
-0.0092345  0.0150436  0.7521316

They’re quite close, but he used D65 and D50 values from a different source. It comes down to rounding differences, but remember, we’re following exact specs, so we want to round the same way they do.

And finally, if we multiply our RGB->XYZ matrix by the adaptation matrix, we get the final adapted values:

0.4360285388823030  0.3850990539931360  0.1430724071245600
0.2224376839759750  0.7169415328858720  0.0606207831381531
0.0138974429946207  0.0970763744845987  0.7139261825207810

Converted to s15Fixed16Number format, written in hex, and transposed to match the profile layout I used earlier, those numbers look like this:

      |     X      Y      Z
Red   |  6FA0   38F2   038F
Green |  6296   B789   18DA
Blue  |  24A0   0F85   B6C4
Sum   |  F6D6  10000   D32D

You’ll note that the hex values exactly total those of the D50 Profile Illuminant, so these values will create a well-behaved profile. These values, however, do not match any other profile I’ve seen.

I also found that the D65 whitepoint stored in most profiles doesn’t match the sRGB spec value. The XYZ values given in the spec, again, are [X=0.9505,Y=1,Z=1.0890], which in s15Fixed16Number hex are [X=F354,Y=10000,Z=116C9]. All sRGB profiles I’ve examined (if they define a D65 whitepoint) have had the following value for the ‘wtpt’ tag [X=F351,Y=10000,Z=116CC], which works out to [X=0.9504547119140625,Y=1,Z=1.08905029296875].

I’m convinced my numbers are the correct colorant and whitepoint values for sRGB as written in the actual spec. But you may be reluctant to take my word for it, especially given that there are so many other profiles out there with different values. Fortunately, I have a bit of official documentation on my side.

While trying to locate the most correct and precise definition of the D65 and colorant values available, I ran across a document entitled “How to interpret the sRGB color space (specified in IEC 61966-2-1) for ICC profiles”. I wonder what it’s about....

That document is published on the ICC website, under its information page for sRGB. For some common colorspaces, the ICC publishes spec extension documents that describe how to treat that specific colorspace in the context of a profile. That document is linked at the bottom of the page, under Hints for Profile Makers.

If you read through that document, you will find the same rules and numbers I used, extracted from the sRGB spec (which I assume the ICC has an actual copy of). For example, section A7 contains the exact XYZ->RGB matrix I listed above. Theirs has more decimal places than the Wikipedia page but less than mine. You’ll also find under section B2, the exact recommended D65->D50 Bradford adaptation matrix. Theirs only matches mine to 5 decimal places, but I think mine came out better, because… hold the phone.. they included the actual suggested ICC profile matrix, with many decimal places of precision. You’ll find that is also very close to mine. In fact, when converted to s15Fixed16Number format in hex as I’ve done with the others, those numbers are:

      |     X      Y      Z
Red   |  6FA0   38F2   038F
Green |  6296   B78A   18DA
Blue  |  24A0   0F85   B6C4
Sum   |  F6D6  10001   D32D

You can see they are identical to mine from above with the exception that the Green Y value came out 1 higher, making the sum 1 too high. That’s within nudging distance of being well-behaved, but I believe that if their adaptation matrix had been a bit better, the nudging wouldn’t have been required; it wasn’t with mine.

The existence of that document on the ICC site begs the question: are they using those values in their reference sRGB profiles? The answer is no. No, they are not. Their profiles, as of today, are still using the same busted numbers from the old HP/Microsoft profile. I don’t know why.

So, that’s the mathematical explanation of how I arrived at my profile color values and a bit of evidence to support their validity. But maybe you’re still not convinced. There are, after all, two different definitions of the D65 value and two different versions of the primary color values given on the Wikipedia page for sRGB. Wouldn’t a profile created with those other numbers also comply with the spec? Well, no, actually. And I’ll do the math to show you why. But first a bit of history.

The Life of sRGB

sRGB started its life as a derivative of the Rec. 709 HDTV standard. The authors, who came from HP and Microsoft, took the primaries/gamut and whitepoint/color temperature from the Rec. 709 standard, modified the gamma curve to more closely match the response curve of CRT-based computer displays, and created their own draft spec. That draft is still available online today.

Despite the large red warning at the top of that page that explains it is obsolete, you will still find values from that draft spec living on in modern software. This is, no doubt, partly a result of the fact that the draft is freely available and the actual spec has to be purchased.

Basically, what happened was that the draft authors rounded most of the numbers they used when they published them. This, in turn, led to inaccuracy in several parts of the draft spec. When the draft was submitted to the IEC for standardization, it went through a process of refinement wherein that inaccuracy was resolved before the spec became final. In many cases, the resolution was a slight tweak of the numbers to cancel out the rounding errors or to bring things back into alignment.

One such example of this is nicely documented on the sRGB Wikipedia page, in the section entitled “Theory of the transformation”. That section describes how the original intended values for the response curve produced numbers with lots of decimal places. Those numbers were rounded in the draft spec, creating a break in the curve at the transition from its linear portion to the actual gamma curve. The numbers were then adjusted for the final spec to resolve the break. The adjustments fixed the error in the sense that the two parts of the curve were made to meet up again, but they also changed the curve segments such that although they meet, the slope of the lines is no longer continuous as was originally intended.

That refinement is a bit of a recurring theme in the sRGB spec, where the intended value and the actual value published are different. This happened with the definitions of the color values and whitepoint as well. There is a note in the section describing the XYZ->sRGB transformation that reads

“The numerical values below match those in the official sRGB specification, which corrected small rounding errors in the original publication by sRGB's creators”

Essentially, what that means is that in the final spec, the XYZ values for the D65 illuminant and the XYZ transformation matrices have been adjusted to compensate for the 4-decimal-place rounding that was used on the original draft spec numbers. If you use those rounded numbers from the draft, you’ll get incorrect results. If you use the intended numbers, you’ll get results that are mathematically correct but are incorrect according to the published spec.

And that leads us back to the colorant and whitepoint tags in the ArgyllCMS reference sRGB profile. I’ll do the math that leads to those numbers so I can show you where they deviate from the standard.

How Not to Create an sRGB ICC Profile

I mentioned earlier that my quest for additional precision beyond that given in the sRGB spec led to a dead-end. I’ll go through that path again to show why. Let’s start by assuming that the XYZ values given for both the D65 illuminant and for the primary colors (by way of the RGB->XYZ transformation matrix) are not good enough. That leaves us with the alternate definitions of those values, which were copied directly from the Rec. 709 standard. They are defined on the sRGB Wikipedia page as follows:

  Red     Green   Blue    White(D65)
x 0.6400  0.3000  0.1500  0.3127
y 0.3300  0.6000  0.0600  0.3290
Y 0.2126  0.7152  0.0722  1.0000

Cross-referencing the Rec. 709 standard, which is freely available, the red, green, blue and whitepoint x and y values all match, except Rec. 709 only defines the color chromaticity coordinates to 2 decimal places (those extra 0’s are filler). The Y values given do not appear in the Rec. 709 spec, and that’s because 1) they can be calculated from x and y if you know the whitepoint and 2) those values are rounded to 4 decimal places, which makes them less precise than they could be if we calculated them.

Bruce Lindbloom has tons of useful color-related math on his site, and I referred to his Chromatic Adaptation page/formulas/matrices earlier. This time I will refer to his page on generating XYZ/RGB matrices.

Remember, the RGB->XYZ matrix and the primaries are the same thing, so if we get the matrix, we’ll have the precise XYZ values for our color primaries. The formula on that page starts by converting each xy color to unscaled XYZ, by setting its Y value to 1 and calculating X and Z from there. It then uses the whitepoint, which we know should have a Y value of 1, to compute a scaling factor (the S vector), which defines the final component colors relative to that white. To get the whitepoint’s XYZ value, we can use this formula, or we can use the simplified version on the matrix calculation page since we know the Y value is 1. That gives us an XYZ value for D65 of [X=0.950455927051672,Y=1,Z=1.08905775075988]. Now that’s some decimal places!

Using that value to compute the RGB->XYZ matrix, we get the following:

0.4123907992659590 0.3575843393838780 0.1804807884018340
0.2126390058715100 0.7151686787677560 0.0721923153607337
0.0193308187155918 0.1191947797946260 0.9505321522496610

And rounding that to 4 decimal places, we get the exact numbers listed in the sRGB spec (I’ve been informed by a Wikipedia author)

0.4124  0.3576  0.1805
0.2126  0.7152  0.0722
0.0193  0.1192  0.9505

Plus, we have extra precision, and we love extra precision. Everything is awesome! Now, let’s create an extra-precise Bradford adaptation matrix to go from our extra-precise definition of D65 to the ICC’s specified D50 value. Here’s the adaptation matrix

 1.0478860032225500  0.0229187651747795 -0.0502160953117330
 0.0295817824980035  0.9904835184905490 -0.0170787077044827
-0.0092518808392088  0.0150726074870313  0.7516781336176040

And the final D50-adapted RGB->XYZ matrix

 0.4360412516160510  0.3851129107981560  0.1430458375857940
 0.2224845402294770  0.7169050786084580  0.0606103811620653
 0.0139201874713754  0.0970672386971240  0.7139125738315010

Converted to profile format, it’s an exact match for Argyll’s sRGB

      |     X      Y      Z
Red   |  6FA0   38F5   0390
Green |  6297   B787   18D9
Blue  |  249F   0F84   B6C4
Sum   |  F6D6  10000   D32D

It’s well-behaved, it’s precise (or at least it was until we quantized it for the profile), and we got it using numbers from our telephone-game version of the spec. So what’s wrong with it? Well, let’s back up a couple of steps to the unrounded, unadapted D65 RGB->XYZ matrix. If we invert that to create the XYZ->RGB matrix, this is what we get:

 3.2409699419045200 -1.5373831775700900 -0.4986107602930030
-0.9692436362808800  1.8759675015077200  0.0415550574071756
 0.0556300796969936 -0.2039769588889760  1.0569715142428800

And here again is the XYZ->RGB matrix from the spec – as described to me by a little birdy.

 3.2406 -1.5372 -0.4986
-0.9689  1.8758  0.0415
 0.0557 -0.2040  1.0570

Notice that these matrices no longer agree to the 4 decimal places of precision defined in the spec. If we go back and look at the draft spec, we can see that it lists a different set of rounded numbers, which do match

 3.2410 -1.5374 -0.4986
-0.9692  1.8760  0.0416
 0.0556 -0.2040  1.0570

And therein lies the problem. These rounded numbers from the draft spec don’t invert to create the correct RGB->XYZ matrix. Here’s that one:

 0.4123808838269000  0.3575728355732480  0.1804522977447920
 0.2126198631048980  0.7151387878413210  0.0721499433963131
 0.0193434956789248  0.1192121694056360  0.9505065664127130

We have a round-trip failure, caused by the lack of precision. To fix that, the spec (which I once saw a blurry photo of, I swear) was modified and the values adjusted so that at 4 decimal places of precision, each matrix inverts to the other. Defining the RGB/XYZ matrices such that they work with only 4 decimals of precision has another benefit that we didn’t get to see. The D65 XYZ values I used were carried through all calculations with full double float precision as well, so there was no opportunity for our whitepoint to throw the other colors off balance. Without that precision, it’s difficult to maintain balance, which I assume is how the HP/Microsoft sRGB profile ended up so bad.

Using the rounded XYZ values for the primaries, you’ll find that they add to exactly the rounded value given for D65.

      X       Y       Z
Red   0.4124  0.2126  0.0193
Green 0.3576  0.7152  0.1192
Blue  0.1805  0.0722  0.9505
White 0.9505  1.0000  1.0890

This creates automatic balance, even with a low level of precision, which was the intent. You may have noticed that this required rounding the Z value of of the whitepoint in the wrong direction. The more precise calculation of the whitepoint we made above gave D65 a Z value of 1.08905775075988, which should have rounded to 1.0891. Oddly enough, the D65 Z value is listed both ways in the draft sRGB spec. But it works out that rounding it down to 1.0890 makes everything work better, so that’s what ended up in the final spec (I think I overheard a guy mutter to himself on the bus one time).

And now I’ll do one final conversion to prove these rounded numbers are the bestest: let’s convert them to xyY, and see if they match the intended Rec. 709 colors.

      x                  y                   Y     
Red   0.640074499456775  0.329970510631693   0.2126
Green 0.3                0.6                 0.7152
Blue  0.150016622340426  0.0600066489361702  0.0722
White 0.312715907221582  0.329001480506662   1.0

Sure enough, round those to the requisite 2 decimal places for colors and 4 for white, and they match the spec values exactly.

So what we learned is, if you use the xy color values inherited from Rec 709, you’ll match the original intent of the draft spec, but you won’t match the actual final spec. For that, you must use its final XYZ numbers with their intentional imprecision. That’s how I got my numbers, and I’m stickin’ to ‘em.

That leaves just one step in my journey to the perfect compact sRGB profile. Come back for the final post, where I’ll compare my final profiles with some references and see which one gives the best bang for your buck.

Update: A Bit of Perspective

After I published this post, Graeme Gill (the creator of ArgyllCMS) commented here, and then he and Elle Stone and I had a bit of further discussion on the pixls.us forum.

Those exchanges led me to think a little more clarification is necessary on this topic. It turns out to be quite controversial, at least among people who have spent any significant time thinking about it (we are likely few in number). And I heard from a couple more people who found the topic interesting but didn’t have the background knowledge to follow everything completely. Talking to them gave me some better ideas for explaining the disagreement, so I thought I’d get them written down.

But I want to get two things straight before I get back into the details.

First, I think a bit of perspective is in order. I said that Graeme and Elle’s reference sRGB profiles (they match in primaries and whitepoint) were wrong. There are varying degrees of wrong, and I want to make it clear that although I think their profiles are wrong according to my interpretation of the sRGB spec, I believe they are correct according to their own interpretations. When I took an alternate approach to deriving the colorant and whitepoint values, using math that I believe to be 100% correct but with inputs I don’t agree are correct, I got numbers that match the ArgyllCMS sRGB profile. That contrasts starkly with the HP/Microsoft sRGB profile, which all three of us agree is wrong in a much more significant sense. I believe the ArgyllCMS sRGB profile is better described as a Rec. 709 profile with sRGB TRCs. Graeme argues those are the same thing.

Second, even though the raw numbers might have looked far apart when I presented them before, they were given with an absurd number of decimal places, especially given their target use. Once they end up in an ICC profile, most of those differences are quantized away. Only first log₁₀(2¹⁶) decimal places are accurately preserved in s15Fixed16Number format, which amounts to 4 decimal places reliably. In the end, the level of disagreement between our interpretations of the sRGB spec has a maximum net impact on our final profile values of 3/65536, or 0.0000457764 on any given number. So while we may argue our interpretations of the spec, we’re arguing over a difference that likely won’t ever be visible in any image.

Update 2: I have finally seen the spec!

I had a section here that further explored the different interpretations possible from the partial information publicly available, but it was a bit rambly and included more conjecture. I have removed it in favor of a new post, which explains what I learned from reading the real deal.

Making a Minimal sRGB ICC Profile Part 2: Curve the Curves

Sat, 31 Mar 2018 22:06:54 -0700

In Part 1 of this series, I examined Facebook’s TinyRGB (c2) ICC profile, following on from the work Øyvind Kolås (Pippin) did in creating his sRGBz profile. I was able to trim an extra 68 bytes off that profile (making 100 bytes total reduction off TinyRGB) by careful packing of the data, and now I turn my attention to the tone reproduction curve (TRC) tags and their shared content.

In his sRGBz post, Pippin discusses the Facebook decision to use 26 points in their tone reproduction curve. The Facebook post explains that this was done because the linear part of the sRGB curve ends about 1/25^th of the way in, making that a natural place for the second TRC point to fall. In fact, the sRGB curve is defined as having a linear segment up to a value of precisely 0.04045, which is awfully close to 1/25. That makes a sensible place to start testing, but it seems they decided that was the magic number and went full speed ahead without bothering to check others.

The tricky thing about optimizing a point-based curve approximation for an ICC profile is that the curve points have to be spaced at even intervals. If we were allowed to space them arbitrarily, we could define the linear segment precisely with two points and then use as many points as we wanted to tune the curvy part of the curve. But with even spacing, options are much more limited, and the performance of curves with different numbers of points defined can be quite unpredictable. It makes sense, then, that the Facebook team would choose 26 as a starting point.

However, Pippin failed to find any compelling evidence that 26 is disproportionately better than other surrounding numbers. My check of their math results in the same conclusion, but I arrived at it in a different way, which I’ll be getting to. 26 points produce a decent curve, but in that size range, more is better and fewer is not necessarily a lot worse. What’s nice about a 26-point curve is that at 2 bytes per point, plus the 12-byte header, the curve is a nice even 64 bytes. And that’s about the only special thing it has going for it.

In Search of Magic Numbers

Is Facebook’s curve the best curve you can get with 26 points? And if 26 isn’t the magic number, is there one?

I was intrigued by Pippin’s alternate proposed curves, so I set out to do some testing of my own using his as a starting point. One thing that stood out to me immediately was that he optimized the curves for minimum mean absolute error. Generally, when testing sample fit to a curve, root-mean-square error is more meaningful, because it gives more weight to points that are further off the curve. Large individual errors are definitely undesirable in this case, so that seems a better measure. I was also interested in seeing the max error for that reason. I set up some code to interpolate the 256 values that would be found in an 8-bit JPEG’s color channels, compared them to the values calculated using the actual sRGB inverse gamma function, and measured the max error, MAE, and RMSE for his curves vs the TinyRGB/c2 curve.

Points | Max Error | Mean Error | RMS Error | Point Values
    23 |  0.000587 |   0.000148 |  0.000194 | 0,229,544,1072,1796,2744,3937,5384,7104,9104,11396,13995,16912,20157,23735,27657,31937,36573,41589,46976,52754,58916,65535
    24 |  0.000675 |   0.000136 |  0.000180 | 0,219,509,993,1655,2521,3605,4920,6476,8288,10364,12716,15353,18283,21517,25062,28924,33115,37636,42500,47710,53277,59193,65535
    25 |  0.000544 |   0.000125 |  0.000166 | 0,210,483,924,1533,2322,3315,4513,5928,7581,9468,11605,14003,16660,19597,22813,26312,30116,34214,38621,43348,48385,53766,59452,65535
   *26 |  0.000449 |   0.000119 |  0.000146 | 0,202,455,864,1423,2154,3060,4156,5454,6960,8689,10637,12821,15247,17920,20855,24042,27501,31233,35247,39549,44132,49018,54208,59695,65535
    26 |  0.000464 |   0.000115 |  0.000150 | 0,203,457,867,1426,2155,3062,4159,5457,6964,8689,10640,12824,15250,17925,20855,24045,27504,31237,35259,39548,44137,49021,54211,59696,65535
    27 |  0.000483 |   0.000106 |  0.000138 | 0,194,429,812,1327,2001,2836,3842,5035,6415,8000,9786,11785,14005,16451,19134,22051,25211,28621,32289,36215,40409,44869,49603,54621,59912,65535
    28 |  0.000408 |   0.000098 |  0.000129 | 0,186,410,763,1243,1865,2635,3567,4662,5938,7388,9034,10870,12910,15157,17614,20294,23191,26324,29681,33285,37124,41214,45555,50148,55007,60114,65535
    29 |  0.000418 |   0.000091 |  0.000122 | 0,180,390,720,1166,1743,2457,3319,4333,5509,6851,8366,10060,11938,14007,16271,18737,21406,24286,27379,30689,34222,37981,41970,46195,50657,55366,60307,65535
    42 |  0.000174 |   0.000043 |  0.000056 | 0,123,246,410,627,897,1224,1612,2064,2583,3170,3826,4558,5365,6250,7212,8258,9385,10602,11901,13289,14769,16342,18005,19765,21620,23574,25630,27778,30038,32395,34859,37431,40105,42891,45785,48794,51909,55140,58486,61945,65535

My results didn’t match his mean error numbers in the 6^th decimal place, but they’re close enough that I can tell we’re using the same basic logic. As you can see, the Facebook curve stats (marked with an asterisk) do show a larger mean error, but the max error and RMSE are lower, meaning their curve is a slightly better fit overall based on this measure. Essentially, that curve has a greater overall error, but the error is distributed more evenly with less large individual errors. Their max error is also lower than the curves with more/less points immediately surrounding, which is good, but those curves weren’t optimized to minimize max relative error, so that might not be meaningful.

But actually, these numbers still aren’t the best measure of the curves’ accuracy. Because the sRGB gamma curve is intentionally very much not linear (except for that small bit at the start), a relatively small absolute error at the bottom end has a greater impact on image fidelity than a larger absolute error at the top of the curve. For example, the output value for an input of 1/255 should be 0.000304. An error of 0.000449 (the max error from the TinyRGB curve) on that value would be huge. At the top of the curve, where the output for 254/255 should be 0.991102, that same error would be insignificant. A more useful measure here would be the error relative to the correct value, not the absolute error.

Going beyond that, it’s important to understand what the curve is used for and what an error actually means as far as image fidelity. This curve is included in an ICC profile that’s meant to be embedded in images so that they can be converted to other colorspaces. Since we know the curve is going to have errors, it’s best to optimize the placement of the points so that the error has as little visual impact as possible when the image is converted.

That conversion process goes like this:

Convert source RGB values to Linear RGB. This is what the curve is used for. It should approximate the inverse gamma function from the sRGB spec. That’s where the errors are introduced – you can’t precisely replicate the sRGB curve with nothing but straight lines.
Convert Linear RGB to XYZ. This is done using the XYZ values for the red, green, and blue primaries that are also included in the profile.
Convert those XYZ values to Linear RGB in the target colorspace using its XYZ primaries.
Run that Linear RGB through the inverse of the curve in the target profile to arrive at the final target RGB values.

The simplest version of this process would be an identity transform from the sRGB-compatible colorspace to true sRGB. If everything goes right, the output values will be identical to the input.

That’s my first criterion for the curve. It must support a round-trip for every value 0-255 through the profile curve and then back through the true sRGB gamma function. If any value changes on round-trip, the curve is not sRGB-compatible.

Measuring Visual Error

The round-trip test is the absolute minimum that the curve should pass, but we can actually get a pretty good idea of the curve’s visual accuracy beyond that. Keep in mind that sRGB is a relatively compact colorspace. When converting to a colorspace with a wider gamut, a difference that might not result in an error in sRGB might throw a color off by quite a lot in a colorspace that is larger and more spread out.

I think Facebook was on the right track with their design. They mentioned validating the error in their curve by using the ΔE-CIE94 measure. That’s a measure of color difference based in the L*a*b* colorspace, which is designed to be perceptually uniform. So instead of measuring numbers from the curve output and just picking the closest ones, they actually verified that the numbers they picked got close visually to the reference values. L*a*b* is calculated directly from XYZ values, so it’s also a good test of the exact conversions that will happen when the profile is used for real.

I got the impression from their post that they tuned the curve first and then used the ΔE-CIE94 measures to make sure the final results were good enough. My plan was to integrate the visual measures into the tuning process itself, so that the results would not just be good enough, but rather would be the best possible for a given number of curve points.

To that end, I decided to take a similar but simpler approach. ΔE-CIE94 is complicated to calculate because it has some refinements to the original ΔE-CIE76 spec to deal with irregularities in the model that show up in certain hue ranges. Furthermore, to test the entire RGB space, I would have to do 16.7M comparisons (at 8-bit input precision) with that complicated calculation for each candidate curve. I realized I could simplify things greatly by working with the grey values 0-255. Since sRGB uses the same curve for all three color channels, grey is as good as any color for testing the curve.

Limiting to just the grey values allows a simpler calculation of L* since it can be directly calculated from the Y value in XYZ, and a* and b* will always be 0. That meant I could look just at ΔL* and have a very good idea what the perceptual difference was between the reference value and the calculated value from the curve candidates. And to make that comparison as accurate as possible, I used the ΔL* adjustments from the even-newer ΔE-CIE2000, which gives more importance to midtones, reducing the visual difference measure for very dark or very light colors.

So, to review, I ended up with three measures for evaluating and tuning the curves. In order of importance, those are:

The round-trip test through the sRGB gamma function
The ΔL* for reference vs calculated values
The relative error in the curve output values

I decided to keep the relative error from the curve output as a measure, because the closer the curve is to the correct sRGB gamma curve numerically, the more points can be interpolated relatively error-free. I’ll explain that more later, but basically, the round-trip test and ΔL* are best for determining the max error and tolerances, but the relative error is best for fitting the curve for points in-between.

With all that explanation out of the way, I’ll get back to the curves from Pippin’s sRGBz post. Here are the stats for those curves using the measures I described. Again, the TinyRGB curve is marked with an asterisk. And the left three error columns are now relative error instead of absolute.

Points | Max Error | Mean Error | RMS Error | Max DeltaL | Mean DeltaL | RMS DeltaL | Max RT Error
    23 |  0.039987 |   0.002466 |  0.005752 |   0.189551 |    0.017885 |   0.029098 | 1
    24 |  0.040603 |   0.002357 |  0.005641 |   0.179425 |    0.016492 |   0.027182 | 1
    25 |  0.031010 |   0.002205 |  0.005345 |   0.124504 |    0.015105 |   0.024044 | 0
   *26 |  0.034171 |   0.001978 |  0.005315 |   0.095100 |    0.014204 |   0.021270 | 0
    26 |  0.029402 |   0.002035 |  0.005077 |   0.111436 |    0.013970 |   0.022130 | 0
    27 |  0.031464 |   0.001920 |  0.004796 |   0.120256 |    0.012911 |   0.020769 | 0
    28 |  0.029564 |   0.001918 |  0.004616 |   0.104781 |    0.011936 |   0.018717 | 0
    29 |  0.028636 |   0.001729 |  0.004265 |   0.093921 |    0.011154 |   0.017872 | 0
    42 |  0.015034 |   0.000887 |  0.002183 |   0.040349 |    0.005297 |   0.008399 | 0

Using these measures, we can learn much more about the real-world usefulness of the curves. First of all, you can see that Pippin’s 23- and 24-point curves, despite having fairly low mean and RMS error values, failed the round-trip test. The Max RT Error of 1 means the pixels were offset from their correct values by a max of 1, but that’s still not good enough. Next, you can see that the Max ΔL* from the TinyRGB curve is lower than all but the two largest of Pippin’s proposed curves. Looking at the columns on the left, you can see that Pippin’s 26-point curve is a better fit to the reference curve based purely on the relative error numbers, and that makes sense given that that’s how he optimized them. He looked only at the raw numbers, while the Facebook team considered the visual impact of the numbers.

So based on that, the TinyRGB curve looks pretty impressive. It passes the round-trip test and was obviously tuned for visual accuracy. But can we do better? Of course we can :)

But first, I’ll explain one more thing. What does the ΔL* value mean in real-world terms?

The Facebook TinyRGB post said that their ΔE-CIE94 testing showed that their error level was less than half of what is perceptible to humans. Under the CIE76 definition of ΔE, a value of 1 is generally considered the minimal noticeable difference between colors, and ΔE is defined as Sqrt(ΔL*² + Δa*² + Δb*²). If we were to assume a target ΔE of 1, then knowing that our Δa* and Δb* values are always 0, we could say that the minimal noticeable ΔL* should be Sqrt(1/3), or 0.57735. However, the newer revisions to ΔE complicate things by adding a scaling factor to each color component, and ΔE-CIE2000 complicates things a bit more by adjusting the color difference so that midtones are more heavily weighted. That makes it more difficult to find a threshold value for ΔL*. I decided to do some ad-hoc testing using real grey values from the real sRGB to lend context. I calculated the minimum and maximum ΔL* for all adjacent shades of grey in 8-bit sRGB. The minimum value was 0.157124, which was the difference between grey levels 0 and 1. The max was 0.397609, between grey levels 117 and 118.

Looking back at the curves that failed the round-trip test, you can see those had max ΔL* values of 0.179425 and 0.189551, so it’s easy to imagine why they would have had values change on the round-trip. To make it easier to picture the difference, though, here’s what those greys look like. First a pair of boxes at grey values 0 and 1:

And now a pair at 117 and 118:

On my laptop, which has an above-average-quality 19 | 0.041959 | 0.003564 | 0.007399 | 0.139496 | 20 | 0.035090 | 0.003569 | 0.007435 | 0.128757 | 21 | 0.033688 | 0.003200 | 0.006765 | 0.115426 | 22 | 0.033893 | 0.003218 | 0.007011 | 0.108881 | 23 | 0.034801 | 0.002913 | 0.006443 | 0.106786 | 24 | 0.031175 | 0.003083 | 0.007205 | 0.089015 |

By only considering options that allowed 25 | 0.035425 | 0.002291 | 0.005597 | 0.091829 | *26 | 0.034171 | 0.001978 | 0.005315 | 0.095100 | 26 | 0.032400 | 0.002235 | 0.005239 | 0.079740 | 27 | 0.026077 | 0.002531 | 0.005867 | 0.069934 | 28 | 0.025899 | 0.002487 | 0.006214 | 0.063570 | 29 | 0.024357 | 0.002259 | 0.005617 | 0.054125 | 32 | 0.018609 | 0.001701 | 0.004036 | 0.039496 | 42 | 0.007896 | 0.000696 | 0.001455 | 0.025409 | 56 | 0.007696 | 0.000515 | 0.001255 | 0.013777 | 63 | 0.003646 | 0.000347 | 0.000720 | 0.009950 | 124 | 0.005790 | 0.000191 | 0.000765 | 0.003821 | 182 | 0.001022 | 0.000092 | 0.000230 | 0.003107 | 212 | 0.001650 | 0.000118 | 0.000357 | 0.002817 | 124 | 0.005790 | 0.000191 | 0.000765 | 0.003821 | 182 | 0.001022 | 0.000092 | 0.000230 | 0.003107 | 212 | 0.001650 | 0.000118 | 0.000357 | 0.002817 | 256 | 0.005447 | 0.000210 | 0.000802 | 0.004125 | 124 | 0.005790 | 0.000191 | 0.000765 | 0.003821 | 182 | 0.001022 | 0.000092 | 0.000230 | 0.003107 | 212 | 0.001650 | 0.000118 | 0.000357 | 0.002817 | 256 | 0.005447 | 0.000210 | 0.000802 | 0.004125 | 1024 | 0.008405 | 0.000205 | 0.000996 | 0.003993 | 4096 | 0.008405 | 0.000175 | 0.000860 | 0.003054 | screen, in a dark room, I can see the line between 117 and 118 quite clearly. The line between 0 and 1, I can’t really see at all. Depending on your screen, viewing environment, and eyes, you may or may not see any difference.

Based on my sample size of one (totally statistically significant – to me, ha), the minimum noticeable difference in ΔL* seems to be somewhere between 0.16 and 0.40… Let’s call it, 0.2-ish to be safe. The max ΔL* of the TinyRGB curve is right around half that, so that checks out. We’re going to do better than that by far, but I wanted to give you an idea what that number means in the real world since it was a key measurement in my testing.

As I mentioned before, I reached the same conclusion Pippin did regarding the magic of the 26-point curve. I did it by testing curves at all sizes from 16-255 and comparing them. The curves were tuned using the same measures I detailed above. The first priority was round-trip accuracy, second was to minimize ΔL*, and third was to fit the curve by minimizing the RMS relative error. This required an iterative approach to curve optimization, where certain points were locked based on their impact to ΔL* and the others were allowed to move until the best-fitting curve was found. My solver found a few interesting ones.

Show Me Those Curves

I’ll start with the smallest useable curves I was able to create.

Points | Max Error | Mean Error | RMS Error | Max DeltaL | Mean DeltaL | RMS DeltaL | Max RT Error | Point Values 0.026601 |   0.038015 | 0            | 0,279,753,1521,2622,4077,5920,8169,10853,13987,17596,21693,26300,31431,37102,43328,50128,57494,65535 0.024572 |   0.035288 | 0            | 0,263,693,1387,2358,3664,5297,7296,9672,12449,15641,19264,23335,27867,32875,38371,44368,50882,57905,65535 0.021854 |   0.031353 | 0            | 0,250,638,1263,2146,3309,4773,6557,8678,11152,13995,17221,20842,24872,29323,34206,39534,45316,51565,58276,65535 0.020180 |   0.028803 | 0            | 0,237,594,1159,1959,3008,4325,5928,7832,10050,12598,15485,18727,22331,26312,30677,35438,40603,46183,52189,58613,65535 0.018459 |   0.026617 | 0            | 0,227,554,1071,1798,2749,3940,5389,7106,9106,11401,14000,16917,20159,23738,27661,31938,36580,41589,46980,52759,58920,65535 0.017114 |   0.024290 | 0            | 0,215,520,994,1657,2523,3607,4922,6479,8291,10369,12721,15358,18288,21521,25065,28928,33116,37639,42503,47714,53282,59201,65535 the round-trip test to pass, I was able to create viable curves with as few as 19 points. You can see that each point added reduces ΔL*, though, so more is better at this stage.Points | Max Error | Mean Error | RMS Error | Max DeltaL | Mean DeltaL | RMS DeltaL | Max RT Error | Point Values 0.015423 |   0.022824 | 0            | 0,210,487,926,1534,2327,3317,4515,5934,7583,9472,11610,14005,16666,19600,22815,26320,30117,34218,38626,43349,48393,53765,59459,65535 0.014204 |   0.021270 | 0            | 0,203,457,867,1426,2155,3062,4159,5457,6964,8689,10640,12824,15250,17925,20855,24045,27504,31237,35259,39548,44137,49021,54211,59696,65535 0.014239 |   0.020672 | 0            | 0,201,459,866,1426,2155,3062,4159,5457,6964,8689,10639,12824,15250,17925,20854,24045,27504,31237,35249,39548,44137,49022,54211,59697,65535 0.013876 |   0.019952 | 0            | 0,191,435,819,1329,2003,2837,3846,5037,6419,8001,9787,11788,14008,16455,19134,22052,25213,28625,32291,36218,40409,44871,49607,54624,59917,65535 0.012585 |   0.017635 | 0            | 0,183,415,765,1245,1867,2638,3568,4666,5938,7392,9036,10873,12913,15159,17618,20296,23195,26325,29685,33286,37127,41217,45557,50152,55009,60120,65535 0.011644 |   0.016265 | 0            | 0,177,395,723,1169,1746,2460,3321,4336,5511,6853,8368,10062,11942,14011,16275,18740,21409,24288,27381,30691,34225,37984,41974,46199,50661,55367,60310,65535
As more points are added, the ΔL* continues to go down. I was able to create 24- and 25-point curves with lower max ΔL* than the TinyRGB 26-point curve (marked with an asterisk again) as well as improve on nearly all the stats with a different 26-point curve of my own. But neither is as good as the 27 or 28 or 29, which is to say… there’s nothing special at all about 26 points.
Outside the small blip between 24 and 25 points, It wasn’t until my solver reached 32 points that it wasn’t able to continue improving with each additional point. Beyond that size, reductions in ΔL* got more difficult to come by, and curve performance was more difficult to predict. The sizes that outperform their neighbors make interesting candidates if you’re looking to optimize size/quality ratio, like you might do if you were trying to make a compact sRGB-compatible profile. Here are stats from a few such curves:
Points | Max Error | Mean Error | RMS Error | Max DeltaL | Mean DeltaL | RMS DeltaL | Max RT Error | Point Values 0.009391 |   0.013111 | 0            | 0,161,345,618,985,1453,2030,2724,3539,4481,5554,6763,8115,9611,11256,13055,15012,17130,19412,21862,24484,27280,30256,33410,36750,40276,43993,47902,52005,56309,60807,65535 0.005082 |   0.007290 | 0            | 0,124,248,412,629,899,1225,1614,2066,2584,3170,3828,4559,5366,6250,7214,8259,9388,10602,11902,13291,14771,16342,18007,19766,21622,23575,25629,27782,30038,32397,34860,37430,40107,42892,45787,48793,51911,55141,58487,61945,65535 0.002988 |   0.004177 | 0            | 0,92,183,284,410,566,751,966,1215,1497,1813,2167,2556,2984,3452,3958,4507,5096,5729,6406,7126,7892,8704,9562,10468,11423,12425,13478,14581,15734,16940,18197,19507,20872,22289,23762,25290,26873,28513,30210,31964,33777,35647,37577,39567,41616,43727,45899,48131,50427,52785,55205,57690,60239,62850,65535 0.002294 |   0.003162 | 0            | 0,82,163,247,350,475,623,794,990,1212,1459,1734,2038,2370,2732,3124,3547,4002,4489,5009,5562,6150,6772,7430,8124,8853,9620,10424,11266,12146,13065,14024,15022,16061,17140,18261,19422,20627,21873,23162,24495,25871,27292,28757,30267,31821,33422,35069,36762,38501,40288,42123,44005,45935,47914,49941,52018,54144,56321,58547,60824,63151,65535 0.000682 |   0.000986 | 0            | 0,41,82,124,165,206,250,300,355,416,482,554,632,716,806,902,1005,1114,1230,1353,1482,1619,1762,1913,2071,2236,2409,2589,2777,2972,3176,3387,3606,3834,4069,4313,4565,4825,5094,5372,5658,5953,6256,6568,6890,7220,7559,7908,8265,8632,9008,9394,9789,10194,10608,11032,11465,11909,12362,12825,13298,13781,14274,14777,15291,15815,16349,16893,17448,18014,18589,19176,19773,20381,21000,21629,22269,22921,23583,24256,24941,25636,26343,27061,27790,28530,29282,30045,30820,31607,32404,33214,34035,34868,35713,36570,37438,38318,39211,40115,41031,41960,42900,43853,44818,45795,46785,47787,48801,49828,50867,51919,52983,54060,55150,56252,57368,58495,59636,60790,61956,63136,64328,65535 0.000440 |   0.000736 | 0            | 0,28,56,84,112,140,168,196,225,256,290,326,365,405,449,496,544,597,651,708,769,831,898,966,1038,1113,1191,1273,1356,1444,1534,1628,1726,1825,1930,2036,2147,2261,2377,2499,2623,2751,2882,3017,3156,3297,3444,3593,3746,3904,4064,4229,4397,4570,4746,4926,5110,5298,5489,5686,5885,6090,6297,6510,6726,6946,7171,7399,7632,7869,8110,8356,8606,8860,9119,9381,9649,9920,10197,10477,10762,11051,11345,11644,11946,12254,12566,12882,13204,13529,13860,14195,14534,14880,15228,15583,15941,16304,16673,17046,17424,17807,18194,18587,18984,19387,19793,20206,20623,21045,21472,21904,22341,22784,23230,23684,24140,24603,25071,25543,26022,26505,26993,27487,27985,28490,28998,29514,30033,30558,31089,31624,32166,32712,33264,33822,34384,34953,35525,36105,36689,37279,37875,38475,39083,39694,40312,40935,41563,42198,42838,43483,44135,44791,45455,46122,46796,47476,48161,48853,49549,50252,50960,51674,52395,53119,53852,54589,55332,56082,56836,57598,58364,59137,59916,60700,61492,62288,63091,63899,64714,65535 0.000449 |   0.000707 | 0            | 0,24,48,72,96,120,144,168,192,217,243,270,300,332,365,400,437,476,517,560,605,652,701,752,805,861,918,978,1040,1104,1170,1239,1310,1383,1459,1536,1617,1700,1785,1873,1962,2055,2150,2248,2348,2450,2555,2663,2774,2886,3002,3120,3242,3365,3492,3620,3753,3887,4025,4164,4308,4453,4602,4754,4908,5065,5225,5389,5554,5723,5895,6070,6248,6429,6612,6799,6990,7182,7379,7577,7780,7986,8194,8406,8620,8838,9060,9284,9512,9742,9976,10214,10454,10698,10945,11196,11449,11706,11967,12230,12498,12768,13042,13319,13599,13884,14171,14462,14756,15054,15356,15660,15969,16280,16596,16915,17237,17563,17892,18226,18563,18903,19247,19594,19946,20300,20659,21021,21387,21756,22130,22506,22887,23271,23660,24051,24447,24846,25250,25657,26067,26482,26900,27323,27749,28178,28612,29050,29492,29937,30386,30839,31296,31758,32223,32691,33165,33641,34123,34607,35096,35588,36086,36587,37092,37600,38113,38631,39152,39677,40206,40739,41277,41819,42364,42914,43468,44027,44589,45155,45726,46301,46880,47463,48050,48642,49238,49839,50443,51051,51664,52281,52903,53528,54159,54793,55432,56075,56722,57373,58030,58690,59355,60023,60697,61375,62057,62744,63435,64130,64830,65535
You can see that at 32 points, the max ΔL* is less than half that of the TinyRGB/c2 curve, which makes the increase in size well worth it. Doubling(-ish) the size to 63 points reduces error a further ~4x. Past that, it becomes increasingly expensive to make quality gains, with doubling size yielding a ~2.5x error improvement. Beyond that, it takes lots more points to improve accuracy, which peaked in the 212-point curve.
At this point, an obvious question comes up: Why even bother with 212? Why not just use a 256-point curve tag and be done with it?
Bigger Isn’t Always Better
Intuitively, one might expect that the best curve fit for an 8-bit image would have 256 points. Each point could contain the exact best output value for each input and no interpolation would be required. But look what happens when we compare a 256-point curve to the best performers from above.
Points | Max Error | Mean Error | RMS Error | Max DeltaL | Mean DeltaL | RMS DeltaL | Max RT Error 0.000682 |   0.000986 | 0 0.000440 |   0.000736 | 0 0.000449 |   0.000707 | 0 0.000646 |   0.001042 | 0
Because the curve points are stored as 16-bit unsigned integer values in the ICC profile (the ICC response16Number type), there’s a natural limit to the output precision. That limit is 1/65535, or 0.0000152902. Remember that at the bottom end of the sRGB curve, the output values are very, very small. For example, the value for an input of 2/255 should be 0.0006070540. Quantized to 16 bits, that value becomes 40/65535, which is actually 0.0006103609. That value is higher than the correct one by 0.5447%, which is the max error shown above. And there are several values with that error – I didn’t just pick the worst one. But notice the 182- and 212-point curves have much lower max errors. The same is reflected in the ΔL*. Although it’s tiny on the 256-point curve, the others still do better. Because those have fewer points, the output values have to be interpolated between two points and can actually fall between the values that would be possible to express explicitly at 16-bit precision. So, in this case, less can be more.
Carrying that further, consider the 1024-point curve used in the standard sRGB profile. Once again, I will reference Elle Stone’s site, which has a detailed survey of a variety of common sRGB profiles. She found that the majority of profiles use that same 1024-point curve. She also explains the precision issue, which she refers to as ‘hexadecimal rounding’. I call it ’16-bit quantization’. Potato, potato.
Let’s see what happens when we use that 1024-point curve to get output for 8-bit input values. And let’s see what happens if we go even bigger and use a 4096-point curve from Elle’s custom profile collection.
Points | Max Error | Mean Error | RMS Error | Max DeltaL | Mean DeltaL | RMS DeltaL | Max RT Error 0.000682 |   0.000986 | 0 0.000440 |   0.000736 | 0 0.000449 |   0.000707 | 0 0.000646 |   0.001042 | 0 0.000475 |   0.000819 | 0 0.000472 |   0.000782 | 0
You can see that the max error has actually gotten worse with the bigger curves. The reason for this is that with more points defined in the curve, their values get closer together, and the quantization/rounding error becomes more significant. If we look at the linear segment of the 1024-point curve, we can see the issue.
0,5,10,15,20,25,30,35,40,45,50,55,59,64,69,74,79,84,89,94,99,104,109,114,119,124,129,134,139,144,149,154,159,164,169,174,178,183,188,193
Notice that there’s a nice even increment of 5 between each step… except for two times where it’s 4. That uneven step hints at the fact that the slope of the line allowed by the quantization to 16 bits is not quite right. The only way to make it better is to remove points so that the slope can be represented correctly. Here is the same segment from the 212-point curve, which has even steps throughout.
0,24,48,72,96,120,144,168,192
The extra resolution in the 4096-point curve moves the error around a bit, so it manages a better ΔL* than the 1024-point, but it still trails the 212-point curve in all stats. That curve also has even more serious rounding issues that we haven’t encountered yet, because we’ve only been looking up 256 values in that curve. I’ll come back to that in a bit.
A change of direction
I must admit, I was rather surprised when I learned there were curve matches that exceeded accuracy of the standard 1024-point curve used in so many profiles. 
The initial goal I had was to find a better solution than TinyRGB/c2 for a compact sRGB-compatible profile. That profile is used almost exclusively to convert JPEG images to other colorspaces, so the accuracy of its output when used with 8-bit input is the most important thing. For that purpose, the 212-point curve turns out to be the most accurate, and that might make it perfect for image embedding if you don’t mind its size, which comes out to 796 bytes in a minimal profile packed using the technique I described in my last post. That’s about a quarter the size of the standard sRGB profile, with increased accuracy – a true win/win. But there’s a reasonable case to made for a smaller profile as well, especially for thumbnail-sized images. If you have a 4KB JPEG, even 796 bytes for the profile seems heavy. There is, therefore, a need for a smaller profile as well, and I can improve on TinyRGB significantly with just a few more curve points.
I’ll get back to the curves I picked for my compact sRGB-compatible profiles later, but the accuracy of the 182- and 212-point curves got me wondering whether they might also work better as a target profile than the standard sRGB profile does or whether they might be appropriate for higher-bit-depth images. I decided to test them again, using more input samples this time. I discovered that the tuning I had done to optimize for 8-bit input hurt the overall fit of the curves a tiny bit, so they didn’t give quite as good results with more samples. So, I ran them through my solver one more time and asked it to tune for 1024 samples instead of 256. There was a very slight drop in their 8-bit accuracy after that was done, but the curves continued to perform well. And their performance at higher resolution beat everything.
Numbers, Numbers, Numbers
With the final set of interesting curves identified, I set out to do comprehensive comparisons. There are lots of numbers here, so feel free to skip this section if you’re the type whose eyes glaze over when they see too many numbers.  Come back for the conclusions and the final profiles, though.  They’ll be interesting, I promise.
Here are the 8-bit results again for my final set of interesting curves, compared with the standard 1024- and 4096-point curves as well as the TinyRGB curve (again with the *).  I have marked the refined 182- and 212-point curves with a caret(^) for comparison with the initial 8-bit tuned ones.
Points | Max Error | Mean Error | RMS Error | Max DeltaL | Mean DeltaL | RMS DeltaL | Max RT Error
    19 |  0.041959 |   0.003564 |  0.007399 |   0.139496 |    0.026601 |   0.038015 | 0
    20 |  0.035090 |   0.003569 |  0.007435 |   0.128757 |    0.024572 |   0.035288 | 0
   *26 |  0.034171 |   0.001978 |  0.005315 |   0.095100 |    0.014204 |   0.021270 | 0
    26 |  0.032400 |   0.002235 |  0.005239 |   0.079740 |    0.014239 |   0.020672 | 0
    32 |  0.018609 |   0.001701 |  0.004036 |   0.039496 |    0.009391 |   0.013111 | 0
    42 |  0.007896 |   0.000696 |  0.001455 |   0.025409 |    0.005082 |   0.007290 | 0
    56 |  0.007696 |   0.000515 |  0.001255 |   0.013777 |    0.002988 |   0.004177 | 0
    63 |  0.003646 |   0.000347 |  0.000720 |   0.009950 |    0.002294 |   0.003162 | 0
   124 |  0.005790 |   0.000191 |  0.000765 |   0.003821 |    0.000682 |   0.000986 | 0
   182 |  0.001022 |   0.000092 |  0.000230 |   0.003107 |    0.000440 |   0.000736 | 0
  ^182 |  0.001072 |   0.000102 |  0.000244 |   0.004540 |    0.000516 |   0.000885 | 0
   212 |  0.001650 |   0.000118 |  0.000357 |   0.002817 |    0.000449 |   0.000707 | 0
  ^212 |  0.001650 |   0.000119 |  0.000361 |   0.003521 |    0.000475 |   0.000743 | 0
   256 |  0.005447 |   0.000210 |  0.000802 |   0.004125 |    0.000646 |   0.001042 | 0
  1024 |  0.008405 |   0.000205 |  0.000996 |   0.003993 |    0.000475 |   0.000819 | 0
  4096 |  0.008405 |   0.000175 |  0.000860 |   0.003054 |    0.000472 |   0.000782 | 0
The changes to the 212-point curve put its ΔL* right between the 1024- and 4096-point curves, so I would still consider it a no-brainer replacement for the standard 1024-point curve.  The 182-point curve fared worse in ΔL* but is still quite good, and it has the best overall fit based on RMSE.
Now look what happens when we increase to 10-bit interpolation (1024 input samples)
Points | Max Error | Mean Error | RMS Error | Max DeltaL | Mean DeltaL | RMS DeltaL | Max RT Error
    19 |  0.042879 |   0.003594 |  0.007416 |   0.156134 |    0.026705 |   0.038168 | 2
    20 |  0.037908 |   0.003614 |  0.007497 |   0.134980 |    0.024678 |   0.035373 | 2
   *26 |  0.034650 |   0.001994 |  0.005349 |   0.118553 |    0.014295 |   0.021400 | 2
    26 |  0.032418 |   0.002248 |  0.005280 |   0.101960 |    0.014276 |   0.020789 | 2
    32 |  0.019770 |   0.001742 |  0.004119 |   0.057270 |    0.009457 |   0.013254 | 1
    42 |  0.010831 |   0.000711 |  0.001501 |   0.034022 |    0.005139 |   0.007341 | 1
    56 |  0.007831 |   0.000521 |  0.001264 |   0.020659 |    0.002989 |   0.004236 | 0
    63 |  0.005564 |   0.000353 |  0.000767 |   0.016122 |    0.002293 |   0.003257 | 0
   124 |  0.005790 |   0.000203 |  0.000804 |   0.006320 |    0.000701 |   0.001046 | 0
   182 |  0.001697 |   0.000111 |  0.000265 |   0.006673 |    0.000587 |   0.000977 | 0
  ^182 |  0.001478 |   0.000110 |  0.000260 |   0.004644 |    0.000561 |   0.000932 | 0
   212 |  0.002159 |   0.000130 |  0.000379 |   0.004728 |    0.000527 |   0.000802 | 0
  ^212 |  0.001883 |   0.000129 |  0.000379 |   0.003708 |    0.000501 |   0.000774 | 0
   256 |  0.005447 |   0.000200 |  0.000782 |   0.005247 |    0.000608 |   0.000980 | 0
  1024 |  0.008405 |   0.000240 |  0.001044 |   0.004104 |    0.000617 |   0.000993 | 0
  4096 |  0.008996 |   0.000224 |  0.001054 |   0.003897 |    0.000506 |   0.000853 | 0
The refined 212-point curve outperforms everything else. And notice that the smaller curves are starting to show round-trip errors at this sample resolution.
Next up, I’ll test them at 12-bits (4096 input samples)
Points | Max Error | Mean Error | RMS Error | Max DeltaL | Mean DeltaL | RMS DeltaL | Max RT Error
    19 |  0.043820 |   0.003601 |  0.007422 |   0.162734 |    0.026727 |   0.038190 | 8
    20 |  0.038794 |   0.003622 |  0.007506 |   0.141046 |    0.024697 |   0.035393 | 8
   *26 |  0.034660 |   0.001997 |  0.005351 |   0.119456 |    0.014308 |   0.021410 | 7
    26 |  0.032431 |   0.002252 |  0.005284 |   0.102827 |    0.014287 |   0.020800 | 6
    32 |  0.019606 |   0.001746 |  0.004125 |   0.056738 |    0.009444 |   0.013208 | 4
    42 |  0.010970 |   0.000712 |  0.001500 |   0.034507 |    0.005144 |   0.007346 | 2
    56 |  0.007827 |   0.000522 |  0.001265 |   0.022992 |    0.002991 |   0.004235 | 1
    63 |  0.006099 |   0.000353 |  0.000767 |   0.015857 |    0.002288 |   0.003245 | 1
   124 |  0.005790 |   0.000205 |  0.000812 |   0.006420 |    0.000701 |   0.001047 | 1
   182 |  0.002016 |   0.000112 |  0.000266 |   0.008137 |    0.000588 |   0.000983 | 0
  ^182 |  0.001482 |   0.000110 |  0.000261 |   0.005065 |    0.000561 |   0.000936 | 0
   212 |  0.002439 |   0.000130 |  0.000381 |   0.005398 |    0.000528 |   0.000802 | 0
  ^212 |  0.001904 |   0.000129 |  0.000381 |   0.003735 |    0.000502 |   0.000775 | 0
   256 |  0.005447 |   0.000202 |  0.000787 |   0.005244 |    0.000610 |   0.000981 | 0
  1024 |  0.008405 |   0.000222 |  0.001025 |   0.003972 |    0.000508 |   0.000852 | 0
  4096 |  0.192685 |   0.000376 |  0.004745 |   0.004194 |    0.000628 |   0.001014 | 0
Look what’s happened with the 4096-point curve. Now that we’re using all of its points, we can see it’s got a serious flaw. Its max error has jumped way up, and its ΔL* is now worse than the 1024-point curve’s. It’s easy to see why. Have a look at its values for the linear part of the curve:
0,1,2,4,5,6,7,9,10,11,12,14,15,16,17,19,20,21,22,24,25,26,27,28,30,31,32,33,35,36,37,38,40,41,42,43,45,46,47,48,50,51,52,53,55,56,57,58,59,61,62,63,64,66,67,68,69,71,72,73,74,76,77,78,79,81,82,83,84,85,87,88,89,90,92,93,94,95,97,98,99,100,102,103,104,105,107,108,109,110,111,113,114,115,116,118,119,120,121,123,124,125,126,128,129,130,131,133,134,135,136,137,139,140,141,142,144,145,146,147,149,150,151,152,154,155,156,157,159,160,161,162,164,165,166,167,168,170,171,172,173,175,176,177,178,180,181,182,183,185,186,187,188,190,191,192,193,194
Again, the problem is apparent. The steps are uneven, alternating between 1-1-2 and 1-1-1-2 patterns. At that resolution, the 16-bit quantization is making it impossible to get the correct slope for the linear part of the curve, which is why the max error jumped up to over 19%. The 212-point curve is still looking outstanding, by the way. And the smaller curves are showing even larger round-trip errors.
And finally, let’s see what it looks like if we interpolate all possible 16-bit samples (65536 of them) with these curves.
Points | Max Error | Mean Error | RMS Error | Max DeltaL | Mean DeltaL | RMS DeltaL | Max RT Error
    19 |  0.044230 |   0.003603 |  0.007423 |   0.162631 |    0.026733 |   0.038193 | 135
    20 |  0.039190 |   0.003625 |  0.007508 |   0.141160 |    0.024702 |   0.035397 | 123
   *26 |  0.034661 |   0.001997 |  0.005352 |   0.120915 |    0.014312 |   0.021413 | 114
    26 |  0.032431 |   0.002254 |  0.005286 |   0.104229 |    0.014291 |   0.020803 | 98
    32 |  0.019766 |   0.001749 |  0.004129 |   0.057259 |    0.009446 |   0.013210 | 65
    42 |  0.011194 |   0.000712 |  0.001501 |   0.035285 |    0.005145 |   0.007347 | 36
    56 |  0.007860 |   0.000522 |  0.001265 |   0.023274 |    0.002991 |   0.004235 | 24
    63 |  0.006172 |   0.000353 |  0.000768 |   0.016116 |    0.002288 |   0.003246 | 18
   124 |  0.005790 |   0.000206 |  0.000815 |   0.006580 |    0.000701 |   0.001047 | 9
   182 |  0.002045 |   0.000112 |  0.000266 |   0.008139 |    0.000588 |   0.000983 | 7
  ^182 |  0.001482 |   0.000110 |  0.000261 |   0.005133 |    0.000562 |   0.000936 | 5
   212 |  0.002560 |   0.000131 |  0.000382 |   0.005650 |    0.000528 |   0.000803 | 7
  ^212 |  0.001905 |   0.000130 |  0.000381 |   0.003738 |    0.000502 |   0.000775 | 5
   256 |  0.005447 |   0.000203 |  0.000789 |   0.005248 |    0.000611 |   0.000981 | 6
  1024 |  0.008405 |   0.000223 |  0.001028 |   0.004089 |    0.000509 |   0.000853 | 6
  4096 |  0.192685 |   0.000324 |  0.004697 |   0.004178 |    0.000497 |   0.000820 | 6
At this sample resolution, none of the curves pass the round-trip test, but you can see that, once again, the refined 212-point curve shows the least visual error. This test also reinforces the validity of the ΔL* measure. The max round-trip error is predicted by and follows the ΔL*. sRGB is not quite as perceptually uniform as L*, so it’s not a 100% match, but it’s a very good predictor of what will happen as the sample resolution increases. A difference of 6/65335 (0.000092) is most certainly not going to be visible, but if you can drop that error to 5/65335 and save over 1.5KB off the ICC profile size at the same time, that’s a no-brainer.
And that just left one question to answer before I could wrap up my curve testing. What would happen if you used these curves in a target profile rather than a source profile? With a source profile, you can predict exactly which values will be looked up or interpolated from the curve, because those values are defined by the bit-depth of the image. 8 bits means exactly 256 values can be looked up, etc. That’s what we tested.  With a target profile, however, the curve is used in reverse.  Output values become input values and vice-versa.  And the input values become unpredictable. They could be any floating-point number between 0 and 1. So that left me with one test to run.
For these final numbers, I generated a set of 1 million random floating-point numbers between 0 and 1, and interpolated the output values for them.  The round-trip test becomes meaningless in this case because you can’t round-trip a random number, but the rest of the numbers can be interpreted the same as before.
Points | Max Error | Mean Error | RMS Error | Max DeltaL | Mean DeltaL | RMS DeltaL
    19 |  0.044252 |   0.003601 |  0.007417 |   0.162732 |    0.026701 |   0.038136
    20 |  0.039191 |   0.003632 |  0.007520 |   0.141245 |    0.024726 |   0.035427
   *26 |  0.034661 |   0.002002 |  0.005362 |   0.120960 |    0.014326 |   0.021432
    26 |  0.032431 |   0.002261 |  0.005299 |   0.104273 |    0.014305 |   0.020825
    32 |  0.019764 |   0.001752 |  0.004138 |   0.057252 |    0.009441 |   0.013209
    42 |  0.011204 |   0.000714 |  0.001503 |   0.035317 |    0.005147 |   0.007351
    56 |  0.007860 |   0.000522 |  0.001263 |   0.023292 |    0.002987 |   0.004229
    63 |  0.006174 |   0.000354 |  0.000770 |   0.016113 |    0.002292 |   0.003253
   124 |  0.005790 |   0.000207 |  0.000819 |   0.006588 |    0.000700 |   0.001046
   182 |  0.002045 |   0.000112 |  0.000267 |   0.008148 |    0.000588 |   0.000984
  ^182 |  0.001482 |   0.000110 |  0.000261 |   0.005147 |    0.000561 |   0.000935
   212 |  0.002566 |   0.000131 |  0.000382 |   0.005663 |    0.000528 |   0.000803
  ^212 |  0.001905 |   0.000130 |  0.000382 |   0.003738 |    0.000503 |   0.000776
   256 |  0.005447 |   0.000204 |  0.000793 |   0.005248 |    0.000611 |   0.000981
  1024 |  0.008405 |   0.000225 |  0.001035 |   0.004100 |    0.000509 |   0.000853
  4096 |  0.192685 |   0.000331 |  0.004806 |   0.004190 |    0.000497 |   0.000821
And the results are just about the same as before. So that does it… I’m convinced that my refined 212-point curve is not just the best fit for 8-bit image conversion – I believe it’s the best overall fit possible for the sRGB gamma curve within the restrictions of the v2 ICC profile format. I call it the Magic Curve, natch.
For a space-saving curve, any of those options between 32 and 63 points would be a huge improvement over Facebook’s 26-point attempt. I’ll be making a few size-conscious profile options with those and testing them out.
And the smallest usable curve is really 20 points. Although the 19-point curve was also valid according to the 8-bit round-trip test, it’s kind of pointless because an odd number of curve points means that the ‘curv’ tag has to be padded by 2 bytes to maintain alignment. You may as well include the extra point if it helps accuracy – and it does in this case. I’ll make what I believe to be the smallest possible sRGB-compatible profile (410 bytes) using that 20-point curve. Note that it is worse than the TinyRGB curve in terms of accuracy, but it’s not as much worse as the 32-point curve is better. Which is to say, once again, the 26-point curve is not at all special in its size/accuracy ratio.
Check the final post in this series for details on those profiles, some real-world tests using them, and of course, download links.  In the meantime, I have some investigation to do regarding the XYZ color values used in sRGB profiles.  That topic turned out to be another tricky one.

Making a Minimal sRGB ICC Profile Part 1: Trim the Fat; Abuse the Spec

Tue, 27 Mar 2018 06:02:00 -0700

I’ve had a long-standing to-do item in the MagicScaler codebase, which was to add a configuration option to force embedding an sRGB ICC profile in output images or to tag output images with the sRGB colorspace Exif tag. I had assumed that at some point, someone would ask for such a thing or would report an issue that turned out to be related to improper colorspace interpretation in another bit of software, which could be fixed by embedding or tagging the profile. Surprisingly, nobody ever did.

MagicScaler has always converted images to sRGB on input and saved its output as sRGB, because sRGB is the colorspace of the Web, and MagicScaler’s primary intended use is Web output. Web browsers and other common software have a spotty history when it comes to color management support, and most of the ones that don’t do color management simply assume that everything is sRGB. Or they don’t even know what sRGB is and just let the OS or hardware handle colors, meaning they likely get sRGB anyway. Furthermore, most W3C specs related to colors either require sRGB explicitly or specify that in the absence of evidence to the contrary, all colors should be treated as sRGB. The general idea is, make everything sRGB, and you never have to worry about colorspaces again (on the web at least – until we all have HDR monitors and are enjoying our 12-bit JPEGs). For the most part, it’s true… which I assume is why nobody ever asked for anything different.

A few weeks ago, however, I received a request to add an option to MagicScaler to allow it to skip its internal sRGB working-space conversion and keep the image in its original colorspace, embedding the source ICC profile in the output image. In general, that’s a bad idea, because most of MagicScaler’s algorithms assume they’re working with sRGB (or sRGB-like) data. But the person who made the request had an interesting use case, so I decided to combine that effort with my other to-do item.

Why embed sRGB?

If the Web is all sRGB all the time, why bother with the profile? Shouldn’t an image without a profile be the same as one with the sRGB profile as far as any web software is concerned? Maybe not…

There were two main reasons I had put that item on my list in the first place. One was a scary warning I often saw when using Jeffrey Friedl’s online Image Metadata Viewer

WARNING: No color-space metadata and no embedded color profile: Windows and Mac web browsers treat colors randomly.
Images for the web are most widely viewable when in the sRGB color space and with an embedded color profile. See my Introduction to Digital-Image Color Spaces for more information.

The other was that I remembered reading a post by Ryan Mack from the Facebook Engineering team a few years ago abut their TinyRGB (c2) sRGB-compatible ICC profile.

Going back to 2012, Facebook has been embedding its TinyRGB profile in every thumbnail and resized JPEG it serves. This extra 524-byte profile has been tacked on to billions of images and likely served hundreds of billions of times. In the post, he explains that they noticed on certain computers/devices that had a display colorspace other than sRGB, some web browsers would treat images as if they were encoded in the display colorspace rather than sRGB. If the display had a wide-gamut colorspace configured, colors in images would be oversaturated/overblown.

I have personally never experienced those types of issues, but I’ve also never used a fancy profiled wide-gamut monitor, so I guess I wouldn’t have.

Anyway, web browsers have come a very long way since 2012 in terms of color management support, and I wondered whether this is still an issue at all. But I just grabbed a thumbnail of a photo recently posted to Facebook, and they’re still embedding that same TinyRGB profile 6 years later. I’d assume Facebook would be pretty happy to cut 524 bytes off every JPEG they serve if they could do so with no ill effects.

Looking into it further, I found a great description of the problem broken down by OS and browser. The linked post indicates that this is a still problem as of its last update in July 2017.

So apparently, it’s still an issue, and I reckon I ought to do something about it. The solution recommended in each case is to assign the sRGB profile to images that don’t have a profile attached. But the standard sRGB profile attached to most images (and the one included in Windows) is just over 3KB, and that’s a lot of overhead to correct an issue that affects only a small percentage of users,

It was pretty cool, then, that the Facebook engineers were able to create a compatible profile so much smaller. I figured I’d probably want to use their tiny profile as well to keep the overhead down. However, as I was looking into the copyright/license status of their profile to see if I’d be allowed to embed it in MagicScaler, I ran across an interesting post by Øyvind Kolås (hereafter referred to by his twitter handle, @Pippin), who claimed to have created an even tinier (487-byte) sRGB-compatible profile, which he called sRGBz.

Thus began my own investigation into ICC profile optimization and my own effort to make a better, smaller sRGB-compatible profile. This led me down a deep rabbit hole, where I learned a ton, and I thought I’d document what I learned here. There was so much, I’ll have to split it into multiple posts.

Trim the Fat

If you’re not familiar with how profiles work or all the many, many things that can be wrong with them, I highly recommend Elle Stone’s articles on color management for some background. Color management is a tricky subject, and I’ve learned a ton from her site.

I’ll also be referring quite a bit to the specification for v2 ICC profiles, because ultimately, I want to abuse the spec to save those precious, precious bytes… but I want to do so in a completely compatible way.

An ICC profile consists of three main parts

A 128-byte header. This is fixed in size, and although it contains some empty reserved padding, there’s nothing that can be done to save space here that won’t break many/most profile readers.
A directory of tags (records) in the profile. Each directory entry consists of a 4-byte tag identifier, a 4-byte offset to the start of the tag data, and a 4-byte length for the tag data. That’s 12 bytes per tag for those keeping track, so the fewer tags the better (duh).
The tag data. Each tag starts with an 8-byte tag header, which consists of a 4-byte identifier and 4-bytes of reserved space. The actual tag content follows. Some tags are fixed-length, some are variable. And each tag must start on a 4-byte boundary, so there may be alignment issues that cause wasted space.

Any effort to save space will be constrained by that structure and by the tags required for each profile type. According the spec, RGB profiles require a minimum of 9 tags: description (desc), copyright (cprt), white point (wtpt), red, green and blue primary values (rXYZ, gXYZ, bXYZ), and red, green and blue tone reproduction curves (rTRC, gTRC, bTRC).

As Pippin correctly points out in his post, the black point (bkpt) tag included in the TinyRGB profile is not explicitly required. In fact, the ICC now explicitly recommends against using it. Plus, its data is completely redundant. In a well-behaved profile black will be defined as X=0, Y=0, Z=0, as it is in the standard sRGB profile. In the absence of a black point tag, the ICC v2 spec clearly says it is to be assumed to be (0,0,0). So we can very safely omit that tag. That saves 12 bytes for the tag directory entry, 8 bytes for the tag header and 4 bytes each for the X, Y, and Z values, for a total of 32 bytes. Minus that tag, Facebook’s TinyRGB profile could easily have been 492 bytes instead of 524.

The other space-saving change Pippin made was to reduce the length of the profileDescriptionTag and move it to the end to eliminate its effect on tag alignment. He claimed that by reducing the description to a single character (z) from Facebook’s 2-character name (c2), he could save the one byte, plus another 4 from the alignment, making a 5-byte reduction. That didn’t add up for me, given that ICC profiles use 4-byte alignment, there’s no way for alignment to waste more than 3 bytes. Since that sounded fishy, I loaded up both the 487-byte and 491-byte versions of sRGBz in the ICC Profile Dump Utility and validated them. They both reported the following:

NonCompliant! - profileDescriptionTag - ScriptCode must contain 67 bytes.

That sent me back to the spec to dig in to the structure of the profileDescriptionTag. It is defined as a complex structure that contains the description in 3 different formats: 7-bit ASCII, Unicode, and ScriptCode. The ASCII description is to be treated as the canonical name of the profile and is required; the other two are optional. In case, like me, you’ve never heard of ScriptCode, it appears to be a thing from Mac OS (the old obsolete one, not OS X).

The length/structure of the tag is as follows:

8-byte header
4-byte length of the ASCII description (including null terminator)
ASCII data of variable length -- at least one printable character, plus the null
4-byte Unicode language code
4-byte Unicode description length
Unicode description of variable length -- can have length of 0
2-byte ScriptCode code
1-byte ScriptCode description length
67 bytes reserved for ScriptCode data

I couldn’t even begin to guess the reason behind a fixed-length reserved space for the ScriptCode data when the others are variable-length, but that’s what the validator was complaining about. If we assume both the Unicode and ScriptCode descriptions will be empty, the length of the description tag will be 8 + 4 + 4 + 4 + 2 + 1 + 67 = 90 bytes, plus the length of the ASCII string, plus its null terminator. That would be 92 bytes for 1-character description or 96 for a 5-character description. Those are incorrectly listed as 91 and 95 bytes in the sRGBz-487 and sRGBz profiles, respectively, and the files are 1-byte short each. So they are, in fact, not valid.

Interestingly, if you add an extra byte to the profile without adjusting the length of the description tag, the validator doesn’t complain. It’s only because the tag is at the end of the file and there’s no padding before another aligned tag that the validator has an issue.

That prompted me to look at the TinyRGB/c2 profile to see where the math went wrong, and it turns out theirs is wrong too. They have the description length listed as 94 bytes, but it really should only be 93. They include the description tag early and pad it out to 96 bytes for alignment, which is enough to satisfy the ICC validator tool, but it looks like it might have caused issues in certain versions of Adobe Illustrator.

In any case, they could have fit 3 more characters in the description for no extra space cost had they wished.

Anyway, after correcting the description tag lengths in the sRGBz profiles, they come out to 488 bytes for the minimal 1-character-name version and 492 for the friendly-named version, same as TinyRGB minus the black point tag.

But we can do better. Quite a bit better, actually…

Abuse the Spec

Pippin mentions in his post that he experimented with packing some tag data in the 44 bytes of reserved padding of the profile header but that it didn’t work out. So, while that’s not an option, there’s another even larger bit of padding that we can put some data into: the 67 bytes reserved for the ScriptCode description. As a test, I chose to move the tone reproduction curve data, which just happens to be 64 bytes. It’s perfectly legal for tag data to overlap, and in fact, for the TRC tags, it’s expected. Well-behaved RGB profiles should have identical curves in the red, green, and blue TRC tags, and it’s common for the three directory entries to refer to a single copy of the data for all of these. This is the case in the standard HP/Microsoft sRGB profile (which would be 4K larger otherwise) and in the TinyRGB profile. If we move that tag data to overlap the ScriptCode reserved area, we can save the full 64 bytes.

As for whether that’s safe, I’ll say the following:

ScriptCode is a Mac OS thing, which is to say it’s not a thing anymore. Nobody will ever be looking at that area for ScriptCode.
The profileDescriptionTag has a 1-byte ScriptCode length field to indicate how many of the 67 reserved bytes contain description data. We set that to 0, so even if some software did read that section of the tag, it shouldn’t go on to read any of the data.
Although the spec does explicitly say that unused bytes in the ScriptCode area should be set to 0, no software I’ve encountered has had any problem with that area containing the TRC data, and all software should be fine with the TRC tag data not having its own dedicated space.

That means we can cut the TinyRGB profile down to 428 bytes simply by removing the black point tag and relocating the TRC data. Finally, if we’re clever with the alignment, we can shave another 4 bytes off. Remember I said that the TinyRGB profile had its description tag length wrong? Well, if we correct that, we can save 1 byte, and it had 2 bytes of padding to align the tag that follows (the copyright tag in their case). Plus, we still have 3 unused bytes left over from the 67-byte ScriptCode area.

The ScriptCode area is tricky because the position of that section is dependent on the length of the ASCII description. Since we have to align the start of the description tag on a 4-byte boundary, if we were to use a minimum 1-character ASCII description, the ScriptCode data section would start at an offset of 25 from there, leaving the first 3 bytes unusable because we can’t start a new tag until offset 28. That means wasting the first 3 of those 67 as padding. That would still allow us to use the last 64 bytes to hold the TRC tag data, though, and the alignment would be correct to start the next tag immediately after.

OR… we could use three extra description characters to give a more descriptive name and have the 67 bytes start on a 4-byte boundary. I chose that option, making the description ‘c2ci’ to differentiate it from the original. That allows the 64 bytes of the TRC tag to start at the beginning of the ScriptCode block and leaves the last 3 for the start of the next tag.

Overall, the length of the description tag ends up being 95 bytes, but as far as the alignment of the following tags go, it doesn’t matter, because they overlap. It’s as if the length is actually 28, which was the offset at which we started the curve data. That 28, plus the 64 of the TRC allows the next tag to start at offset 92, meaning we saved 4 bytes over the 96-byte alignment that Facebook used.

There’s one last place that space could be saved if we were so inclined. Facebook used ‘FB’ for their copyright text but then had to include a byte of padding because that results in an 11-byte tag. If we moved the copyright tag to the end of the file, we wouldn’t need that padding, because there’s no need to align for another tag. That would make the final size 423 bytes. I liked the change Pippin made in his sRGBz profile, though, which was to set the copyright text to ‘CC0’ – a value that fits perfectly in a 12-byte tag. Facebook has since released their profile under the CC0 license, so that’s a good change to make in my alternate.

And that’s my compact profile starting point. At 424 bytes (an even 100-byte savings from the original) it can have the exact same data as TinyRGB/c2 -- minus the redundant black point tag, plus some extra description characters and corrected copyright text. Here’s that file for reference if you want to check it out. But let me say, you won’t want to use it for anything real. I’m going to do much better before I’m done.

Not Just Tinier – Better

So what’s wrong with the TinyRGB or its new tinier variant? A couple of things, actually…

I’ve mentioned well-behaved RGB profiles a couple of times now, and if you didn’t follow the link to Elle Stone’s post on the subject, I highly recommend you do that. Pippin mentions in his sRGBz post that he improved the matrix precision of his profile, and what that means is that his profile was created using XYZ color values that are balanced to allow for properly-neutral grey colors. TinyRGB uses the unbalanced values from the old HP/Microsoft sRGB profile. I’ll be ensuring I’ve got the most correct values possible in my profile.

And, like Pippin, I was curious about that 26-point TRC tag Facebook came up with. It turns out, that’s not all that great either.

I’ll have entire posts on both of those topics, because I made some fascinating (to me at least) findings in researching and testing them. Tune in next time for my post on finding the perfect curve…

5 Reasons You Should Stop Using System.Drawing from ASP.NET

Mon, 12 Jun 2017 04:41:43 -0700

Every time you use System.Drawing from ASP.NET, something bad happens to a kitten.
I don’t know what, exactly... but rest assured, kittens hate it.

Well, they’ve gone and done it. The corefx team has finally acquiesced to the many requests that they include System.Drawing in .NET Core.

The upcoming System.Drawing.Common package will include most of the System.Drawing functionality from the full .NET Framework and is meant to be used as a compatibility option for those who wish to migrate to .NET core but were blocked by those dependencies. From that standpoint, Microsoft is doing the right thing. Reducing friction as far as .NET Core adoption is concerned is a worthy goal.

On the other hand, System.Drawing is one of the most poorly implemented and most developer-abused areas of the .NET Framework, and many of us were hoping that the uptake of .NET Core would mean a slow death for System.Drawing. And with that death would come the opportunity to build something better.

For example, the mono team have released a .NET-compatible wrapper for the Skia cross-platform graphics library from google, called SkiaSharp. Nuget has come a long way in supporting platform-native libraries, so installation is simple. Skia is quite full-featured, and its performance blows System.Drawing away.

The ImageSharp team have also done tremendous work, replicating a good deal of the System.Drawing functionality but with a nicer API and a 100% C# implementation. This one isn’t quite ready for production use yet, but it appears to be getting close. One word of warning with this library, though, since we’re talking about server apps: As of now, its default configuration uses Parallel.For internally to speed up some of its operations, which means it will tie up more worker threads from your ASP.NET thread pool, ultimately reducing overall application throughput. Hopefully this will be addressed before release, but it only takes one line of code to change that configuration to make it server-friendly.

Anyway, if you’re drawing, graphing, or rendering text to images in a server-side app, either of these would be worth a serious look as an upgrade from System.Drawing, whether you’re moving to .NET Core or not.

For my part, I’ve built a high-performance image processing pipeline for .NET and .NET Core that delivers image quality that System.Drawing can’t match and that does it in a highly scalable architecture designed specifically for server use. It’s Windows only for now, but cross-platform is on the roadmap. If you use System.Drawing (or anything else) to resize images on the server, you’d do well to evaluate MagicScaler as a replacement.

But the resurrection of System.Drawing, while easing the transition for some developers, will probably kill much of the momentum these projects have gained as developers were forced to search for alternatives. Unfortunately, in the .NET ecosystem, a Microsoft library/package will almost always win out over other options, no matter how superior those alternatives might be.

This post is an attempt to make clear some of the shortcomings of System.Drawing in the hopes that developers will evaluate the alternatives even though System.Drawing remains an option.

I’ll start with the oft-quoted disclaimer from the System.Drawing documentation. This disclaimer came up a couple of times in the GitHub discussion debating System.Drawing.Common.

"Classes within the System.Drawing namespace are not supported for use within a Windows or ASP.NET service. Attempting to use these classes from within one of these application types may produce unexpected problems, such as diminished service performance and run-time exceptions"

Like many of you, I read that disclaimer a long time ago, and then I went ahead and used System.Drawing in my ASP.NET apps anyway. Why? Because I like to live dangerously. Either that, or there just weren’t any other viable options. And you know what? Nothing bad happened. I probably shouldn’t have said that, but I’ll bet plenty of you have had the same experience. So why not keep using System.Drawing or the libraries built around it?

Reason #1: GDI Handles

If you ever did have a problem using System.Drawing on the server, this was probably it. And if you haven’t yet, this is the one you’re most likely to see.

System.Drawing is, for the most part, a thin wrapper around the Windows GDI+ API. Most System.Drawing objects are backed by a GDI handle, and there are a limited number of these available per process and per user session. Once that limit is reached, you’ll encounter out of memory exceptions and/or GDI+ ‘generic’ errors.

The problem is, .NET’s garbage collection and finalization process may delay the release of these handles for long enough that you can overrun the limit even under relatively light loads. If you forget (or don’t know) to call Dispose() on objects that hold one of those handles, you run a very real risk of encountering these errors in your environment. And like most resource-limit/leak bugs, it will probably get missed during testing and only bite you once you’ve gone live. Naturally, it will also occur when your app is under its heaviest load, so the max number of users will know your shame.

The per-process and per-session limits vary by OS version, and the per-process limit is configurable. But no matter the version, GDI handles are represented with a USHORT internally, so there’s a hard limit of 65,536 handles per user session, and even well-behaved apps are at risk of encountering this limit under sufficient load. When you consider the fact that more powerful servers allow us to serve more and more concurrent users from a single instance, this risk becomes more real. And really, who wants to build software with a known hard limit to its scalability?

Reason #2: Concurrency

GDI+ has always had issues with concurrency, and although many of those were addressed with architectural changes in Windows 7/Windows Server 2008 R2, you will still encounter some of them in newer versions. Most prominent is a process-wide lock held by GDI+ during any DrawImage() operation. If you’re resizing images on the server using System.Drawing (or the libraries that wrap it), DrawImage() is likely at the core of that code.

What’s more, when you issue multiple concurrent DrawImage() calls, all of them will block until all of them complete. Even if the response time isn’t an issue for you (why not? do you hate your users?), consider that any memory resources tied up in those requests and any GDI handles held by objects related to those requests are tied up for the duration. It actually doesn’t take very much load on the server for this to cause problems.

There are, of course, workarounds for this specific issue. Some developers spawn an external process for each DrawImage() operation, for example. But really, these workarounds just add extra fragility to something you really shouldn’t be doing in the first place.

Reason #3: Memory

Consider an ASP.NET handler that generates a chart. It might go something like this:

Create a Bitmap as a canvas
Draw some shapes on that Bitmap using Pens and/or Brushes
Draw some text using one or more Fonts
Save the Bitmap as PNG to a MemoryStream

Let’s say the chart is 600x400 pixels. That’s a total of 240,000 pixels, multiplied by 4 bytes per pixel for the default RGBA format, so 960,000 bytes for the Bitmap, plus some memory for the drawing objects and the save buffer. We’ll call it 1MB for that request. You’re probably not going to run into memory issues in this scenario, and if you do, you might be bumping up against that handle limit I mentioned earlier because of all those Bitmaps and Pens and Brushes and Fonts.

The real problem comes when you use System.Drawing for imaging tasks. System.Drawing is primarily a graphics library, and graphics libraries tend to be built around the idea that everything is a bitmap in memory. That’s fine if you’re thinking small. But images can be really big, and they’re getting bigger every day as high-megapixel cameras get cheaper.

If you take System.Drawing’s naive approach to imaging, you’ll end up with something like this for an image resizing handler:

Create a Bitmap as a canvas for the destination image.
Load the source image into another Bitmap.
DrawImage() the source onto the destination, resized/resampled.
Save the destination Bitmap as JPEG to a MemoryStream.

We’ll assume the same 600x400 output as before, so we have 1MB again for the destination image and Stream. But let’s imagine someone has uploaded a 24-megapixel image from their fancy new DSLR, so we’ll need 6000x4000 pixels times 3 bytes per pixel (72MB) for the decoded RGB source Bitmap. And we’d use System.Drawing’s HighQualityBicubic resampling because that’s the only one that looks good, so we need to add another 6000x4000 times 4 bytes per pixel for the PRGBA conversion that it uses internally, making another 96MB. That’s 169MB(!) for a single image resizing request.

Now imagine you have more than one user doing the same thing. Now remember that those requests will block until they’re all complete. How many does it take before you run out of memory? And even if you’re not concerned about running completely out of memory, remember there are lots of ways your server memory could be put to better use than holding on to a bunch of pixels. Consider the impact of memory pressure on other parts of the app/system:

The ASP.NET cache may start dumping items that are expensive to re-create
The garbage collector will run more frequently, slowing the app down
The IIS kernel cache or Windows file system caches may have to remove useful items
The App Pool may overrun its configured memory limit and get recycled
Windows may have to start paging memory to disk, slowing the entire system

None of those are things you want, right?

A library designed specifically for imaging tasks will approach this problem in a very different way. It has no need to load either the source or destination image completely into memory. If you’re not going to draw on it, you don’t need a canvas/bitmap. It goes more like this:

Create a Stream for the output JPEG encoder
Load a single line from the source image and shrink it horizontally.
Repeat for as many lines from the source as required to create a single line of output
Shrink intermediate lines vertically and write a single output line to the encoder
Goto 2. Repeat until all lines are processed.

Using this method, the same image resizing task can be performed using around 1MB of memory total, and even larger images incur only a small incremental overhead.

I know of only one .NET library that is optimized in this way, and I’ll give you a hint: it’s not System.Drawing.

Reason #4: CPU

Another side-effect of the fact that System.Drawing is more graphics-focused than imaging-focused is that DrawImage() is quite inefficient CPU-wise. I have covered this in quite a bit of detail in a previous post, but that discussion can be summarized with the following facts:

System.Drawing’s HighQualityBicubic scaler works only in PRGBA pixel format. In almost all cases, this means an extra copy of the image. Not only does this use (considerably) more RAM, it also burns CPU cycles on the conversion and the processing of the extra alpha channel.
Even after the image is in its native format, the HighQualityBicubic scaler performs roughly 4x as many calculations as are necessary to obtain the correct resampling results.

These facts add up to considerable wasted CPU cycles. In a pay-per-minute cloud environment, this directly contributes to higher hosting costs. And of course your response times will suffer.

And think of all the extra electricity wasted and heat generated. Your use of System.Drawing for imaging tasks is directly contributing to global warming. You monster.

Reason #5: Imaging is deceptively complicated

Performance aside, System.Drawing doesn’t get imaging right in many ways. Using System.Drawing means either living with incorrect output or learning all about ICC Profiles, Color Quantizers, Exif Orientation correction, and many more domain-specific topics. It’s a rabbit hole most developers have neither the time nor inclination to explore.

Libraries like ImageResizer and ImageProcessor have gained many fans by taking care of some of these details, but beware, they’re System.Drawing on the inside, and they come with all the baggage I've detailed in this post.

Bonus Reason: You can do better

If, like me, you’ve had to wear glasses at some point in your life, you probably remember what it was like the first time you put them on. I thought I could see ok, and if I squinted just right, things were pretty clear. But then I slid those glasses on, and the world became a lot more detailed than I knew it could.

System.Drawing is a lot like that. It does ok if you get the settings just right, but you might be surprised how much better your images could look if you used a better tool.

I’ll just leave this here as an example. This is the very best System.Drawing can do versus MagicScaler’s default settings. Maybe your app would benefit from getting glasses…

System.Drawing

MagicScaler

Photo by Jakob Owens

So look around, evaluate the alternatives, and please, for the love of kittens, stop using System.Drawing in ASP.NET.

Lies, Damned Lies, and Benchmarks Part 3: Varying Variables

Fri, 13 May 2016 08:39:00 -0700

This is the final part of my review of the FastScaling plugin for ImageResizer. Over the first two parts of this series, we examined some of the performance claims made by the FastScaling documentation. To review, those claims could be grouped into three categories:

It claimed that its orthogonal processing was more efficient than DrawImage()’s ‘general distortion filter’. That was true, but other architectural deficiencies cancel out that benefit in many cases. We saw that at equivalent output quality and on a single thread, FastScaling doesn’t offer much, if any, improvement over optimized DrawImage() usage. Its native RGB processing is more efficient, but even with that advantage, it barely eked out a win in our JPEG test. With other container formats, results may vary. With other pixel formats, it does significantly worse than GDI+.
It claimed to break free of the single-thread limit imposed by DrawImage(), allowing it to scale up with more more processors/cores. That was also true. But we saw the cost of that is that it’s allowed to run away with your server resources. Memory is particularly hard-hit since FastScaling seems to require even memory per image than DrawImage() does.
It claimed performance improvements through dynamic adjustment of scaler settings and through what they call ‘averaging optimizations’. We have not yet explored these.

Point 2 above could easily be the end of the road for this series. It’s a deal-breaker for true scalability. I certainly wouldn’t let FastScaling anywhere near any of my servers. But I’m still curious about that last point. I do some dynamic adjustments of quality settings in MagicScaler as well, and I’m interested to see how they compare.

I’m also curious as to how they arrived at such impressive numbers in their benchmarks. Nothing I’ve seen indicates FastScaling is anywhere near as fast as they say, but I’d like to see if I can get close to some of those numbers or at least explain how they got them. I came up with my own baseline for my own tests, but I might need to reset that baseline if I’m going to match theirs.

Narrowing the Scope

Beyond the baseline problem, there’s a problem of variables. I showed how limiting benchmarks to a single variable at a time makes them much more educational. Likewise, carefully choosing those variables can allow you present a distorted view of reality. I’d like to see if I can determine how they arrived at theirs, and why. Right off the bat, there are several to consider, such as:

Input image size, container format and pixel format
Output image size, container format and pixel format
Interpolation method and parameters (this can be extremely complex and variable itself)
Shortcuts, such as in-decoder transformations, or intermediate processing

JPEG input and output are clearly the most representative of a real-world web workload, so that part is a no-brainer. As for the input image size, I mentioned before that a larger input image exaggerates the performance difference in the scalers. I used a 24MP image for my initial tests, but the 16MP input used in the FastScaling benchmarks is also reasonable for those purposes. I’ll go ahead and switch to that size now. We’re also going to be doing only RGB (YCbCr, actually) JPEGs since they’re most typical.

The image I chose for this round of tests comes from the USGS flickr. The original file had an embedded Nikon ICC profile, which adds considerable processing time to the decode/conversion step. This would make things particularly unfair when using MagicScaler’s ability to resize within the decoder, so in order to keep the workload as similar as possible for all the scalers, I converted the image to the sRGB colorspace in Photoshop and re-saved it without an embedded profile for these benchmarks. The converted file is here

So the first real decision we have to make is output size. It has to be something realistic for web scenarios, but beyond that, it doesn’t seem like all that important a choice. I chose an output width of 400px for my earlier tests simply because I find that size easy to manage. I can do screenshots of my test app without them being too big, and I can easily take in all of the images in a single glance so differences in visual results are easy to spot. The FastScaling benchmarks used 800px output, and I wondered whether there was a reason for that. If you saw my earlier benchmarks between ImageResizer’s GDI+ implementation and my own reference GDI+ resizer, you may remember that at larger output sizes, the sub-optimal GDI+ settings used by ImageResizer made it significantly slower. I wondered if that handicap would make FastScaling look better by comparison, so I ran a few tests using my baseline settings from Part 1 of this series. The idea here is to keep them on even ground and change only the output size variable for now.

What’s interesting here is that the two scalers in ImageResizer follow a completely different trajectory than the reference GDI+ resizer and MagicScaler. ImageResizer is clearly paying a performance penalty at larger output sizes, but that penalty is paid by both of its scalers. There doesn’t appear to be any special reason they chose the 800px output size. In fact, at that size, FastScaling is actually slower than the reference GDI+ resizer. It is noteworthy that FastScaling beats ImageResizer’s GDI+ implementation at all output sizes, but the margin is modest, at a relatively constant 40-50ms. By comparison, MagicScaler maintains a steady 120-130ms advantage over the reference GDI+ resizer.

With these results in mind, I don’t think it’s at all unfair to stick with my preferred 400px output width for the remaining benchmarks. FastScaling actually holds a slight edge over the reference GDI+ resizer at that size, and we’ll have an easier time comparing output quality once we start enabling some of the processing shortcuts that FastScaling and MagicScaler support. This is the new baseline I’ll be using going forward.

Speaking of Quality…

Before I start sacrificing quality for speed in these comparisons, there’s one last topic I want to visit from the FastScaling documentation. Beyond the performance claims made in the docs, they also claim to have improved quality over DrawImage().

Another failing of DrawImage is that it only averages pixels in the sRGB color space. sRGB is a perceptual color space, meaning that fewer numbers are assigned to bright colors; most are assigned to shades of black. When downscaling (weighted averaging), this tends to exaggerate shadows and make highlights disappear, although it is just fine when upscaling.
FastScaling defaults to working in the srgb color space too - but only because users expect DrawImage-like behavior, not because sRGB is better. Linear light is almost always a better choice for downscaling than sRGB, and is more 'correct'.

These statements about processing light values in the sRBG compressed domain are true. It’s a bit of an oversimplification, but Eric Brasseur has written an excellent piece on the topic if you want more detailed info. I was interested by the statement in the second paragraph that FastScaling chooses sRGB processing as a default only because that’s what people expect, especially in light of all the performance claims made. Processing in linear light is better, but it’s always more expensive, and I wonder just what kind of performance hit FastScaling takes to do it. We saw in the last test, FastScaling barely beat the reference GDI+ resizer at 400px output from a 16MP JPEG source. Let’s do that same test again but enable linear light processing in FastScaling this time. Oh, and in MagicScaler too, because of course it supports linear processing as well…

As you might have guessed, FastScaling gave up its meager lead with the added processing. It’s now over 200ms slower than the GDI+ reference, while MagicScaler is still almost 100ms faster. The difference in quality is quite subtle in this image, but it can be more pronounced in images with small high-contrast areas. Here’s a better example using an untouched 17MP image of the Milky Way, also from the USGS flickr.

WIC looks worst (as usual) here, but both FastScaling and MagicScaler look worlds better than the best GDI+ can do with this image. And with roughly the same input image size, performance is about the same as the previous test after accounting for the increase in decoding time. FastScaling is ~200ms slower than GDI+, and MagicScaler is ~100ms faster. So while FastScaling is sometimes better or faster than GDI+, it’s most certainly not both.

I feel the need, the need for speed

Ok, with that last quality issue addressed and with a good baseline established, we can start to play with some of the options that sacrifice quality for processing speed. GDI+ is obviously going to be quite limited in this regard, as we can really only change the interpolation mode to get better performance. However, as I suggested in Part 1 of this series, the ‘averaging optimizations’ mentioned in the FastScaling docs are also possible to implement with DrawImage(). I call it Hybrid Scaling in MagicScaler, so I’ll use that term from now on.

The reason it’s possible to do such a thing with DrawImage() is because we happen to know (from my earlier analysis of the GDI+ interpolation modes) that the default Linear interpolation mode from GDI+ adapts toward a Box (or averaging) filter at higher downscale ratios. We also saw in my earlier testing with GDI+ that the Linear interpolation mode doesn’t require a conversion to RGBA to do its work and doesn’t require the entire source image to be decoded into memory all at once. That makes this technique particularly interesting in GDI+, because we can reduce memory usage while at the same time increasing speed. I went ahead implemented hybrid scaling in my reference GDI+ resizer (it took all of about 10 minutes), so we can see what GDI+ can do under the best of conditions. We’ll compare that with the best speeds FastScaling and MagicScaler can achieve. We’ve already seen that in terms of raw speed and efficiency, WIC is going to be impossible to beat, but there isn’t really anything we can do to make it faster or slower, so I’ll drop it from my benchmarks at this point. The best it did on my reference image was 54ms. We’ll keep that number in mind.

The FastScaling docs are light on details regarding its speed vs quality tradeoffs, but it appears they’re all driven with the down.speed setting. MagicScaler allows control of its quality tradeoffs with its HybridMode setting, which has 4 options. The first option is ‘Off’, which is what we’ve done so far. The other 3 modes (FavorQuality, FavorSpeed, and Turbo) allow MagicScaler to resize the source by powers of 2 using either the decoder or the WIC Fant scaler (which is somewhere between a Linear and Box filter) before finishing with its own high-quality scaler. The 3 options control how far the low-quality resizing is allowed to go.

FavorQuality allows low-quality scaling to the nearest power of 2 at least 3x the target size.
FavorSpeed allows low-quality scaling to the nearest power of 2 at least 2x the target size.
Turbo allows low-quality scaling to the nearest power of 2 to the target size. When resizing by a power of 2 up to 8:1, it is equivalent to the WIC scaler I’ve benchmarked so far.

The Hybrid mode I added to my reference GDI+ resizer follows the same rules but uses the GDI+ Linear scaler to do its low-quality phase. From this point on, I’ll have to abandon the idea that we can reach equivalent output, so we’ll be stuck with more subjective comparisons for quality. And Away we go…

Quality looks to be pretty even at this point. The hybrid scaling version of my GDI+ resizer knocked 115ms off the normal GDI+ time, but FastScaling and MagicScaler both did better. Note that I’m moving the FastScaling down.speed setting up by 2 at a time since it has a total of 7 levels to MagicScaler’s 4. I’ve also left the down.window=4 setting in place for the FastScaling tests since I believe that setting’s default value was a bug. I’ll allow it to use its default value when we test the maximal speed of each component. And finally, note that MagicScaler is using the JPEG decoder to perform part of the scaling, so its speed is approaching that of the WIC scaler already. Next level up…

Looks like nothing really changed here. MagicScaler’s logic used an intermediate ratio of 4:1 on both this test and the last, so the work done was the same. It appears FastScaling might have also used the same settings for both of these runs. And now the fastest settings:

With this setting, MagicScaler is using an 8:1 intermediate ratio, and the speed is within 2ms of the pure-WIC times we saw earlier. The image is noticeably more blurry now, but doesn’t seem to be as bad off as the FastScaling one. No matter, though, FastScaling barely beats out the hybrid version of the GDI+ resizer in single-thread performance. But that’s probably not the best FastScaling can do performance-wise. I’ll do one final test, changing its down.filter value to ‘Fastest’ and removing its down.window setting, while leaving the down.speed=4 setting. As far as I can tell from the docs, this should be its best speed.

That shaved a few milliseconds off the FastScaling number, but it’s probably within the margin of error. Its visual quality is by far the worst now.

You may notice I changed other one thing in this test while I was at it. Since I had already maxed out MagicScaler’s speed, in this test I enabled its automatic sharpening. You can see here that it added only 2ms to the processing time, but the results are quite striking. MagicScaler is showing nearly 3x the speed of FastScaling and better quality to boot. In fact, the MagicScaler result looks better than GDI+ at 5x the single-threaded performance or 25x the performance on 8 threads.

As for FastScaling’s numbers vs GDI+, the biggest number we’re showing here is 8.3x faster than GDI+ when running on 8 threads. That’s actually within the 4.5-9.6x end-to-end speed range quoted in the FastScaling benchmarks. The problem is, those numbers are with its lowest quality settings, which are unacceptably poor. And it used over 400MiB of RAM during the test, which is unacceptably poor for scalability. The hybrid scaling in my GDI+ reference dropped its memory use to 13MiB from the baseline version’s 64MiB, by the way, and its single-threaded performance numbers were very close to FastScaling’s best while producing better quality.

I think I’ve proven my point. FastScaling’s performance claims are way overblown, and MagicScaler is in a completely different league.

Oh, and there’s one more thing:

This plugin (FastScaling plugin) is part of the Performance Edition
The Performance edition costs $249 per domain

Ha! Did I mention MagicScaler is free?

Lies, Damned Lies, and Benchmarks Part 2: DrawImage(), Why U So Slow?

Wed, 11 May 2016 21:33:18 -0700

Most of the time I write about System.Drawing/GDI+, I’m pointing out its flaws and talking about how much better MagicScaler is for server-side image processing. It’s odd, then, that I now find myself writing a post defending it. In my last post, I quoted the documentation page from ImageResizer’s FastScaling plugin and said I’d address a part of it I skipped over. Here it is:

Unlike DrawImage, [FastScaling] uses orthogonal/separable resampling, and requires less of the CPU cache.

For those who haven’t studied up on Eric Lippert’s Five Dollar Words for Programmers™ or aren’t familiar with the basic mechanics of image resampling, I’ll give some background. Orthogonal/separable in this context simply means that you can resize an image either by doing both dimensions (width/height) at the same time and calculating the final value for each output pixel all at once, or you can resize in each dimension separately. It works out that for almost all standard resampling algorithms, you can do it either way and get the exact same results. The reason this matters is that if you’re using a resampling algorithm that samples each pixel value more than once – and any good one will --, it’s much less expensive to do it orthogonally.

Take, for example, the cubic resampler we tested in Part 1. Cubics usually require a sample window of 4 (remember FastScaling got that wrong by default), which means they sample a 4x4 pixel area in the source image to determine the value of a single pixel in the output image. On top of that, when you scale an image down, you must scale the sample area up proportionally to make sure you sample all the source pixels. Scaling the sample area up is effectively what makes a high-quality scaler high quality. Low-quality cubic scalers (like the one in WIC) just stick with 4x4 regardless of ratio.

So if, as we did in Part 1, you’re scaling a 6000x4000 image down to 400x267 (1:15 ratio of the source), you need to sample a 60x60 (15:1 ratio of the sampler) pixel area from the input for each pixel in the output. That would mean, in a naïve implementation, you would have to process 400*267*60*60 (384.5 million) pixels to perform that resize. In other words, you would read and perform calculations on each of the 24 million input pixels 16 times (the 4x4 sample size). And for RGB or RGBA images, those numbers would be multiplied by 3 or 4 channels, respectively. You could easily be doing over a billion sample calculations for this seemingly-simple resize operation.

To do the same resize orthogonally, you would first resize to 400x4000, sampling only in the horizontal dimension, so you sample only 60 pixels for each output pixel. That’s 400*4000*60 (96 million) pixels for the first dimension. Then 400*267*60 (6.4 million) for the other dimension. That’s a grand total of 102.4 million pixels processed instead of 384.5 million, a huge savings considering they produce the same result.

Besides the huge reduction in work done, the other benefit of processing orthogonally is cache locality. During the first step of the resize -- where 94% of the processing is done in this example -- the pixels being processed are located in the same line(s) and are, therefore, contiguous in memory. That improves your cache hit ratio. This is the reason almost all resizers will process the horizontal dimension first.

It would be downright foolish to do it any other way, really -- unless you had a good reason to. It turns out (and I have to thank Nathanael Jones, the creator of ImageResizer/FastScaling for pointing this out to me) that DrawImage() does have a reason to do it otherwise. Some of its many, many overloads allow you to pass in an arbitrary set of 3 points, which it uses to create a parallelogram. It allows you to do things like this:

Neat… I guess…

But orthogonal processing only works for rectangles, so in order to support this very fancy feature, DrawImage() has to do it the hard way. You pay that penalty every time you use it.

Given that, it should be quite easy for any scaler that doesn’t pay that penalty to beat GDI+. We saw in Part 1 that FastScaling did, but only barely. In this post, we’ll look at some cases where it doesn’t at all. That means more benchmarks! Yay!

Before that, though, I have one more quote to review from the FastScaling page:

Graphics.DrawImage() holds a process-wide lock, and is a very severe bottleneck for any imaging work on the GDI+/.NET platform. This is unfortunate, as WIC and WPF do not offer any high-quality resampling filters, and DirectX is 10-20X slower than DrawImage.

DrawImage also implements a general distortion filter. This type of filter thrashes the CPU cache; it is not optimized for linear memory access. It does not parallelize well on multiple cores even when used in separate processes.

There’s a lot to digest in those short paragraphs. I’ll start with the statements that are true:

DrawImage() does hold a process-wide lock. We’ve seen evidence of it in the benchmarks I’ve run through so far. All calls to DrawImage() are serialized, and in fact, when multiple calls are made, they will all block until they are all done. That’s why the performance numbers for my parallel test runs show almost no jitter in the timings. I’ll do more in case you missed that the first time.

DrawImage() does also implement a ‘general distortion filter’, sort of. That’s actually not a term with which I was familiar, so, as I was taught when I was younger, I looked it up in the dictionary. I mean, I google-bing’ed it… The most plausible definition I could find comes from ImageMagick, which implements a class of resizing operations that are non-orthogonal so that they can be combined with an affine transform to do things like we saw above with the parallelogram, only they have more fancy options. Again, that is bad for caching since the pixel data isn’t read sequentially as in the orthogonal case.

It’s also true that WIC (and WPF by extension) doesn’t have high-quality resampling filters [built-in]. Or at least it didn’t. Windows 10 added a high-quality Catmull-Rom filter, as I discussed in an update to my post on WIC interpolation. That should be present in Windows Server 2016 when it’s released as well, but I haven’t yet verified that. In any case, it’s not available as of now, on the server, in a built-in fashion.

But of course WIC is pluggable by design, and it’s possible to plug in a high-quality scaler. I know, because that’s exactly what the core of MagicScaler is. I took the superior architecture of WIC and plugged in the part that was missing. The statement above dismisses WIC as a useful solution because it’s missing something, but then it suggests that plugging the same type of component into the inferior and inherently non-pluggable GDI+ architecture is a good alternative. Bah, I say.

As for DirectX, it is not 10-20x slower than DrawImage(). DirectX is hardware accelerated, and its performance very much depends on your graphics hardware and the shader code you’re running on it. Integrating WIC with DirectX can yield amazing performance with the right hardware, and in fact, many of the improvements to WIC over the last couple of Windows releases have been related to integration with DirectX for hardware-accelerated imaging. Seriously, if you thought WIC looked fast before, that’s nothing. But since the target for FastScaling (like MagicScaler) is server apps, it is reasonably fair to rule out DirectX as a valid replacement for GDI+ functions. Most servers don’t have GPUs, and the ones that do are generally very expensive. Software-only processing in DirectX is relatively slow, so I can only hope the statement above was an allusion to that.

Those statements about WIC and DirectX seem to be justifications for staying within a GDI+ solution and simply replacing the supposedly broken DrawImage() implementation. That’s faulty logic, as GDI+’s shortcomings are not just limited to DrawImage() performance. We’ve already seen how much faster things can be in a different architecture (like WIC), and we’ll explore that a bit more in this post.

Back to the Numbers

We did see in Part 1 of this series that GDI+ came in last in our baseline benchmark. It wasn’t miles behind, but it was last. Is there anything it’s good at?

In order to answer that question, we’re going to need to do some more testing. As in the last post, I’ll try to minimize the number of variables in play between any two tests, so I’m going to start with the benchmark I ended with last time. But this time I’ll change just one thing. I’m going to switch the input image to an uncompressed RGB TIFF. I’ll explain why in a sec. But first the numbers:

A lot of interesting things happened here if you compare with the last set of numbers. Here’s how they compare with the last test run I did. Again, I’m sticking with the single-threaded numbers for now.

	JPEG Input	TIFF Input
FastScaling	376ms	380ms
GDI+	405ms	367ms
WIC	36ms	75ms
MagicScaler	228ms	192ms

I’ll start with the simple ones first. GDI+ and MagicScaler both improved by about 35ms in this test. That 35ms likely represents the reduction in decoding and pixel format conversion time for the 24MP image. JPEG takes more effort to decode than the uncompressed TIFF, so you’d expect all the tests would see similar benefit from the removal of that workload.

The WIC resizer actually took quite a bit longer, though. There’s a simple explanation for that too. When resizing a JPEG source, WIC is able to use the decoder to do part of the scaling operation. I covered this fact in my examination of the WIC scaler some time back. The short version is, the JPEG decoder would have transformed the image from 6000x4000 down to 750x500 (an 8:1 reduction) before even handing a single pixel to the WIC scaler. That’s why the WIC numbers were so good in the last test. It finished the whole operation in less time than the others took to just decode the source image. That’s also why its parallel numbers were unrealistically good. There was very little processing going on compared to what you’d expect. Fancy, no? In case you’re wondering, I’m able to do the same in MagicScaler, but I’ve disabled that feature for these tests to keep them fair. The WIC results for this test are still quite impressive, but notice the parallel numbers are more in line with expectations.

The only one I can’t fully explain is the FastScaling result. My guess is it would have gained the same 35ms advantage as the others except it squandered that advantage with excessive memory allocations. You’ll see what I mean in just a bit. Large memory allocations are slow, and that’s my best guess for why it failed to improve as much as the others.

And in case you missed it, the biggest news here is that GDI+ is no longer in last place. FastScaling takes over that honor. They were close in the last test, and now they’ve flipped. GDI+ edges it out by just under 4%. I was surprised by these results, so I ran them a few times. They’re correct.

The real reason I switched to a TIFF input, though, was not to point out those things. I switched to limit the number of variables between this test and the next one.

You see, the reality is GDI+ is just not optimized for what we would like it to be. Remember, GDI+ was not made for server-side image processing and certainly not for web apps. It was made for processing images/graphics for display on the screen or sending to a printer. Its typical operations involve lots of compositing, layering, drawing, etc. Basically the kinds of things you need for windows. So it might make sense that GDI+ would do all of its processing in an RGBA pixel format. I posited as much in a previous post and showed some evidence to back that up. If you want to see GDI+ at its best, you have to give it the task it was actually built to do.

Here are the results of the same test I did above, only this time the input image was swapped for an RGBA TIFF. Of course this image has no transparency, it’s simply a format change to illustrate performance characteristics.

Well, well, well… what do you know… GDI+ is much faster than FastScaling here. In fact, even on the 8 thread parallel test, GDI+ only took twice as long as FastScaling, and it had seven of its threads tied behind its back.

Remember the difference in pixel counts for my breakdown of orthogonal vs non-orthogonal processing earlier? Let’s revisit those calculations with this example. Processed non-orthogonally, this resize has 384.5 million pixels sampled, multiplied by 4 channels, for a total of just over 1.5 billion sample calculations. Processed orthogonally, that becomes 102.4 million pixels * 4 channels, which is just ~410 million sample calculations. DrawImage() is doing nearly 4 times as many calculations as FastScaling and completing 34% faster anyway.

What’s really interesting here is that if you compare the numbers across the last two tests, you’ll find DrawImage() was roughly the same speed with RGBA as it was with RGB, whereas all three of the other scalers were significantly slower (WIC doesn’t look that much worse, but it’s doing less than 1/4 the processing of the others). In fact, GDI+ was as fast at scaling in RGBA as FastScaling was in RGB. One might infer from those numbers that DrawImage() is missing the optimized code path for RGB that all three of the other scalers have. When doing the one thing it’s good at, GDI+ isn’t actually all that bad. And FastScaling looks a lot less clever by comparison.

Of course, it is a real bummer that DrawImage() isn’t optimized for RGB processing, and it’s a bummer that it doesn’t process orthogonally. Most of the work we do in web image resizing only involves rectangles. And most of it is on RGB images, particularly when we’re dealing with high-res photos. Those are usually JPEGs, which don’t support transparency at all. There is a huge benefit to taking the fast path on those images, and that’s a real opportunity for performance improvements. Anything that takes advantage of that opportunity should beat GDI+ performance-wise. Again, I’m actually surprised FastScaling failed to better GDI+ in the RGB TIFF test, but the numbers say it did.

With all that in mind, let’s look at MagicScaler’s numbers. They’re a decent improvement over GDI+ in RGBA mode, but nothing earth-shattering. We beat GDI+ handily in RGB (over 1.9x as fast), but it’s a much closer race in RGBA (25%).

And just for fun, because I guessed in that earlier post that GDI+ actually uses a Premultiplied Alpha format for its internal processing, let’s see how we all compare with that kind of input. Here is a test run with a PARGB TIFF input.

GDI+ gets even faster when given its native format for processing, and FastScaling gets even slower. Here GDI+ is almost 64% faster. Notice WIC also got faster with PARGB input, so we can assume its RGBA processing converts to PARGB as well. I haven’t built a native processor for PARGB in MagicScaler since this type of file is pretty rare in web resizing scenarios, but MagicScaler does still manage to edge out GDI+ even when it’s doing the thing it does best.

And in case you overlooked it again in the numbers, I want to revisit the comment I made about DrawImage blocking all concurrent calls until they all complete. You might expect that if I fired off 8 calls to DrawImage() on 8 threads one after another, the first one should finish in a normal amount of time and the last one should take the longest as it waits in the queue behind the other 7. We’d expect to see a huge standard deviation on those, but that’s not the case. They all returned at the same time. This behavior makes GDI+ scale even less well than you might have guessed in a server environment.

Speaking of which…

FastScaling’s Dirty Little Secret

I promised in the last post that I’d reveal this, and I hinted at it earlier. A picture is worth a thousand words in this case.

This is a Visual Studio Diagnostic Tools trace of another benchmarking run configured the same way as the last one (PARGB TIFF input), although the results are similar regardless of input format.

Ok, maybe this picture requires just a few words…

Each test here had a breakpoint, followed by a forced Gen 2 garbage collection, followed by a 1-second sleep, followed by the 3 portions of the test (10 runs serial, 4 parallel, 8 parallel).

The breakpoints separate each component’s test nicely in the Events section (note the gaps in the grey bar), and I’ve labeled each one using everyone’s favorite graphics program: MS Paint. The garbage collections (orange arrows) ensure nothing is left over from one test to the next, and the sleep puts a nice break in the CPU chart before each test gets going. If you haven’t used this tool before, hopefully the graphs are self-explanatory, but I’ll call out some details. This debug session, by the way, was run on a release build.

The baseline memory usage at the start of this graph is 156MiB. The reason it’s that high, even though nothing has happened yet, is that I pre-loaded the source file into memory so that I could wrap it in a MemoryStream and pass it to each component. I could have passed in the file name and had each component do its own read, but this a 92MiB TIFF file, and when we get to the parallel processing tests, disk I/O could become a bottleneck. Using a shared copy in memory removes that possibility and makes the tests more fair. Each component still has to do the full decode and encode; we’re really just caching the source file.

When the GDI+ test begins, there is a step up in memory usage to 249MiB. That represents a decoded copy of the 92MiB source, plus another 1MiB of… miscellaneous. Because the source is uncompressed and already in PARGB format, the encoded and decoded images are the same size. For the entire duration of the GDI+ test run, the CPU chart is steady at ~13% (one of 8 virtual cores) and the memory usage is flat. It actually peaks at 255MiB, but that’s just noise at this level. So no matter how many parallel calls we made to DrawImage(), there was only ever one decoded copy of the image in memory and one core doing work.

I’ll pause here and point out that it’s really not cool that GDI+ decodes the entire image into memory at once and holds it there for the duration of processing. The fact that a 93MiB jump in memory usage looks so insignificant on this graph is a hint to just how out-of-control things got later. In isolation, I would have said that was way too much memory to dedicate to a single operation. That’s a real killer for scalability. Fortunately, this is mitigated by the fact that GDI+ will only ever do this once at a time, due to its non-reentrant nature. I don’t know if this is the actual reason for that design or if it has its roots in the fact that GDI+ was designed for screen use. Maybe it has to do with handles and device contexts and what-not. I dunno. Whatever the reason, GDI+ essentially protects you from yourself if you’re using it in a server environment. It may not scale well for speed, but at least it won’t run away with all your memory.

Moving on to the WIC test, you see GDI+’s in-memory decoded copy of the bitmap has been dropped, and we’re back to the baseline memory level, which has moved up to 157MiB by now, because we’re starting to fill in the test results in the UI. The important thing is, the memory usage remains flat throughout the test run, peaking at only 160MiB. WIC never has to load the entire source image because it processes the output by pulling only as many pixels as it needs at a time through the pipeline. The CPU usage is flat at one core for the duration of the serial runs, then we get a nice spike as the parallel tests kick off. From a server-side scalability standpoint, this segment is a beauty.

Then there’s the FastScaling test. If the WIC test was a scalability beauty, this one is U.G.L.Y. (it ain’t got no alibi). The lowest memory usage observed during this test was 345MiB. That’s 96MiB more than GDI+ ever used, and that’s the minimum. Near the beginning of the test you can see the memory usage creep up to a high of 624MiB before the garbage collector decides it’s time to take action. As the serial runs continue, we see a cycle of rolling hills in the memory usage, with the value repeatedly climbing to 536MiB before the GC kicks in again taking it back down to 444MiB. Then the parallel tests start, and all hell breaks loose. Memory usage peaked at over 2.7GiB during the 8 thread test. But at least they broke free of GDI+’s single-thread restriction. That’s worth it, right?

Finally, we get to the MagicScaler test, and you can see that, much like WIC (because they’re like this [holds up crossed fingers]), memory usage is almost flat through the entire test. It starts with a baseline of 159MiB and peaks at 179MiB. MagicScaler needs more buffering memory than WIC does because it’s doing a more complex resample operation, but 20MiB for 8 parallel operations on an image this size is quite reasonable, I think. Mostly, it looks like the WIC test but with higher CPU usage. Like I said, quality isn’t cheap.

There’s one final thing I want to address that you may have noticed in the above chart. There are a bunch of little red diamonds in the Events panel during the FastScaling run. Those are all exceptions thrown during the FastScaling test, but they’re all caught internally by the ImageResizer library. As far as I can tell, they didn’t affect the test results. The exception, in case you’re curious, was an UnauthorizedAccessException, saying “MemoryStream’s internal buffer cannot be accessed”. It appears ImageResizer was attempting to call GetBuffer() on the MemoryStream passed in to it. That MemoryStream wrapped an existing byte array, so that’s not allowed. I don’t know why ImageResizer didn’t just use the Stream it was given, but that may have been an attempt at optimization. The other components use the Stream interface, so that failure kept them on even ground.

Truth from Lies

I started off Part 1 of this series by saying that benchmarks can lie, because we can always make the numbers show whatever we want. But of course, benchmarks can also be a valuable tool for learning the true nature of performance. We used them to prove -- at least in my mind – that DrawImage() isn’t really bad; it’s just misunderstood. Or more accurately: it’s misused. We saw that it could easily take over a system’s resources if it were allowed to, so its non-reentrant nature is probably a good thing. Removing that brake and allowing it to run wild is ill-advised, and yet that’s pretty much what FastScaling does, only worse. Most of FastScaling’s performance claims are rooted in the fact that it can and will take over your server if you let it.

We also saw that when GDI+ is doing what it was designed to do, it’s not terrible at it. A separate code path that optimized for rectangular RGB images would have been nice, but that wasn’t part of its design. FastScaling obviously does have those optimization, and on that front, we saw it does edge out DrawImage() performance-wise, sometimes. That’s a win, albeit a small one. Overall it’s misplaced in an architecture that is hostile to server environments.

And we saw that there are significant trade-offs when it comes to performance vs quality. WIC was ridiculously fast, but the image quality with its Fant scaler isn’t good enough for most uses. Getting to the quality level of GDI+’s high-quality scaling negates most of the performance improvements FastScaling claims. There is benefit to flexibility, though, and having the ability to balance performance with quality is a good thing. There is a middle ground between WIC and GDI+, and FastScaling seeks to make its place there. MagicScaler seeks the same but takes a different approach. In my next post, we’ll start exploring that area and planting some flags.