PhotoSauce Blog

Lies, Damned Lies, and Benchmarks Part 3: Varying Variables

May 13. 2016 0 Comments

This is the final part of my review of the FastScaling plugin for ImageResizer. Over the first two parts of this series, we examined some of the performance claims made by the FastScaling documentation. To review, those claims could be grouped into three categories:

It claimed that its orthogonal processing was more efficient than DrawImage()’s ‘general distortion filter’. That was true, but other architectural deficiencies cancel out that benefit in many cases. We saw that at equivalent output quality and on a single thread, FastScaling doesn’t offer much, if any, improvement over optimized DrawImage() usage. Its native RGB processing is more efficient, but even with that advantage, it barely eked out a win in our JPEG test. With other container formats, results may vary. With other pixel formats, it does significantly worse than GDI+.
It claimed to break free of the single-thread limit imposed by DrawImage(), allowing it to scale up with more more processors/cores. That was also true. But we saw the cost of that is that it’s allowed to run away with your server resources. Memory is particularly hard-hit since FastScaling seems to require even memory per image than DrawImage() does.
It claimed performance improvements through dynamic adjustment of scaler settings and through what they call ‘averaging optimizations’. We have not yet explored these.

Point 2 above could easily be the end of the road for this series. It’s a deal-breaker for true scalability. I certainly wouldn’t let FastScaling anywhere near any of my servers. But I’m still curious about that last point. I do some dynamic adjustments of quality settings in MagicScaler as well, and I’m interested to see how they compare.

I’m also curious as to how they arrived at such impressive numbers in their benchmarks. Nothing I’ve seen indicates FastScaling is anywhere near as fast as they say, but I’d like to see if I can get close to some of those numbers or at least explain how they got them. I came up with my own baseline for my own tests, but I might need to reset that baseline if I’m going to match theirs.

Narrowing the Scope

Beyond the baseline problem, there’s a problem of variables. I showed how limiting benchmarks to a single variable at a time makes them much more educational. Likewise, carefully choosing those variables can allow you present a distorted view of reality. I’d like to see if I can determine how they arrived at theirs, and why. Right off the bat, there are several to consider, such as:

Input image size, container format and pixel format
Output image size, container format and pixel format
Interpolation method and parameters (this can be extremely complex and variable itself)
Shortcuts, such as in-decoder transformations, or intermediate processing

JPEG input and output are clearly the most representative of a real-world web workload, so that part is a no-brainer. As for the input image size, I mentioned before that a larger input image exaggerates the performance difference in the scalers. I used a 24MP image for my initial tests, but the 16MP input used in the FastScaling benchmarks is also reasonable for those purposes. I’ll go ahead and switch to that size now. We’re also going to be doing only RGB (YCbCr, actually) JPEGs since they’re most typical.

The image I chose for this round of tests comes from the USGS flickr. The original file had an embedded Nikon ICC profile, which adds considerable processing time to the decode/conversion step. This would make things particularly unfair when using MagicScaler’s ability to resize within the decoder, so in order to keep the workload as similar as possible for all the scalers, I converted the image to the sRGB colorspace in Photoshop and re-saved it without an embedded profile for these benchmarks. The converted file is here

So the first real decision we have to make is output size. It has to be something realistic for web scenarios, but beyond that, it doesn’t seem like all that important a choice. I chose an output width of 400px for my earlier tests simply because I find that size easy to manage. I can do screenshots of my test app without them being too big, and I can easily take in all of the images in a single glance so differences in visual results are easy to spot. The FastScaling benchmarks used 800px output, and I wondered whether there was a reason for that. If you saw my earlier benchmarks between ImageResizer’s GDI+ implementation and my own reference GDI+ resizer, you may remember that at larger output sizes, the sub-optimal GDI+ settings used by ImageResizer made it significantly slower. I wondered if that handicap would make FastScaling look better by comparison, so I ran a few tests using my baseline settings from Part 1 of this series. The idea here is to keep them on even ground and change only the output size variable for now.

fsbaseline16mp

What’s interesting here is that the two scalers in ImageResizer follow a completely different trajectory than the reference GDI+ resizer and MagicScaler. ImageResizer is clearly paying a performance penalty at larger output sizes, but that penalty is paid by both of its scalers. There doesn’t appear to be any special reason they chose the 800px output size. In fact, at that size, FastScaling is actually slower than the reference GDI+ resizer. It is noteworthy that FastScaling beats ImageResizer’s GDI+ implementation at all output sizes, but the margin is modest, at a relatively constant 40-50ms. By comparison, MagicScaler maintains a steady 120-130ms advantage over the reference GDI+ resizer.

With these results in mind, I don’t think it’s at all unfair to stick with my preferred 400px output width for the remaining benchmarks. FastScaling actually holds a slight edge over the reference GDI+ resizer at that size, and we’ll have an easier time comparing output quality once we start enabling some of the processing shortcuts that FastScaling and MagicScaler support. This is the new baseline I’ll be using going forward.

fsbaseline16mpjpeg

Speaking of Quality…

Before I start sacrificing quality for speed in these comparisons, there’s one last topic I want to visit from the FastScaling documentation. Beyond the performance claims made in the docs, they also claim to have improved quality over DrawImage().

Another failing of DrawImage is that it only averages pixels in the sRGB color space. sRGB is a perceptual color space, meaning that fewer numbers are assigned to bright colors; most are assigned to shades of black. When downscaling (weighted averaging), this tends to exaggerate shadows and make highlights disappear, although it is just fine when upscaling.
FastScaling defaults to working in the srgb color space too - but only because users expect DrawImage-like behavior, not because sRGB is better. Linear light is almost always a better choice for downscaling than sRGB, and is more 'correct'.

These statements about processing light values in the sRBG compressed domain are true. It’s a bit of an oversimplification, but Eric Brasseur has written an excellent piece on the topic if you want more detailed info. I was interested by the statement in the second paragraph that FastScaling chooses sRGB processing as a default only because that’s what people expect, especially in light of all the performance claims made. Processing in linear light is better, but it’s always more expensive, and I wonder just what kind of performance hit FastScaling takes to do it. We saw in the last test, FastScaling barely beat the reference GDI+ resizer at 400px output from a 16MP JPEG source. Let’s do that same test again but enable linear light processing in FastScaling this time. Oh, and in MagicScaler too, because of course it supports linear processing as well…

fslinearjpeg

As you might have guessed, FastScaling gave up its meager lead with the added processing. It’s now over 200ms slower than the GDI+ reference, while MagicScaler is still almost 100ms faster. The difference in quality is quite subtle in this image, but it can be more pronounced in images with small high-contrast areas. Here’s a better example using an untouched 17MP image of the Milky Way, also from the USGS flickr.

fslinearjpeg2

WIC looks worst (as usual) here, but both FastScaling and MagicScaler look worlds better than the best GDI+ can do with this image. And with roughly the same input image size, performance is about the same as the previous test after accounting for the increase in decoding time. FastScaling is ~200ms slower than GDI+, and MagicScaler is ~100ms faster. So while FastScaling is sometimes better or faster than GDI+, it’s most certainly not both.

I feel the need, the need for speed

Ok, with that last quality issue addressed and with a good baseline established, we can start to play with some of the options that sacrifice quality for processing speed. GDI+ is obviously going to be quite limited in this regard, as we can really only change the interpolation mode to get better performance. However, as I suggested in Part 1 of this series, the ‘averaging optimizations’ mentioned in the FastScaling docs are also possible to implement with DrawImage(). I call it Hybrid Scaling in MagicScaler, so I’ll use that term from now on.

The reason it’s possible to do such a thing with DrawImage() is because we happen to know (from my earlier analysis of the GDI+ interpolation modes) that the default Linear interpolation mode from GDI+ adapts toward a Box (or averaging) filter at higher downscale ratios. We also saw in my earlier testing with GDI+ that the Linear interpolation mode doesn’t require a conversion to RGBA to do its work and doesn’t require the entire source image to be decoded into memory all at once. That makes this technique particularly interesting in GDI+, because we can reduce memory usage while at the same time increasing speed. I went ahead implemented hybrid scaling in my reference GDI+ resizer (it took all of about 10 minutes), so we can see what GDI+ can do under the best of conditions. We’ll compare that with the best speeds FastScaling and MagicScaler can achieve. We’ve already seen that in terms of raw speed and efficiency, WIC is going to be impossible to beat, but there isn’t really anything we can do to make it faster or slower, so I’ll drop it from my benchmarks at this point. The best it did on my reference image was 54ms. We’ll keep that number in mind.

The FastScaling docs are light on details regarding its speed vs quality tradeoffs, but it appears they’re all driven with the down.speed setting. MagicScaler allows control of its quality tradeoffs with its HybridMode setting, which has 4 options. The first option is ‘Off’, which is what we’ve done so far. The other 3 modes (FavorQuality, FavorSpeed, and Turbo) allow MagicScaler to resize the source by powers of 2 using either the decoder or the WIC Fant scaler (which is somewhere between a Linear and Box filter) before finishing with its own high-quality scaler. The 3 options control how far the low-quality resizing is allowed to go.

FavorQuality allows low-quality scaling to the nearest power of 2 at least 3x the target size.
FavorSpeed allows low-quality scaling to the nearest power of 2 at least 2x the target size.
Turbo allows low-quality scaling to the nearest power of 2 to the target size. When resizing by a power of 2 up to 8:1, it is equivalent to the WIC scaler I’ve benchmarked so far.

The Hybrid mode I added to my reference GDI+ resizer follows the same rules but uses the GDI+ Linear scaler to do its low-quality phase. From this point on, I’ll have to abandon the idea that we can reach equivalent output, so we’ll be stuck with more subjective comparisons for quality. And Away we go…

fsspeed016mp

Quality looks to be pretty even at this point. The hybrid scaling version of my GDI+ resizer knocked 115ms off the normal GDI+ time, but FastScaling and MagicScaler both did better. Note that I’m moving the FastScaling down.speed setting up by 2 at a time since it has a total of 7 levels to MagicScaler’s 4. I’ve also left the down.window=4 setting in place for the FastScaling tests since I believe that setting’s default value was a bug. I’ll allow it to use its default value when we test the maximal speed of each component. And finally, note that MagicScaler is using the JPEG decoder to perform part of the scaling, so its speed is approaching that of the WIC scaler already. Next level up…

fsspeed216mp

Looks like nothing really changed here. MagicScaler’s logic used an intermediate ratio of 4:1 on both this test and the last, so the work done was the same. It appears FastScaling might have also used the same settings for both of these runs. And now the fastest settings:

fsspeed416mp

With this setting, MagicScaler is using an 8:1 intermediate ratio, and the speed is within 2ms of the pure-WIC times we saw earlier. The image is noticeably more blurry now, but doesn’t seem to be as bad off as the FastScaling one. No matter, though, FastScaling barely beats out the hybrid version of the GDI+ resizer in single-thread performance. But that’s probably not the best FastScaling can do performance-wise. I’ll do one final test, changing its down.filter value to ‘Fastest’ and removing its down.window setting, while leaving the down.speed=4 setting. As far as I can tell from the docs, this should be its best speed.

fsspeed4fastest16mp

That shaved a few milliseconds off the FastScaling number, but it’s probably within the margin of error. Its visual quality is by far the worst now.

You may notice I changed other one thing in this test while I was at it. Since I had already maxed out MagicScaler’s speed, in this test I enabled its automatic sharpening. You can see here that it added only 2ms to the processing time, but the results are quite striking. MagicScaler is showing nearly 3x the speed of FastScaling and better quality to boot. In fact, the MagicScaler result looks better than GDI+ at 5x the single-threaded performance or 25x the performance on 8 threads.

As for FastScaling’s numbers vs GDI+, the biggest number we’re showing here is 8.3x faster than GDI+ when running on 8 threads. That’s actually within the 4.5-9.6x end-to-end speed range quoted in the FastScaling benchmarks. The problem is, those numbers are with its lowest quality settings, which are unacceptably poor. And it used over 400MiB of RAM during the test, which is unacceptably poor for scalability. The hybrid scaling in my GDI+ reference dropped its memory use to 13MiB from the baseline version’s 64MiB, by the way, and its single-threaded performance numbers were very close to FastScaling’s best while producing better quality.

fsdirty2

I think I’ve proven my point. FastScaling’s performance claims are way overblown, and MagicScaler is in a completely different league.

Oh, and there’s one more thing:

This plugin (FastScaling plugin) is part of the Performance Edition
The Performance edition costs $249 per domain

Ha! Did I mention MagicScaler is free?

Lies, Damned Lies, and Benchmarks Part 2: DrawImage(), Why U So Slow?

May 11. 2016 0 Comments

Most of the time I write about System.Drawing/GDI+, I’m pointing out its flaws and talking about how much better MagicScaler is for server-side image processing. It’s odd, then, that I now find myself writing a post defending it. In my last post, I quoted the documentation page from ImageResizer’s FastScaling plugin and said I’d address a part of it I skipped over. Here it is:

Unlike DrawImage, [FastScaling] uses orthogonal/separable resampling, and requires less of the CPU cache.

For those who haven’t studied up on Eric Lippert’s Five Dollar Words for Programmers™ or aren’t familiar with the basic mechanics of image resampling, I’ll give some background. Orthogonal/separable in this context simply means that you can resize an image either by doing both dimensions (width/height) at the same time and calculating the final value for each output pixel all at once, or you can resize in each dimension separately. It works out that for almost all standard resampling algorithms, you can do it either way and get the exact same results. The reason this matters is that if you’re using a resampling algorithm that samples each pixel value more than once – and any good one will --, it’s much less expensive to do it orthogonally.

Take, for example, the cubic resampler we tested in Part 1. Cubics usually require a sample window of 4 (remember FastScaling got that wrong by default), which means they sample a 4x4 pixel area in the source image to determine the value of a single pixel in the output image. On top of that, when you scale an image down, you must scale the sample area up proportionally to make sure you sample all the source pixels. Scaling the sample area up is effectively what makes a high-quality scaler high quality. Low-quality cubic scalers (like the one in WIC) just stick with 4x4 regardless of ratio.

So if, as we did in Part 1, you’re scaling a 6000x4000 image down to 400x267 (1:15 ratio of the source), you need to sample a 60x60 (15:1 ratio of the sampler) pixel area from the input for each pixel in the output. That would mean, in a naïve implementation, you would have to process 400*267*60*60 (384.5 million) pixels to perform that resize. In other words, you would read and perform calculations on each of the 24 million input pixels 16 times (the 4x4 sample size). And for RGB or RGBA images, those numbers would be multiplied by 3 or 4 channels, respectively. You could easily be doing over a billion sample calculations for this seemingly-simple resize operation.

To do the same resize orthogonally, you would first resize to 400x4000, sampling only in the horizontal dimension, so you sample only 60 pixels for each output pixel. That’s 400*4000*60 (96 million) pixels for the first dimension. Then 400*267*60 (6.4 million) for the other dimension. That’s a grand total of 102.4 million pixels processed instead of 384.5 million, a huge savings considering they produce the same result.

Besides the huge reduction in work done, the other benefit of processing orthogonally is cache locality. During the first step of the resize -- where 94% of the processing is done in this example -- the pixels being processed are located in the same line(s) and are, therefore, contiguous in memory. That improves your cache hit ratio. This is the reason almost all resizers will process the horizontal dimension first.

It would be downright foolish to do it any other way, really -- unless you had a good reason to. It turns out (and I have to thank Nathanael Jones, the creator of ImageResizer/FastScaling for pointing this out to me) that DrawImage() does have a reason to do it otherwise. Some of its many, many overloads allow you to pass in an arbitrary set of 3 points, which it uses to create a parallelogram. It allows you to do things like this:

witchskew

Neat… I guess…

But orthogonal processing only works for rectangles, so in order to support this very fancy feature, DrawImage() has to do it the hard way. You pay that penalty every time you use it.

Given that, it should be quite easy for any scaler that doesn’t pay that penalty to beat GDI+. We saw in Part 1 that FastScaling did, but only barely. In this post, we’ll look at some cases where it doesn’t at all. That means more benchmarks! Yay!

Before that, though, I have one more quote to review from the FastScaling page:

Graphics.DrawImage() holds a process-wide lock, and is a very severe bottleneck for any imaging work on the GDI+/.NET platform. This is unfortunate, as WIC and WPF do not offer any high-quality resampling filters, and DirectX is 10-20X slower than DrawImage.

DrawImage also implements a general distortion filter. This type of filter thrashes the CPU cache; it is not optimized for linear memory access. It does not parallelize well on multiple cores even when used in separate processes.

There’s a lot to digest in those short paragraphs. I’ll start with the statements that are true:

DrawImage() does hold a process-wide lock. We’ve seen evidence of it in the benchmarks I’ve run through so far. All calls to DrawImage() are serialized, and in fact, when multiple calls are made, they will all block until they are all done. That’s why the performance numbers for my parallel test runs show almost no jitter in the timings. I’ll do more in case you missed that the first time.

DrawImage() does also implement a ‘general distortion filter’, sort of. That’s actually not a term with which I was familiar, so, as I was taught when I was younger, I looked it up in the dictionary. I mean, I google-bing’ed it… The most plausible definition I could find comes from ImageMagick, which implements a class of resizing operations that are non-orthogonal so that they can be combined with an affine transform to do things like we saw above with the parallelogram, only they have more fancy options. Again, that is bad for caching since the pixel data isn’t read sequentially as in the orthogonal case.

It’s also true that WIC (and WPF by extension) doesn’t have high-quality resampling filters [built-in]. Or at least it didn’t. Windows 10 added a high-quality Catmull-Rom filter, as I discussed in an update to my post on WIC interpolation. That should be present in Windows Server 2016 when it’s released as well, but I haven’t yet verified that. In any case, it’s not available as of now, on the server, in a built-in fashion.

But of course WIC is pluggable by design, and it’s possible to plug in a high-quality scaler. I know, because that’s exactly what the core of MagicScaler is. I took the superior architecture of WIC and plugged in the part that was missing. The statement above dismisses WIC as a useful solution because it’s missing something, but then it suggests that plugging the same type of component into the inferior and inherently non-pluggable GDI+ architecture is a good alternative. Bah, I say.

As for DirectX, it is not 10-20x slower than DrawImage(). DirectX is hardware accelerated, and its performance very much depends on your graphics hardware and the shader code you’re running on it. Integrating WIC with DirectX can yield amazing performance with the right hardware, and in fact, many of the improvements to WIC over the last couple of Windows releases have been related to integration with DirectX for hardware-accelerated imaging. Seriously, if you thought WIC looked fast before, that’s nothing. But since the target for FastScaling (like MagicScaler) is server apps, it is reasonably fair to rule out DirectX as a valid replacement for GDI+ functions. Most servers don’t have GPUs, and the ones that do are generally very expensive. Software-only processing in DirectX is relatively slow, so I can only hope the statement above was an allusion to that.

Those statements about WIC and DirectX seem to be justifications for staying within a GDI+ solution and simply replacing the supposedly broken DrawImage() implementation. That’s faulty logic, as GDI+’s shortcomings are not just limited to DrawImage() performance. We’ve already seen how much faster things can be in a different architecture (like WIC), and we’ll explore that a bit more in this post.

Back to the Numbers

We did see in Part 1 of this series that GDI+ came in last in our baseline benchmark. It wasn’t miles behind, but it was last. Is there anything it’s good at?

In order to answer that question, we’re going to need to do some more testing. As in the last post, I’ll try to minimize the number of variables in play between any two tests, so I’m going to start with the benchmark I ended with last time. But this time I’ll change just one thing. I’m going to switch the input image to an uncompressed RGB TIFF. I’ll explain why in a sec. But first the numbers:

fscubrgbtif

A lot of interesting things happened here if you compare with the last set of numbers. Here’s how they compare with the last test run I did. Again, I’m sticking with the single-threaded numbers for now.

	JPEG Input	TIFF Input
FastScaling	376ms	380ms
GDI+	405ms	367ms
WIC	36ms	75ms
MagicScaler	228ms	192ms

I’ll start with the simple ones first. GDI+ and MagicScaler both improved by about 35ms in this test. That 35ms likely represents the reduction in decoding and pixel format conversion time for the 24MP image. JPEG takes more effort to decode than the uncompressed TIFF, so you’d expect all the tests would see similar benefit from the removal of that workload.

The WIC resizer actually took quite a bit longer, though. There’s a simple explanation for that too. When resizing a JPEG source, WIC is able to use the decoder to do part of the scaling operation. I covered this fact in my examination of the WIC scaler some time back. The short version is, the JPEG decoder would have transformed the image from 6000x4000 down to 750x500 (an 8:1 reduction) before even handing a single pixel to the WIC scaler. That’s why the WIC numbers were so good in the last test. It finished the whole operation in less time than the others took to just decode the source image. That’s also why its parallel numbers were unrealistically good. There was very little processing going on compared to what you’d expect. Fancy, no? In case you’re wondering, I’m able to do the same in MagicScaler, but I’ve disabled that feature for these tests to keep them fair. The WIC results for this test are still quite impressive, but notice the parallel numbers are more in line with expectations.

The only one I can’t fully explain is the FastScaling result. My guess is it would have gained the same 35ms advantage as the others except it squandered that advantage with excessive memory allocations. You’ll see what I mean in just a bit. Large memory allocations are slow, and that’s my best guess for why it failed to improve as much as the others.

And in case you missed it, the biggest news here is that GDI+ is no longer in last place. FastScaling takes over that honor. They were close in the last test, and now they’ve flipped. GDI+ edges it out by just under 4%. I was surprised by these results, so I ran them a few times. They’re correct.

The real reason I switched to a TIFF input, though, was not to point out those things. I switched to limit the number of variables between this test and the next one.

You see, the reality is GDI+ is just not optimized for what we would like it to be. Remember, GDI+ was not made for server-side image processing and certainly not for web apps. It was made for processing images/graphics for display on the screen or sending to a printer. Its typical operations involve lots of compositing, layering, drawing, etc. Basically the kinds of things you need for windows. So it might make sense that GDI+ would do all of its processing in an RGBA pixel format. I posited as much in a previous post and showed some evidence to back that up. If you want to see GDI+ at its best, you have to give it the task it was actually built to do.

Here are the results of the same test I did above, only this time the input image was swapped for an RGBA TIFF. Of course this image has no transparency, it’s simply a format change to illustrate performance characteristics.

fscubrgbatif

Well, well, well… what do you know… GDI+ is much faster than FastScaling here. In fact, even on the 8 thread parallel test, GDI+ only took twice as long as FastScaling, and it had seven of its threads tied behind its back.

Remember the difference in pixel counts for my breakdown of orthogonal vs non-orthogonal processing earlier? Let’s revisit those calculations with this example. Processed non-orthogonally, this resize has 384.5 million pixels sampled, multiplied by 4 channels, for a total of just over 1.5 billion sample calculations. Processed orthogonally, that becomes 102.4 million pixels * 4 channels, which is just ~410 million sample calculations. DrawImage() is doing nearly 4 times as many calculations as FastScaling and completing 34% faster anyway.

What’s really interesting here is that if you compare the numbers across the last two tests, you’ll find DrawImage() was roughly the same speed with RGBA as it was with RGB, whereas all three of the other scalers were significantly slower (WIC doesn’t look that much worse, but it’s doing less than 1/4 the processing of the others). In fact, GDI+ was as fast at scaling in RGBA as FastScaling was in RGB. One might infer from those numbers that DrawImage() is missing the optimized code path for RGB that all three of the other scalers have. When doing the one thing it’s good at, GDI+ isn’t actually all that bad. And FastScaling looks a lot less clever by comparison.

Of course, it is a real bummer that DrawImage() isn’t optimized for RGB processing, and it’s a bummer that it doesn’t process orthogonally. Most of the work we do in web image resizing only involves rectangles. And most of it is on RGB images, particularly when we’re dealing with high-res photos. Those are usually JPEGs, which don’t support transparency at all. There is a huge benefit to taking the fast path on those images, and that’s a real opportunity for performance improvements. Anything that takes advantage of that opportunity should beat GDI+ performance-wise. Again, I’m actually surprised FastScaling failed to better GDI+ in the RGB TIFF test, but the numbers say it did.

With all that in mind, let’s look at MagicScaler’s numbers. They’re a decent improvement over GDI+ in RGBA mode, but nothing earth-shattering. We beat GDI+ handily in RGB (over 1.9x as fast), but it’s a much closer race in RGBA (25%).

And just for fun, because I guessed in that earlier post that GDI+ actually uses a Premultiplied Alpha format for its internal processing, let’s see how we all compare with that kind of input. Here is a test run with a PARGB TIFF input.

fscubprgbatif

GDI+ gets even faster when given its native format for processing, and FastScaling gets even slower. Here GDI+ is almost 64% faster. Notice WIC also got faster with PARGB input, so we can assume its RGBA processing converts to PARGB as well. I haven’t built a native processor for PARGB in MagicScaler since this type of file is pretty rare in web resizing scenarios, but MagicScaler does still manage to edge out GDI+ even when it’s doing the thing it does best.

And in case you overlooked it again in the numbers, I want to revisit the comment I made about DrawImage blocking all concurrent calls until they all complete. You might expect that if I fired off 8 calls to DrawImage() on 8 threads one after another, the first one should finish in a normal amount of time and the last one should take the longest as it waits in the queue behind the other 7. We’d expect to see a huge standard deviation on those, but that’s not the case. They all returned at the same time. This behavior makes GDI+ scale even less well than you might have guessed in a server environment.

Speaking of which…

FastScaling’s Dirty Little Secret

I promised in the last post that I’d reveal this, and I hinted at it earlier. A picture is worth a thousand words in this case.

This is a Visual Studio Diagnostic Tools trace of another benchmarking run configured the same way as the last one (PARGB TIFF input), although the results are similar regardless of input format.

fsdirty

Ok, maybe this picture requires just a few words…

Each test here had a breakpoint, followed by a forced Gen 2 garbage collection, followed by a 1-second sleep, followed by the 3 portions of the test (10 runs serial, 4 parallel, 8 parallel).

The breakpoints separate each component’s test nicely in the Events section (note the gaps in the grey bar), and I’ve labeled each one using everyone’s favorite graphics program: MS Paint. The garbage collections (orange arrows) ensure nothing is left over from one test to the next, and the sleep puts a nice break in the CPU chart before each test gets going. If you haven’t used this tool before, hopefully the graphs are self-explanatory, but I’ll call out some details. This debug session, by the way, was run on a release build.

The baseline memory usage at the start of this graph is 156MiB. The reason it’s that high, even though nothing has happened yet, is that I pre-loaded the source file into memory so that I could wrap it in a MemoryStream and pass it to each component. I could have passed in the file name and had each component do its own read, but this a 92MiB TIFF file, and when we get to the parallel processing tests, disk I/O could become a bottleneck. Using a shared copy in memory removes that possibility and makes the tests more fair. Each component still has to do the full decode and encode; we’re really just caching the source file.

When the GDI+ test begins, there is a step up in memory usage to 249MiB. That represents a decoded copy of the 92MiB source, plus another 1MiB of… miscellaneous. Because the source is uncompressed and already in PARGB format, the encoded and decoded images are the same size. For the entire duration of the GDI+ test run, the CPU chart is steady at ~13% (one of 8 virtual cores) and the memory usage is flat. It actually peaks at 255MiB, but that’s just noise at this level. So no matter how many parallel calls we made to DrawImage(), there was only ever one decoded copy of the image in memory and one core doing work.

I’ll pause here and point out that it’s really not cool that GDI+ decodes the entire image into memory at once and holds it there for the duration of processing. The fact that a 93MiB jump in memory usage looks so insignificant on this graph is a hint to just how out-of-control things got later. In isolation, I would have said that was way too much memory to dedicate to a single operation. That’s a real killer for scalability. Fortunately, this is mitigated by the fact that GDI+ will only ever do this once at a time, due to its non-reentrant nature. I don’t know if this is the actual reason for that design or if it has its roots in the fact that GDI+ was designed for screen use. Maybe it has to do with handles and device contexts and what-not. I dunno. Whatever the reason, GDI+ essentially protects you from yourself if you’re using it in a server environment. It may not scale well for speed, but at least it won’t run away with all your memory.

Moving on to the WIC test, you see GDI+’s in-memory decoded copy of the bitmap has been dropped, and we’re back to the baseline memory level, which has moved up to 157MiB by now, because we’re starting to fill in the test results in the UI. The important thing is, the memory usage remains flat throughout the test run, peaking at only 160MiB. WIC never has to load the entire source image because it processes the output by pulling only as many pixels as it needs at a time through the pipeline. The CPU usage is flat at one core for the duration of the serial runs, then we get a nice spike as the parallel tests kick off. From a server-side scalability standpoint, this segment is a beauty.

Then there’s the FastScaling test. If the WIC test was a scalability beauty, this one is U.G.L.Y. (it ain’t got no alibi). The lowest memory usage observed during this test was 345MiB. That’s 96MiB more than GDI+ ever used, and that’s the minimum. Near the beginning of the test you can see the memory usage creep up to a high of 624MiB before the garbage collector decides it’s time to take action. As the serial runs continue, we see a cycle of rolling hills in the memory usage, with the value repeatedly climbing to 536MiB before the GC kicks in again taking it back down to 444MiB. Then the parallel tests start, and all hell breaks loose. Memory usage peaked at over 2.7GiB during the 8 thread test. But at least they broke free of GDI+’s single-thread restriction. That’s worth it, right?

Finally, we get to the MagicScaler test, and you can see that, much like WIC (because they’re like this [holds up crossed fingers]), memory usage is almost flat through the entire test. It starts with a baseline of 159MiB and peaks at 179MiB. MagicScaler needs more buffering memory than WIC does because it’s doing a more complex resample operation, but 20MiB for 8 parallel operations on an image this size is quite reasonable, I think. Mostly, it looks like the WIC test but with higher CPU usage. Like I said, quality isn’t cheap.

There’s one final thing I want to address that you may have noticed in the above chart. There are a bunch of little red diamonds in the Events panel during the FastScaling run. Those are all exceptions thrown during the FastScaling test, but they’re all caught internally by the ImageResizer library. As far as I can tell, they didn’t affect the test results. The exception, in case you’re curious, was an UnauthorizedAccessException, saying “MemoryStream’s internal buffer cannot be accessed”. It appears ImageResizer was attempting to call GetBuffer() on the MemoryStream passed in to it. That MemoryStream wrapped an existing byte array, so that’s not allowed. I don’t know why ImageResizer didn’t just use the Stream it was given, but that may have been an attempt at optimization. The other components use the Stream interface, so that failure kept them on even ground.

Truth from Lies

I started off Part 1 of this series by saying that benchmarks can lie, because we can always make the numbers show whatever we want. But of course, benchmarks can also be a valuable tool for learning the true nature of performance. We used them to prove -- at least in my mind – that DrawImage() isn’t really bad; it’s just misunderstood. Or more accurately: it’s misused. We saw that it could easily take over a system’s resources if it were allowed to, so its non-reentrant nature is probably a good thing. Removing that brake and allowing it to run wild is ill-advised, and yet that’s pretty much what FastScaling does, only worse. Most of FastScaling’s performance claims are rooted in the fact that it can and will take over your server if you let it.

We also saw that when GDI+ is doing what it was designed to do, it’s not terrible at it. A separate code path that optimized for rectangular RGB images would have been nice, but that wasn’t part of its design. FastScaling obviously does have those optimization, and on that front, we saw it does edge out DrawImage() performance-wise, sometimes. That’s a win, albeit a small one. Overall it’s misplaced in an architecture that is hostile to server environments.

And we saw that there are significant trade-offs when it comes to performance vs quality. WIC was ridiculously fast, but the image quality with its Fant scaler isn’t good enough for most uses. Getting to the quality level of GDI+’s high-quality scaling negates most of the performance improvements FastScaling claims. There is benefit to flexibility, though, and having the ability to balance performance with quality is a good thing. There is a middle ground between WIC and GDI+, and FastScaling seeks to make its place there. MagicScaler seeks the same but takes a different approach. In my next post, we’ll start exploring that area and planting some flags.