PhotoSauce Blog

0 Comments

Most of the time I write about System.Drawing/GDI+, I’m pointing out its flaws and talking about how much better MagicScaler is for server-side image processing. It’s odd, then, that I now find myself writing a post defending it. In my last post, I quoted the documentation page from ImageResizer’s FastScaling plugin and said I’d address a part of it I skipped over. Here it is:

Unlike DrawImage, [FastScaling] uses orthogonal/separable resampling, and requires less of the CPU cache.

For those who haven’t studied up on Eric Lippert’s Five Dollar Words for Programmers™ or aren’t familiar with the basic mechanics of image resampling, I’ll give some background. Orthogonal/separable in this context simply means that you can resize an image either by doing both dimensions (width/height) at the same time and calculating the final value for each output pixel all at once, or you can resize in each dimension separately. It works out that for almost all standard resampling algorithms, you can do it either way and get the exact same results. The reason this matters is that if you’re using a resampling algorithm that samples each pixel value more than once – and any good one will --, it’s much less expensive to do it orthogonally.

Take, for example, the cubic resampler we tested in Part 1. Cubics usually require a sample window of 4 (remember FastScaling got that wrong by default), which means they sample a 4x4 pixel area in the source image to determine the value of a single pixel in the output image. On top of that, when you scale an image down, you must scale the sample area up proportionally to make sure you sample all the source pixels. Scaling the sample area up is effectively what makes a high-quality scaler high quality. Low-quality cubic scalers (like the one in WIC) just stick with 4x4 regardless of ratio.

So if, as we did in Part 1, you’re scaling a 6000x4000 image down to 400x267 (1:15 ratio of the source), you need to sample a 60x60 (15:1 ratio of the sampler) pixel area from the input for each pixel in the output. That would mean, in a naïve implementation, you would have to process 400*267*60*60 (384.5 million) pixels to perform that resize. In other words, you would read and perform calculations on each of the 24 million input pixels 16 times (the 4x4 sample size). And for RGB or RGBA images, those numbers would be multiplied by 3 or 4 channels, respectively. You could easily be doing over a billion sample calculations for this seemingly-simple resize operation.

To do the same resize orthogonally, you would first resize to 400x4000, sampling only in the horizontal dimension, so you sample only 60 pixels for each output pixel. That’s 400*4000*60 (96 million) pixels for the first dimension. Then 400*267*60 (6.4 million) for the other dimension. That’s a grand total of 102.4 million pixels processed instead of 384.5 million, a huge savings considering they produce the same result.

Besides the huge reduction in work done, the other benefit of processing orthogonally is cache locality. During the first step of the resize -- where 94% of the processing is done in this example -- the pixels being processed are located in the same line(s) and are, therefore, contiguous in memory. That improves your cache hit ratio. This is the reason almost all resizers will process the horizontal dimension first.

It would be downright foolish to do it any other way, really -- unless you had a good reason to. It turns out (and I have to thank Nathanael Jones, the creator of ImageResizer/FastScaling for pointing this out to me) that DrawImage() does have a reason to do it otherwise. Some of its many, many overloads allow you to pass in an arbitrary set of 3 points, which it uses to create a parallelogram. It allows you to do things like this:

witchskew

Neat… I guess…

But orthogonal processing only works for rectangles, so in order to support this very fancy feature, DrawImage() has to do it the hard way. You pay that penalty every time you use it.

Given that, it should be quite easy for any scaler that doesn’t pay that penalty to beat GDI+. We saw in Part 1 that FastScaling did, but only barely. In this post, we’ll look at some cases where it doesn’t at all. That means more benchmarks! Yay!

Before that, though, I have one more quote to review from the FastScaling page:

Graphics.DrawImage() holds a process-wide lock, and is a very severe bottleneck for any imaging work on the GDI+/.NET platform. This is unfortunate, as WIC and WPF do not offer any high-quality resampling filters, and DirectX is 10-20X slower than DrawImage.

DrawImage also implements a general distortion filter. This type of filter thrashes the CPU cache; it is not optimized for linear memory access. It does not parallelize well on multiple cores even when used in separate processes.

There’s a lot to digest in those short paragraphs. I’ll start with the statements that are true:

DrawImage() does hold a process-wide lock. We’ve seen evidence of it in the benchmarks I’ve run through so far. All calls to DrawImage() are serialized, and in fact, when multiple calls are made, they will all block until they are all done. That’s why the performance numbers for my parallel test runs show almost no jitter in the timings. I’ll do more in case you missed that the first time.

DrawImage() does also implement a ‘general distortion filter’, sort of. That’s actually not a term with which I was familiar, so, as I was taught when I was younger, I looked it up in the dictionary. I mean, I google-bing’ed it… The most plausible definition I could find comes from ImageMagick, which implements a class of resizing operations that are non-orthogonal so that they can be combined with an affine transform to do things like we saw above with the parallelogram, only they have more fancy options. Again, that is bad for caching since the pixel data isn’t read sequentially as in the orthogonal case.

It’s also true that WIC (and WPF by extension) doesn’t have high-quality resampling filters [built-in]. Or at least it didn’t. Windows 10 added a high-quality Catmull-Rom filter, as I discussed in an update to my post on WIC interpolation. That should be present in Windows Server 2016 when it’s released as well, but I haven’t yet verified that. In any case, it’s not available as of now, on the server, in a built-in fashion.

But of course WIC is pluggable by design, and it’s possible to plug in a high-quality scaler. I know, because that’s exactly what the core of MagicScaler is. I took the superior architecture of WIC and plugged in the part that was missing. The statement above dismisses WIC as a useful solution because it’s missing something, but then it suggests that plugging the same type of component into the inferior and inherently non-pluggable GDI+ architecture is a good alternative. Bah, I say.

As for DirectX, it is not 10-20x slower than DrawImage(). DirectX is hardware accelerated, and its performance very much depends on your graphics hardware and the shader code you’re running on it. Integrating WIC with DirectX can yield amazing performance with the right hardware, and in fact, many of the improvements to WIC over the last couple of Windows releases have been related to integration with DirectX for hardware-accelerated imaging. Seriously, if you thought WIC looked fast before, that’s nothing. But since the target for FastScaling (like MagicScaler) is server apps, it is reasonably fair to rule out DirectX as a valid replacement for GDI+ functions. Most servers don’t have GPUs, and the ones that do are generally very expensive. Software-only processing in DirectX is relatively slow, so I can only hope the statement above was an allusion to that.

Those statements about WIC and DirectX seem to be justifications for staying within a GDI+ solution and simply replacing the supposedly broken DrawImage() implementation. That’s faulty logic, as GDI+’s shortcomings are not just limited to DrawImage() performance. We’ve already seen how much faster things can be in a different architecture (like WIC), and we’ll explore that a bit more in this post.

Back to the Numbers

We did see in Part 1 of this series that GDI+ came in last in our baseline benchmark. It wasn’t miles behind, but it was last. Is there anything it’s good at?

In order to answer that question, we’re going to need to do some more testing. As in the last post, I’ll try to minimize the number of variables in play between any two tests, so I’m going to start with the benchmark I ended with last time. But this time I’ll change just one thing. I’m going to switch the input image to an uncompressed RGB TIFF. I’ll explain why in a sec. But first the numbers:

fscubrgbtif

A lot of interesting things happened here if you compare with the last set of numbers. Here’s how they compare with the last test run I did. Again, I’m sticking with the single-threaded numbers for now.

JPEG Input TIFF Input
FastScaling 376ms 380ms
GDI+ 405ms 367ms
WIC 36ms 75ms
MagicScaler 228ms 192ms

I’ll start with the simple ones first. GDI+ and MagicScaler both improved by about 35ms in this test. That 35ms likely represents the reduction in decoding and pixel format conversion time for the 24MP image. JPEG takes more effort to decode than the uncompressed TIFF, so you’d expect all the tests would see similar benefit from the removal of that workload.

The WIC resizer actually took quite a bit longer, though. There’s a simple explanation for that too. When resizing a JPEG source, WIC is able to use the decoder to do part of the scaling operation. I covered this fact in my examination of the WIC scaler some time back. The short version is, the JPEG decoder would have transformed the image from 6000x4000 down to 750x500 (an 8:1 reduction) before even handing a single pixel to the WIC scaler. That’s why the WIC numbers were so good in the last test. It finished the whole operation in less time than the others took to just decode the source image. That’s also why its parallel numbers were unrealistically good. There was very little processing going on compared to what you’d expect. Fancy, no? In case you’re wondering, I’m able to do the same in MagicScaler, but I’ve disabled that feature for these tests to keep them fair. The WIC results for this test are still quite impressive, but notice the parallel numbers are more in line with expectations.

The only one I can’t fully explain is the FastScaling result. My guess is it would have gained the same 35ms advantage as the others except it squandered that advantage with excessive memory allocations. You’ll see what I mean in just a bit. Large memory allocations are slow, and that’s my best guess for why it failed to improve as much as the others.

And in case you missed it, the biggest news here is that GDI+ is no longer in last place. FastScaling takes over that honor. They were close in the last test, and now they’ve flipped. GDI+ edges it out by just under 4%. I was surprised by these results, so I ran them a few times. They’re correct.

The real reason I switched to a TIFF input, though, was not to point out those things. I switched to limit the number of variables between this test and the next one.

You see, the reality is GDI+ is just not optimized for what we would like it to be. Remember, GDI+ was not made for server-side image processing and certainly not for web apps. It was made for processing images/graphics for display on the screen or sending to a printer. Its typical operations involve lots of compositing, layering, drawing, etc. Basically the kinds of things you need for windows. So it might make sense that GDI+ would do all of its processing in an RGBA pixel format. I posited as much in a previous post and showed some evidence to back that up. If you want to see GDI+ at its best, you have to give it the task it was actually built to do.

Here are the results of the same test I did above, only this time the input image was swapped for an RGBA TIFF. Of course this image has no transparency, it’s simply a format change to illustrate performance characteristics.

fscubrgbatif

Well, well, well… what do you know… GDI+ is much faster than FastScaling here. In fact, even on the 8 thread parallel test, GDI+ only took twice as long as FastScaling, and it had seven of its threads tied behind its back.

Remember the difference in pixel counts for my breakdown of orthogonal vs non-orthogonal processing earlier? Let’s revisit those calculations with this example. Processed non-orthogonally, this resize has 384.5 million pixels sampled, multiplied by 4 channels, for a total of just over 1.5 billion sample calculations. Processed orthogonally, that becomes 102.4 million pixels * 4 channels, which is just ~410 million sample calculations. DrawImage() is doing nearly 4 times as many calculations as FastScaling and completing 34% faster anyway.

What’s really interesting here is that if you compare the numbers across the last two tests, you’ll find DrawImage() was roughly the same speed with RGBA as it was with RGB, whereas all three of the other scalers were significantly slower (WIC doesn’t look that much worse, but it’s doing less than 1/4 the processing of the others). In fact, GDI+ was as fast at scaling in RGBA as FastScaling was in RGB. One might infer from those numbers that DrawImage() is missing the optimized code path for RGB that all three of the other scalers have. When doing the one thing it’s good at, GDI+ isn’t actually all that bad. And FastScaling looks a lot less clever by comparison.

Of course, it is a real bummer that DrawImage() isn’t optimized for RGB processing, and it’s a bummer that it doesn’t process orthogonally. Most of the work we do in web image resizing only involves rectangles. And most of it is on RGB images, particularly when we’re dealing with high-res photos. Those are usually JPEGs, which don’t support transparency at all. There is a huge benefit to taking the fast path on those images, and that’s a real opportunity for performance improvements. Anything that takes advantage of that opportunity should beat GDI+ performance-wise. Again, I’m actually surprised FastScaling failed to better GDI+ in the RGB TIFF test, but the numbers say it did.

With all that in mind, let’s look at MagicScaler’s numbers. They’re a decent improvement over GDI+ in RGBA mode, but nothing earth-shattering. We beat GDI+ handily in RGB (over 1.9x as fast), but it’s a much closer race in RGBA (25%).

And just for fun, because I guessed in that earlier post that GDI+ actually uses a Premultiplied Alpha format for its internal processing, let’s see how we all compare with that kind of input. Here is a test run with a PARGB TIFF input.

fscubprgbatif

GDI+ gets even faster when given its native format for processing, and FastScaling gets even slower. Here GDI+ is almost 64% faster. Notice WIC also got faster with PARGB input, so we can assume its RGBA processing converts to PARGB as well. I haven’t built a native processor for PARGB in MagicScaler since this type of file is pretty rare in web resizing scenarios, but MagicScaler does still manage to edge out GDI+ even when it’s doing the thing it does best.

And in case you overlooked it again in the numbers, I want to revisit the comment I made about DrawImage blocking all concurrent calls until they all complete. You might expect that if I fired off 8 calls to DrawImage() on 8 threads one after another, the first one should finish in a normal amount of time and the last one should take the longest as it waits in the queue behind the other 7. We’d expect to see a huge standard deviation on those, but that’s not the case. They all returned at the same time. This behavior makes GDI+ scale even less well than you might have guessed in a server environment.

Speaking of which…

FastScaling’s Dirty Little Secret

I promised in the last post that I’d reveal this, and I hinted at it earlier. A picture is worth a thousand words in this case.

This is a Visual Studio Diagnostic Tools trace of another benchmarking run configured the same way as the last one (PARGB TIFF input), although the results are similar regardless of input format.

fsdirty

Ok, maybe this picture requires just a few words…

Each test here had a breakpoint, followed by a forced Gen 2 garbage collection, followed by a 1-second sleep, followed by the 3 portions of the test (10 runs serial, 4 parallel, 8 parallel).

The breakpoints separate each component’s test nicely in the Events section (note the gaps in the grey bar), and I’ve labeled each one using everyone’s favorite graphics program: MS Paint. The garbage collections (orange arrows) ensure nothing is left over from one test to the next, and the sleep puts a nice break in the CPU chart before each test gets going. If you haven’t used this tool before, hopefully the graphs are self-explanatory, but I’ll call out some details. This debug session, by the way, was run on a release build.

The baseline memory usage at the start of this graph is 156MiB. The reason it’s that high, even though nothing has happened yet, is that I pre-loaded the source file into memory so that I could wrap it in a MemoryStream and pass it to each component. I could have passed in the file name and had each component do its own read, but this a 92MiB TIFF file, and when we get to the parallel processing tests, disk I/O could become a bottleneck. Using a shared copy in memory removes that possibility and makes the tests more fair. Each component still has to do the full decode and encode; we’re really just caching the source file.

When the GDI+ test begins, there is a step up in memory usage to 249MiB. That represents a decoded copy of the 92MiB source, plus another 1MiB of… miscellaneous. Because the source is uncompressed and already in PARGB format, the encoded and decoded images are the same size. For the entire duration of the GDI+ test run, the CPU chart is steady at ~13% (one of 8 virtual cores) and the memory usage is flat. It actually peaks at 255MiB, but that’s just noise at this level. So no matter how many parallel calls we made to DrawImage(), there was only ever one decoded copy of the image in memory and one core doing work.

I’ll pause here and point out that it’s really not cool that GDI+ decodes the entire image into memory at once and holds it there for the duration of processing. The fact that a 93MiB jump in memory usage looks so insignificant on this graph is a hint to just how out-of-control things got later. In isolation, I would have said that was way too much memory to dedicate to a single operation. That’s a real killer for scalability. Fortunately, this is mitigated by the fact that GDI+ will only ever do this once at a time, due to its non-reentrant nature. I don’t know if this is the actual reason for that design or if it has its roots in the fact that GDI+ was designed for screen use. Maybe it has to do with handles and device contexts and what-not. I dunno. Whatever the reason, GDI+ essentially protects you from yourself if you’re using it in a server environment. It may not scale well for speed, but at least it won’t run away with all your memory.

Moving on to the WIC test, you see GDI+’s in-memory decoded copy of the bitmap has been dropped, and we’re back to the baseline memory level, which has moved up to 157MiB by now, because we’re starting to fill in the test results in the UI. The important thing is, the memory usage remains flat throughout the test run, peaking at only 160MiB. WIC never has to load the entire source image because it processes the output by pulling only as many pixels as it needs at a time through the pipeline. The CPU usage is flat at one core for the duration of the serial runs, then we get a nice spike as the parallel tests kick off. From a server-side scalability standpoint, this segment is a beauty.

Then there’s the FastScaling test. If the WIC test was a scalability beauty, this one is U.G.L.Y. (it ain’t got no alibi). The lowest memory usage observed during this test was 345MiB. That’s 96MiB more than GDI+ ever used, and that’s the minimum. Near the beginning of the test you can see the memory usage creep up to a high of 624MiB before the garbage collector decides it’s time to take action. As the serial runs continue, we see a cycle of rolling hills in the memory usage, with the value repeatedly climbing to 536MiB before the GC kicks in again taking it back down to 444MiB. Then the parallel tests start, and all hell breaks loose. Memory usage peaked at over 2.7GiB during the 8 thread test. But at least they broke free of GDI+’s single-thread restriction. That’s worth it, right?

Finally, we get to the MagicScaler test, and you can see that, much like WIC (because they’re like this [holds up crossed fingers]), memory usage is almost flat through the entire test. It starts with a baseline of 159MiB and peaks at 179MiB. MagicScaler needs more buffering memory than WIC does because it’s doing a more complex resample operation, but 20MiB for 8 parallel operations on an image this size is quite reasonable, I think. Mostly, it looks like the WIC test but with higher CPU usage. Like I said, quality isn’t cheap.

There’s one final thing I want to address that you may have noticed in the above chart. There are a bunch of little red diamonds in the Events panel during the FastScaling run. Those are all exceptions thrown during the FastScaling test, but they’re all caught internally by the ImageResizer library. As far as I can tell, they didn’t affect the test results. The exception, in case you’re curious, was an UnauthorizedAccessException, saying “MemoryStream’s internal buffer cannot be accessed”. It appears ImageResizer was attempting to call GetBuffer() on the MemoryStream passed in to it. That MemoryStream wrapped an existing byte array, so that’s not allowed. I don’t know why ImageResizer didn’t just use the Stream it was given, but that may have been an attempt at optimization. The other components use the Stream interface, so that failure kept them on even ground.

Truth from Lies

I started off Part 1 of this series by saying that benchmarks can lie, because we can always make the numbers show whatever we want. But of course, benchmarks can also be a valuable tool for learning the true nature of performance. We used them to prove -- at least in my mind – that DrawImage() isn’t really bad; it’s just misunderstood. Or more accurately: it’s misused. We saw that it could easily take over a system’s resources if it were allowed to, so its non-reentrant nature is probably a good thing. Removing that brake and allowing it to run wild is ill-advised, and yet that’s pretty much what FastScaling does, only worse. Most of FastScaling’s performance claims are rooted in the fact that it can and will take over your server if you let it.

We also saw that when GDI+ is doing what it was designed to do, it’s not terrible at it. A separate code path that optimized for rectangular RGB images would have been nice, but that wasn’t part of its design. FastScaling obviously does have those optimization, and on that front, we saw it does edge out DrawImage() performance-wise, sometimes. That’s a win, albeit a small one. Overall it’s misplaced in an architecture that is hostile to server environments.

And we saw that there are significant trade-offs when it comes to performance vs quality. WIC was ridiculously fast, but the image quality with its Fant scaler isn’t good enough for most uses. Getting to the quality level of GDI+’s high-quality scaling negates most of the performance improvements FastScaling claims. There is benefit to flexibility, though, and having the ability to balance performance with quality is a good thing. There is a middle ground between WIC and GDI+, and FastScaling seeks to make its place there. MagicScaler seeks the same but takes a different approach. In my next post, we’ll start exploring that area and planting some flags.

0 Comments

I’ll open this post with an explanation of the title of this series for those who may not be familiar with the famous quote on which it is based. I’ll be looking at some benchmarks in this series and basically poking holes in them. Let me be clear, I don’t mean to imply that anyone is actually lying in their benchmarks. The important thing to remember about benchmarks is that they’re just numbers. And if you change enough variables, you can almost always come up with a number that serves to reinforce your position. I take care when I do benchmarks to ensure they are both realistic and fair, and I reserve the right to point out when other benchmarks are contrived or disingenuous. But in the end, I don’t expect you to take my word for it either. You should always test for yourself. I aim to give details to back up my numbers so that you can re-create them or explain how they came to be, but that doesn’t mean you shouldn’t question them.

The reason for this series is that I recently became aware of another image scaling component that promises to deliver superior quality and performance over System.Drawing/GDI+ for server-side applications (hey, that’s my line!). I had intended to do all of my benchmarking against my own reference GDI+ resizer so that I could be sure I was doing a like-for-like comparison where possible and because GDI+ was the only real competition I had. In cases where MagicScaler took advantage of a shortcut that wasn’t available in GDI+ or did some additional processing for quality that GDI+ couldn’t match, I’d point that out. But I never got around to doing those detailed benchmarks, and it’s just as well, because now I have something much more interesting to compare to.

The benchmarks in question come from the documentation for the FastScaling plugin from the ImageResizer library. I previously looked at ImageResizer when discussing optimization of DrawImage() and used it as an example of how performance can suffer if the GDI+ settings are not chosen carefully. That the FastScaling plugin is part of the same project is quite interesting because they make some truly impressive performance claims, especially given the poor performance of their GDI+ implementation. I’ll start by quoting from the page linked above.

Unlike DrawImage, [FastScaling] uses orthogonal/separable resampling, and requires less of the CPU cache. On a 4-core laptop, this can translate into a 16-30X increase in throughput when benchmarked against DrawImage for photos from your average 16MP pocket camera. Even when executed on a single-thread, this can mean a 5-12X performance advantage. On a Azure D14 instance, FastScaling has been measured to be 43x faster).

I’ll address the first sentence of that quote later. For now, I’ll focus on the meat of it: the numbers.

There are lot of numbers crammed in there, and to be quite honest they gave me a bit of a scare at first glance, as my own benchmarking of MagicScaler had shown it to peak at around 10x the performance of GDI+ on a single thread. And the performance gains I achieved required some techniques that aren’t even available in GDI+, like DCT-domain scaling in the JPEG decoder and planar processing of the image. So this must be one fast scaler indeed. The documentation page also explains that the plugin requires the Visual C++ redistributable package, and it comes in platform-specific binaries in the Nuget packages. Surely I wouldn’t be able to match the performance of carefully-tuned C++ (and possibly assembly?) code with my C# implementation, no matter how awesome the JIT. It’s a good thing I optimized the entire pipeline, or this would be a non-starter. But there’s plenty that looks fishy in those numbers, so I wasn’t nervous for too long. And I wasn’t about to accept those benchmarks without testing them for myself.

Fortunately, I just happen to be using a 4-core laptop right now and have some nice 16MP JPEGs lying around my office, so we can put those numbers to the test. As for the Azure D14 (16 Xeon cores / 112GB RAM) instance, I’m not even going to bother with that. First of all, either the resizers scale* well with extra processors, or they don’t. We can see that as well with 4 cores as we can with 16. Secondly, I don’t know too many websites that run on that kind of hardware. That’s a beefy box for a web server, and we usually scale web servers out, not up. I’d argue that my quad-core laptop is much more representative of a typical web server than a D14 is. Plus I’m lazy and don’t want to set one up. So there.

* It gets really difficult to disambiguate between performance scalability and image scaling in the same topic. I trust my audience is astute enough to know which ‘scale’ I mean given the context, so I won’t bother.

Then there’s this:

Will you always see benefits this drastic? No. For tiny images that can fit - in their entirety - on your CPU cache (say 300x300), the gain is small - 25% on a single thread, or 0.8 x core count for a throughput increase.

If you have a large number of cores, you might see much larger increases in throughput.

Including the jpeg encoding and decoding cost (which FastScaling does not address), enabling FastScaling will translate into roughly a 4.5-9.6X increase in overall performance for 16MP images.

Ah, that’s more like it… a touch of realism. So the first set of numbers didn’t include decoding or encoding, which makes them fairly pointless. Decoding and encoding are part of the process, and they can’t rightly be excluded just to make the numbers bigger. We’re talking about components designed for server-side use, so decoding the entire image before processing is a bad practice. Keep in mind, a decoded 16MP RGB image would occupy 48MB of memory. If you’re dedicating that much memory to serving a single image request, you’re doing it wrong. Maybe we have different definitions of scalability, but that does not compute for me. The real-world numbers are the only ones that matter here, but they’re not the ones that are bolded.

There’s one more clue in that quote as to the nature of the performance numbers given. They’re very much dependent on scaling up to more than one core. That explains the D14. More cores means a bigger number. The reason for that is simple: DrawImage() is non-reentrant, so no matter how big your box, it will only ever use one core per process. Removing that limitation is an instant performance win since you can now scale up. Of course, there’s a reason why DrawImage() is implemented that way, and I’ll show you what that might be in the next post. It’s worth building something that removes that limitation, but only if you address the underlying reasons for that limitation (hint: they didn’t).

One final quote before we move on:

FastScaling can make a range of adjustments to favor speed or quality on very large images. These typically set default values for the filter type/size and for averaging optimizations.

Averaging optimization: If you're downscaling an image to 1/20th of its size, FastScaling will use an averaging filter to scale it to 1/6th, then scale the remaining 1/3.333 using a high-quality filter. Since no filters use a window larger than ~3x the scale filter, this does not measurably affect quality. You can make this optimization more agressive [sic] by increasing the speed value, or reduce it by specifying a negative value.

Ok, so that part about ‘averaging optimization’ is referring to a technique I also employ in MagicScaler. It’s a fairly common practice, and as I mentioned in a previous post, it would be possible to do this with DrawImage() as well. In fact, I might add it to my GDI+ resizer just for grins. It does measurably affect quality, as we’ll see later, but it’s a tradeoff that is acceptable sometimes.

The less clear part of that quote is the ‘range of adjustments to favor speed or quality’. It certainly looks like the numbers given are the result of some of these adjustments and shortcuts, but there are no specifics. The graphs shown in the page have labels for ‘speed prioritized’ and ‘quality optimized’, but it’s not clear how much quality was sacrificed in order to bump up the performance numbers, nor does there seem to be a speed comparison with identical quality to DrawImage(). That’s not cool if you ask me. Relative performance is only meaningful if you understand the baseline, and too many variables were changed for the numbers to be comparable.

To be fair, it does appear the source code for the benchmark program is available on github, but I don’t really think it’s right to put all the big numbers and pretty graphs on the page and then expect someone to dig into your source code to figure out just how many shortcuts you took to get those numbers. The numbers should reflect real-world scenarios, and where quality differs, that should be explicitly spelled out or better yet, shown. It’s totally cool to trade quality for performance, but let’s be clear about it. Those are the rules I’ll be following in this series.

/rant.

Evening the Playing Field

With that last point about image quality in mind, I figured it would be best to start off by measuring baseline performance between GDI+ and FastScaling at the same quality level. If, as that opening quote implied, FastScaling is implemented in a cleverer way than DrawImage(), it should win any head-to-head comparison by a significant margin, without having to sacrifice quality to do it. We’ll be verifying that.

The first step in establishing that baseline will be examining the resampling algorithms offered by FastScaling so we can match them up with their equivalents in GDI+ (and MagicScaler). Ensuring that we’re using the same algorithms/parameters means we’ll know we’re doing the same work and getting the same output quality.

Looking through the docs, I see FastScaling has a long list of supported resampling filters, so I won’t list them all here. It appears they’re using standard names for some of the filters, and some appear to be made-up variants of those standards. Some, I’ve never heard of. That makes comparisons tricky, but we’ve seen before that ResampleScope can reveal much about the implementations.

I’ll point out now that ImageResizer and FastScaling are open source, so I could just dig into the source and find out how they’re actually implemented, but I won’t be doing that. There is much overlap between those projects and mine, so I prefer to remain ignorant of their implementations. For that reason, I’ll stick with the documentation available on the website.

Fortunately -- or unfortunately, depending how many comparisons you wanted to see – there appears to be very little overlap in the capabilities of GDI+ and FastScaling. We know from my earlier experimentation with the GDI+ InterpolationMode values that there are only two of them that are actually suitable for large-ratio downscaling, and only one of those appears to match up with the filters included in FastScaling.

From the docs:

To mimic DrawImage precisely, use Cubic

Ok.

I’ll assume they mean DrawImage() using the HighQualityBicubic interpolation mode since that’s the default in ImageResizer. We saw in my review of the GDI+ InterpolationMode values that the HighQualityBicubic is actually adaptive depending on resize ratio. I doubt they did actually mimic that ‘precisely’, but we’ll see what the graphs say.

In addition to specifying the algorithm/filter, I’ll also need to explicitly disable the averaging shortcut (and possibly other shortcuts). Those might not come into play now since ResampleScope uses small images anyway, but I’ll use this just to be safe and to be consistent with the next few tests.

&down.speed = -2..4 (default 0) -2 disables averaging optimizations.

I’m using FastScaling 4.0.5 x64, freshly installed from Nuget, and I ended up with the following options for my tests:

width={w}&height={h}&mode=stretch&format=png&fastscale=true&down.speed=-2&down.filter=cubic

The ‘mode=stretch’ is necessary for now because ResampleScope works with images resized on a single dimension. Let’s try those settings and see what we get. First I resize down by 2px.

rsirfcub

This looks mostly like a Cardinal (0, 1) Cubic, which means it’s not a match for GDI+ at all at this size (it’s not until a much larger downscale ratio that GDI+ uses a standard cubic). But there’s something strange going on with the top of the curve. It’s slightly flattened, and it pokes up slightly above 1. Let’s compare that to MagicScaler.

rsirfmagcub

Yep, it’s not quite right. This is not a great start, but I guess we’ll call it close enough for now. Moving on to a 50% shrink…

rsirfcub50

Here, the shape of the curve is even more wrong, with the ends jumping up to 0 prematurely. This will lead to a fuzzier image, as it’s the negative lobes that give this filter its sharpening effect. Contrast that with the curve from MagicScaler, which remains the same regardless of resize ratio.

rsirfmagcub50

And finally, we’ll do a shrink to 25%. At this ratio, GDI+ does use a Cardinal (0, 1) Cubic, so if it gets this one right, it’s at least a match in some cases.

rsirfcub25

Nope. It’s even worse. Here it is compared with the GDI+ curve. Note again that the ‘arms’ in the correct implementation go smoothly (well, as smoothly as this graph can show) up to the origin line.

rsirfgdicub25

What’s Wrong with This Picture?

Ok, so the graphs don’t match expectations, but what does this mean for actual image quality? Fear not, I have visual comparisons to go with those mathematical ones. I have a 6000x4000 reference image I’ve been using for a lot of my testing. I like it for a few reasons:

  1. At 24MP, it’s among the biggest images you’ll find in the wild, but not unrealistically big. Most of the new mid-to-high-end DSLRs have standardized around 24MP in their last generation or two.
  2. The bigness of the image exaggerates the impact of the scaler performance. This is the same reason the FastScaling benchmarks used 16MP images, I assume. I’m just kicking it up a notch for now. Bam!
  3. It’s a human face. Humans are very attuned to spotting differences in faces, so even without a mathematical comparison, it’s usually easy to spot differences in quality or sharpness
  4. In order to test these components thoroughly, we need to be able to experiment with images in different container formats and pixel formats. I happen to have 14 variants of this image saved every which-way.
  5. Get it? Which-way? It’s a witch. Witches are magic. My thing is called MagicScaler. I’ll see myself out…

I’ve applied the settings used above and resized the original JPEG version of the 6000x4000 reference image to a 400x267 PNG in my test/benchmarking harness. The reason I output PNG at this stage instead of JPEG is that I want to compare the pixels output from each scaler, and JPEG encoding is lossy, meaning it may amplify or cancel out small differences. We’ll switch to JPEG output once we’ve normalized the settings and verified the outputs are the same. Here are the results of my first trial:

fscubbadwindow

First, let me explain my test harness. The four images shown are the results of a single warm-up run for each test. Following the warm-up, I run each test 10 times serially and display the average and standard deviation of those 10 runs. Next I run 4 tests in parallel and show their average, standard deviation, and total elapsed time. Finally, I do the same for 8 in parallel. I’m running these on my dev laptop, which is a quad-core i7 with hyperthreading enabled. 8 threads is the max I can theoretically actually run simultaneously, so that’s as far as I go with my parallel testing. It should be good enough to give you an idea how each scaler… erm… scales… performance-wise. The visual results are kind of a sanity check to make sure the numbers reflect the work I thought I was doing in the test.

Which brings me to the visual results of this run. We could talk about the performance numbers now, but I think it’s clear from the visual results that the work done was not the same for all cases, so lets hold off on that. The WIC result looks pretty bad to me, with both aliasing and blurring. I’m running Windows Server 2012 R2 on this machine, so that’s actually the best we can do with WIC quality-wise. I’ll leave it there as a reference for just what kind of performance you can get if you’re willing to sacrifice quality for speed. We’ll focus on matching up the quality on the remaining three images, with the GDI+ one as the reference.

The GDI+ and MagicScaler results look the same to me, so I’ll check them in Beyond Compare’s picture compare tool to see how close they are. If you missed my last use of this tool, I have it set with a tolerance of 1, meaning pixels shown in grey are identical between the images, pixels in blue differ by 1 (out of 256), and pixels that differ by more than 1 are red.

t1maggdidiff

Those are darn-near identical, and certainly close enough that you can’t see a difference with your human eyes. The only red pixels are near the edge, where GDI+’s weird ‘mirror’ treatment differs from my more normal ‘extend’ treatment of edge pixels. Considering the difference in implementations, I’ll call these a match visually. Note that for this test, I had all optimizations that would affect image quality disabled in MagicScaler. I also disabled its auto-sharpening. Again, the idea is to establish a baseline where all the components are doing the same work. Only then can we compare them fairly.

Now let’s see how the FastScaling output from that run compares to GDI+. Remember, the docs said its Cubic mimics DrawImage() ‘precisely’.

t1fsgdidiff

Nope. The graphs told us it would be blurry, and I could tell just by looking at it that it was significantly more blurry (even more so than the WIC output). There’s the proof. So it really doesn’t matter how fast this was. It was wrong. As in not what I expected and not what I asked for.

At this point I was about ready to declare the FastScaling implementation ‘busted’ and give up, but I had one more look at the docs and spotted something that held promise:

&down.window=0.5..3 Default 1..3, depends on filter. This describes the support window (input pixel set) size relative to the output pixel's corresponding input area. Values of 1 will involve only corresponding pixels. Values of 2 will involve the corresponding area, plus half again on each side. Values of three will triple the number of input pixels.

This actually explains the problems we saw in the graphs. It appears the sample window is being set incorrectly initially, and it’s possibly being changed relative to the resize ratio. That’s why the ‘arms’ were cut off on those graphs. You see, the correct sample window for this cubic function is 4. That’s the width of the correct curves in the graphs above. But 4 isn’t even listed in the valid range of values for this option… I figured I’d try the max value allowed and added &down.window=3 to my config for ImageResizer. The image was still blurry, as one might expect. Next I went out on a limb and tried plugging in the correct value of 4:

fscubgoodwindow

Et voilà! Now the WIC image is the only one that looks bad. I’ll compare the FastScaling output to GDI+ again in Beyond Compare just to be sure my eyes don’t deceive me.

t1fsggdidiff

Much better. This one also shows a few red pixels around the edges, but again, the GDI+ implementation is weird, so I’ll call this a match.

Quality Isn’t Cheap

Note that in fixing the image quality, the average serial processing time for FastScaling went up from 301ms to 384ms, a 27% increase. Initially, it was cheating by sampling fewer pixels than it should, and we wouldn’t have caught it without comparing visually. It’s a bit troubling that I had to set the window size to an undocumented value to get the correct visual result, but hopefully that’s just a bug and not an intentional cheat.

Before we look at the rest of the numbers, we need to make one more small change. Remember, we were using PNG output, and we don’t want to do that for anything other than visual comparisons. The PNG encoder is much slower than the JPEG one when processing a large or complex image, so it hides some of the scaler performance differences. Plus it’s not representative of what we’d do in the real world. Now that we know the scalers are producing the same quality results, I’ll re-run the tests with JPEG output (quality 90).

fscubgoodwindowjpeg

There we go. Notice the numbers didn’t change a ton, but they all went down. It basically amounted to a 7-8ms difference in the serial runs. Where that really shows up, though, is in the WIC numbers. That 8ms difference represented a 22% speed-up in the WIC resizer. Seriously, that thing is fast.

I have just a few notes on the numbers at this point:

You may have noticed I’ve only been quoting the numbers from the serial portion of the tests. There are two reasons for that.

  1. We can only get a fair comparison with GDI+ on one thread since we know it’s non-reentrant. We’re still establishing a baseline at this point.
  2. My processor has Intel’s Turbo Boost and Hyper-Threading features enabled, so the performance on multiple threads will always be lower per thread than when only one is working at a time. The serial test represents the true best time for each scaler.

The parallel results are important for scalability, but in the interest of keeping the number of variables to a minimum, the serial tests are the most controlled and easily comparable.

Now that we’ve normalized the quality, FastScaling is very close to GDI+ in single-threaded performance. Its advantage is now down to less than 8%. That’s certainly nowhere near the 5-12x improvement we were teased with in the beginning. It does perform better running on multiple threads, but we’ll see later why you probably don’t want it doing that.

This test was mostly about comparing FastScaling and GDI+ on even terms, but check out those numbers on MagicScaler. Producing the same quality output, it’s significantly faster than both of the others. And we haven’t even opened up our bag of tricks yet. We’ll do that once we start enabling some of the FastScaling shortcuts, to keep it fair.

Wrapping up, let’s go back to what I originally said about benchmarks. I set out to show that FastScaling really isn’t that much faster than GDI+, and with a few tweaked variables, I think I’ve done that. It is still faster in this case, but in the next post we’ll see some cases where it’s not. And if you want to talk about big numbers, you can grab some out of this test as well. WIC came in over 53x faster than GDI+ in this last test @ 8 threads. That’s end-to-end processing, compared against an optimized GDI+ implementation. Of course I traded quality for that big number, but so did the benchmarks from the FastScaling page, and my number is bigger. Neener, neener.

In the next post I’ll take a stab at explaining why GDI+ came in last in this benchmark, and I’ll reveal FastScaling’s dirty little secret that doesn’t show up in the timing numbers. Stay tuned…

Oh, and one more thing. I keep typing ‘FastScaler’ instead of ‘FastScaling’, and now I notice I labeled it that way in my benchmark app. Mea culpa. Just in case there was any confusion, I was using the right component. I’ll fix the label going forward, but I don’t feel like redoing any of the benchmarks to correct the images I’ve already done.