Oct 5, 2011

Capture One 6.3: A Performance Review

Shortly before moving out of California, I noticed that the latest version of Capture One (C1) uses OpenCL. Capture One 6 was announced last December, so it was well over half a year at that point, and I was really surprised that I hadn't heard any buzz about it, especially from the NVIDIA or AMD marketting machines. You would think either would leap at the chance to sell more GPUs. Although GPUs could make RAW editing a whole lot smoother, NVIDIA has never paid serious attention to the photography market. I'm assuming someone must have crunched the numbers and determined that it was too niche a market to address. I know that there was some investigation at some point, which determined that the speedup for processing RAW was not significant enough. This was mostly because part of the RAW decode was done on the CPU, and memory transfer times between CPU and GPU nullified much of the benefit of the GPU decode. I suspected though, that the analysis didn't really reflect real-world situations where users would want to upload the video to the GPU once, and decode the RAW multiple times as they adjusted various settings. This would be a huge improvement for photo editing apps and make the UI much smoother. It looks like Phase One has gone and done that, at least to a degree.

So it's been about a month since I've gotten back, and I've barely touched my pictures in that time. I decided finally to sit down and give Capture One 6 a shot, since I switched over to Lightroom a couple of years ago. The software is already up to version 6.3. I used the trial version, so I could check out Pro. I was curious about some of he features, especially the advanced noise reduction and local adjustments. I'm running on Windows 7 SP1 64-bit, on a Core i7 930 with 12GB of RAM, and a motley assortment of hard disks. I tried using both a GF106 (GeForce GTS450 or Quadro 2000) as well as a GF100 (GeForce GTX 470 or Quadro 6000). I'm playing with full resolution RAW images shot on a Canon 5D MarkII. While I was pretty happy with C1 versions 2 and 3 for dealing with images from my Rebel and Rebel XTi on an old Athlon system, the huge images from the 5DMk2 has put some annoying lag into the workflow, that never went away even after upgrading to the i7.

OpenCL Benefits
C1 allows you to turn on and off OpenCL support in the Edit\Preferences dialog. There's a setting for OpenCL, either Auto or Never (this setting doesn't appear if your system doesn't have an OpenCL driver). Other than the setting, there's no other indicator that OpenCL is being used. There's information in a Phase One knowledgebase article on when OpenCL is used, it's pretty accurate, but not very detailed.

After poking around with the OpenCL setting enabled/disabled, I figured OpenCL is only used for processing the pixels onscreen when viewing a RAW file. This is pretty limited, but it does provide some noticeable benefits that do impact my workflow.

1. the view updates much quicker when switching between images, particularly when zoomed in
2. the view updates much qucker when making adjustments, particularly when zoomed in

*1 Given that these were all visual changes, it was difficult to give any benchmark measurements. Situation 1 was where I noticed the most measurable differences. When switching from one image to another on CPU only, the new image would load in very blurry, then get less blurry, then crisp. There was also about a 0.5 second delay before the image changed at all, but that happened with both CPU and GPU, so I'm going to ignore that in the comparision. My rough attempts at measuring the time with a stopwatch makes it look like this: new image appears very blurry -> 0.5 seconds -> less blurry -> 1 second -> crisp, so it would take almost 2 seconds every time I changed from one image to the next. It gets painful if you're sorting through 150 images after some sort of shoot. The first 0.5 second delay after the keypress was there on every GPU, so I'm going to ignore that, and focus on the first delay after the image appears, and the second delay when it becomes clear.

GPU behavior looked the same, but upon closer inspection it looks liek the pipeline was different. Instead of going from "very blurry"->"less blurry"->crisp, it went through a "blurry"->"crisp"->"denoised/sharpened" stages. It was barely any time at all on either GPU to get from blurry to crisp. I'll say 0.2 seconds. Additionally, the final "denoise/sharpen" stage seemed to vary a bit depending on the zoom level. Initially I thought the speed depended on the GPU, but then I noticed that in zoom levels > 100%, it looked like it would do some sort of denoise, while if the zoom < 100%, it would sharpen. The sharpen (at low zoom) would take about 1 second on both GPUs (I'm guessing it was actually performed on the CPU). The denoise seemed to take about 0.5 second on the GF106, and was nearly instantaneous on the GF100.

For Zoom < 100

ProcessorTime to Clear ImageTotal time for Final Image (including initial delay)
CPU1.82.3
GF1060.22
GF1000.22

For Zoom > 100

ProcessorTime to Clear ImageTotal time for Final Image (including initial delay)
CPU11.5
GF1060.21
GF1000.20.3

*2 When making adjustments with the sliders, it was noticeably smoother on the GPUs than the CPU, though even the GPU wasn't perfectly smooth. It also depended on the setting. For example, adjusting Exposure was reasonably smooth on the CPU, and wasn't too much smoother on the GPU. The Moire slider though, which was completely un-smooth on the CPU, showed a marked difference. The biggest difference however, was that when zoomed in, the CPU version of Exposure, Curves, and Colour would use the preview sized image while adjusting. Once you move the slider, the image would get downsampled and blurry to show the adjustments, and then come back into focus. The GPU version would remain in focus the whole time. The GF100 was a bit smoother than the GF106. This isn't something you'd notice initially, since it just seems like the way it ought to be. But if you go back to the CPU version, it's horrible.

Things that don't improve with OpenCL
For all the improvements, there are a lot of things that don't improve, and really reduce the effectiveness of the OpenCL implementation. As I've already mentioned, it looks like there's a CPU sharpening pass that slows down the image rendering. Other noticeable items are:

- Speed depends on whether the image is cached. The numbers above are the best case. If you switch to an image that isn't cached, it could take a few seconds to read from disk, regardless of your CPU or GPU. This actually happens a lot, so you don't really get the optimized speeds listed above unless you're going back and forth between a set of images.
- JPEGs don't go through the GPU pipeline, so if you happen to have JPEG thumbnails, sorting through them will slow you down, even though scrolling through RAWs is now faster.
- Zooming in isn't sped up, you still see a bunch of pixelated pixels for half a second before the proper pixels are displayed.
- Panning isn't sped up, so when you move the image over, you get a bunch of new pixelated pixels for half a second before the proper pixels are displayed.
- Final rendering doesn't take advantage of the GPU. You might get a slight improvement if they're going on in the background, since the CPU is a little less busy with the UI.

Shoehorned in
As I mentioned initially, it looks like Phase One took whatever was visible onscreen, and sped up the processing of that bitmap using OpenCL. This makes sense given that they have an existing codebase, This means they don't have to start from scratch, and they can reuse existing functions like their CPU sharpening algorithm. The problem is that their original design is based on the fact that processing many pixels is processor intensive, so they optimize by only processing the pixels onscreen. Trying to shim the GPU using their existing design is rather limiting. They're most likely sending the visible bitmap to the GPU for OpenCL processing, which is fast. Then I suspect they're copying back down to system memory so that it can fit right back into their existing pipeline and have the CPU finish whatever work it needed to do. There's a few items that are sped up, but they're far from perfect, and in the grand scheme of things, they perceived improvement is just not that spectacular given the bottlenecks.

The ideal design to take full advantage of the GPU would be to upload the entire image to video memory, and manipulate it in video memory using the GPU, and display it, without copying it back down to system memory. With GPU memories being typically 512MB or 1GB+ these days, it should be a problem to fit multiple 20MP images in video memory. Making copies of the images is much faster on the GPU as well, since video memory tends to be 3-10x as fast as system memory. The result would be much smoother performance in adjustments, as well as smooth zooming and panning.

The main drawback, and I suspect the reason Phase One didn't go down this route, is that it requires rewriting the entire application. All the complicated image processing algorithms would need to be rewritten. Not only that but they'd need to maintain both pipelines. It's hard to argue against this. While technically far superior, this would mean twice the work. Phase One would need twice the sales to justify it. If C1 was fully GPU accelerated, I'd probably switch to it over Lightroom. One strong argument is that an i3 laptop with a midrange GPU could outperform a much more expensive i7 system. It's quite possible that C1 could take a sizeable share of the Lightroom market if their app is that much faster and smoother.

They could also potentially support one pipeline with OpenCL (or CUDA), and rely on CPU implmentations when the user does not have a GPU.

A third, weak argument to design C1 around GPU processing is that the CPUs on forthcoming Wiin8 tablet PCs will be rather weak. A GPU solution would far smoother, especially for allowing the user to drag the image around with their fingertips. A RAW editing app on a tablet would be great, but I suspect it would HAVE to be GPU oriented. Problem is none of the tablet GPUs are particularly programmable yet. We'd probably have to wait another generation - I suspect late 2012, and at this point I wouldn't know whether CUDA, OpenCL or DirectX 12 would be the way to go.

Other Bottlenecks
The other major bottleneck in C1 is the disk while rendering final images. The rendering is done purely on CPU. Tests across the multiple GPUs showed the same time, roughly 3:20 for 30 RAWs from a 5D MarkII. On a quad core i7, the CPU cycled between 0-100% load. The average was only maybe 50%. I suspected the bottleneck was my disk, so I tried using two disks, using one as the source, and the other as the destination. That made no difference. What I did notice was that C1 was writing to 8 output JPEGs at once. Most likely, the thrashing caused by this was limiting performance. The disk output was pretty slow, maybe 1-5MB/s, probably due to the thrashing. Phase One may be relying on their customers to purchase SSDs or use RAID arrays, but if they queued up their disk writes, it could potentially halve their rendering time on quad-core CPUs.

As a side note, I probably bought the wrong components on my PC. With a quad core i7 and 12GB of RAM, I rarely saturate the CPU (only when compiling using MSVC), or the RAM (only when running multiple VMs). Very few apps seem to be optimized for a fast CPU and lots of RAM. I had though that the extra RAM would mean fewer disk accesses, but there's still many cases (like both C1 and Lightroom) where I'm disk bound, with plenty of free memory.

Other notes on C1
Capture One LE was my first RAW workflow application, and I'm probably biased towards it because of that. I never fully got used to Lightroom's model with different modes, and the Lightroom's export to JPEG always felt a little weird, since the UI features rendering for print or web, and I never do either. When C1 was redesigned with the new .NET based UI, there were a number of things that turned me off, and I eventually switched over to Lightroom. I always organized my RAW files in folders by date. I hated the way C1 would put its working folder into every one of my folders. I loved the way the older C1 let me put all my final output files as subfolders of the image folders - now this seems to be available in the Pro version only. I find this seriously annoying.

In addition to those annoyances, the slowness of C1, the poor noise reduction, and the spot healing tool led me over to Lightroom. The OpenCL support has definitely improved the performance. The noise reduction seems much improved as well, though I haven't played with it enough to really judge it against Lightroom 3. There's a new spot removal tool that works pretty well. They've also added in keystone correction, which is only in the Pro version.

Right now I'm quite happy to use C1 Pro over LR3. It's $399 though, which is pretty steep unless it's discounted. There's a handful of features that I'd use in the Pro over Express - the output folder management, along with keystone correction, and maybe RGB curves. If they could come up with a third intermediate version that would match the LR3 feature set, that'd be perfect.