- By Joel Hruska on July 21, 2016 at 8:45 am
Last week, Futuremark released Time Spy, a new DirectX 12 benchmark that takes full advantage of DX12’s features and capabilities, including asynchronous compute. While the release of a new benchmark is typically of modest interest, there’s been a great deal of confusion, uncertainty, and doubt over Time Spy’s benchmark results and what those results mean. Futuremark has since published an updated and expanded guide to how the benchmark functions and what it’s designed to do.
Much of the confusion on this topic is related to what Time Spy tests and how it implements support for asynchronous compute in DirectX 12. A graph from PC Perspective’s test results from last week will illustrate the question:
These results show performance in Time Spy at the benchmark’s default settings with asynchronous compute enabled versus disabled. AMD’s GPUs gain a significant amount of performance, with the RX 480 increasing its score by 8.5% while the R9 Nano and Fury X pick up 11.1% and 12.9% respectively. Nvidia’s Maxwell cards, in contrast, are flat.
Pascal, however, does gain some performance, with the GTX 1070 gaining 5.4% and the GTX 1080 picking up 6.8%. This stands contrary to what we’ve seen in most DX12 benchmarks to date, in which enabling async compute on Nvidia cards either led to a small performance decrease or had no impact on performance at all.
Revisiting asynchronous compute
The debate over whether Time Spy is a valid benchmark, and the questions regarding its implementation of asynchronous compute, speak to a significant amount of confusion in the user community about what asynchronous compute is, how it works, and how it can or should be used in DirectX 12.
While asynchronous compute support is a component of DirectX 12, the details of how to implement that support were left to AMD, Intel, and Nvidia. AMD and Nvidia implemented this capability very differently, and with very different results. We now know that Nvidia has never implemented async compute in-driver for Maxwell v2 GPUs (GM200, GM204, and GM206), which means our attempts to characterize and examine the performance impact of running asynchronous compute workloads on Maxwell in Ashes of the Singularity didn’t measure what we thought they were measuring. The current consensus is that Nvidia is unlikely to ever enable asynchronous compute on Maxwell due to the difficulty of implementing the feature in a way that would improve performance.
While it’s true that Nvidia could’ve been far more clear about Maxwell’s ability to perform asynchronous compute workloads and the benefits (or lack thereof) of doing so, some in the user community have locked on to asynchronous compute as if it were the sole defining feature of the DX12 API. This is not the case.
The reason asynchronous compute has become such a prominent feature of DirectX 12 is because adopting it tends to significantly improve performance on AMD hardware. To date, AMD’s GCN has picked up substantially more performance from the shift to DirectX 12 then Nvidia has, though some of this gain reflects the relative state of driver optimization between the two companies. Nvidia has historically had more cash to spend on driver optimization and developer relations, even if some of its programs, like GameWorks, have been controversial. Asynchronous compute also improves performance on AMD hardware because it exposes functionality that previously went untapped in DirectX 11.
So where does Pascal fit into all this?
Pascal adds support for fine-grained preemption and dynamic load balancing — two critical features that Maxwell lacked. One of the limits of Maxwell’s asynchronous compute implementation is that the GPU had to schedule its compute and graphics workloads prior to execution and couldn’t shift its strategy mid-stream. This made it comparatively likely that enabling asynchronous compute on a Maxwell v2 chip would result in poor performance due to improper resource allocation.
Pascal’s dynamic load balancing allows the GPU to quickly shift the resources it dedicates to compute and graphics depending on what’s happening in-game. This feature doesn’t automatically guarantee that Pascal will benefit from asynchronous compute, but it fixes a major issue with Nvidia’s last generation. Pascal’s other new capability, fine-grained preemption, allows the GPU to quickly switch between workloads at the pixel level, rather than Maxwell v2’s coarse-grained draw-call boundary. Anandtech has just published a longer in-depth look at both these topics if you’d like additional information.
Enthusiasts will undoubtedly be quick to point out that Pascal, despite these changes, still can’t execute an asynchronous compute workload the way AMD can — and they’re right. What gets lost is the fact that Pascal’s architecture wouldn’t benefit from executing workloads in the same fashion as AMD, because it isn’t designed to do so. The flip side to this is that workloads optimized for Pascal probably wouldn’t run all that well on AMD hardware, either. Futuremark built a benchmark that’s designed to run well on both cards without favoring any single vendor.
This approach should prove similar to what we’ll see in future titles, given that different applications can and will use asynchronous compute in distinct ways and to varying degrees. GPUs within the same family also respond differently to asynchronous compute; the RX 480 picks up 8.5% in Time Spy while the Fury X gains 12.9%. Does that mean Time Spy is biased against the RX 480 just because the Fury X gets a much larger boost from the feature? Of course not.
Futuremark’s Time Spy
The Time Spy-related questions can be broadly summarized as follows:
- Why does Nvidia’s Pascal architecture gain performance in Time Spy when it shows no performance gain from asynchronous compute in other benchmarks?
- Why doesn’t Futuremark implement optimized, vendor-specific code paths for AMD and Nvidia? Isn’t this a functional requirement of DX12?
According to Futuremark, Time Spy uses a new engine specifically architected for DirectX 12. The benchmark was designed over a period of two years of active collaboration with Intel, AMD, and Nvidia, all of whom have had source code access and have contributed best practices and technical understanding. Furthermore, all of Futuremark’s partners have signed off on releasing the benchmark in its current form.
I’d like to note that this public explanation lines up with what we’ve heard privately. Neither AMD nor Nvidia’s PR teams are known for their reticence when it comes to attacking benchmarks they perceive as flawed or unfair, and neither company has anything negative to say about Time Spy.
Futuremark goes on to say it has considered implementing vendor-specific code paths, but that its partners are invariably against the practice. It writes:
In many cases, an aggressive optimization path would also require altering the work being done, which means the test would no longer provide a common reference point. And with separate paths for each architecture, not only would the outputs not be comparable, but the paths would be obsolete with every new architecture launch.
3DMark benchmarks use a path that is heavily optimized for all hardware. This path is developed by working with all vendors to ensure that our engine runs as efficiently as possible on all available hardware. Without vendor support and participation this would not be possible, but we are lucky in having active and dedicated development partners.
Ultimately, 3DMark aims to predict the performance of games in general. To accomplish this, it needs to be able to predict games that are heavily optimized for one vendor, both vendors, and games that are fairly agnostic. 3DMark is not intended to be a measure of the absolute theoretical maximum performance of hardware.
This statement caused some controversy in the user community because a joint AMD-Nvidia presentation at GDC 2016 prominently claimed that there was no point to implementing DirectX 12 unless you planned to also implement IHV-specific code paths.
So, is this proof of skullduggery, bias, or deceit? No. In fact, from my perspective as a reviewer, it’s quite the opposite.
Vendor-optimized paths are risky
Back in 2008, when I worked for Ars Technica, I wrote a review of the Via Nano. During the course of testing that CPU, I decided to use a VIA-provided utility to change the CPUID string that identifies the microprocessor. Most of the test scores didn’t change, but the memory subsystem score changed drastically.
Changing the CPUID improved Nano’s performance by 47% because a vendor-specific codepath had been implemented and certain optimizations had been tied to it. Futuremark always insisted that this was due to an accident rather than a deliberate attempt to skew benchmark results in favor of Intel. When Futuremark announced PCMark 8 I asked the company what had happened after the PCMark05 controversy. Futuremark informed me it had overhauled its developer programs and optimization strategies to avoid vendor-specific, hand-optimized code paths because of the fallout surrounding the PCMark05 issue.
It would be hypocritical in the extreme to attack Futuremark for using Intel-specific optimizations in one test, only to turn around and attack it for not implementing AMD or NV-specific optimizations in a different test. If I have to choose between a general-case, all-around fair test that doesn’t include vendor-specific optimizations for any architecture, and a benchmark that’s been optimized to an unknown degree by multiple vendors, I’ll take the former every time — even if it means missing out on seeing the absolute best-case scenario for any given GPU.
A program like Time Spy, Fire Strike, or 3DMark 11 is designed to serve as a general, representative vehicle for measuring performance in a given series of tests. Futuremark’s customer base isn’t limited to individual gamers. It also sells site licenses to other companies that want to measure their hardware’s general performance in a standardized benchmark. 3DMark versions also tend to have longer shelf lives than game benchmarks. Most reviewers refresh their game tests on a 1-2 year cycle, while 3DMark versions typically last three or more. Writing and updating a benchmark that performs decently well on multiple architectures without being specifically optimized for any single target may prevent any one company from showcasing a specific feature. But it also provides a framework that multiple companies can rely on for qualifying their own designs.
Futuremark’s formal statement and updated technical guide contains a great deal of additional information on how asynchronous compute is executed on Maxwell, Pascal, and AMD GPUs. Again, there’s simply no evidence that this test is unfairly or unusually biased towards any vendor.
The only thing this benchmark shows is that Pascal can see a modest improvement with async compute enabled. Given the still-early state of DirectX 12, the limited number of games that utilize it, and the fact that only two engines we are aware of have been written for low-overhead APIs from the ground up (Oxide’s Nitrous engine and Time Spy itself), concluding that this benchmark is biased simply because it shows a small gain for Pascal is extremely premature.
Even 12 months after launch, DirectX 12 support in shipping titles is still limited, and our ability to characterize what DirectX 12 performance will look like across the entire industry is similarly constrained. With Pascal just launched and AMD’s Vega arriving later this year, there’s going to be ample opportunity to watch how the API evolves.