AMD’s custom RDNA 3.5 GPU is focused solely on improving mobile gaming performance

When you buy through links in our articles, Future and its syndication partners may earn a commission.

    AMD Strix Point APU chip held in a hand, with reflected light showing the various processing blocks within the chip.

Credit: AMD

At an event in Los Angeles last week, AMD went into more detail about all the changes it’s introducing with the Zen 5 CPU architecture. The chip giant also briefly explained what’s new in RDNA 3.5: a “fractional improvement” that’s “bolted on” to the current graphics processor design. In short, it’s all about optimizing rendering performance in mobile applications.

The updated design was introduced by AMD Chief Technology Officer Mark Papermaster, who began by explaining that the changes are a result of a partnership with Samsung, which licenses AMD’s graphics technology for its Exynos line of smartphone and tablet processors.

“A lot of the technologies are ideal for notebooks,” he said. “They’re ideal for giving you the same great Radeon graphics experience, but at much lower power and much higher efficiency.”

There aren’t any sweeping changes, but that’s to be expected from the architecture’s codename. RDNA 3.5 exists to address some of the performance bottlenecks that AMD’s GPUs have encountered when used in low-power, low-shader-count configurations, namely the integrated Radeon GPUs in its mobile APUs, which are used in laptops and most handheld gaming PCs.

In the case of the latter, these typically run on a power budget of 15W or so and while they can draw more, it’s still significantly less power than the lowest-end discrete GPUs can draw. For example, a Radeon RX 6400 can draw up to 54W, which is 80% more power than the GPU inside the Asus ROG Ally can demand.

Combined with a small number of Compute Units (CUs), this means that certain rendering operations that would normally be no problem for a desktop GPU become a limiting factor in overall performance. The first one that Papermaster identified was the texture sampling rate.

In RDNA 3, each CU houses four texture units, each of which can sample and return one bilinearly filtered texel per clock cycle. Papermaster says that AMD has doubled that number to eight in RDNA 3.5, though you might be wondering why. Low-power integrated GPUs aren’t as fast as discrete desktop chips, and combined with the fact that they use system memory for VRAM, texturing is a fairly slow process for mobile GPUs.

By doubling the number of samplers, the chip can fetch twice as many texels per clock cycle, which compensates for the lower core clocks. The lack of VRAM bandwidth is not necessarily a problem, since texture sampling introduces huge latencies anyway.

Image 1 of 4

AMD presentation slides introducing the changes added to the RDNA 3.5 GPU architectureAMD presentation slides introducing the changes added to the RDNA 3.5 GPU architecture

AMD presentation slides introducing the changes added to the RDNA 3.5 GPU architecture

Image 2 of 4

AMD presentation slides introducing the changes added to the RDNA 3.5 GPU architectureAMD presentation slides introducing the changes added to the RDNA 3.5 GPU architecture

AMD presentation slides introducing the changes added to the RDNA 3.5 GPU architecture

Image 3 of 4

AMD presentation slides introducing the changes added to the RDNA 3.5 GPU architectureAMD presentation slides introducing the changes added to the RDNA 3.5 GPU architecture

AMD presentation slides introducing the changes added to the RDNA 3.5 GPU architecture

Image 4 of 4

AMD presentation slides introducing the changes added to the RDNA 3.5 GPU architectureAMD presentation slides introducing the changes added to the RDNA 3.5 GPU architecture

AMD presentation slides introducing the changes added to the RDNA 3.5 GPU architecture

However, AMD’s presentation slides state that this doubling only applies to “a subset of the most common texture sampling operations”, so it may not be as clear-cut as simply doubling the number of texture units. I wonder if this is actually more a case of improvements in the way some of the vector memory image’s instructions are processed. If I eventually get my hands on an RNDA 3.5 GPU, I’ll hopefully be able to dig deeper into what’s actually being doubled in nature.

Another thing that saw a 2x performance boost are vector-intensive operations that involve the interpolation or comparison of values. These involve performing multiple data reads from the vector register files and since the performance of these is affected by clock speeds, it makes sense to improve things here. I’m not entirely convinced that such routines are a significant bottleneck for integrated GPUs, but AMD apparently does.

Or it could have something to do with all the other changes implemented in RDNA 3.5 that focus on improving memory management. Since iGPUs don’t have the power budget to enjoy blazing-fast clock speeds, nor do they have the space for large amounts of cache, every cycle saved on any form of memory operation is a good thing in the mobile world.

In short, RDNA 3.5 focuses squarely on both memory and shader execution to significantly improve graphics efficiency while still delivering the same Radeon experience our customers expect.

These include a new instruction that detects if a one-time write has been performed and can skip it, allowing the GPU to move on to the next instruction. Writing data, especially to RAM, can be very slow, while a lot of vector multiplications can be performed in a few cycles.

The way primitives (groups of vertices that form a shape) are processed in batches has been refined to take greater advantage of spatial locality. The data for primitives is naturally grouped together in cache or RAM, so if you perform an operation to generate a memory address to fetch data, there is a good chance that the next address will also be the same primitive (also known as spatial locality). Improving the way all of this is managed means that fewer system memory accesses and address operations are required.

RDNA 3.5 also has better memory compression algorithms, and the iGPU’s memory controller is well optimized for LPDDR5, the RAM of choice for handheld gaming PCs and increasingly laptops. Accessing system memory for graphics routines is not only slow, it is also very power inefficient compared to cache.

To summarize, it’s all about doing more for the same or less amount of power, and to achieve that goal, AMD did a performance comparison between an RDNA 3.5-powered Strix Point APU and an RDNA 3 Hawk Point. Specifically, it was a Ryzen AI 9 HX 370 versus a Ryzen 7 8840U, both capped at 15W.

More about the steam deck

Steam Deck set up as a PCSteam Deck set up as a PC

Steam Deck set up as a PC

Best Portable Gaming PC: What is the best travel companion?
Steam Deck OLED Review: Our verdict on Valve’s handheld.
Best Steam Deck Accessories: Wear your best clothes.
Steam Deck Battery Life: How long does the battery actually last?

The new GPU is about 32% faster than the previous generation in the old 3DMark Time Spy benchmark, and 19% faster in the lightweight Night Raid test. Those numbers seem impressive at first glance, but the HX 370 has 33% more CUs than the 8840U. While we don’t know what clock speeds the GPUs inside those chips were running at, the fact that one has significantly more shader units than the other isn’t something we can ignore.

But the Strix Point chip’s scores as a general overview of its capabilities are worth taking. I ran Time Spy and Night Raid on my ROG Ally, set to 15W, and got results of 2,915 and 19,994 respectively – 16% and 52% slower than the Ryzen AI 9 HX 370. How much of that is down to CPU cores, shader counts, and clock speeds is anyone’s guess at this point, but it certainly bodes well for gaming.

At this point, however, I’m not convinced that the RDNA 3.5 updates will have much of an impact on gaming compared to the increase in CPU cores and shader count.

Leave a Comment