Recently I attended a presentation at the Machine Learning Developers Conference held at the Santa Clara Convention Center. The presentation, “Overcoming the Memory System Challenge in Dataflow Processing”, was given jointly by Darren Jones of Wave Computing and Drew Wingard of Sonics. The presentation was indeed fascinating as Jones first described how dataflow processors are ideal for deep learning, especially as compared to using existing CPU and GPU architectures. Jones then showed some performance numbers for machine learning training using the Wave Computing solution. And that is when I had to scratch my head. I did not hear any gasps. Everyone just accepted this phenomenal piece of engineering – just took it in stride.
The Wave Computer was providing 2.9 Peta-Ops per second, had more than 2TB of bulk and high-speed memory, plus up to 32TB of SSD storage. To keep the 256,000 processing elements (PEs) humming, it supported a dataflow bandwidth of over 4.5TB per second. First, “Peta” is a prefix for a quadrillion, 1015, a thousand trillion, a million billion – easy words to say, but I am not sure our minds can truly comprehend how truly huge that number is – 1,000,000,000,000,000. When you discuss, peta-flops or peta-ops, you are using the terminology of supercomputers.
Wave’s Dataflow Computer for ML Training
With all this processing power, much is demanded from the on-chip network.
Try to visualize this: If every ‘op’ performed by the Wave computer added one dime to a stack of dimes, over a one year period the stack would be about 10 light-years high – more than the distance to Alpha Centauri and back. Of course it also would mean the stack was growing 10 times faster than the speed of light. So the visualization is more than a little difficult to grasp.
Now let’s look at the role of the network to feed this computing monster that gobbles data at an incredible rate. Each Wave computer uses 16 of the Wave Dataflow Processing Unit (DPU) SoCs, with each DPU containing 16K PEs. Wave Computing utilized a SonicsGN (SGN) network-on-chip (NoC) implementation. In this case, the AXI4 protocol was used to manage traffic between 32 AXI (128 bit) initiator and target dataflow processor channels and four high speed memory ports (640 bit wide, 60 GB/s each), two DDR4 ports (256-bit, 15 GB/s each), and a 16-lane PCI Express port (256-bit, 30 GB/s). That’s about 300 GB/s of external data bandwidth for each DPU! How is this accomplished?
In order to get the required throughput, load balancing must be utilized to distribute the traffic evenly among the channels. Otherwise, some memory channels will be under-used, while others are overloaded. Normally, utilizing multiple channels can lead to throughput or reordering problems for pipelined memories. The DRAM controllers rely on reordering to achieve the needed throughput. With DRAM, you need to optimize the usage of the memory banks to hide the overhead of page misses in order to increase memory utilization. The best solution is to interleave the memory traffic across the different channels, so the natural locality in the application data streams accomplish the load balancing without explicit intervention. This is where SGN’s Interleaved Multichannel Technology (IMT) differentiates itself even more dramatically from its competition. SGN automagically handles the memory access reordering to preserve the protocol’s intent, scales to up to 8-way interleaving, supports up to 16 virtual channels per link, and achieves up to 2 GHz speeds in 14nm technologies. Plus, interleaving in the NoC ensures that the traffic for different memories never needs to converge into a common choke point, which is a common throughput and routing congestion issue for designs that interleave inside the memory controllers. All this performance comes with optional security, power, and error management capabilities AND the ability to specify the design requirements in a complete NoC design environment supporting design capture, verification, and performance analysis.
Simple 2-Channel Example (5 AXI Masters)
IMT with reorder support delivers superior load balancing.
After all of this, the network still had one additional challenge in this design. The 16K PEs in the DPU are densely packed in large rectangular area occupying – and blocking – most of the center of the die. So the NoC is actually implemented in a narrow ring on the outside of this area. This can mean huge clock insertion delay differences at different parts of the ring and some very long routing distances. But Wave avoided timing closure issues on its very high-speed design by using SGN’s mesochronous link feature to tolerate clock skew, plus adopted Sonics’ user-directed, floorplan-based, domain-sensitive automatic repeater insertion to rapidly close timing.
So if we visualized the Wave Computer as spitting out dimes at 10x the speed of light, how do we visualize what SGN is achieving in this design? Why, it is feeding the processors the necessary proportions of copper and nickel to keep them running at peak performance while catching all the dimes and reordering them to be stacked in serial number order. That is if dimes had serial numbers, or if there was anywhere close to that number of dimes in the world. Yep, it is hard to visualize just how ridiculously powerful this design and its component parts are. I just wish people could somehow grasp how special the magic of this technology really is.