feature article
Subscribe Now

Achronix Accelerates eFPGA

7nm Speedcore Gen4 Brings Big Improvements

Perhaps when the most important problem is a nail, every solution starts to look like a hammer. With the ramping explosion in AI and machine learning, countless companies are trying to climb on the bandwagon, morphing and melding their existing technologies in an attempt to come up with a differentiated solution that will capture a meaningful share of this mind-boggling emerging opportunity. Everybody from EDA vendors to cloud data centers to GPU companies, FPGA companies, IP companies, and boutique semiconductor startups are spinning stories about how their technology is the key to unlocking the potential of AI.

Most of them will fail.

Achronix, however, looks like a pretty strong contender. This week, the company unveiled their fourth-generation Speedcore eFPGA technology, targeting 7nm CMOS. While this new IP continues the mission of allowing FPGA fabric to be part of any SoC/ASIC design, the latest version has numerous features aimed specifically at accelerating machine learning inferencing. For many with the expertise and resources to do custom chip design, Achronix has a compelling alternative to multi-chip solutions with discrete FPGAs or other custom AI accelerators.

The Speedcore Gen4 embedded FPGA (eFPGA) IP for integration into customers’ SoCs increases performance by 60% over previous generations, reduces power by 50%, and decreases die area by 65%. These are significantly better PPA jumps than one would get simply by moving to the next process node. There are major architectural improvements at the heart of Achronix’s gains. Achronix says they are focusing on bringing “programmable hardware-acceleration capabilities to a broad range of compute, networking, and storage systems for interface protocol bridging/switching, algorithmic acceleration, and packet processing applications.”

Speedcore essentially lets you build an FPGA to your exact specifications. Since modern FPGAs contain LUT fabric, multipliers/DSP blocks, and embedded memories, at the least, and often also include processor cores and other hard blocks, stand-alone FPGAs are always built as a “guess” by the FPGA company about the relative amounts of each of these resources needed for given broad classes of applications. When you select a stand-alone FPGA, it’s always some kind of compromise. You may have to take an FPGA with more multipliers than you need in order to get the amount of LUT fabric you require or the amount of high-speed IO your design needs. There is practically never a situation where the FPGA has exactly the mix of resources required for your application.

eFPGAs like Speedcore allow your to tailor the mix of resources exactly to your anticipated application needs. And, since you’re designing your own SoC anyway, you have the flexibility to merge the FPGA core with any number of other hard resources and IO. None of that changes with this new version of Speedcore, of course. But, with this edition, there are more (and more interesting) options for the types of blocks you can include in your implementation. The capabilities of those new blocks, plus (of course) the Moore’s Law gains from dropping to a 7nm process, make this a powerful new offering for those seeking to accelerate critical applications – particularly if those applications involve AI and machine learning.

Achronix has added what it calls Machine Learning Processor (MLP) blocks to the library of available blocks for building your eFPGA implementation. The company claims the new block delivers 300% higher system performance for artificial intelligence and machine learning (AI/ML) applications. These MLP blocks are aimed at the kind of matrix-multiply operations common in CNN inferencing. Each MLP includes a local cyclical register file that leverages temporal locality for optimal reuse of stored weights or data. The MLPs are tightly coupled with neighboring MLP blocks and larger embedded memory blocks, and they support multiple precision fixed point and floating point formats, including Bfloat16, 16-bit, half-precision floating point, 24-bit floating point, and block floating point (BFP). In many ML applications, reducing the precision of these calculations can yield massive gain in performance and power consumption with very little loss in accuracy. By supporting a wide range of precisions, the MLP allows you to find the optimal compromise between performance and accuracy for your application.

Other architecture changes and improvements include a new 8-1 mux, which allows up to 8-wide muxing with a single level of logic. Also new is an 8-bit ALU with 2x the adder density of the previous generation. The new ALU is aimed at AI/ML applications, where it is frequently used for adders, counters, and comparators. There is also a new 8-bit cascadable bus-maximum function, new high-efficiency dedicated shift registers, and a new 6-input LUT with 2 registers per LUT. Taken together, these should substantially improve throughput and architectural efficiency, provided the Achronix tool chain (and synthesis in particular) can take optimal advantage of the new and changed resources.

Achronix has also added a new independent dedicated bus-routing structure to the architecture, allowing bus-grouped routing separate from the normal bit-wise routing channels. This should minimize congestion as well as improving timing by providing matched-length connections for all bits in a bus. The company says these should be optimal for busses running between memories and MLPs, and they effectively create a giant, run-time-configurable switching network on-chip. Cascadable 4-to-1 bus routing provides 2x performance for busses while saving LUT resources.

Architectural improvements in the new Speedcore allow LUT-based multipliers to be implemented more efficiently, providing the ability to create a 6×6 multiplier, using only 11 LUTs, that operates at 1 GHz. Typical FPGA implementations would require 21 LUTs for the same functional implementation.

The new resources are organized on the chip using non-traditional column adjacency that Achronix says doubles compute operations’ density by providing cascade routing between the new MLP blocks and embedded memory blocks. This dataflow is optimized for AI/ML applications and should also result in significant power savings on those types of operations, because less power will be consumed in data transfers between compute and memory resources.

Achronix uses its Speedcore Builder tool to create custom Speedcore instances to match each user’s requirements. The user can then evaluate the suitability of the generated eFPGA block for their application, and Achronix can supply die size and power information as well. This allows design teams to have a solid understanding of the functional applicability, performance, and power consumption of their eFPGA implementations long before they commit to silicon.

Achronix says Speedcore Gen4 for TSMC 7nm CMOS is available today and will be in production in 1H 2019. The company will then back-port Speedcore Gen4 for TSMC 16nm and 12nm with availability in 2H 2019.

Speedcore Gen4 should bring impressive levels of FPGA and AI/ML acceleration capability to many applications, and it could save dramatically on system cost, power, and complexity compared with solutions that use stand-alone FPGAs. With the expected dramatic growth in the market for AI/ML acceleration, we also expect to see third parties developing commercial specialized accelerator chips based on the Achronix IP. It will be interesting to watch the evolution of this market as design teams size up the various competing alternatives for compute acceleration in this exciting new domain.

Leave a Reply

featured blogs
Oct 4, 2022
We share 6 key advantages of cloud-based IC hardware design tools, including enhanced scalability, security, and access to AI-enabled EDA tools. The post 6 Reasons to Leverage IC Hardware Development in the Cloud appeared first on From Silicon To Software....
Oct 4, 2022
Anyone designing a data center faces complex thermal management challenges . Yes, there's a large amount of electrical power required, but the other side of that coin is that almost all the power gets turned into heat, putting a tremendous strain on the airflow and cooling sy...
Sep 30, 2022
When I wrote my book 'Bebop to the Boolean Boogie,' it was certainly not my intention to lead 6-year-old boys astray....

featured video

PCIe Gen5 x16 Running on the Achronix VectorPath Accelerator Card

Sponsored by Achronix

In this demo, Achronix engineers show the VectorPath Accelerator Card successfully linking up to a PCIe Gen5 x16 host and write data to and read data from GDDR6 memory. The VectorPath accelerator card featuring the Speedster7t FPGA is one of the first FPGAs that can natively support this interface within its PCIe subsystem. Speedster7t FPGAs offer a revolutionary new architecture that Achronix developed to address the highest performance data acceleration challenges.

Click here for more information about the VectorPath Accelerator Card

featured paper

Algorithm Verification with FPGAs and ASICs

Sponsored by MathWorks

Developing new FPGA and ASIC designs involves implementing new algorithms, which presents challenges for verification for algorithm developers, hardware designers, and verification engineers. This eBook explores different aspects of hardware design verification and how you can use MATLAB and Simulink to reduce development effort and improve the quality of end products.

Click here to read more

featured chalk talk

Single Pair Ethernet : Simplifying IIoT & Automation

Sponsored by Mouser Electronics and Analog Devices and HARTING and Würth Elektronik

Industry 4.0 with its variety of sensing solutions and fieldbus systems can make communication pretty tricky but single pair ethernet can change all of that. In this episode of Chalk, Amelia Dalton chats with representatives from three different companies: Analog Devices, HARTING and Würth Elektronik to discuss the benefits of single pair Ethernet, what the new IEEE standard means to SPE designs, and what you should consider when working on your next single pair Ethernet design.

Click here for more information about Single Pair Ethernet solutions from Analog Devices, HARTING and Würth Elektronik