feature article
Subscribe Now

Re-inventing the DSP Block

Altera Changes the Game

Everything in technology changes – evolves – improves.

First, we had 8-bit processors, then 16, then 32… now many of us are tapping the keys on 64-bit devices.

Nothing stays still for very long.

Why, then, have we lived for around a decade with very little change to the garden-variety 18×18 multipliers in our hardened FPGA DSP blocks.  Except for a few minor improvements, those haven’t really progressed in years.  

To paraphrase something Bill Gates apparently never really said: “Why would anybody ever need more than 18×18 bit multiplication?”

OK, wait.  There has been some evolution in DSP blocks.  We’ve got from 18×18 multipliers to multiplier-accumulator-ALU-ish blocks with all kinds of fancy carry logic.  We’ve even got asymmetric multipliers that have a wider side to accommodate a few tougher problems.  Both of the major vendors have continued to improve their blocks in ways that allow more complex operations to be done without jumping out to the LUT fabric.  

Altera, however, has just raised the stakes a lot – with their complete re-design of the DSP block for their upcoming 28nm Stratix-V line.

For the tried-and-true sweet spot of FPGA, 18×18 multipliers were just fine, but with FPGA markets expanding into areas like medical imaging, wireless, mil/aero, and test and measurement, wider fixed- and floating-point multiplication is required to solve the real-world problems.  If your FPGA can support those operations in hard-wired logic, you can skip the LUT fabric altogether, improve your throughput and power consumption, and save the programmable fabric for other work – or, better yet, save the money by buying a smaller FPGA.

The key new element in Altera’s DSP is variable precision.  Instead of a fixed-width hardware multiplier, the company has introduced a fracturable/cascadable multiplier that can deliver a variety of bit widths very efficiently.  To avoid glossing our eyes over with the exhaustive list of every possible combination, we’ll just say that you can choose precision from 9X9 up to 54X54, including asymmetric settings, with very little wasted hardware.  Floating point mantissa multiplication is easily accomplished as well, so the enthusiasts of the relatively narrow area of FPGA-accelerated high-performance computing (or “reconfigurable computing”) will be very excited.  (You see, OpenFPGA.org?  Somebody IS listening.)

Back in the days of Stratix II, an Altera DSP block had four independent 18×18 multipliers (four 36-bit inputs).  For Stratix III, the company doubled the block and made it splittable (four 72-bit inputs), so that we could use “Half Blocks.”  The DSP block then could do eight 18×18 multiplications summing, or four 18×18 multiplications independently.  Now, the DSP block has four of the new variable-precision blocks (four 72-bit inputs), so the unit can do eight 18×18 multiplications summing, eight 18×18 multiplications independently, and high precision operations.  

The new block has two native modes – “18-bit” and “high-precision.”  In “18-bit” mode, two 18×18 products can be summed into a 64-bit accumulator (with 37-bit precision out of the adder), or two 18×18 products can be independently output with 32-bit product precision.  In “High-Precision” mode, you can do 27×27 multiplication with a 64-bit accumulator and 18×36 with a 64-bit accumulator.  This means you can do single-precision floating-point mantissa multiplication in one variable-precision DSP block.  The 64-bit accumulators allow for cascading without loss of precision. 

Altera lists many common applications where this all comes in handy.  For example, FFTs require high-precision complex multiplication.  The data width increases with each stage, while the coefficient remains the same, so we can go from 18×18 to 18×25 to 18×36.  With the new architecture, each of these can be done in a single block.  With previous-generation blocks, the number of DSP blocks required could double.

For floating-point precision, using the 64-bit cascade, a single-precision mantissa multiplication can be done with one block at 27×27, or a double-precision 54×54 can be implemented with four blocks cascaded.  Four blocks cascaded could do a single-precision floating-point FFT’s complex multiplication.

The combinations and permutations go on and on, of course.  Altera looked at a number of critical popular applications in designing the new blocks, and the net effect is that you’ll use far fewer of the new blocks to accomplish the same math, and far less often be required to take your critical timing path out of the hardened world of your DSP blocks and into the LUT fabric.

Will this benefit you?  The answer is – sometimes. 

If you’re doing arithmetic operations that require more than the fixed-point precision choices available currently, you’ll certainly be able to do them with fewer DSP blocks, and with fewer excursions into the LUT fabric.  That means you’ll have more options.  If the number of DSP blocks was the reason you had to buy that bigger FPGA, you can now buy a smaller one.  (Don’t tell Altera they just engineered themselves into a smaller sale.)

If you were pushing it on Fmax because some of your arithmetic logic was bleeding over into the LUTs, you may now be able to operate the multiplier-accumulator part of your datapath closer to the datasheet frequencies.  Or, you may have a lot less work to do on timing closure when you’re finishing up your design.

If you were resource sharing DSP blocks because you were limited in the number available, now you can go with more parallelism and potentially improve your throughput and/or latency.  This would, of course, also translate to less memory/register resources being used in the course of sharing magic.  

Another group certain to benefit from this architecture are those using high-level synthesis to come from algorithmic representations in C/C++ or other untimed high-level languages into FPGA hardware.  If your high-level synthesis tool has this more flexible block in its tool chest (and if it has the wherewithal to use it properly), you’ll magically get better results without even worrying about it.

As the marginal returns from each new process node continue to diminish, FPGA companies need to step up their architectural innovation to keep pace with our insatiable appetite for more performance and efficiency.  Smart advances like this new DSP block are exactly what FPGA users need and exactly what the FPGA industry needs to keep attracting new customers and new design wins in an increasingly competitive environment.

13 thoughts on “Re-inventing the DSP Block”

  1. Pingback: Chat
  2. Pingback: DMPK Studies
  3. Pingback: kari satilir
  4. Pingback: Bolide
  5. Pingback: bandar judi
  6. Pingback: Aws Colarts Diyala
  7. Pingback: beaubou.com
  8. Pingback: redirected here
  9. Pingback: Corporate Events
  10. Pingback: iraqi coehuman

Leave a Reply

featured blogs
Oct 5, 2022
The newest version of Fine Marine - Cadence's CFD software specifically designed for Marine Engineers and Naval Architects - is out now. Discover re-conceptualized wave generation, drastically expanding the range of waves and the accuracy of the modeling and advanced pos...
Oct 4, 2022
We share 6 key advantages of cloud-based IC hardware design tools, including enhanced scalability, security, and access to AI-enabled EDA tools. The post 6 Reasons to Leverage IC Hardware Development in the Cloud appeared first on From Silicon To Software....
Sep 30, 2022
When I wrote my book 'Bebop to the Boolean Boogie,' it was certainly not my intention to lead 6-year-old boys astray....

featured video

PCIe Gen5 x16 Running on the Achronix VectorPath Accelerator Card

Sponsored by Achronix

In this demo, Achronix engineers show the VectorPath Accelerator Card successfully linking up to a PCIe Gen5 x16 host and write data to and read data from GDDR6 memory. The VectorPath accelerator card featuring the Speedster7t FPGA is one of the first FPGAs that can natively support this interface within its PCIe subsystem. Speedster7t FPGAs offer a revolutionary new architecture that Achronix developed to address the highest performance data acceleration challenges.

Click here for more information about the VectorPath Accelerator Card

featured paper

Algorithm Verification with FPGAs and ASICs

Sponsored by MathWorks

Developing new FPGA and ASIC designs involves implementing new algorithms, which presents challenges for verification for algorithm developers, hardware designers, and verification engineers. This eBook explores different aspects of hardware design verification and how you can use MATLAB and Simulink to reduce development effort and improve the quality of end products.

Click here to read more

featured chalk talk

TE APL: Flexibility for Any Use

Sponsored by Mouser Electronics and TE Connectivity

Connectors can make a big difference when it comes to reducing system complexity and ease of use but did you know they can also help with automation and sustainability as well? In this episode of Chalk Talk, Amelia Dalton and Anita Costamagna from TE discuss TE’s APL Connectivity solutions. They dig into the details of these connector solutions and how you can get started using these connector solutions in your next design.

Click here for more information about TE Connectivity Appliance Solutions