editor's blog
Subscribe Now

Making the Line Move Faster

No one likes standing in line, but if you’re going to be doing any serious parallel processing, you’ll run into many queues as a way for threads or processes to send messages to each other. Usually they’re implemented in software, which adds a level of overhead to the programs, in particular when putting things on or taking them off the queue.

For instance, when putting a new item into the queue, you have to check to see if there is room in the queue, and you have to upgrade the head afterwards, and, by the time all is said and done, you’ve chewed up 10 instructions. In addition, the queue information may or may not be in the cache, and you have to ensure exclusive access where critical, and, well, it can get slow enough to where you want to limit how many things work in parallel and how much they communicate. If the program is split up too finely, then too much time is lost to the not-so-instant messaging between processes or threads

In order to alleviate this, hardware queues have been proposed. But these interrupt the micro-architecture and require custom interconnect. Worse yet, they’re not well handled by operating systems, especially when it comes to context switches. And architects have the unenviable job of deciding how big to make them and how many to provide: you know they’ll never get that right in everyone’s eyes.

In one of those random encounters of something interesting, I ran across a recent report by Lee et al from the University of North Carolina that takes a middle road for single-producer/single-consumer queues: hardware-accelerated software queues. In so doing, there were four primary criteria they wanted to meet:

  1. The frequent enqueue and dequeue tasks have to be as efficient as possible. Hardware queues meet this; software queues don’t.
  2. The time between one entity putting something on the queue and something else taking it off should be as short as possible. Again, hardware queues can do this; software queues, not so much.
  3. They should be as easy to program as software queues (quantity, synchronization, etc.). Software queues meet this by definition; hardware queues don’t at the very least by virtue of their limited quantity and size.
  4. They have to work without changing the OS. Because software queues work in the application memory space, they can do this; hardware queues can’t.

The abridged version of what they do can be summarized by four primary points:

  • While still implementing the queue in memory, cache a local copy of the queue head, tail, and size in a separate dedicated hardware table. This table stores some number of active queues much the way a memory cache stores some number of active memory addresses. This means that the various bookkeeping steps can be done without going to memory and without contention from anyone else.
  • Pipeline the queue operations into three steps: address generation, the actual store or load, and index updating.
  • Use dedicated hardware to calculate the address of the store or load. Again, this happens in private hardware without interference; multiple addresses can be generated in a single operation. The actual load or store can happen when the address is ready, meaning that the queue operations may happen out of order (if an early one takes longer to have its address resolved, for instance).
  • Accumulate index updates and store a bunch of them at the same time to reduce the amount of access to the cache.

Of course, as with everything, the details hide much devilry, so they have special considerations to handle misspeculation, precise interrupts, a full queue cache, fence (memory barrier) instructions, and avoiding livelock (since, in theory, the various queue indices could reside on three different memory pages that an OS may not be able to keep in memory at the same time).

With this solution, Criterion 1 is met because of the hardware acceleration, as is Criterion 2. Because the actual queues are still implemented in the application memory space, with no specific size or quantity limits, Criteria 3 and 4 are met.

For details on all of this as well as the results of their testing, you can check out the paper courtesy of James Tuck, one of the authors.

Leave a Reply

featured blogs
Jul 1, 2022
We all look for 100% perfection and want to turn our dreams (expectations) into reality as far as we can. Are you also looking for a magic wand to turn expectation into reality? The story applies to... ...
Jun 30, 2022
Learn how AI-powered cameras and neural network image processing enable everything from smartphone portraits to machine vision and automotive safety features. The post How AI Helps Cameras See More Clearly appeared first on From Silicon To Software....
Jun 28, 2022
Watching this video caused me to wander off into the weeds looking at a weird and wonderful collection of wheeled implementations....

featured video

Demo: Achronix Speedster7t 2D NoC vs. Traditional FPGA Routing

Sponsored by Achronix

This demonstration compares an FPGA design utilizing Achronix Speedster7t 2D Network on Chip (NoC) for routing signals with the FPGA device, versus using traditional FPGA routing. The 2D NoC provides a 40% reduction in logic resources required with 40% less compile time needed versus using traditional FPGA routing. Speedster7t FPGAs are optimized for high-bandwidth workloads and eliminate the performance bottlenecks associated with traditional FPGAs.

Subscribe to Achronix's YouTube channel for the latest videos on how to accelerate your data using FPGAs and eFPGA IP

featured paper

3 key considerations for your next-generation HMI design

Sponsored by Texas Instruments

Human-Machine Interface (HMI) designs are evolving. Learn about three key design considerations for next-generation HMI and find out how low-cost edge AI, power-efficient processing and advanced display capabilities are paving the way for new human-machine interfaces that are smart, easily deployable, and interactive.

Click to read more

featured chalk talk

Current Sense Amplifiers: What Are They Good For?

Sponsored by Mouser Electronics and Analog Devices

Not sure what current sense amplifiers are and why you would need them? In this episode of Chalk Talk, Amelia Dalton chats with Seema Venkatesh from Analog Devices about the what, why, and how of current sense amplifiers. They take a closer look at why these high precision current sense amplifiers can be a critical addition to your system and how the MAX40080 current sense amplifiers can solve a variety of design challenges in your next design. 

Click here for more information about Maxim Integrated MAX40080 Current-Sense Amplifiers