feature article
Subscribe Now

Considerations for High-Bandwidth TCP/IP PowerPC Applications

The TCP/IP protocol suite is the de facto worldwide standard for communications over the Internet and almost all intranets. Interconnecting embedded devices is becoming standard practice even in device classes that were previously stand-alone entities. By its very definition, an embedded architecture has constrained resources, which is often at odds with rising application requirements.

Achieving wire-speed TCP/IP performance continues to be a significant engineering challenge, even for high-powered Intel™ Pentium™-class PCs. In this article, we’ll discuss the per-byte and per-packet overheads limiting TCP/IP performance and present the techniques utilized to maximize TCP/IP over Gigabit Ethernet performance in embedded processor-based applications.


Gigabit Ethernet performance is achieved by leveraging a multi-port DDR SDRAM memory controller to allocate memory bandwidth between embedded PowerPC™ processor local bus (PLB) interfaces and two data ports. Each data port is attached to a direct memory access (DMA) controller, allowing hardware peripherals high-bandwidth access to memory.

System Architecture

Memory bandwidth is an important consideration for high-performance network attached applications. Typically, external DDR memory is shared between the processor and one or more high-bandwidth peripherals such as Ethernet. A multi-ported memory controller efficiently divides the available memory bandwidth between processor interfaces and streaming peripherals, including Ethernet. The streaming peripherals are linked to memory through a point-to-point streaming interface via Direct Memory Access (DMA) controllers. The DMA controller implements a scatter-gather scheme whereby multiple buffers are converted to/from a contiguous stream on the Ethernet or other streaming peripheral. The Ethernet peripheral implements checksum offload on both the Transmit and Receive paths for optimal TCP performance. A block diagram of the system described above is shown in Figure 1.

Figure 1

TCP/IP Per-Byte Overhead

Per-byte overhead occurs when the processor touches payload data 1. The two most common operations of this type are buffer copies and TCP checksum calculation. Buffer copies represent a significant overhead for two reasons:

1. Most of the copies are unnecessary.

2. The processor is not an efficient data mover.

TCP checksum calculation is expensive, as it is calculated over each payload data byte. Embedded TCP/IP-enabled applications such as medical imaging require near wire speed TCP bandwidth to reliably transfer image data over a Gigabit Ethernet network. The data is generated from a high-resolution image source, not the processor. In this case, introducing a zero-copy software API and offloading the checksum calculation into FPGA fabric completely removes the per-byte overheads. “Zero-copy” is a term that describes a TCP software interface where no buffer copies occur. Linux and other operating systems have introduced software interfaces like sendfile() 2 that serve this purpose, and commercial standalone TCP/IP stack vendors like Treck TM offer similar zero-copy features. These software features allow the removal of buffer copies between the user application and the TCP/IP stack or operating system.

The scatter-gather and the checksum offload features of the system provide the hardware support necessary for zero-copy functionality. The scatter-gather feature is a flexibility of the DMA controller that allows software buffers to be located at any byte offset. This removes the need for the processor to copy unaligned or fragmented buffers.

Checksum offload is a feature of the Ethernet peripheral. It allows the TCP payload checksum to be calculated in FPGA fabric as Ethernet frames are transferred between main memory and the peripheral’s hardware FIFOs. These system features remove the need for costly buffer copies and processor checksum operations, leaving the processor to perform protocol operations and user functions.

TCP/IP Per-Packet Overhead

Per-packet overhead is associated with operations surrounding the transmission or reception of packets 1. Packet interrupts, hardware interfacing, and header processing are examples of per-packet overheads. Interrupt overhead represents a considerable burden on the processor and memory subsystem, especially when small packets are transferred. Interrupt coalescing is a technique used in such a system to alleviate some of this pressure by amortizing the interrupt overhead across multiple packets. The DMA engine waits until there are n frames to process before interrupting the processor, where n is a software-tunable value.

Transferring larger sized packets (jumbo frames of 9,000 bytes) has a similar effect by reducing the number of frames transmitted, and therefore the number of interrupts generated. This amortizes the per-packet overhead over a larger data payload.


An example implementation of this architecture is the Gigabit System Reference Design from Xilinx (GSRD). It is geared toward high-performance bridging between TCP/IP-based protocols and user data interfaces like high-resolution image capture or Fibre Channel. The components of GSRD contain features to address the per-byte and per-packet overheads of a TCP/IP system. For applications requiring an embedded operating system, a MontaVista™ Linux™ port is available while a commercial standalone TCP/IP stack from Treck™ is available to satisfy applications with the highest bandwidth requirements.

The GSRD can provide Transmit TCP performance up to 890Mbps using jumbo frames, and is implemented in the latest FPGA technology available today from Xilinx. The GSRD can be downloaded today from http://www.xilinx.com/gsrd/.

Click here for printable PDF
(By clicking on this link you agree to FPGA Journal’s Terms of Use for PDF files. PDF files are supplied for the private use of our readers. Republication, linking, and any other distribution of this PDF file without written permission from Techfocus Media, Inc. is strictly prohibited.)


1. “End-System Optimizations for High-Speed TCP” (www.cs.duke.edu/ari/publications/end-system.pdf)

2. “Use sendfile to optimize data transfer” (http://builder.com.com/5100-6372-1044112.html)

Leave a Reply

featured blogs
Nov 25, 2020
It constantly amazes me how there are always multiple ways of doing things. The problem is that sometimes it'€™s hard to decide which option is best....
Nov 25, 2020
[From the last episode: We looked at what it takes to generate data that can be used to train machine-learning .] We take a break from learning how IoT technology works for one of our occasional posts on how IoT technology is used. In this case, we look at trucking fleet mana...
Nov 25, 2020
It might seem simple, but database units and accuracy directly relate to the artwork generated, and it is possible to misunderstand the artwork format as it relates to the board setup. Thirty years... [[ Click on the title to access the full blog on the Cadence Community sit...
Nov 23, 2020
Readers of the Samtec blog know we are always talking about next-gen speed. Current channels rates are running at 56 Gbps PAM4. However, system designers are starting to look at 112 Gbps PAM4 data rates. Intuition would say that bleeding edge data rates like 112 Gbps PAM4 onl...

featured video

Improve SoC-Level Verification Efficiency by Up to 10X

Sponsored by Cadence Design Systems

Chip-level testbench creation, multi-IP and CPU traffic generation, performance bottleneck identification, and data and cache-coherency verification all lack automation. The effort required to complete these tasks is error prone and time consuming. Discover how the Cadence® System VIP tool suite works seamlessly with its simulation, emulation, and prototyping engines to automate chip-level verification and improve efficiency by ten times over existing manual processes.

Click here for more information about System VIP

featured paper

Overcoming PPA and Productivity Challenges of New Age ICs with Mixed Placement Innovation

Sponsored by Cadence Design Systems

With the increase in the number of on-chip storage elements, it has become extremely time consuming to come up with an optimized floorplan using manual methods, directly impacting tapeout schedules and power, performance, and area (PPA). In this white paper, learn how a breakthrough technology addresses design productivity along with design quality improvements for macro-dominated designs. Download white paper.

Click here to download the whitepaper

featured chalk talk

The Wireless Member of the DARWIN Family

Sponsored by Mouser Electronics and Maxim Integrated

MCUs continue to evolve based on increasing demands from designers. We expect our microcontrollers to do more than ever - better security, more performance, lower power consumption - and we want it all for less money, of course. In this episode of Chalk Talk, Amelia Dalton chats with Kris Ardis from Maxim Integrated about the new DARWIN line of low-power MCUs.

Click here for more information about Maxim Integrated MAX32665-MAX32668 UB Class Microcontroller