High-reliability design considerations are fast becoming an art that system engineers have to undertake very early on in the design process, often beginning with a designer’s choice of system silicon. For designers seeking flexibility and time-to-market advantages, high-reliability FPGAs are a good choice. However, not all high-reliability designs are created equal. There is a broad spectrum of high-reliability designs out there today, from space-bound satellites to anti-lock braking systems in our ground-based vehicles. While there are obvious similarities between these two types of systems, there are also some key differences, all of which become important considerations when determining your silicon-of-choice for these applications.
The Need for High Reliability
Space-flight applications represent perhaps the highest requirement for reliability for electronics components. The consequences of a reliability problem are severe. In terms of financial cost, a high-end satellite may cost the operator over $1B to develop and launch, and the cost of building and launching a replacement may be a substantial fraction of the initial cost.
Outside of replacement costs, there are other consequences of a reliability problem. For a commercial satellite, the loss of revenue to the satellite’s operator may be substantial; for Earth observation satellites, the loss of such a satellite may cause critical data about the direction and intensity of a tropical storm to be delayed; for military satellites, the loss of service can impede battlefield communications, possibly giving an advantage to the enemy; for reconnaissance satellites, the loss of service could cause critical information on an enemy’s movements to be missed, with potentially catastrophic consequences for national security. Moreover, if a critical function in a satellite is interrupted by a reliability issue, it is impossible to conduct a service call to repair the satellite. There is no alternative but to launch a replacement.
In automotive applications, reliability impacts different automotive electronic subsystems in different ways. Reliability problems with parts in system-critical functions, such as engine control and braking, and safety systems, such as air bags, may lead to expensive product recalls, product liability lawsuits or, at worst, loss of life. Reliability problems with parts in ancillary systems, such as auto dim mirrors, seat adjustments, object-proximity parking aids, GPS navigation and infotainment systems, result in poor product quality metrics and can severely damage brand image and reduce repeat buying and product resale values. Automotive manufactures and parts suppliers have the difficult task of ensuring the highest standards of safety, reliability and security while at the same time focusing on making their products affordable.
Assuring Reliability at the Component Level
The first step a component manufacturer takes in assuring component reliability is to perform device qualification and characterization. Qualification involves testing many aspects of the packaged product against an industry-recognized standard. For space-flight applications, the traditional qualification standards used are Mil-Std 883 Class B or QML Class V. For automotive applications, each part of the automotive subsystem may have different requirements and device specifications. However, universal requirements for extended temperature support and significant device characterization are required. Experiments are often extended beyond normal operating life until device failure in order to determine specifications margin and to drive future product improvement activities.
For space-flight applications, the standards contain rigorous test methods, which must be applied by the manufacturer. Manufacturers may also apply their own qualification standards in addition to those mandated by Mil-Std 883 Class B or QML Class V, such as Actel’s Enhanced Antifuse Qualification testing.
Characterization of FPGAs intended for space-flight and automotive applications requires that components from multiple wafer lots – usually three at minimum – are evaluated for compliance to published datasheet specifications, including DC parameters, AC (timing) parameters, power consumption and power-up behavior. Parameters are measured at both extremes of the military temperature range (-55ºC and +125ºC) and multiple points in between. Automotive applications are usually characterized to a temperature range (-40ºC to +125ºC or +135ºC).
For the high-volume, high-reliability automotive FPGA market, typical automotive part flows include sample testing at extended temperatures for all lots. In the most demanding system-critical automotive applications, 100-percent dynamic burn-in at temperature is used to remove typical semiconductor infant mortality failures. Failures that do occur are analyzed, and the knowledge gained is used for continuous improvement.
Because the volumes involved in space applications are relatively small – a typical order may be in the range of 20 to 30 units – it is practical to screen 100 percent of the manufactured units to an extremely high standard of reliability. In fact, this is required by Mil-Std 883 Class B and QML Class V. Production testing of space-flight FPGAs is a complicated matter. “Flight” units – those that are shipped for integration into spacecraft – are subjected to temperature cycling, constant acceleration and a visual inspection prior to electrical testing. In addition to testing the quality of the hermetic seal of the ceramic packaging via a fine and gross leak test, a “particle impact noise detection” test – commonly referred to as PIND – is performed. This requires the parts to be shaken in close proximity to a microphone, in order to detect the presence of loose particles in the package cavity.
After the package tests are completed, electrical testing is performed, and then dynamic burn. During the burn-in, devices are located on specially designed burn-in boards in a temperature chamber. The devices are powered-up and operating with special test patterns during the burn-in, hence the term dynamic burn-in is used to describe this testing. Mil-Std 883 Class B allows manufacturers to perform the dynamic burn-in at either 125ºC for 160 hours, or 150ºC for 80 hours. With the decreasing maximum junction temperatures of today’s advanced deep sub-micron fabrication processes, most manufacturers today are performing dynamic burn-in using the 125ºC / 160 hours option. Following dynamic burn-in, electrical parameters are measured again, and then a suite of electrical tests are performed at the extremes of the military temperature range as well as at room temperature. Strict rules govern the acceptability of a lot, so any failures during the testing are recorded. If more than a specific, small number of failures occur, the entire lot will be rejected and will not be shipped to customers.
An even higher level of reliability screening is available to customers of Actel space-flight FPGAs. In addition to the Mil-Std 883 Class B screening described above, Actel’s “extended flow” screening adds additional hours to the dynamic burn-in, and also adds a static burn-in, applied to 100 percent of flight units. Testing is preceded by serialization, where each device receives an individual unique serial number. Prior to burn-in, electrical parameters are measured and recorded for each device, a process referred to as “read and record.” Following the completion of burn-in, these electrical parameters are again read and recorded, and delta calculations are performed to compare the pre- and post-burn-in parameters. This data is supplied to customers in a data pack upon shipment of the flight units.
In addition to these tests, some destructive tests are applied to a sample of parts from each production lot. These tests include a destructive bond-pull, where the force required to break the bond wires connecting the die to the package is measured. Mil-Std 883 Class B specifies a minimum force required to cause breakage, and if the sample fails to reach that breaking force, then the production lot will be held for a reliability investigation.
Threats to Reliability
In space-flight applications a major reliability threat comes from the radiation that exists in space. Natural radiation effects fall into two categories: Total Ionizing Dose (TID) and Single Event Effects (SEE). Over the five- to 15-year duration of a typical space-flight mission, radiation accumulates in semiconductors, causing the transistor voltage thresholds to change and the transistor junctions to become leaky. A high enough TID will cause integrated circuits (ICs) to exceed the parametric limits of the datasheet. Stand-by current or propagation delay are usually the first parameters to fail. At an even higher level of TID, the IC will functionally fail. Modern space-flight FPGAs have a high resistance to TID, to the extent that they are suitable for deployment in the vast majority of satellite programs.
However, SEEs are a different story. A SEE occurs when a heavy ion strikes an IC and causes a disruptive effect. SEEs include data upsets (Single Event Upset, SEU), functional upsets (Single Event Functional Interrupt, SEFI) or latch-ups (Single Event Latch-up, SEL). These are harmful phenomena, which at minimum can interrupt a data path in the spacecraft. More serious consequences include the loss of key spacecraft control data, which could result in the loss of power control, attitude control, or telemetry links between the spacecraft and ground control. In the worst-case scenario, a SEL could cause high current consumption in an IC, resulting in destruction of the IC, and potentially the loss of a circuit board. Thankfully, SEL has been rendered unlikely by modern design and processing techniques for space-flight FPGAs, as evidenced by the radiation testing results published on today’s space-flight FPGAs.
Other SEEs are more problematic. Designers selecting space-flight FPGAs should take care to ensure that the programming technology used in their FPGAs is not vulnerable to radiation effects. For example, SRAM cells have a very high susceptibility to SEUs, and an FPGA using SRAM cells for device configuration will need special mitigation techniques in order to prevent loss of functionality (SEFI) when struck by a heavy ion. On the other hand, FPGAs using antifuse programming technology do not need any mitigation techniques, since antifuses are not upset or damaged by heavy ions at the energy levels found in space.
Similar to space applications, FPGAs used in automotive applications face reliability threats from mechanical stresses such as thermal shock and vibration. Also, like space-based applications, automotive FPGAs are exposed to ionizing radiation. In automotive, this radiation is from atmospheric neutrons and alpha particles from packaging impurities. It is impossible to screen or insulate semiconductor devices from these sources of ionizing radiation. The ionizing radiation sources will result in SEUs of the SRAM inside an FPGA. If the upset occurs in data SRAM, error detection and correction (EDAC) techniques can be used to mitigate the error. However, if the error occurs in the SRAM configuration memory used to control the ‘personality’ of an SRAM-based FPGA, a logic failure or firm error of the FPGA may result. Firm errors lead to FIT rates (failure in time rate = number of device failures in 109 device hours) several orders of magnitude higher than those seen in a typical semiconductor device. FIT rates give a true picture of long-term device reliability.
New Challenges to Product Reliability
Advancing process technology brings new challenges to both manufacturers and customers of high-reliability electronics. Among the challenges are increased susceptibility to radiation SEEs and greater susceptibility to electromigration and thermal runaway.
As the feature size of transistors decreases, the amount of energy required to cause an SEE also decreases. Therefore, a greater percentage of the heavy ion population will possess sufficient energy to cause SEEs. Consequently, as manufacturers advance their processing technology, they will need to build in SEE mitigation techniques to protect their circuits. Actel’s RTSX-SU and RTAX-S space-flight FPGAs already contain mitigation techniques to eliminate SEE at energy levels found in space flight.
Advanced process technologies also bring new challenges in electromigration. The thinness of the metal tracks at fine process geometries makes them more vulnerable to electromigration, where the high density of the electrical current is sufficient to cause the atoms of metal to physically move. This effect causes additional thinning and ultimately will cause the breakage of the metal connection, with loss of functionality in the IC. Electromigration is greatly accelerated with high temperatures, such as those in automotive applications. One solution is for manufacturers to limit the maximum junction temperature of their ICs or, alternatively, to specify a lower limit for device current consumption allowed for an FPGA design that will be exposed to high temperatures for extended periods of time.
Another phenomenon affecting maximum junction temperature and overall reliability is thermal runaway. A device experiencing thermal runaway will exhibit an increase in current consumption as temperature rises. As the current increases, the amount of power dissipated in the device will also rise, and the device will heat itself up. The increased temperature causes a further increase in current, causing a further temperature increases. This vicious cycle continues until the device destroys itself. Care must be taken in thermally demanding applications, such as automotive under-hood systems, that FPGA device power consumption is well understood over the entire operating profile of system.
The net effect of electromigration and thermal runaway is that manufacturers are specifying lower maximum junction temperatures than in previous generations of ICs. This in turn places additional constraints on designers of high-reliability systems. It is incumbent on designers to ensure that the thermal characteristics of the system do not cause the maximum junction temperature to be exceeded. Since the maximum junction temperatures of ICs are decreasing, it is necessary to take increased measures to control power dissipation and heat flow in the system. The choice of packaging technology can assist the designer here. For example, in space-flight FPGAs, only ceramic packages are offered, which works to the designers’ advantage. Ceramic packages have very low thermal resistance (usually specified as θ JC, or junction-to-case thermal resistance), which effectively conducts heat away from the die to the package pins where heat is conducted to a heat-plane in the system. However, the decreasing maximum junction temperatures place higher demands on system cooling resources.
What is a high reliability designer to do?
The increasing complexity, functionality and cost of high reliability designs has only added to the pressure on designers of such systems. Whether you are designing the latest spacecraft to explore the far reaches of space, or designing the next touring car to explore new family vacation destinations, the number of testing and design considerations could make a designer’s head spin. To easily move through this process, designers should look for suppliers that have a proven track record of understanding the needs of high-reliability applications and partnering with their customers in solving design challenges.