February 21, 2011
A look at the world of audio IP
Humans have always communicated by sound. It’s so basic that we take it for granted, whether in our telephones or our stereo systems. By contrast, we fawn and ooh and ah over video. High def this, low power that. Let’s see if we can get a bazillion pixels on a 1-cm x 1-cm screen. OK, no one can see that small, but hey, we can say we did it!
So you’d be forgiven for thinking that all that’s happening in audio is finding new ways to compress files in a lossy manner so that you can fit a bazillion tinny-sounding songs on a media player (instead of just a paltry billion). I mean, all kids use these days are crummy headphones – er – earbuds, and they don’t know what good sound is supposed to sound like because they’ve never heard it, right? So they won’t miss the highs and the lows. They’ll continue to throw money after crappy sound.
It’s enough to make an audiophile go… oh, I don’t know… put bubble gum on his speaker wire to create sound bubbles that will capture the frequencies that would otherwise be lost and thereby provide an unparalleled listening experience that would be wasted on mere mortals.
But, in fact, there’s a ton of stuff going on in audio. Truth be told, it’s kind of a confusing mishmash, both in terms of what is being done and how it’s being done. So let’s take a stroll through the world of sound and try to unravel a bit of what and who is where.
Different end games
First, sound shows up in a lot of places. As described by the players in the market, there are some clear segments, but others are fuzzy. If you classify by device, you end up with something like home entertainment, computers, mobile, broadcast, and automotive. The latter two are pretty distinct, but if you use a laptop to stream sound into your home audio components, is that home entertainment? Computer? What if it’s an iPod? Mobile?
In fact, it seems that, to some extent, it’s academic. Clearly, something that has to operate untethered in your hand must be small and go gentle on the battery. Less so if wall-plugged and not so space-constrained.
With these types of devices, there are actually two categories of “audio”: speech only (typically over IP, but also for cell service) and, well, everything, including voice, music, and that slightly disturbing sound Uncle Bernie always makes after Thanksgiving dinner. While the latter category is very much consumer-oriented, voice has particular importance for business use since VoIP plays an increasing role in business communication. (In fact, one conversation I had in researching this article was done across the Atlantic using a well-known free VoIP service, and the voice volume would gradually diminish to pretty much nothing, and then would rocket back immediately to full volume. Makes a cogent conversation difficult.)
The difference with speech is bandwidth. According to SPIRIT DSP’s Alex Kravchenko, voice is also getting an HD upgrade, moving from 3.7 kHz up to 8 and even 22 kHz. There is a codec from IETF in the works that will handle both voice and full audio, but, in reality, they’re still two separate codecs wrapped together. It appears that speech and full audio will continue in their separate spheres.
While much of high-definition audio comes along with video (particularly under the BluRay label), there is also HD Radio. This provides a high-def format for digital radio broadcast over AM or FM bands. Think of it as yet another wireless channel, except that it’s the oldest wireless technology around.
For simpler audio or for voice on simpler cellphones, the main processor still gets used for running the codecs. But as expectations of sound quality increase – and there are those that believe that cellphones should be able to process sound just as well as a home system, even though you’d notice it only if you patch it to a real amp and speakers – the need for dedicated processing is growing.
So, whether voice or full audio, the name of the game is increased sound fidelity while reducing size and power consumption. The concept of having big boxes in your home for anything but the actual speakers or a turntable or a slot for DVDs and CDs is becoming dated. If a cellphone can process hi-fi sound, then there’s not much reason for big boxes. (Audiophiles, on the other hand, will continue to need large tubes and five-digit isolation platforms and such before they’ll stop cringing.)
Even for voice, the goal is for wireless and internet voice to have the same quality and reliability as good old Ma Bell used to provide. And that’s not just for a point-to-point conversation; that includes conference calling with many parties in many parts of the world. And, in some respects, voice is more demanding than full audio: you can tolerate a bit of a delay before your audio track starts playing, but delays in a conversation are unworkable, as any two people who have ended up talking on top of each other can readily attest.
Broadcast is characterized by more expensive equipment produced in far lower volumes than your typical playback devices, giving it a different set of design constraints. Here the name of the game is channels. Now, if you’re like me, you’ve seen TVs with 500 channels, most of which have, literally, nothing on them, just waiting for a pay-per-view event, and most of the rest having nothing of value on them. And so you might wonder why we would need lots more channels with nothing on them.
Well, that’s not what it’s about. It’s the channels-per-channel. Original TV had a picture along with an audio stream; eventually that went to picture plus stereo, or two audio channels. Nowadays, shows are broadcast with multiple language choices, not to mention high-definition audio requiring eight channels per audio… channel. Ok, wait, to unravel, each “TV channel” may have several audio “language channels,” each of which may have 8 “audio channels.” Got that? Obviously your receiver will use only a few channels at a time, but broadcast stations will send out many simultaneously.
Automotive comprises the infotainment systems largely intended to hypnotize your demon child into submission during long rides. As with all things automotive, they have long design cycles, which is what distinguishes this market so dramatically from the peripatetic consumer market. And what makes it less attractive to some players.
A different partition
There’s another way to slice things. Much of the technology is aimed at getting sound from here to there. There are dozens of ways of chopping sound up into transportable form for reassembly later. In fact, this is one of the things that characterizes this (and the video) market: dozens of codecs, all vying for your love. Much of this is due to good old-fashioned competition, with guys like Dolby and DTS trying to out-sound each other. To some extent, different codecs address different markets: mp3 clearly targets small hand-helds whereas 7.1 clearly has its roots in home theater. But just watch. Someone will start driving for 7.1 over earbuds. Or headbuds, implanted in seven different places on your head. Yeah, a kind of digital shower cap. Looks like hell but sounds great. With a subwoofer located… oh, never mind.
So the big challenge for the guys making these systems is to encode and decode and transcode over eighty different protocols or coding schemes. And that’s just to get things from point A to point B. What happens when it arrives and gets put back together? Will it sound as good coming out as it did going in?
That’s actually a more complicated question than it sounds. The simple view involves looking at the coding and finding less lossy ways to transmit the sound efficiently. Next would be pre-distorting the sound so that the unavoidable distortion in transmission (and in the speaker) actually restores the original fidelity. Here we also venture into a field with the slightly creepy moniker “psychoacoustics” to do things like fooling your brain into hearing bass that isn’t there based on the remaining harmonics from the original bass.
But here the attention is all on preserving the integrity of the original sound over the transmission channel. It ignores one element that most of us have ignored since long before digital music: the acoustics of the room. Which used to be addressed by adjusting the room, not the sound. Which brings us to the second slice: post-processing.
Regardless of how high the transmission fidelity is, it can still be made to sound better if the characteristics of the listening environment are considered. And we’re not just talking about how big the room is or whether there are lots of soft curtains or hard walls. We’re talking the position of furniture. Or even where you’re sitting. We’re talking algorithms that can figure out how the sound interacts with the room and compensate to improve the clarity and spatial integrity of the different audio channels, which is of particular importance for that life-like home theater experience.
In fact, this is one area where, according to Tensilica’s Steve Roddy, even ardent analog buffs will make some space for digital. You can’t do that kind of post-processing efficiently in the analog domain, so here the analog can be assisted by digital, both for the complex processing and for pre-distortion that keeps the analog circuits in the linear range to reduce power (and improve quality). He claims that the post-processing end of things is growing faster than the codec end.
The how, who, and how well
The technology for implementing all of this, for the most part, involves software executing on platforms. Things change so often that pretty much nothing runs on dedicated hardware. So how does one measure which solution is “best”? That’s actually a tough question. It’s definitely not the clock speed of the processor.
What does come up over and over is the number of MHz required to run a particular codec or function. Or, conversely, how many MHz are left over for doing other things. You may have a GHz processor, but if it takes the whole GHz to process voice, then another processor is needed for the other stuff. If the code executes more efficiently, then you can do more things with the same processor.
Benchmarking is also where the truth gets told. There are various real-world workloads available for doing bake-offs. The RightMark Audio Analyzer was developed to provide an independent, open suite of tests for computer audio quality. And, of course, everyone has their own set of target or nightmare scenarios they can use.
But, all in all, the overall quality of sound, combined with price and battery life, will be a function of the combination of the underlying platform, the algorithms used, and the quality of porting those algorithms to the platforms. So there’s no one easy measure of goodness. Conversely, everyone can claim leadership.
The technical and business models of the various companies in the market vary, as do the value propositions. So a quick run-down of the players can help put this all into some semblance of order. For space reasons, this summary is necessarily concise. Each of these companies could write (and some have written) pages on what they do and why they do it well.
VeriSilicon’s ZSP processor, acquired from LSI Logic some years back, pretty much shows up in any conversation about audio. VeriSilicon actually is largely a design service provider. Large companies can and do purchase IP from them, but, more typically, either the customer is too small to take on the audio design or the audio isn’t in the bulls-eye of the customer’s core value, and so they ask VeriSilicon to do the design. In those cases, VeriSilicon provides a turn-key black box audio subsystem that can be integrated into the customer’s SoC. They claim to have 60% market share of the stand-alone HD Audio (BluRay) market. They also claim to be the only Google-approved silicon house for the Hantro codec. They have some analog hardware IP and claim the most complete software availability.
Tensilica is another name that appears well-entrenched in the audio world, although they don’t chase automotive business. Here again, the starting point is their processor, which is configurable to a far greater degree than that of any of their competitors’. They claim as one of their biggest strengths the ease with which they can port new codecs, with more than 80 ready to go. As a case in point, their DTS Master Audio codec was certified 15-18 months ahead of the next competitor. In particular, they’re seeing algorithm guys starting to take into account the ability to tailor the instruction set when designing their algorithms.
CEVA claims to be the biggest overall DSP core provider, although they target a number of markets (like the femtocells we looked at recently). They’re clearly also running hard at the high end of audio, with a focus on home audio – digital TV, set-top boxes, media gateways – and smartphones. They recently announced a new member of their TeakLite family, the TL3211, built for audio. Their focus is on the DSP core, but they have a library of optimized codecs as well as their CEVA-Toolbox environment for software developers to write or port their own code.
The day I did a Google search for audio IP blocks, it seemed like Synopsys was the only game in town. The first two pages of search results had hardly anything but Synopsys on them. Maybe they paid Google for a one-day promotion or something, since I can’t repeat that now as I write. But it was pretty impressive. Synopsys claims as their strength a complete solution, from microphone to speaker. On the one hand, they have all the lower-level analog IP necessary to implement the ports, digital/analog conversion, and things like volume control, digital multiplexing, and filtering. At the top end, they have the software acquired from Virage acquired from ARC processing and post-processing.
With respect to audio, SPIRIT DSP focuses on voice. In fact, they say they’ve offered HD Voice for years. They’ve got specific engines for audio and for conferencing, the latter combining voice and video. They claim to be all about achieving better than plain-old telephone service (POTS) quality over the internet. They wrestle down the voice coding, echo issues, delays, jitter, lost packets, transmission rate problems, and synchronization between voice and video (Can you say “digital lip sync”?) that threaten to make internet phone calls less than satisfying. They claim to be powering more than 200 million channels in more than 80 countries.
Coreworks has a completely different angle. Their focus is on the broadcast side of things. The volumes of such equipment are not large enough to warrant SoCs, so they use FPGAs. And thus Coreworks is the only vendor that targets FPGAs for audio. They have put together their own architecture, consisting of a CPU plus a reconfigurable accelerator. They don’t use the Nios or Microblaze processors; apparently the frequent SDK revisions kill code written on older versions, which is painful. So they did their own processor and tools. They acknowledge that past reconfigurable FPGA efforts have failed due to poor tools, and their tools are still tough to use, so, for now, they do the work themselves on behalf of their customers. Their goal is to improve the tools so that they can then deploy them directly to users.
Imagination Technologies has a big multimedia push, and most of their literature discusses their heavy emphasis on graphics. Less touted is the audio component that accompanies the pictures. But they just announced a deal to work together with iBiquity Digital to implement HD radio on their Ensigma UCCP Series3 processors.
When it comes to comparing the success of these companies, market share is an elusive concept. While VeriSilicon feels confident in the share calculations they make based on information from their customers and the royalties they see, not all companies operate the same way. So some might measure revenue, some design starts, some chips sold. But whatever you measure, these aren’t public numbers and there are no analysts following this area, so, even if you can figure out your own numbers, it’s hard to compare them to the competition. Market success joins technology leadership as being sort of amorphous and fuzzy.
So, even though sound, speech, and music are primal human pursuits, we’ve come a long way from pounding logs and hollering across the holler. Long gone are the days of a vibrating needle creating a varying electrical signal driving a speaker cone. So the next time you watch a TV show, make a phone call, or listen to a song, think about all the stuff going on in the background.
Actually, don’t. Just relax and don’t think about it. Enjoy the sound.
Posted on February 22, 2011 at 12:27 PMHi Byron,
Good job of capturing the major issues and competitors. Just want to add two points. First, one of the biggest issues we find is how well the codecs work on the audio IP. Almost anyone can do a crude port. Optimizing the port to really do well on the processor is essential. Second, the best way to tell how well the software is ported is to do a benchmark comparison, and we (Tensilica) encourages the benchmarking.
Paula Jones - Tensilica
Posted on February 23, 2011 at 3:54 PMHi Bryon,
I second Paula on the nice job with this article. Couple of comments - regarding codecs certification, note that certification can be done in various formats, from crude, non-optimized ports, to fully-optimized codecs on actual hardware platforms; so certification by itself is not a good indication of performance. Furthermore, it is not just the codecs, but the complete use case (e.g. Blu-ray Disc) which performance needs to be measured upon, and whether it fits into a single processor or forces the customer to use multiple cores, faster and more power hungry process, etc. The CEVA-TeakLite-III and TL3211 which you mentioned is handling the toughest use cases with a single core utilized at a low power process, leaving ample headroom for the newly required post processing functions.
Moshe Sheier, CEVA
Posted on March 04, 2011 at 5:12 AMThanks for interesting audio IP market review! I have a short explanation regarding VoIP capabilities in comparison with POTS. Due to wideband "upgrade" ("moving from 3.7 kHz up to 8 and even 22 kHz") in voice coding technology the sound quality of today's VoIP services is typically better than the quality of POTS. Unlike in VoIP, wideband is also very costly to traditional PSTN and GSM systems due to need in changing the existing infrastructures. The real wideband voice (or HD voice) you can experience for instance with the new Viber which can easily deliver high wideband quality over IP, with SPIRIT IP-MR.
Elizabeth Maleeva, SPIRIT DSP
Posted on March 07, 2011 at 8:17 AMHi Bryon, Great article. Benchmarks and performance are important, but just as important is finding a complete solution that addresses the whole audio subsystem. You can spend months trying to find the various components that are needed and even longer trying to make them work together. A complete package from a single supplier like Synopsys including a high performance audio processor, digital and analog interfaces, a full suite of codecs and a media streaming framework to tie it all together dramatically lowers integration risk and cost.
Henk Hamoen, Synopsys