Mark Henninger's excellent piece about whether or not high-end audio is obsolete got me thinking about a related topic that's been in the AV news a lot lately: high-resolution audio. As most AVS readers probably know, LPCM (linear pulse-code modulation) digital audio, the most common type used in commercial music recording, has two basic parameters that define its resolution—sampling frequency (aka sample rate) and bit depth.
SACD uses a different type of digital audio called DSD (Direct Stream Digital), which I'll put aside for the moment. Also, sampling frequency and bit depth don't apply to lossy compressed formats like MP3, only to the original files that were used to create them, so I won't include them here.
Before I address the question in the title, I'd like to make sure everyone is up to speed on the basics of digital audio. If you're familiar with those basics, you can skip to the next subhead.
Digital Audio Basics
In all analog-audio electrical signals, the voltage smoothly rises and falls between a minimum and maximum value in a pattern called the waveform. In most cases, the waveform is quite complex, and it determines the tone or timbre of the sound it represents.
All complex waveforms can be separated into pure tones whose waveform is a sine wave with a single frequency and a certain amplitude (the difference in voltage between the peaks and troughs of the sine waveform); this process is called Fourier analysis. If you were to combine all the pure tones at the proper levels, you would end up with the original complex waveform. (Actually, it's not as simple as this—the waveform varies over time as well, but the basic concept still applies.)
When added together, the sine waves combine to form the complex waveform.
In the process of sampling an analog-audio signal, the instantaneous voltage of the signal is measured multiple times as it rises and falls during each cycle of the waveform. The number of times this measurement is taken per second is the sampling frequency. Each measurement, or sample, is represented by a digital number that includes a certain number of bits; this is the bit depth. The higher the sampling frequency, the more measurements are taken per second, and the greater the bit depth, the more accurate each of those measurements is.
As the sampling frequency and bit depth increase, the measurements of the instantaneous level of the analog waveform become more accurate. Also, as bit depth increases, so does the dynamic range, which is represented here by depicting the 24-bit samples as taller than the 16-bit sample. The "size" of each bit in all three graphs remains the same. Note that there are far more than 16 steps in a 16-bit sample; in fact, there are 65,536 steps. These graphs are meant as conceptual illustration, not to be numerically accurate.
The sampling frequency establishes the highest audio frequency that can be accurately represented in the digital information. According to a well-established tenet called the Nyquist Theorem, any analog signal with a frequency no greater than half the sampling frequency can, in principle, be sampled and reconverted back to its analog form with perfect fidelity. (Remember that most analog-audio signals include many frequencies in combination, and the Nyquist Theorem applies to each of them individually.)
If frequencies above half the sampling frequency—which is called the Nyquist frequency—are digitized, they create lower-frequency artifacts that were not in the original signal. To avoid these so-called aliasing artifacts, the analog input signal is first sent through a lowpass filter that removes any frequencies above the Nyquist frequency.
In this diagram, a waveform (red) is sampled at less than twice its frequency, resulting in a low-frequency aliasing artifact (blue).
When the digital signal is converted back to analog, it first looks like a series of stairsteps that roughly follows the shape of the original waveform. These stairsteps correspond to the values that were sampled when the signal was digitized. Such a stairstep waveform includes many high-frequency components, or harmonics, which are removed by sending the signal through another lowpass filter (also called a reconstruction filter). This removes all frequencies above the Nyquist frequency, returning the waveform to its original shape. Theoretically—and amazingly—no information whatsoever from the original waveform is lost in this process.
A waveform is sampled, converting it into a series of numbers. When the numbers are converted back to analog, they start as a stairstep approximation of the original waveform. A lowpass reconstruction filter restores the waveform's original shape.
As I mentioned earlier, the bit depth is the number of bits used to represent each sampled value, from the peaks to the troughs of the waveform. With a bit depth of 8 bits, 256 different values can be represented (28 = 256); with 16 bits, 65,536 different values can be represented (216 = 65,536), and with 24 bits, 16,777,216 different values can be represented (224 = 16,777,216). The bit depth determines the maximum dynamic range that can be represented—each bit adds roughly 6 dB to the dynamic range, so 16 bits corresponds to a dynamic range of about 96 dB and 24 bits corresponds to a dynamic range of 144 dB.
Because there is a finite number of possible values for each sample, most samples will not correspond exactly with the instantaneous voltage of the analog waveform, so the value of the sample is the closest it can be without exceeding the voltage it represents. The difference between the actual instantaneous voltage and the sampled value is called the quantization error.
Occasionally, the voltage and the sampled value are precisely equal, in which case the quantization error is 0, but most of the time, the samples do not equal the voltage by different amounts. So quantization error is often expressed as an average over many samples. The greater the bit depth, the more accurate all sample values are and the lower the average quantization error, which is also called quantization noise or distortion. This defines the noise floor, below which no actual signal can be represented or reproduced.
In this image, the green curve is the original waveform, and the yellow curve is the waveform resulting from quantization. The red curve represents the quantization errors or quantization noise.
CDs store LPCM digital audio with a sampling frequency of 44.1 kHz and a bit depth of 16 bits, often specified with the shorthand designation 44.1/16 or 16/44.1. (Many professional digital-audio systems use a sampling frequency of 48 kHz with a bit depth of 16 bits.) Why were these values chosen? According to most research, humans can't hear frequencies above 20 kHz, so if the sampling frequency is more than twice that, all audible frequencies can be accurately represented as digital data.
The dynamic range encompassed by healthy human hearing—that is, the difference between the softest sound we can perceive (the threshold of hearing) to the loudest sound we can perceive without pain (the threshold of pain)—is around 140 dB, which is more than the theoretical maximum of 96 dB represented by 16 bits. But using more bits would require more storage capacity, and one of the design goals of the CD format was the ability to store at least 60 minutes of audio, so 16 bits was deemed sufficient while allowing that goal to be achieved.
Higher-resolution LPCM audio recordings—typically 24-bit/96 kHz or even 24/192—can be distributed on Blu-ray or DVD-Audio discs or made available for downloading from websites. A bit depth of 24 bits represents a theoretical dynamic range of about 144 dB, and a sampling rate of 96 kHz can accurately represent frequencies up to 48 kHz; a sampling rate of 192 kHz can represent frequencies up to 96 kHz.
However, it's important to verify that the original recordings were made at the higher resolution and not upconverted from 16/44.1, which would negate any potential improvement in the audio quality. Then there's the issue of an analog master tape being digitized at 24/96 or 24/192, the value of which is debatable, since professional analog-audio tape has a dynamic range of 60-70 dB without noise reduction.
Another form of high-resolution audio is DSD (Direct Stream Digital), the digital-audio format used on SACD discs. DSD uses a very high sampling rate of 2.8 MHz and a bit depth of only one bit, but it uses a different encoding scheme called pulse density modulation (PDM), so it's not directly comparable to LPCM. According to Wikipedia, it's approximately equivalent to 20-bit/96 kHz LPCM.
There is much more to digital audio than I have explained here, but this is enough to understand the issues of high-resolution audio and whether or not it is irrelevant.
High-Resolution Audio Goes Mainstream
Recently, digital audio with higher resolution than CD has gotten a lot of attention, especially with the news that Neil Young is moving ahead with his PonoMusic project now that its Kickstarter crowd-funding campaign has raised over $6 million. Young proposes to distribute commercial music recorded at a sampling frequency of 96 or even 192 kHz and a bit depth of 24 bits, which will be playable on a portable Pono Player built by high-end manufacturer Ayre Acoustics.
The Pono Player will include hardware from Ayre Acoustics.
Young is not the first to distribute high-res music files. AIX Records has been recording and distributing 24/96 music files on DVD-Audio and Blu-ray discs since 2000, and it launched itrax.com, the first high-resolution audio-download site, in the fall of 2007. Chesky Records sells SACDs, DVD-Audio discs, and DVD-ROM discs with 24/192 audio files that you can copy to a computer hard drive. Other sources of high-res downloads include Bowers & Wilkins, Linn Records, Naim Label, and 2L. Another well-known source is HDtracks.com, but some audiophiles suspect that some of its files are upconverted from 16/44.1; see Polk Audio's forum for a discussion of this.
The crux of the question I pose in the title of this thread is whether or not true high-resolution audio—recorded, edited, mastered, and distributed in 24/96, 24/192, or DSD—offers an audible improvement over the good ol' 16/44.1 audio found on CDs. And as you might imagine, there is much debate over this proposition.
For example, it seems clear that a bit depth of 24 bits could potentially sound better than 16 bits, since the dynamic range of human hearing is about 140 dB. But very few recordings are made without some form of dynamic compression. (I'm not talking about data compression like MP3.) In the case of most popular music, the dynamic range is severely compressed so that everything can be heard in the presence of road noise in a car or a city street while you're out for a stroll or bike ride.
In terms of frequency range, traditional research has established that humans can't hear above 20 kHz—and virtually all adults can't hear anywhere near that high—so a sampling rate of 44.1 kHz should be more than enough, especially since the Nyquist Theorem states that all frequencies less than half the sampling frequency can be reconstructed with perfect accuracy. The problem here is that the anti-aliasing input filter and reconstruction output filter must have very steep slopes to allow 20 kHz to pass unattenuated while completely blocking 22.1 kHz and above. This type of "brick-wall" filter is very difficult to design and implement without introducing some audible artifacts of its own—at least in the analog domain. By using a higher sampling frequency, the slope of these filters can be much more gradual, which results in much less artifacts.
Then there's the issue of whether or not ultrasonic frequencies above 20 kHz somehow affect the audible range, even though we can't hear them directly. For example, some believe that ultrasonic harmonics interact with each other, producing what are called difference or interference tones down in the audible range. So capturing and reproducing those harmonics could affect the sound we can hear, as many listeners claim they do.
On the other hand, you need some unusually capable equipment to record and reproduce frequencies above 20 kHz. Some speakers can do it—for example, Sony's new Core Series of speakers are spec'd up to 50 kHz for the SS-CS3 floorstander and SS-CS5 bookshelf, and they're not even that expensive ($480/pair for the SS-CS3, $220/pair for the SS-CS5); for more info about these speakers, see our coverage here. In fact, Sony is placing a lot of emphasis on high-resolution audio in many of its new products.
Assuming the ADC (analog-to-digital converter), DAC (digital-to-analog converter), and all digital electronics in the recording and playback chain are capable of accurately representing 24/96 or higher, what about the other analog components in the signal chain, including microphones, preamps, and power amps, along with the analog portions of the converters? If any of them can't support at least 48 kHz and 140 dB of dynamic range, the effort to record and deliver 24/96 audio—not to mention 24/192—is moot.
Argument For the Proposition
Aside from Mark Henninger's piece about whether or not high-end audio is obsolete, the article that inspired me to write this post is "24/192 Music Downloads...and why they make no sense" by Monty Montgomery on xiph.org. Among the arguments in this article is the assertion that all transducers and amplifiers exhibit some amount of distortion, which increases at the lowest and highest frequencies. In particular, reproducing ultrasonic frequencies leads to intermodulation distortion that can extend into the audible range. Thus, it's better not to encode ultrasonics to avoid any possibility of intermodulation distortion.
Montgomery also points out that, while an analog anti-aliasing filter works better if its slope is gradual as explained earlier, a digital anti-aliasing filter has no such limitation. If you sample at a high sampling frequency—say, 96 or 192 kHz—you can apply a digital lowpass filter that simply discards the ultrasonic components, and you're left with a 44.1 kHz dataset that has no aliasing artifacts.
Regarding the use of 24 bits instead of 16, the article argues that the threshold of hearing increases with age and hearing damage, and the threshold of pain decreases, reducing the dynamic range of human hearing as we get older. Also, a technique called dithering, which adds a bit of noise to the signal to mask quantization noise, allows amplitudes of less than one bit to be encoded and reproduced.
Finally, Montgomery points out that if the loudest possible undistorted sound is defined as 0 dB, the quantization-noise floor is -96 dB with a bit depth of 16 bits. But this is the RMS noise floor of the entire broadband signal, and each hair cell in the inner ear is sensitive to a narrow fraction of the total bandwidth, which means the noise floor of each hair cell is much lower than -96 dB. With the use of dither, the article claims that the practical dynamic range of a 16-bit digital audio signal is actually more like 120 dB.
The article does acknowledge that using more than 16 bits is important during recording, mixing, and mastering to avoid clipping and allow digital signal processing without raising the noise floor to objectionable levels. But once the music is ready to be distributed, there is no reason to use more than 16 bits.
After all this theory, Montgomery cites some empirical tests performed by the Boston Audio Society (BAS) in which listeners were played high-resolution DVD-Audio and SACD content and the same content downsampled to 16/44.1 on the spot (no dithering), and they were asked to identify which was which. The tests were said to be conducted using high-end equipment in noise-isolated environments with both amateur and trained professional listeners. In over 500 trials, listeners chose correctly 49.8% of the time, which is no better than random chance.
Argument Against the Proposition
One of the staunchest and longest-active advocates for high-resolution audio is Dr. Mark Waldrep, founder and chief engineer for AIX Records. Waldrep responds to part of the xiph.org article on his website, realhd-audio.com, in a post entitled "24-Bits Makes Sense!" Waldrep acknowledges that most pop/rock recordings and some classical and jazz recordings are subjected to dynamic-range compression, and that most commercial music does not exceed a dynamic range of 96 dB even without compression. But he has, in fact, recorded pieces that do exceed this dynamic range and thus benefit from 24-bit resolution.
In this graph, you can see the dynamic range of human hearing, a typical room, music, and an analog-audio signal. (Courtesy RealHD-Audio.com)
When I asked Waldrep about the xiph.org article, he said, "I agree with Monty that we do not derive any sonic benefit from sample rates higher than 96 kHz. But he's incorrect about the 24-bits claim. His statement that 16-bit CDs can deliver more than 96 dB requires some fancy dithering, which no one is actually doing in practice. CDs have the potential to achieve greater than 90 dB of dynamic range, but why not just shift to 24 bits, since the hardware and software are already there?"
Waldrep maintains that the CD, which has been around since 1982, is really hard to beat when it comes to convenience and fidelity. The format has the potential to eclipse analog tape and vinyl LPs, but only if the entire production chain is up to the task and the engineering/production team are focused on audio fidelity.
He goes on to say that moving to high-resolution PCM audio offers additional fidelity thanks to its increased specifications. In fact, 24/96 PCM provides an additional octave of frequency response and brings the dynamic range to the capability of human hearing. The fact that the ultrasonics included in high-resolution audio might be impossible to hear doesn't deter him.
"It's all about fidelity," he says. "if Wallace Roney is playing his trumpet with a Harmon mute that's outputting partials well above 20 kHz, and I have microphones and a complete signal path that can capture and reproduce those frequencies, shouldn't I include them? I'm giving back everything that was being performed. I'm not willing to arbitrarily roll off the ultrasonics because we haven't proven that humans can't hear them." This might be an intellectual argument, but there is some evidence that recording at higher than 44.1 or 48 kHz is perceptible in some way. The jury is still out on that, he says.
I asked Waldrep about the BAS study, and he dismissed it as being completely botched. According to him, "the examples that were evaluated came primarily from the major labels with a few audiophile recordings as well. The recordings were either DVD-Audio or SACD discs from the private collections of BAS members. This is where the issue of provenance becomes important." The term was first applied to the production history of audio recordings by Waldrep in 2007. "If the original sessions were recorded on analog tape, mixed to a stereo analog tape, and then mastered to yet another copy, the dynamic range would span about 10-12 bits! How were the listeners in the BAS study supposed to hear the difference between high-resolution audio and a downconverted CD version when both had the same dynamic range?
"And the same can be said for the frequency response. Even the new recordings from the Chesky label that were released on SACD had no frequencies above 20 kHz. The DSD 2.8224 MHz 1-bit format forces all of the 'in-band' noise above the upper range of human hearing in a process known as 'noise shaping.' This is the reason that DSD at higher rates have appeared—to push the noise out even further."
According to Waldrep, the BAS study was so seriously flawed that its conclusions are completely invalid. "If the listeners were attempting to discern a difference between two things that were essentially identical, of course the results would be the same as random choice! There weren't any real high-resolution audio titles among those that were auditioned."
So there you have it—high-resolution audio recordings are moving into the mainstream, thanks in large part to advances in recording and playback technology that make it relatively inexpensive to create, distribute, and reproduce. But does it offer any real, tangible benefit over CD? It's time for you to weigh in with your thoughts, opinions, and experiences. I look forward to following the discussion.
This seems to be the very test that Monty Montgomery and Mark Waldrep refer to as recounted in the OP. I scanned through this paper, and nowhere did I find a list of the SACD/DVD-A titles they used, which means we have no way to verify their provenance (how they were recorded) and whether or not they actually contained frequencies beyond 20 kHz or a dynamic range beyond 96 dB. As Waldrep is quoted as saying in the OP, the high-resolution selections used in these tests did not go beyond these limits, so of course the results were no better than random chance, because there was effectively no difference between the high-res and CD versions. What is needed is another such test in which the high-res selections are verified to contain that extra octave of frequencies and a wider dynamic range than CDs are capable of.
Now is the moment when I admit... I bought 95% of my music through iTunes and unless I'm allowed to upgrade it to high-resolution for free, it'll remain compressed 16/44. If I had to choose, I'd prefer lossless 16/44 to lossy 24/96.