Because sometimes it's what you don't Hear
Double Blind Testing
Written by Mark Deneen. Formerly of Juicy Music Audio and Paragon Audio: Used with permission.
The Flawed Idea Behind Any Form of AB Testing
The premise behind all forms of audio AB testing is that the memory is the measuring instrument that provides output data to the experiment. You listen to a sample, followed by listening to another sample, and you then must compare your memory of one or more samples to the current experience. You repeat this serially. This is using the human memory as a measuring instrument to provide output data. That represents a flawed premise for the entire experiment which is rarely debated.
When DBT is used in say, pharmaceutical trials, most of the output data is provided by specific, calibrated and reliable instruments like X-Ray machines, MRI, blood gas analysis, chemical assay and so on. This provides the concept of repeatability - any other researcher should be able to confirm or duplicate the results, because the instruments and measures are standards.
In audio, the instrument is human memory. Here's a few very significant factors to be considered about memory:
1. Memories are constructions made in accordance with present needs, desires, influences, etc.
2. Memories are often accompanied by feelings and emotions.
3. Memory usually involves awareness of the memory.
In this model any memory has a dynamic, instantaneous value in relation to the sum total of all sensory inputs, current emotions, and state of being of the "machine" (person) storing the memory. Strike a bell at this moment, and the memory will be different than striking the bell at some other moment. Same bell, different universal circumstances, and therefore different memory. That's roughly equivalent in reliability to a volt meter which zeroes itself to some random voltage before you make each measurement.
What we know about AB testing is that when A is grossly different than B, people easily pass the test. As you shrink the difference between A and B, more and more people fail the test. When you reach the useful "resolution" of an AB memory test, most people fail. This is where the technicians trip over their feet. They assume the failure means: "people have reached the limit of differences they can hear." That's not the meaning of the results at all. The meaning is: "We have reached the limits of resolution for using human memory as the measuring instrument in an AB test."
You can measure a lot of things with a wooden yardstick - very useful tool. You can't reliably measure the thickness of paper though. The tool doesn't have the necessary resolution at that small dimension. Now you need a caliper or something similar to that.
Referring to items 1, 2 and 3 above, you simply can not extract the intention of the subject from the act of listening. All listening experiences contain their intent, the subject and the object as an indivisible whole. This is why bias can never be eliminated - because even intent of the subject is a bias.
The problem is the reliance on memory, period. I fail to see how DBT solves the "unreliable memory" problem in any way at all. To the contrary, it is wholly based on memory for the entire test.
I think the people promoting DBT show a lot of misunderstanding of neuroscience, memory, and perception. Short term, long term, makes no difference. It is not an instrument for measurement and comparison for such fine resolution. Sure if A is an apple, and B is a banana, it works fine. When A is a Pioneer and B is a Mark Levinson, that's not going to work the same way.
1. We DO NOT RECORD all input! And just in that fact alone, the use of memory should be tossed out the window. We only store essential patterns, partial inputs, enough to make the recall. If we were storing ALL sensate input in full bandwwidth, our brains would have to be the size of Texas. (EX: If you drove an hour on the freeway tonight to get home, can you recall the license # of each car that passed you? The year, make and model of each car? Description of all the occupants? Obviously not---but you "saw" it all with your eyes. If you tried to store all the sound you heard in a few minutes at something like CD resolution, your brain would explode.)
2. A "memory" of a sound is not a recording of the soundwaves! Nor is it a digital representation, nor is it necessarily even stored in a contiguous brain space. Electrical impulses from the ear system are "mixed" (yeah, like a 16 track mixer) along with visual inputs, other sense inputs, emotional content, other similar patterns from previous events, and wellness stimulus. Point being, there is no discrete memory object of "just the sound" that was heard when A was played or when B was played. Proponents of this are pretending that the brain functions like a tape recorder. Ain't so.
3. DBT depends on this real time analysis: A past memory must be compared to a current stream of conscious perception and a determination for differential must be made. Well, hang on, these are two vastly different brain processes. You might as well try to compare volt meter readings to the direction a weathervane is blowing. It's nonsense. That's why it only works with "Apple and Banana" level comparisons.
There's just absolutely no science behind using DBT in audio. All the assumptions are flawed.
The Null Hypothesis or How I Batted .650 in The Audio League
ABX testing sets out with a common goal: to disprove the Null Hypothesis. In the beginning, looking at A and B, it is hypothesized that no difference exists between A and B. This is referred to as the "null hypothesis." It is the starting point of all ABX testing in audio. "A and B have no sonic difference."
The object for the Subject of the test is to disprove the null hypothesis. This is done by accepting a number of challenges to measure an unknown (X) against the known A and B, and correctly declare X to be A or B. The results are tallied, and using statistical techniques, it is determined if the Subject "disproved" the null hypothesis or not. If he did not disprove it, he failed. Then further, all the subjects, and all the results are tallied to compute the result of whether the null hypothesis was disproved for the entire test.
Let's assume each Subject is given 20 challenges. In order to disprove the null hypothesis they must be correct on more than half of the challenges. How many more than half is established by what confidence level is desired in the test. The confidence level describes how certain you want to be that the right answers were not caused by luck or accident. The higher the confidence level, the more right answers are required, and the higher is the confidence that the result is not due to luck or accident. To achieve the typical 95% confidence level with 20 challenges, the Subject must be right 14 times out of his 20 challenges.
Subjects with 13 correct identifications then, "fail" the test. And of course, subjects with 10 correct are assumed to have the same percentage as anyone simply "guessing" their way through the test by flipping a coin. They are dubbed "fail" also - even though half the time they were (potentially) right. (And also, you could simply guess 15 right answers and pass the test.)
So, if you hear that 60 people took the test and they all failed, you can assume that more than half of all the challenges were met with the RIGHT ANSWER - meaning the Subject correctly identified X as being either A or B.
Now, this ABX testing is a statistical exercise, not empirical. It's conclusions are drawn through computation, not observation and direct experience. People who made honest effort and got 13 right only to fail, are computed in with those who guessed and got 15 right which is a PASS. There is no discrimination as to how the actual test was experienced for each subject. A field might consist of all guessers or all experts and the results might be the same.
Statistics, in order to work its own significance, has to assume that 13 right answers is the result of an accident, or pure guessing, and thus a "fails to disprove the null hypothesis". But empirically, we could ask the Subject if they were guessing, and the Subject might say no. In other words, his 13 right answers were a direct experience, empirical observation he made, along with his 7 wrong ones, and on this basis, you could say, "The Subject had 13 hits in 20 at bats and is hitting .650!" Not bad for such a difficult task.
I'll take a .650 batting average to the stock market, or to Vegas, or to the racetrack any day.
Now it could be that he was guessing, and lying, and that's what the statistical model seeks to neutralize. But to be sure, it's just a computation not direct experience.
So all of this is to clarify that the claims of ABX tests, such as "everyone failed to identify the Pioneer from the Mark Levinson" leaves a lot of information off the table.
If you are going to gamble $200M of your company's capital on a new drug, the statistical virtues of DBT and the Null Hypothesis makes a lot of sense. You can run many tests, and invest the money when you get the highest statistical confidence level, right? That makes perfect business sense, and the DBT is a valuable tool in the toolbox. But, is that how you do your hobby? You might be perfectly happy to correctly pick the right amp for you using a .650 batting average! Is that a "FAIL"?
Imagine trying to judge camera lenses the same way we do AB testing in audio. Here's an experiment to consider the problem.
Using two makes of high quality 50mm lenses, two photographs are taken of the same scene filled with a lot of detail and contrast. The two nominally identical photos are given to a subject who is asked to find the differences that exist, if any.
Now, to be clear, the lenses are physically very different in construction. Number of lens elements, arrangement of elements, glass composition, and so on. They only have in common their specifications as to aperture and focal length so that the picture will be the same in terms of exposure, field of view and content. Light passing through a lens and being refocused into a plane behind is one of the most complex engineering feats. It involves many compromises even in the finest lenses ever made. The two lenses are different because their design invokes different sets of compromise.
Back to our subject. The subject will usually take both photos and lay them side to side and study them together as two whole, inclusive entities. Using full parallel perceptual processing the photos are examined in detail. Casual review may reveal "no difference" for some subjects but, a more studied review, particularly by someone with expertise and or training, will reveal small differences in sharpness, spherical distortions near the edges, coma, color aberrations, and so on. For those subjects who see the difference, some judgment can be made about which is better. Ok---easy enough.
But now suppose the subjects weren't given both photos simultaneously to compare side by side? Suppose it worked like this: One photo is labeled "A" and the other is "B". They will be viewed using a special technique that creates a serial memory presentation. The "A" photo is loaded into a "roller box" with a horizontal slit measuring perhaps 1/4" high by 8 inches wide - (the width of the photo). The photo is rolled past the viewing slit at the rate of 1-inch per second. In 12 seconds the entire photo is rolled through the box with the subject viewing it as it moves past the slit. The entire photo is viewed, but not all at once.
Now, the "B" photo is loaded and rolled past the slit in the same 12-seconds.
The subject is asked to identify whether there is a difference between "A" and "B" by guessing each time which photo went past the slit. How well would even the best photography experts do on this test?
That is essentially what an AB test is like in audio. A song or musical piece, is a serial stream of aural sensations just like the photo is a stream of light sensations rolling past the slit. You can't ever hear the audio stream as a "whole" the way you can view a whole picture all at once. And you sure can't hear TWO streams of audio simultaneously in the way you can examine two photographs at the same time.
In the photo AB testing, it would be doubtful that many subjects could correctly identify "A" from "B". And testers would exclaim, "See, there IS NO DIFFERENCE, and we scientifically proved it!" And yet, a person could be handed the two lenses, and assuming they knew anything about optics they would understand immediately that light will pass differently through these two very different constructs. Well, there's no deep paradox here, it is obvious the contrived photographic "slit test" was meaningless as a method of discriminating differences between lenses.
Likewise with audio, AB/X testing is the bulwark that is used by all those who like to prove "no differences" in wire, cable, amplifiers, cd players and so on. They make a false assumption from the start that the test is valid, when it's obviously not - because the human memory is not a scientifically valid instrument for such a comparison. The principle of taking my memory of event "A" and comparing it for differences to my other memory of event "B" is flawed from the start. Yes, gross differences can be remembered, but in this area of extremely subtle differences between "A" and "B", memory isn't useful.