Originally Posted by amirm
Originally Posted by kbarnes701
In an ABX test, there are only two outcomes too - 1 that the testee can hear a difference and 2 that the testee cannot hear a difference.
As logical as that may sound, unfortunately it is not true. If we could tell what a tester had * heard*, the game would be over as we would know with 100% certainty what is audible! Alas, we can't. We don't have ESP and hence no knowledge of what the person heard. What we know is how they voted. Those are not the same things although very well could be.
A good example is the DVD Forum double blind tests to determine the high definition video codec for HD DVD (which later became that of Blu-ray). In that test, a number of codecs were being tested against the original in a side-by-side display of the original on one side and the one under test, on the other. In the mix was also the original itself as a control. The scale was something like 1 to 5. The original got a score of ~4.7 or something like it. Clearly no tester "saw" a difference between it and the original because by definition it was the same file. What then explains the difference? One of two things occurred: the tester imagined seeing a difference. Or more likely, he saw no difference but gave it a low score anyway under the assumption that all the samples were degraded and he did not want to be the only schmuck who didn't see it!
The story gets better. In one test, one of the codecs actually scored higher than the reference itself!
Human psychology and bias is at play even in double blind tests. And males can be terrible this way and corrupt test results as they tend to want to "win" any competition to find the right answer. I hope one day we can discover what you say: the ability to determine what is being heard. We have to get the person voting out of the equation. This is why if we can show something with measurement and math, it is much more preferable than relying on tests where we can't as easily as above determine incorrect voting.
The flaw with this little anecdote is that the facts seem to relate to a DBT that was not ABX, and was probably some form of ABC/hr or ABCD/hr, MUSHRA etc.
The alleged flaw with the cited test that was found cannot possibly happen in ABX tests.
In an ABX test, the listener has to identify an unknown sample X as being A or B, with A (usually the original) and B (usually the encoded version) available for reference. The outcome of a test must be statistically significant. This setup ensures that the listener is not biased by his/her expectations, and that the outcome is not likely to be the result of chance. If sample X cannot be determined reliably with a low p-value in a predetermined number of trials, then the null hypothesis cannot be rejected and it cannot be proved that there is a perceptible difference between samples A and B. This usually indicates that the encoded version will actually be transparent to the listener.
In an ABC/HR test, C is the original which is always available for reference. A and B are the original and the encoded version in randomized order. The listener must first distinguish the encoded version from the original (which is the Hidden Reference that the "HR" in ABC/HR stands for), prior to assigning a score as a subjective judgment of the quality. Different encoded versions can be compared against each other using these scores.
In MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor), the listener is presented with the reference (labeled as such), a certain number of test samples, a hidden version of the reference and one or more anchors. The purpose of the anchor(s) is to make the scale be closer to an "absolute scale", making sure that minor artifacts are not rated as having very bad quality.
The Wikipedia article seems to be sufficient to make the differences in terms of purpose and execution among these three kinds of testing clear enough to the average reader. Somehow Amir missed the boat.