Originally Posted by antoniobiz1
This is a giant strawman. The difference between you and every other listener is that sometimes you manage to hear microdifferences by doing something completely different than what Meyer and Moran did (which was having people listen to music).
I am a "people" and I listened to "music" presented by Scott/Mark.
These are the results I got:
Originally Posted by amirm
Thank you Scott! Much appreciated the effort you have put on this project Scott. For the first time I feel that the forum is moving forward toward better understanding of this topic.
foo_abx 1.3.4 report
File A: C:\Users\Amir\Music\AIX AVS Test files\On_The_Street_Where_You_Live_A2.wav
File B: C:\Users\Amir\Music\AIX AVS Test files\On_The_Street_Where_You_Live_B2.wav
18:50:44 : Test started.
18:51:25 : 00/01 100.0%
18:51:38 : 01/02 75.0%
18:51:47 : 02/03 50.0%
18:51:55 : 03/04 31.3%
18:52:05 : 04/05 18.8%
18:52:21 : 05/06 10.9%
18:52:32 : 06/07 6.3%
18:52:43 : 07/08 3.5%
18:52:59 : 08/09 2.0%
18:53:10 : 09/10 1.1%
18:53:19 : 10/11 0.6%
18:53:23 : Test finished.
Total: 10/11 (0.6%)
Originally Posted by amirm
The third track was pretty easy. First segment picked was quite revealing:
foo_abx 1.3.4 report
File A: C:\Users\Amir\Music\AIX AVS Test files\Just_My_Imagination_A2.wav
File B: C:\Users\Amir\Music\AIX AVS Test files\Just_My_Imagination_B2.wav
21:01:16 : Test started.
21:02:11 : 01/01 50.0%
21:02:20 : 02/02 25.0%
21:02:28 : 03/03 12.5%
21:02:38 : 04/04 6.3%
21:02:47 : 05/05 3.1%
21:02:56 : 06/06 1.6%
21:03:06 : 07/07 0.8%
21:03:16 : 08/08 0.4%
21:03:26 : 09/09 0.2%
21:03:45 : 10/10 0.1%
21:03:54 : 11/11 0.0%
21:04:11 : 12/12 0.0%
21:04:24 : Test finished.
Total: 12/12 (0.0%)
Originally Posted by amirm
foo_abx 1.3.4 report
File A: C:\Users\Amir\Music\AIX AVS Test files\Mosaic_A2.wav
File B: C:\Users\Amir\Music\AIX AVS Test files\Mosaic_B2.wav
06:18:47 : Test started.
06:19:38 : 00/01 100.0%
06:20:15 : 00/02 100.0%
06:20:47 : 01/03 87.5%
06:21:01 : 01/04 93.8%
06:21:20 : 02/05 81.3%
06:21:32 : 03/06 65.6%
06:21:48 : 04/07 50.0%
06:22:01 : 04/08 63.7%
06:22:15 : 05/09 50.0%
06:22:24 : 05/10 62.3%
06:23:15 : 06/11 50.0% <---- difference found reliably. Note the 100% correct votes from here on.
06:23:27 : 07/12 38.7%
06:23:36 : 08/13 29.1%
06:23:49 : 09/14 21.2%
06:24:02 : 10/15 15.1%
06:24:10 : 11/16 10.5%
06:24:20 : 12/17 7.2%
06:24:27 : 13/18 4.8%
06:24:35 : 14/19 3.2%
06:24:40 : 15/20 2.1%
06:24:46 : 16/21 1.3%
06:24:56 : 17/22 0.8%
06:25:04 : 18/23 0.5%
06:25:13 : 19/24 0.3%
06:25:25 : 20/25 0.2%
06:25:32 : 21/26 0.1%
06:25:38 : 22/27 0.1%
06:25:45 : 23/28 0.0%
06:25:51 : 24/29 0.0%
06:25:58 : 25/30 0.0%
06:26:24 : Test finished.
Total: 25/30 (0.0%)
So we now have 3 out of 3 positive detection of differences in Scott's clips.
Summarizing, I managed to consistently tell all three files apart from their downsampled 44.1 Khz/16 bit versions.
You are not the graduate, and we are not kids.
That is simple enough to demonstrate. Let's have people run the above test and report on their results. If everyone does as I did, then you are right. If however most people cannot, then we are mixing populations.
This is another giant strawman. They tested real world people in real world conditions with real world recordings in a real world way. There is no dilution. You cannot select 10 people over 6 feet tall and than say that human population is 6 feet tall on average.
The problem with their test is the phrase "real world." The world. We like to, as you all doing, apply their results to all systems, and all listeners. At the risk of stating the obvious, they did not have me in their pool of listeners, nor did they test my system. And of course, they did not have the same content we have in this test.
The above is the reality of any listening test, right? We can't include the "world" as far as population in our test. We can't test all systems. We can't test all content. Yet we want it to be all of these things. How do we solve this quandary? We use best practices in the industry to get closer to these ideals than what they did in Meyer and Moran.
The industry standard in this regard is the international recommendation from ITU BS 1116:
Rec. ITU-R BS.1116-1 1
RECOMMENDATION ITU-R BS.1116-1*
METHODS FOR THE SUBJECTIVE ASSESSMENT OF SMALL IMPAIRMENTS
IN AUDIO SYSTEMS INCLUDING MULTICHANNEL SOUND SYSTEMS
Let's review some of the recommendations:
3.2.1 Pre-screening of subjects
Pre-screening procedures, include methods such as audiometric tests, selection of subjects based on their previous
experience and performance in previous tests and elimination of subjects based on a statistical analysis of pre-tests. The
training procedure might be used as a tool for pre-screening.
The major argument for introducing a pre-screening technique is to increase the efficiency of the listening test. This must
however be balanced against the risk of limiting the relevance of the result too much.
There was no pre-screening of listeners in Meyer and Moran. They let everyone in the test. Yes, the last sentence says we have to be careful to not make the selection too narrow. But the reverse of this is not to throw out the recommendation all together and let any and all people take the test with no pre-screening.
In my case, I am pre-screened as stated above, having participated and done better than most people in blind tests, meeting the requirement of the recommendation.
4.1 Familiarization or training phase
Prior to formal grading, subjects must be allowed to become thoroughly familiar with the test facilities, the test
environment, the grading process, the grading scales and the methods of their use. Subjects should also become
thoroughly familiar with the artefacts under study. For the most sensitive tests they should be exposed to all the material
they will be grading later in the formal grading sessions. During familiarization or training, subjects should be preferably
together in groups (say, consisting of three subjects), so that they can interact freely and discuss the artefacts they detect
with each other.
There is no mention whatsoever of any of this happening in Meyer and Moran test. No one was trained to become familar with "the artifacts under study." They were not put in a group so that they could learn from each other.
Contrast this with my testing as I reported in the other thread. Once I passed Arny's test with careful listening, all of a sudden others managed to do the same who had thought it was impossible to do so. I did not even teach them what to do other than letting them know that differences are audible. Imagine how much better people could do in a room with me where I could show them exactly what I am hearing.
6 Programme material
Only critical material is to be used in order to reveal differences among systems under test. Critical material is that which
stresses the systems under test. There is no universally “suitable” programme material that can be used to assess all
systems under all conditions. Accordingly, critical programme material must be sought explicitly for each system to be
tested in each experiment. The search for good material is usually time-consuming; however, unless truly critical
material is found for each system, experiments will fail to reveal differences among systems and will be inconclusive
This is a fundamental flaw in Meyer and Moran. Not only did they not seek out to find the right material, they did not even measure to see what they picked had high frequency content before they chopped it down to 44.1 Khz.
Critical material does not mean "let's buy DVD-A and SACDs that people say sound good." That is how you have a forum food fight. It is not a search for actual differences between clips that would apply to all of us here and now that listen to different content such as what is presented to us in this thread which has been verified to have high frequency content. The last line in read completely invalidates the Meyer and Moran test and hence, their results are "inconclusive."
Note that my testing was with material in this thread that while not shown to be critical, they at least have high frequency content.
From the same section:
It must be empirically and statistically shown that any failure to find differences among systems is not due to
experimental insensitivity because of poor choices of audio material, or any other weak aspects of the experiment, before
a “null” finding can be accepted as valid. In the extreme case where several or all systems are found to be fully
transparent, then it may be necessary to program special trials with low or medium anchors for the explicit purpose of
examining subject expertise (see Appendix 1).
Another fatal problem with Meyer and Moran test. They had no anchors. What are anchors? They are samples where we know the answer. In this scenario, you would downsample to 22 Khz instead of 44.1 Khz. If the outcome is that no one could tell the difference between a clip that had a maximum of 11 Khz vs 48 Khz, then we know something is seriously wrong with our test protocol. No such anchor existed in Meyer and Moran test.
In formal parlance, the anchor is called a "control." When my wife was working in medical laboratory, before using the chemistry machine, they would feed it sugar water. If it did not report the same, then they knew the machine was broken. Without such a control, you cannot trust the results.
Controls can also catch human errors. What if you are creating such a test and plug in the wrong cable into the ABX box and both A and B are the same signal? You would get a 50-50 response, yes? That would be wrong of course because you were testing a signal against itself. Now if you had a control and still got 50-50 in that, you would know something is wrong and find the wiring problem.
We follow all of these conventions in formal testing in research and product development. We are not doing a pepsi vs coke test where you can drag people off the street and have them tell you which one they like. We are testing for complex and potentially very subtle differences. As such, we need to follow proper recommendations on how to minimize errors. Only then, do we know that the results can apply to the "world."
So let's not try to stop people from running these tests or believing the outcomes because of a highly flawed listening test. They could be right but there are enough protocol errors that we simply cannot run around and push this into people noses as the bible of high resolution vs CD listening test.