Originally Posted by RobertR
You missed it again. I said "the likes of", which clearly refers to more than two people.
Nobody said I am the sharpest tool in the shed
You keep missing what I said. I already said that such people aren't the target.
Now I am really lost in the woods
. How can RH and JA not be part of your target even though you listed them by name?
So you do exhibit some enlightenment after all!
I am often misunderstood.
Citing a bad example of peer review doesn't make the concept valueless, especially as opposed to not doing it at all.
No but explaining the reality of "peer review" vs imagined one is. That is what I explained. I talked about how I had a team of people performing that function and it is not what the layman thinks the process is. The M&M report just brings the point home that we somehow think there is a judging panel that makes sure the work has good results where in reality there is none.
You left out trying the methodology themselves and verifying the results, which is of course very important in giving the experiment credence.
Once again, the peer review process does not "verify" anything. That is for the reader to judge. If they had, they would be judged right now and they are not (not directly anyway).
The methodology to some extent is reviewed. You can't show up and say you looked at two cables and decided the thicker one sounded better. They would throw out your paper because it would violate the "simple math" equiv. of controlled testing. The Meyer and Moran paper walked and talked like the proverbial duck. It talked about ABX testing and statistical results. That put it on, "they didn't flunk math" class and paper was approved. That's all. There is no representation that the methodology was completely correct but rather, it was not completely wrong. The two are not the same thing. We can't go and skewer the review panel for these guys not testing their samples to see if they are high-res or the other more detailed failings. It is not their job to scrutinize papers to that level. They get tons of papers. They read through it and if there is no obvious red flag, it goes through.
So with all due respect I did not omit those things. I know what is involved in the process and it is not what you are assuming it is. No one here is arguing that they flunked basic math. They followed it. Problem is, they missed the next level of understanding. Let me quote International Telecommunication Union (ITU) document that is the bible of testing for small impairments in BS1116:"It should be understood that the topics of experimental design, experimental execution, and statistical analysis are complex, and that only the most general guidelines can be given in a Recommendation such as this. It is recommended that professionals with expertise in experimental design and statistics should be consulted or brought in at the beginning of the planning for the listening test."
This is a 30 page document describing many aspects of proper controlled listening tests and it still says that it is just scratching the surface and that qualified people need to be consulted before running head long into such testing. If M&M had brought professionals with experience in such tests we would not be here discussing the problems with their work. I don't even see a reference to this document so makes me believe that they had not read it. Had they done so they would have seen requirements such as these:"3.1 Expert listeners
It is important that data from listening tests assessing small impairments in audio systems should come exclusively from
subjects who have expertise in detecting these small impairments. The higher the quality reached by the systems to be
tested, the more important it is to have expert listeners."
Yes they claim to have used people who were involved in recording music and such but no specifics are provided as to their expertise in hearing small impairments. A job title doesn't make you an expert listener for this type of test. I can easily hear compression artifacts that top creative engineers who produce music content can't. I can't do their job and they can't do mine. Likewise, we would need to have people who know what to listen for when we resample the music down. What kind of artifacts could be there and what they would sound like just like me knowing those for compressed music. It is not that I have better ears than others. It is simply the case of being trained to hear compression artifacts and studying how the codec work.
ITU paper goes on to say:"The outcome of subjective tests of sound systems with small impairments utilizing a selected group of listeners is not
primarily intended for extrapolation to the general public. Normally the aim is to investigate whether a group of expert
listeners, under certain conditions, are able to perceive relatively subtle degradations but also to produce a quantitative
estimate of the introduced impairments."
Yet M&M went on to recurit any and all people: "With the help of about 60 members of the Boston Audio Society and many other interested parties, a series of double-blind (A/B/X) listening tests were held over a period of about a year."
Who cares if the people were part of the audio society? Anyone can join that group. That doesn't make them qualified to hear small differences. They picked other people with same problems: "The subjects included men and women of widely varying ages, acuities, and levels of musical and audio experience; many were audio professionals or serious students of the art."
Again, the people selected must demonstrate skill in hearing the artifacts introduced into audio samples. Variety is not a requirement or a merit despite what the authors think."3.2.1 Pre-screening of subjects
Pre-screening procedures, include methods such as audiometric tests, selection of subjects based on their previous
experience and performance in previous tests and elimination of subjects based on a statistical analysis of pre-tests. The
training procedure might be used as a tool for pre-screening.
The major argument for introducing a pre-screening technique is to increase the efficiency of the listening test. This must
however be balanced against the risk of limiting the relevance of the result too much."
Clearly no screening was performed per industry guidelines. No pre-test was given before allowing a person to take the test."3.2.2 Post-screening of subjects
Post-screening methods can be roughly separated into at least two classes; one is based on inconsistencies compared
with the mean result and another relies on the ability of the subject to make correct identifications."
The idea here is to throw out the votes from people who had no business taking the test. The only way to do that is to have a control: a test where we *know* the outcome and if the person misses it, then we know they are not fit for this exercise. There is nothing resembling this in the test. There is this comment however:"The “best” listener score, achieved one single time, was 8 for 10, still short of the desired 95% confidence level. There were two 7/10 results. All other trial totals were worse than 70% correct."
What if these were the only qualified people to be in the test and got as much as 7-8 identities right out of 10? Very different picture emerges than "we find no difference." Personally I don't care how the rest of the people did. I care what the few did because by definition we are trying to find small differences that are not audible to the masses."4.1 Familiarization or training phase
Prior to formal grading, subjects must be allowed to become thoroughly familiar with the test facilities, the test
environment, the grading process, the grading scales and the methods of their use. Subjects should also become
thoroughly familiar with the artefacts under study. For the most sensitive tests they should be exposed to all the material
they will be grading later in the formal grading sessions. During familiarization or training, subjects should be preferably
together in groups (say, consisting of three subjects), so that they can interact freely and discuss the artefacts they detect
with each other."
None of this was done.
I could go on but you get the idea as to why I put so little weight on their work. This is hobby work trying to masquerade as proper testing. I am especially disappointed to see this kind of boasting in the M&M paper: "With the printing of the characterizations in Stuart’s lead paper  in this Journal, it became clear that it was well past time to settle the matter scientifically."
Settle the matter scientifically? Sorry but no. We don't have such a low standard for science of audio testing. The mere existence of an ABX box and a bunch of people in the test doesn't get us there. People don't wake up one morning and be qualified to do drug trials. Even your doctor may not be qualified to do so. Yet in audio we think it must be that simple and all it takes is something playing music and two ears pushing a button on ABX box. That is fine if the differences are large or we don't intend the results to be authoritative. Such is not the case here. Differences are small and the authors claim scientific validity sanctioned by "peer review" stamp.
So high-res music or not, the testing leaves a ton to be desired. That they didn't even have the right sample data just exacerbates the problems.
Should we dismiss the test? No. The likely showed that most people can't hear differences in a sampling of the music they had picked when downsampled to 44.1 Khz. That much we can believe. Going beyond that is taking the test to places that it is not qualified to go.