Originally Posted by GregLee
Your explanation of this example is incorrect, regardless of the outcome of an experiment with reverberation. I am confident that if you operated on recordings to snip away the final consonant portions of "cat" and "cab", so the consonants were entirely absent from the sound, in the absence of reverberation, listeners would be able to distinguish the two correctly.
In the absence of reverberations? That is like me saying in a car crash your seat belt can help reduce injury and you disagree by prefacing: “putting aside an accident, the seat belts can’t reduce injury!”
I will take on your test but I really find it surprising that you want to keep arguing about a point that is absent the condition all of us care about
: real sounds in real rooms (with reverberations).Experiment
I recorded the word ”cat”
on my laptop and then chopped off the ending consonant as instructed. This is the resulting file. Please take a listen and see if you can still tell it is cat or not: http://www.madronadigital.com/Downloads/RTSpeechTests/foo.wav
Not wanting to wait for you all to vote
, I ran the test blind by my three family members: my wife, and two college going sons. I asked them what word was being played using headphones with no other hints whatsoever. All three without hesitation said “cat.” I also tested myself sighted and heard “cat.” I then emailed the link to someone and their vote (with previous knowledge of this argument) was that he could not tell. So the score is 4.5 in favor of Greg’s argument, and only 0.5 in favor of mine. Oh boy….
But Greg has a serious problem on his hands because the word I truncated was not cat but cab
Here is the original file before truncation: http://www.madronadigital.com/Downloads/RTSpeechTests/cab.wav
Now the scores are 4.5 in my favor and Greg is left with 0.5.
Seriously, it should be abundantly clear that the ending consonant was playing an important role and taking it out seriously damaged our discrimination between the two words – precisely the reason I picked that word pair.
Here is the really fascinating part: with full knowledge of what I just explained, i.e. biasing you to vote the right way, listen to the truncated clip again. I bet if you let yourself hear it, you may still perceive the word incorrectly as “cat”!!! The ambiguity is so strong that your logical mind cannot override it all the time. I created the darn clip and even I am not immune to this: I can easily convince myself either word is being said. There is also a lesson here with respect to reliability of our audio perceptions, issues with sighted testing. But we digress.
It goes without saying that you can try to chop off the word in a different place and the outcome may be different. That fact actually indicates that the test as proposed is faulty and runs against simple rules of linguistics. Namely the fact that the transition from vowel to consonant is smooth and hence there is no clear break point where you can separate the consonant. For this reason, formal research in this area tends to avoid words with formant transition.Discussion and Further Research
Greg in his last post puts forward that since we are able to understand a speaker even if he truncates the consonants, it must therefore be the case that the rest of the word was a complete predictor of it. While the premise of what he says is true, the conclusion is overly broad and not supported. Our comprehension is better than it should be because our communication channel has fair amount of redundancy if you include the full set of information being conveyed. If I am holding a cat and try to hand it to you while saying “take the ca” without the ending “t,” you still understand it. That is not because the “ca” led you to think I said “cat” but because of visual cue of me offering you the cat. I could even skip the word cat and you would still understand what I am saying.
Sure, if you heard “ca”, you would know I am not saying “dog” so it is helpful that way but it is not a reliable predictor.
We have many such clues from lips moving, to context, to people’s expressions, etc. It is this collective set of hints that let us understand chopped off words. Such hints help us understand speech very well even when as much as 20% is lost/not heard! Unless all of these factors are taken out, you can’t conclude cause and effect as Greg has attempted. The proof point is simply insufficient to make this case.
Note also that when someone teaches linguistics at school, the focus is on elementary research devoid of real world conditions we face. Very clean voices, recorded well and listened to in quiet, non-reverberant spaces are used for example. Such is not the case for us. Dialog in movies for example is rarely without background music, noises, special effects, etc. Likewise, if you listen in multi-function spaces, noises from everyday living and adjacent spaces help reduce our effective signal noise ratio. Now add to it accents, not being a native speaker, getting older with less than perfect hearing, etc. and you see how what we face is a degraded speech comprehension environment. This is why it is important to design our listening spaces such that they don’t add to the problem.
Going back to our topic of interest, room reverberations are a form of “self-noise” as they raise the residual sound level in the room, and therefore, degrade any component of speech which plays at levels that are similar or lower than it. As I post before, research shows that the most damaging aspect here is with regards to consonants. Vowels in contrast tend to not suffer nearly as much. This factor so important that a specific metric was created for it called ALcons (Articulation Loss of Consonants), created by research that Peutz performed in 1970s as summarized in this excellent AES paper, What You Specify Is What You Get (part 1)
:”With Articulation Loss of Consonants (ALcons) only the wrongly understood consonants are counted. Peutz found that for speech in rooms the vowels are much easier understood than consonants and hence the loss of consonants are the deciding factor in speech intelligibility. ALcons is expressed in %. Under perfect conditions (speech direct on headphone) a combination of a very good speaker and a very good listener will have an ALcons of 2.5%. In excellent room acoustical conditions they can have on top of that 5% ALcons or less. An extra 5% loss is still considered as good and another 5% extra loss is still considered fair and sufficient for most messages. The initial 2.5% is considered the zero correction or proficiency factor. Which target to set for a certain situation depends on the proficiency (to be expected) of the talker and the listener.”
Computing %ALcons for my theater example which had an RT of 0.7 is 9% which is outside of the recommended range of <7%. Recall that our recommendation for RT60 was to not exceed 0.5. We exceeded that in my theater and the result is higher potential for loss for consonant comprehension.
For grins, I computed the %ALcons for Ethan’s garage and got a whopping 26%!!!
It probably makes for great pipe organ music but don’t try to have a kid’s play in there
Here is a very nice visualization of the effect of noise and reverberation on speech by Professor Boothroyd, of SDSU based on a model created from listening tests:
There are a number of graphs there. One is for intelligibility of “CVC” constructs which stands for consonant-vowel-consonant words – like my example of cat and cab. The other two are for word recognition in easy and difficult sentences which is yet another vector/metric of comprehension (remember my example of “take the cab” or “take the cat:”).
The top left quadrant is a quiet room with reasonable reverberation of 0.5. The Quadrant on the right keeps the room very quiet but ups the reverberations to around what my living room is at around 1.5. There we notice a sharp drop in comprehension of words even in easy sentences. Indeed the result is pretty similar to bottom left quadrant where we keep the reverberation low but raise the noise floor from 35 db to 60 db. Put another way, upping the RT60 time resulting in a whopping loss of 25 dB in our signal to noise ratio!
Of course all hell breaks loose when we combine noise with high reverberation times in the right bottom quadrant. Note again that we have the equiv. of “noise” in our movie content with respect to other background sounds. You can try to sit closer to the source to compensate as the graph represents but that won’t change the noise level since our “noise” can be in the content.
Bringing the lesson home, listen to this “convolution reverb” of my cat-cab example. If you don’t know what that phrase means, convolution reverb is a process by which we profile a room, in this case an “empty living room” using an impulse response and then by multiplying (convolving) that with any audio clip, we can make it as if it were in the room we profiled. The accuracy is not perfect and depends on how good the impulse response is but is close enough for this exercise. Here is the file with that reverb added to it:
You likely hear mix of “cat” and “cab”, right? If not, listen again.
Guess what? I only said “cab” in that clip! Indeed it is the word “cab” in my last example simply concatenated four times:
Listen to how much clearer it is without the high reverberations of that empty room applied to it. Once more with full knowledge listen to the reverb version and I bet you can still hear “cat” in it! This strange brain we have
.What does it mean to you
There is uncanny resemblance between this argument and the last. As with the concept of reverberations in the room, what is being quoted from textbooks is not wrong. It simply is misapplied to the topic at hand. How do we know that? Well, we confirm our understanding by actually performing the test Greg suggested. If the results agree, then we know we understood the science well. It didn't in this case. It is also critical that we triangulate our knowledge by reading more than one text to make sure we are not applying what we have learned to a broader situation as occurred here.
With respect to what to do in your room, nothing has changed: if you have difficulty understanding speech in movies for example, one factor may be an overly “live” room. Measure RT60 and look at mid-frequencies. If it is above our target range of 0.2 to 0.5, you need to put additional absorption in the room. Such absorption can be ordinary furnishings such as a carpet on the floor or dedicated products. Location of the absorber is not critical but if given a chance, combine with control of other acoustic issues such as floor bounce (i.e. put carpet on the floor). Otherwise, you may wind up with too much absorption when the job is done.
Note that there are other causes for speech intelligibility such as poorly recorded content, too little early reflections, poorly designed center speakers, and hearing loss. If your room is already in that target RT range, especially if it is at the lower end of that scale, then the problem is one of these.
Edit: someone please teach me how to spell right.