OK, second question: 16/44 vs perceptual coding vs transients vs phase vs brick walls.
I'm going to make the argument that 16/44 is not quite sufficient for high-end mastering/delivery, and is not at all adequate for high-end recording.
Let's start with 16 bits resolution. This means that the dynamic range between absolute power and quantization floor is 98 dB. However, nobody actually uses that noise floor, because it's full of correlated aliasing, which sounds bad. Instead, some dithering is added. The mathematics tend to favor something like 3 dB of error diffusion dithering; others use some other variants like SR-22 or (shudder) triangle or whatever. But, with "mathematically perfect" dithering, you get to 95 dB between noise floor and peak. (Dithering is interesting, too -- for lower frequency content, you can actually use the dithering noise as a simple class D signal to go below the theoretical noise floor. The draw-back is that you'd have to use a much lower reconstruction filter cut-off frequency for this to give you clean signal, rather than correlated noise, so most attempts in this area actually end up sounding worse than pure dithering for some kinds of material. End side note.)
My home system is calibrated to 0 - 105 dB with studio quality, low distortion, monitors. There's 10 dB more headroom there than in redbook audio. Compared to empty signal, I can hear the soft hiss, or worse, 11 kHz triangle wave if it's done wrong, of the dithering noise at 10 dB. If you want to emulate sitting in the middle of a real symphony orchestra (or rock concert,) you may actually have to go higher than 105 dB, but I don't recommend it if you live within two hundred yards of anyone else, and/or want to keep your hearing for long :-)
20 bits, done right all the way, would probably be sufficient -- it's about 120 dB S/N with 3 dB dithering. The reason I suggest 24 bits is that it gives all the marketroids something to shoot for, and hopefully the top 20 bits of the solution will be nice and linear (you know, the ones your DAC can actually resolve over the capacitor leakage noise) :-)
Music that really brings out the details of these limitations include tings like quiet flute sections, or human voice soloists, accompanied with a full orchestra so you need the full dynamic range. And I'm talking about full dynamic range, not systems where I have to play slider jockey on volume to compensate for aggressive dynamic compression.
For mastering, 16 bits is worse. One reason is that I need some headroom when recording, so that the vocalist suddenly getting an extra burst of inspiration doesn't hard-clip my session. Another reason is that some sounds will be leveled in the mix, where some are boosted, and others cut. EQ similarly may bring out low-bit-depth flaws in sources. Thus, to produce good N-bit mastered recordings, you really want N+M bits in your input/processing chain, for some value M greater than 2.
I don't think that any of this is actual controversial. Listening tests I've done myself and seen published by others do hold it out. The math checks out, too. The main reason 16 bits are considered "enough" for the market is that most systems do not have 105 dB of low-distortion headroom, nor do most listeners sit quietly, suppressing their breathing so they can hear the softest nuances of the flutist's breath.
If your program material is the latest by Justin Bieber and the listeners are the typical buyer of said material, then 16 bits is probably overkill :-)
Now for the argument on phase. Someone said it pretty well above:
In practice you have to leave a buffer between the highest frequency signal you want reproduced well, and the brick wall. This is because when you get close to the brick wall, you start doing nasty things to the transient response of the signal.
It's actually not even in practice -- it's in theory! The reason is that the Nyquist/brick-wall sampling theorem holds for static signals with infinitely long stimulus, and infinitely long reconstruction filters. No such thing exists in reality. When I have a transient, time aliasing of samples close to the Nyquist limit actually does affect the sound -- it "smears" transients, or even makes them have essentially random variations in presence (by which I mean loudness.) Even a perfectly band-limited signal is going to see transient smearing for transients with major spectral content between Nyquist / 2 and Nyquist, because of the way the sampling theory assumes static, infinite stimuli to a perfect reconstruction filter. This is one reason why I argue that a 96 kHz system (as opposed to, say, 192 kHz) is fully sufficient -- Nyquist / 2 is still above the hearing range. (These days, my range stops at 15 kHz, so for me, 60 kHz sampling is enough -- ah, to be 15 years younger when I heard up to 19 kHz ;-) Transient smearing is an effect of phase. This is why I claim that phase is important, even without smearing phase by thousands of degrees.
There's another reason, too, which is more practical: The recording system must band-limit your input signal *before* sampling, which means in the analog domain. Analog filters are not at all as nicely behaved as modern digital filters. I can't convolve with a 100-millisecond sinc pulse in the analog domain; the best I can do is typically 12 dB/octave, in cascade. (Sometimes you can do better, if you give up linearity in frequency, phase, or both.) If I want the band limiting filter to be at 20 kHz, then the effects of the analog anti-aliasing filter, plus sampling, plus digital filtering, is a *lot* better when I record at 96 kHz sampling, than when I record at 44. Even reconstruction filters are hard, but with oversampling, 44 kHz reconstruction is not as bad in this regard.
These are all effects that you and I can hear, on good-quality, non-snake-oil equipment/rooms (like you'll find in many mixing rooms and studios,) and I can see them in the time domain, too, and even understand when I analyze the math, once I move beyond the "music behaves like a static signal" fallacy.
Regarding perceptual coders: Assuming a well-constructed encoder and decoder, 192 kbit MP3 is not lossless to me. 320 kbit MP3 is very good, and I believe I could not hear a difference to Redbook in most situations, although I'd be interested in creating some "worst case" input signals and take a blind listening test in some high quality-room to get a better evaluation of it. All evaluations I've seen have been "10 on a scale of 10" or "virtually imperceptible" or similarly measured -- which is not the same as actually perceptibly lossless. Nobody can throw away 90% of all information in a recording (meaning something like 128 kbps AAC compared to redbook) and it still sounding exactly the same. And sounding the same is what high quality recording and reproduction is about.
So, the real reason I brought this up is for illustration: Can the typical marketing victim tell the difference between this discussion, and the discussion of "greatly increased Stranding Dynamism"? I believe: No.
In fact, I could actually be on crack, and actually be peddling snake oil in the above arguments -- but to be convinced of being in error, I'd like to see an argument that does address all of the points above, and does so without just falling back on the simplifying assumptions of your typical sampling theorem textbook. The effects I'm talking about are exactly the effects that are ignored by those simplifying assumptions.