- #1
- 10,824
- 3,690
I have recently been investigating modern digital audio.
First, we need some background in Digital Signals. This can be mathematically quite advanced, but since I would like this post to be accessible to as wide an audience as possible, here is a link that explains what is needed (not even Calculus is required):
https://brianmcfee.net/dstbook-site/content/intro.html
An important point not emphasised in the above is if we have a signal with maximum frequency f, Shannon guarantees not only can it be reconstructed if sampled at 2f, it can be exactly reconstructed - no phase shift, ringing, blurring, etc, but exact reconstruction.
The most fundamental building block of modern digital audio is quickly becoming outdated: CD audio sampled at 44.1 kHz, with each sample 16 bits, abbreviated as 44.1/16.
We know this has a signal-to-noise ratio (SNR) of 96 dB from the article on digital signals.
But enter dither:
What is dither, and is it still relevant in the Hi-Res audio age?
The actual SNR using triangular dither (TDPF) is about 112 dB SNR.
44.1/16 allows transmitting frequencies up to 22 kHz from Shannon.
To determine the number of bits needed, the background noise of high-quality recordings needs to be examined. The minimum SNR under ideal conditions with the best equipment is 110 dB. 130 dB would be a very reasonable SNR to aim at. Indeed, that is close to the thermal limit noise. Going beyond that is useless.
Using 16 bits with dither has an SNR of 112db. As we will see, we can increase this further, achieving well over 130db. Fourteen bits are found to be enough to achieve 130 db
A fascinating phenomenon happens when you convert it to analogue due to aliasing. You get your original audio plus reflections of it that go on forever. It needs to be filtered at about 20 kHz to eliminate those. They are above audibility, so leaving them there has no audible consequences but can play havoc with amplifiers, etc., when listening to audio. Some don't bother when designing a DAC. They are called NOS DACs, but most designers like to remove them.
Combine this with a filter to limit the signal to 22 kHz so Shanon holds, and can be exactly reproduced without aliasing; two hard-to-design steep analogue filters are required. Well, life is not perfect, and the first DACs to appear did just this.
Then engineers started to have bright ideas.
First, is there an easier way to tackle the filter issue in the CD player? While the minimum frequency you can transmit at is 44.1k, nothing stops designers at the other end from increasing the sampling frequency, let us say, eight times to 352k - it is called oversampling. You take one 44.1 k sample, then seven zero samples, and continue this way. Designing a 22 kHz digital filter that uses this upsampled data is straightforward. Now, you have all these copies at 176 kHz instead of 22 kHz. It's much easier to filter. Oversampling was the first idea.
This had the following important byproduct. If dithered, adding extra zero samples means all the samples are no longer dithered. The noise is concentrated in the non-zero samples. Applying the 22 kHz filter spreads the noise evenly across all samples. For eight times oversampling, the overall noise is now eight times less. Each halving of the noise means 3 dB less noise. So, you now have not 112 dB SNR, but 115 SNR. 8 times oversampling means we have 121 dB SNR. The first DAC chips could not handle 16 bits - 14 bits was the max. But using four times oversampling and an early form of noise shaping (to be discussed later), they were made equivalent to 16-bit DACs.
As always, from those early days, things move on. Let's look at a modern DAC like the PS Audio Direct Stream (DS). I use that as an example because I own one and have investigated how it works. It is nothing special; most other high-quality DACs these days work similarly. It over-samples a whopping 1280 times or about 56 mHz sampling. Consider what this oversampling does to SNR. Let's keep dividing it by 2: 640, 320, 160, 80, 40, 20, 10, 5, 2.5. 1.25. Count the number of doublings, and we get ten doublings. This is an extra 30 dB on the 112 dB we have after dithering, giving an SNR of 142 dB, way over what is required when the thermal limit is considered. Fourteen bits give 130 dB SNR. If a degradation in SNR of 130 is acceptable, 12 or even 8 bits could be used, giving an SNR of 118 dB and 102 dB, respectively. Considering the DS has an overall noise floor of 120 dB, 12 bits would be acceptable. An even better strategy would be to locate the noise floor of the recording and only transmit enough bits to reproduce above that. FLAC compression does not compress noise well, and doing this will reduce FLAC files considerably.
As an experiment, I took some 44.1/16, changed it to 44.1/8 with dither and played it on my computer. During quiet passages, you could hear a faint hiss. But through my Direct Stream DAC - it is dead quiet even with my ear next to the speaker. As I said, 130 db has a margin of safety on the best recordings, but even 102 db is good.
This leads us naturally to how modern audio is done. The exact implementation will vary a bit, but here goes. We feed the output of a microphone into one side of a comparator. It outputs a one if it is greater than the other side. Otherwise, a zero. This is sampled at a very high frequency, say 56 MHz, and then fed into an integrator whose output voltage slowly rises if one is present and falls if a zero is present. If the input voltage is positive, each sample will be one, and the integrator will slowly rise. Eventually, it will be greater than the input voltage, and a zero is output, so the voltage falls. Thus, we have a large number of zeroes and ones that are easy to convert to an analog signal by simply using a low pass filter like a capacitor or a high-quality transformer whose frequency drops off at, say, about 70 kHz.
To create the master from which the audio files are distributed, we digitally filter it to give eight times oversampled PCM called DXD. Why DXD? Audio engineers want a format guaranteed to have a sampling frequency above any maximum possible audio frequency, so Shannon implies exact reconstruction. They decided to make it much more than necessary. Nearly all recordings have no frequencies over 44 kHz that are not swamped by noise. A few recordings do have frequencies not masked by noise above 44 kHz. It is rare to come across a recording with frequencies above 88 kHz, and none, to my knowledge, are above 176 kHz. 24-bit resolution is used for the same reason.
After downsampling the audio to DXD, the resolution is more like 8 bits than 24 bits. This is where a trick called noise shaping comes in. It is explained here:
https://www.analog.com/en/technical-articles/behind-the-sigma-delta-adc-topology.html
It covers what I said previously about increasing resolution using TDPF. But goes further, discussing another type of dither, called noise-shaped dither. These types of dither do not increase SNR equally across all frequencies. The SNR increases compared to TDPF dither at the lower frequencies but much less at higher frequencies. The sampling rate of the one-bit audio, e.g. 56 MHz, records frequencies up to 28 MHz. This is far too high to be of any concern, and we can have a horrid SNR at that frequency but a much better SNR of 24 bits at the DXD frequencies.
Knowing this, we can complete how the DS DAC works. Everything is upsampled to 1280 times the CD sampling rate. Then, it is downsampled ten times, uses the same process that created the 1-bit stream with noise shaping and passes it through a transformer to get rid of the digital high frequencies to give the audio output. The designer arranged it so that above about 70 kHz, the transformer's frequency response drop cancels the rise in noise above 70 kHz from the one-bit converter and its noise shaper. The SNR is 120 dB to very high frequencies.
That's basically how modern audio is recorded and played back. For those who want the ultimate fidelity, you can purchase the DXD master. But in most cases, everything is recovered by downsampling it to 176k or 88k. 44.1k is becoming less popular because the 20 kHz filter removes actual recorded frequencies. How audible this is is a matter of debate. But 88k, for nearly all recordings, is enough to preserve all frequencies. Remember Shannon - provided the highest frequency is below half the sampling frequency, you get exact reproduction. Many DAC designers put a 50 kHz filter on the output to reduce noise because so few recordings have content above 50 kHz that is not masked by recording noise.
Also, now you understand dithering; we could use dither on DXD and only 14, 12 or even 8 bits. Combined with the upsampling, it would likely be indistinguishable from the DXD master. There is a further trick that can be used. A program that determines the maximum frequency of DXD recording that is not masked by noise can be written. A filter a bit above that frequency can be applied. This will not affect exact reproduction, but since noise is often present at high frequencies, the final file is more accessible to compression using something like FlAC, which does not compress noise well. Indeed, the lower bits are mostly noise, so only transmitting 12 bits, for example, will also make a big difference. Some experiments I have done indicate a reduction to about the size of 44.1/16.
IMHO, this may eventually become the standard way audio is distributed.
Another issue is something audio engineers noticed. As the sampling rate is increased, the audio sounds better. Not only this, but the effect continues well into mHz sampling rates. We can only hear up to 20 kHz, so it can't be the possible reconstruction of higher frequencies. I won't go into the hypothesised reasons for this, except to note it is a phenomenon well-known to audio engineers. However, as suggested above, we get exact reconstruction when played back if produced correctly. We upsample to a high sampling rate to simulate high sampling frequencies, which the upsampling of 1280 times in the PS Audio DAC does.
This is important. A system called MQA was devised to reduce time smear, one of the hypothesised reasons high sampling rates sound better. This is of no importance in the system I described because we have exact reproduction at a very high sampling rate - there is no time smear - simple as that. It caused a lot of heated debate in Hi-Fi circles. But IMHO, it is a non-issue because modern DACs have exact reproduction at very high sampling rates.
First, we need some background in Digital Signals. This can be mathematically quite advanced, but since I would like this post to be accessible to as wide an audience as possible, here is a link that explains what is needed (not even Calculus is required):
https://brianmcfee.net/dstbook-site/content/intro.html
An important point not emphasised in the above is if we have a signal with maximum frequency f, Shannon guarantees not only can it be reconstructed if sampled at 2f, it can be exactly reconstructed - no phase shift, ringing, blurring, etc, but exact reconstruction.
The most fundamental building block of modern digital audio is quickly becoming outdated: CD audio sampled at 44.1 kHz, with each sample 16 bits, abbreviated as 44.1/16.
We know this has a signal-to-noise ratio (SNR) of 96 dB from the article on digital signals.
But enter dither:
What is dither, and is it still relevant in the Hi-Res audio age?
The actual SNR using triangular dither (TDPF) is about 112 dB SNR.
44.1/16 allows transmitting frequencies up to 22 kHz from Shannon.
To determine the number of bits needed, the background noise of high-quality recordings needs to be examined. The minimum SNR under ideal conditions with the best equipment is 110 dB. 130 dB would be a very reasonable SNR to aim at. Indeed, that is close to the thermal limit noise. Going beyond that is useless.
Using 16 bits with dither has an SNR of 112db. As we will see, we can increase this further, achieving well over 130db. Fourteen bits are found to be enough to achieve 130 db
A fascinating phenomenon happens when you convert it to analogue due to aliasing. You get your original audio plus reflections of it that go on forever. It needs to be filtered at about 20 kHz to eliminate those. They are above audibility, so leaving them there has no audible consequences but can play havoc with amplifiers, etc., when listening to audio. Some don't bother when designing a DAC. They are called NOS DACs, but most designers like to remove them.
Combine this with a filter to limit the signal to 22 kHz so Shanon holds, and can be exactly reproduced without aliasing; two hard-to-design steep analogue filters are required. Well, life is not perfect, and the first DACs to appear did just this.
Then engineers started to have bright ideas.
First, is there an easier way to tackle the filter issue in the CD player? While the minimum frequency you can transmit at is 44.1k, nothing stops designers at the other end from increasing the sampling frequency, let us say, eight times to 352k - it is called oversampling. You take one 44.1 k sample, then seven zero samples, and continue this way. Designing a 22 kHz digital filter that uses this upsampled data is straightforward. Now, you have all these copies at 176 kHz instead of 22 kHz. It's much easier to filter. Oversampling was the first idea.
This had the following important byproduct. If dithered, adding extra zero samples means all the samples are no longer dithered. The noise is concentrated in the non-zero samples. Applying the 22 kHz filter spreads the noise evenly across all samples. For eight times oversampling, the overall noise is now eight times less. Each halving of the noise means 3 dB less noise. So, you now have not 112 dB SNR, but 115 SNR. 8 times oversampling means we have 121 dB SNR. The first DAC chips could not handle 16 bits - 14 bits was the max. But using four times oversampling and an early form of noise shaping (to be discussed later), they were made equivalent to 16-bit DACs.
As always, from those early days, things move on. Let's look at a modern DAC like the PS Audio Direct Stream (DS). I use that as an example because I own one and have investigated how it works. It is nothing special; most other high-quality DACs these days work similarly. It over-samples a whopping 1280 times or about 56 mHz sampling. Consider what this oversampling does to SNR. Let's keep dividing it by 2: 640, 320, 160, 80, 40, 20, 10, 5, 2.5. 1.25. Count the number of doublings, and we get ten doublings. This is an extra 30 dB on the 112 dB we have after dithering, giving an SNR of 142 dB, way over what is required when the thermal limit is considered. Fourteen bits give 130 dB SNR. If a degradation in SNR of 130 is acceptable, 12 or even 8 bits could be used, giving an SNR of 118 dB and 102 dB, respectively. Considering the DS has an overall noise floor of 120 dB, 12 bits would be acceptable. An even better strategy would be to locate the noise floor of the recording and only transmit enough bits to reproduce above that. FLAC compression does not compress noise well, and doing this will reduce FLAC files considerably.
As an experiment, I took some 44.1/16, changed it to 44.1/8 with dither and played it on my computer. During quiet passages, you could hear a faint hiss. But through my Direct Stream DAC - it is dead quiet even with my ear next to the speaker. As I said, 130 db has a margin of safety on the best recordings, but even 102 db is good.
This leads us naturally to how modern audio is done. The exact implementation will vary a bit, but here goes. We feed the output of a microphone into one side of a comparator. It outputs a one if it is greater than the other side. Otherwise, a zero. This is sampled at a very high frequency, say 56 MHz, and then fed into an integrator whose output voltage slowly rises if one is present and falls if a zero is present. If the input voltage is positive, each sample will be one, and the integrator will slowly rise. Eventually, it will be greater than the input voltage, and a zero is output, so the voltage falls. Thus, we have a large number of zeroes and ones that are easy to convert to an analog signal by simply using a low pass filter like a capacitor or a high-quality transformer whose frequency drops off at, say, about 70 kHz.
To create the master from which the audio files are distributed, we digitally filter it to give eight times oversampled PCM called DXD. Why DXD? Audio engineers want a format guaranteed to have a sampling frequency above any maximum possible audio frequency, so Shannon implies exact reconstruction. They decided to make it much more than necessary. Nearly all recordings have no frequencies over 44 kHz that are not swamped by noise. A few recordings do have frequencies not masked by noise above 44 kHz. It is rare to come across a recording with frequencies above 88 kHz, and none, to my knowledge, are above 176 kHz. 24-bit resolution is used for the same reason.
After downsampling the audio to DXD, the resolution is more like 8 bits than 24 bits. This is where a trick called noise shaping comes in. It is explained here:
https://www.analog.com/en/technical-articles/behind-the-sigma-delta-adc-topology.html
It covers what I said previously about increasing resolution using TDPF. But goes further, discussing another type of dither, called noise-shaped dither. These types of dither do not increase SNR equally across all frequencies. The SNR increases compared to TDPF dither at the lower frequencies but much less at higher frequencies. The sampling rate of the one-bit audio, e.g. 56 MHz, records frequencies up to 28 MHz. This is far too high to be of any concern, and we can have a horrid SNR at that frequency but a much better SNR of 24 bits at the DXD frequencies.
Knowing this, we can complete how the DS DAC works. Everything is upsampled to 1280 times the CD sampling rate. Then, it is downsampled ten times, uses the same process that created the 1-bit stream with noise shaping and passes it through a transformer to get rid of the digital high frequencies to give the audio output. The designer arranged it so that above about 70 kHz, the transformer's frequency response drop cancels the rise in noise above 70 kHz from the one-bit converter and its noise shaper. The SNR is 120 dB to very high frequencies.
That's basically how modern audio is recorded and played back. For those who want the ultimate fidelity, you can purchase the DXD master. But in most cases, everything is recovered by downsampling it to 176k or 88k. 44.1k is becoming less popular because the 20 kHz filter removes actual recorded frequencies. How audible this is is a matter of debate. But 88k, for nearly all recordings, is enough to preserve all frequencies. Remember Shannon - provided the highest frequency is below half the sampling frequency, you get exact reproduction. Many DAC designers put a 50 kHz filter on the output to reduce noise because so few recordings have content above 50 kHz that is not masked by recording noise.
Also, now you understand dithering; we could use dither on DXD and only 14, 12 or even 8 bits. Combined with the upsampling, it would likely be indistinguishable from the DXD master. There is a further trick that can be used. A program that determines the maximum frequency of DXD recording that is not masked by noise can be written. A filter a bit above that frequency can be applied. This will not affect exact reproduction, but since noise is often present at high frequencies, the final file is more accessible to compression using something like FlAC, which does not compress noise well. Indeed, the lower bits are mostly noise, so only transmitting 12 bits, for example, will also make a big difference. Some experiments I have done indicate a reduction to about the size of 44.1/16.
IMHO, this may eventually become the standard way audio is distributed.
Another issue is something audio engineers noticed. As the sampling rate is increased, the audio sounds better. Not only this, but the effect continues well into mHz sampling rates. We can only hear up to 20 kHz, so it can't be the possible reconstruction of higher frequencies. I won't go into the hypothesised reasons for this, except to note it is a phenomenon well-known to audio engineers. However, as suggested above, we get exact reconstruction when played back if produced correctly. We upsample to a high sampling rate to simulate high sampling frequencies, which the upsampling of 1280 times in the PS Audio DAC does.
This is important. A system called MQA was devised to reduce time smear, one of the hypothesised reasons high sampling rates sound better. This is of no importance in the system I described because we have exact reproduction at a very high sampling rate - there is no time smear - simple as that. It caused a lot of heated debate in Hi-Fi circles. But IMHO, it is a non-issue because modern DACs have exact reproduction at very high sampling rates.