Francois Charlton on lossless audio compression

Correspondence from Francois Charleton. To the place in my page where I link to this ../#Francois .

Hello,

Over the last month, I have been working on lossless audio coding, as a “pet project”. This is not my trade, but I am a heavy music listener, own a lot of recordings which I sometime save on a hard disc, and work in a field (the processing of media audience surveys) which makes heavy use of compression algorithms.

As many others, probably, I was surprised by the fairly low rate of compression achieved by all audio compressors. This might be a theoretical limit, or something important that was missed so far. Anyway, it is interesting to study. So far, I have been experimenting with “classical” techniques (Linear Prediction plus Rice coding, basically), and have been acheving results which are close (3-8% larger than FLAC at maximal compression level, with much simpler algorithms). Now is when the fun part begins : either I have already hit the theoretical wall, and I will need a lot of work to get a few % improvement, or this is where to start to get much better rates.

I want to try several ideas (this is fairly sketchy, as it is not the purpose of this email, I can provide more details if you are interested) :

1- improving on Linear Predictive Coding : so far, all algorithms seem to use low order predictors, ie they make use of the correlations between one sample and the N previous, N being typically between 1 and 10. Thinking of it in terms of filters, LPC takes advantage of the fact that very high frequencies (over 5khz, say) are not very important in audio files (due, IMO, to recording constraints). Now it would be interesting to take into account lower (hearing range) frequencies : this is musical sound, after all, and there has to be higher order correlations which could be exploited.

2- Improving on variance calculations : all the statistical methods used to derive predictors select the one which minimise variance, and probably tend to overajust to big residuals (stragglers in your terminology)

3- Rethinking stereo decorrelation : I just cannot admit that there is little to be gained from stereo decorrelation... I think the main problem comes from the fact (observed by looking at audio data) that even when the left and right channel are correlated, they tend to be dephased, again there is room for improvement

4- Thinking in terms of interpolation : there are modern and efficient ways to sparsely interpolate a series of data (using sparse splines, for instance, but this is what transform coding does, implicitly), could they be applied to audio compression.

Anyway, during my experiments with classical methods, I tried to improve on rice coding, and have noticed most simple ideas were very counterproductive... I did try your Elias codes, which seemed to degrade to overall bitrate. Thinking about it, I believe I have an explanation of why this is so. So, here goes...

Once you have derived (through modelling) the residuals you need to encode, and provided you have eliminated all correlations between them, the best bit rate you can achieve on them only depends on their distribution. More precisely, the length of the shortest representation you can derive is the entropy of the distribution : sum { -f(x)Log2( f(x) ) } where f(x) is the probability of value x appearing in the residuals. This can be done explicitly, through arithmetic coding, at the expense of having a complex coder/decoder (arithmetic coding is hard to program when the number of possible residuals is large), and maybe some overhead. However, it is known that simpler schemes can result in lengths close to optimal, in the specific cases when the distribution has a known form. Rice, Elias, and all these schemes can be thought as “specific implementations” of arithmetic/huffman codes, for special distributions.

On a specific set of residual, their ability to compress mostly depends on whether the actual distribution of residuals fits the implicit distributions these codes were designed to operate on.

Now, Rice codes are optimal on Laplacian-distributed residuals (see below for more on this), ie if for any x and n, the probability of a residual being between nx and (n+1)x is half that of being between (n-1)x and nx (more on this below). If you apply an Elias code to such a distribution, the gain for long stragglers will be offset by the fact that short (and very short) stragglers tend to appear much more often. In a nutshell, the Elias codes would be better if the residual distribution fell more sharply than the Laplacian does (actually, if it were a gamma distribution)...

On the real life cases I have tested : the Elias codes seem to be less efficient than the Rice code, which tends to prove that the actual distribution of the residuals is closer to a Laplacian, than to a more peaked Elias. As a result, the few cases where Elias is less efficient than Rice happen too often to make the improvement over longer codes valuable.

Now, apart from empirical observations, there are good theoretical reasons why the residual distribution should be laplacian : the laplacian distribution corresponds to the maximal entropy distribution (the most random one) when one knows the means of a distribution (zero in the audio case), but not its variance (ie the volume of the sound can vary as much as one wants). If the variance was normalised (ie the model could account for volume variations), then the distribution would be a gaussian (and both Elias and Rice code would be less efficient). For the residuals to follow a gamma distribution, there should be more ‘large residuals” than normally, ie a number of “unnacounted for” stragglers, which is just what the predictor approach tries to avoid. To sum up, the better your modelling, the less efficient elias codes will be.

This leaves us with one question : can we improve on Rice coding? In my opinion, the answer is yes, but not much, except if a breakthrough is found in sound modelling.

If the residuals do follow a Laplace distribution, we could first calculate the difference between the Entropy of the Laplace distribution, and that of the corresponding rice code. This can be done by calculating the integral entropy of the Laplace distribution, which is :

Integral of f(x) Log2(x) dx, where f(x)=1/2a exp( |-x/a|) (Laplace distribution), which is equal to 2,44 + Log2(a) (where a can be estimated as the average absolute deviation of the sample)

The rice code entropy (for the same a), for parameter k, is 1+k+ 1/ ( 1 - exp(-2**k/a)), tabulating it in excel, one can find out that the best value of k est the integral part of Log2(a), and that the entropy of the corresponding rice code is over the theoretical Laplace entropy varies between 0,14 bits and 0,03 bits (maximum when a is a power of 2). This is because the rice code estimates the continuous Laplace probabilities with discrete values, which are less efficient than “real entropy” coding. This could be improved by using arithmetic coding for the laplace distribution instead of the rice code.

On a typical audio file (16 bits, with a compression ratio around 50%), this would means some improvement between 0,5 and 2%, small but not negligible (apparently, LA uses such a scheme, although the description they give for it is not very clear).

A second improvement could come from the observation that audio residues are usually coded on a frame basis, ie estimated over a small number of samples. This means that the empirical distribution of the residues will not be that close to a laplacian (there will be a few large outliers, not a lot of small probability values). It might therefore be a good idea to exclude a number of “bad samples” from the rice coding. I think FLAC does something like this, by having a “rice escape code”, which tells the decoder that the next sample is coded in clear (over 16 bits...). As the rice escape code will be fairly long, and as the corresponding “bad samples” will be fairly rare, I am not sure this will bring a lot onf improvement, but there might be some better ways to perform this hybrid coding.

A last possibility (which is what I aim at) would be to find a better model, which would still reduce the randomness in the residual (and actually make them gaussian). My impression is that the residuals are Laplacian (as are the original samples) because :

- sound signals have an average value of zero

- their variance (volume) varies a lot over the time

Now, LPC coders seem to have a lot of difficulty to model the volume : the louder a recording, the worse it compresses. We clearly need models which can predict volume variations... Such models exist in speech modelling : in speech signals, high volumes are caused by “voicing” (ie vowels), whereas unvoiced signals (consonants) tend to have low volumes. Voicing seems to result from low frequency (100-1000 Hz) high amplitude components in the signal, which LPC coders cannot detect. If the predictor could model such “voiced” situations in musical files (which more or less correspond to the actual notes on the score), then the variance could be reduced, resulting in better compression, the residual distribution might evolve from a Laplace towards a Gaussian, and a better code might be used...

I hope these ramblings can be of some interest to you. Please feel free to react (or let me know it was plain silly and drop it in the appropriate electronic waste basket...).

Francois Charton