Correspondence from
Francois Charleton. To the place in my page where I link to this ../#Francois .
Hello,
Over the
last month, I have
been working on lossless audio coding, as a “pet project”. This is
not my trade, but I am a heavy music listener, own a lot of recordings
which I
sometime save on a hard disc, and work in a field (the processing of
media
audience surveys) which makes heavy use of compression algorithms.
As many
others, probably, I
was surprised by the fairly low rate of compression achieved by all
audio
compressors. This might be a theoretical limit, or something important
that was
missed so far. Anyway, it is interesting to study. So far, I have been
experimenting with “classical” techniques (Linear Prediction plus
Rice coding, basically), and have been acheving results which are close
(3-8%
larger than FLAC at maximal compression level, with much simpler
algorithms). Now
is when the fun part begins : either I have already hit the theoretical
wall,
and I will need a lot of work to get a few % improvement, or this is
where to
start to get much better rates.
I want to
try several ideas
(this is fairly sketchy, as it is not the purpose of this email, I can
provide
more details if you are interested) :
1-
improving on
Linear Predictive Coding : so far, all
algorithms seem to use low order predictors, ie they make use of the
correlations between one sample and the N previous, N being typically
between 1
and 10. Thinking of it in terms of filters, LPC takes advantage of the
fact
that very high frequencies (over 5khz, say) are not very important in
audio
files (due, IMO, to recording constraints). Now it would be interesting
to take
into account lower (hearing range) frequencies : this is musical sound,
after
all, and there has to be higher order correlations which could be
exploited.
2-
Improving on
variance calculations : all the
statistical methods used to derive predictors select the one which
minimise
variance, and probably tend to overajust to big residuals (stragglers
in your terminology)
3-
Rethinking
stereo decorrelation : I just cannot admit
that there is little to be gained from stereo decorrelation... I think
the main
problem comes from the fact (observed by looking at audio data) that
even when
the left and right channel are correlated, they tend to be dephased,
again
there is room for improvement
4-
Thinking in
terms of interpolation : there are modern
and efficient ways to sparsely interpolate a series of data (using
sparse
splines, for instance, but this is what transform coding does,
implicitly),
could they be applied to audio compression.
Anyway,
during my experiments
with classical methods, I tried to improve on rice coding, and have
noticed
most simple ideas were very counterproductive... I did try your Elias
codes,
which seemed to degrade to overall bitrate. Thinking about it, I
believe I have
an explanation of why this is so. So, here goes...
Once you
have derived
(through modelling) the residuals you need to encode, and provided you
have
eliminated all correlations between them, the best bit rate you can
achieve on
them only depends on their distribution. More precisely, the length of
the shortest
representation you can derive is the entropy of the distribution : sum { -f(x)Log2( f(x) ) } where f(x) is
the probability of value x appearing in the residuals. This can be done
explicitly, through arithmetic coding, at the expense of having a
complex
coder/decoder (arithmetic coding is hard to program when the number of
possible
residuals is large), and maybe some overhead. However, it is known that
simpler
schemes can result in lengths close to optimal, in the specific cases
when the
distribution has a known form. Rice, Elias, and all these schemes can
be
thought as “specific implementations” of arithmetic/huffman codes,
for special distributions.
On a
specific set of
residual, their ability to compress mostly depends on whether the
actual distribution
of residuals fits the implicit distributions these codes were designed
to
operate on.
Now, Rice
codes are optimal
on Laplacian-distributed residuals (see below for more on this), ie if
for any
x and n, the probability of a residual being between nx and (n+1)x is
half that
of being between (n-1)x and nx (more on this below). If you apply an
Elias code
to such a distribution, the
gain
for long stragglers will be offset by the fact that short (and very
short)
stragglers tend to appear much more often. In a nutshell, the Elias
codes would
be better if the residual distribution fell more sharply than the
Laplacian
does (actually, if it were a gamma distribution)...
On the real
life cases I have
tested : the Elias codes seem to be less efficient than the Rice code,
which
tends to prove that the actual distribution of the residuals is closer
to a
Laplacian, than to a more peaked Elias. As a result, the few cases
where Elias
is less efficient than Rice happen too often to make the improvement
over
longer codes valuable.
Now, apart
from empirical
observations, there are good theoretical reasons why the residual
distribution
should be laplacian : the laplacian distribution corresponds to the
maximal
entropy distribution (the most random one) when one knows the means of
a distribution
(zero in the audio case), but not its variance (ie the volume of the
sound can
vary as much as one wants). If the variance was normalised (ie the
model could
account for volume variations), then the distribution would be a
gaussian (and
both Elias and Rice code would be less efficient). For the residuals to
follow
a gamma distribution, there should be more ‘large residuals” than
normally, ie a number of “unnacounted for” stragglers, which is
just what the predictor approach tries to avoid. To sum up, the better
your
modelling, the less efficient elias codes will be.
This leaves
us with one
question : can we improve on Rice coding? In my opinion, the answer is
yes, but
not much, except if a breakthrough is found in sound modelling.
If the
residuals do follow a
Laplace distribution, we could first calculate the difference between
the
Entropy of the Laplace distribution, and that of the corresponding rice
code.
This can be done by calculating the integral entropy of the Laplace
distribution, which is :
Integral of
f(x) Log2(x) dx,
where f(x)=1/2a exp( |-x/a|) (Laplace distribution), which is equal to
2,44 +
Log2(a) (where a can be estimated as the average absolute deviation of
the
sample)
The rice
code entropy (for
the same a), for parameter k, is 1+k+ 1/ ( 1 - exp(-2**k/a)),
tabulating it in
excel, one can find out that the best value of k est the integral part
of
Log2(a), and that the entropy of the corresponding rice code is over
the
theoretical Laplace entropy varies between 0,14 bits and 0,03 bits
(maximum
when a is a power of 2). This is because the rice code estimates the
continuous
Laplace probabilities with discrete values, which are less efficient
than “real
entropy” coding. This could be improved by using arithmetic coding for
the laplace distribution instead of the rice code.
On a typical
audio file (16
bits, with a compression ratio around 50%), this would means some
improvement
between 0,5 and 2%, small but not negligible (apparently, LA uses such
a
scheme, although the description they give for it is not very clear).
A second
improvement could
come from the observation that audio residues are usually coded on a
frame
basis, ie estimated over a small number of samples. This means that the
empirical distribution of the residues will not be that close to a
laplacian
(there will be a few large outliers, not a lot of small probability
values). It
might therefore be a good idea to exclude a number of “bad samples”
from the rice coding. I think FLAC does something like this, by having
a “rice
escape code”, which tells the decoder that the next sample is coded in
clear (over 16 bits...). As the rice escape code will be fairly long,
and as
the corresponding “bad samples” will be fairly rare, I am not sure
this will bring a lot onf improvement, but there might be some better
ways to
perform this hybrid coding.
A last
possibility (which is
what I aim at) would be to find a better model, which would still
reduce the randomness
in the residual (and actually make them gaussian). My impression is
that the
residuals are Laplacian (as are the original samples) because :
- sound
signals have an
average value of zero
- their
variance (volume)
varies a lot over the time
Now, LPC
coders seem to have
a lot of difficulty to model the volume : the louder a recording, the
worse it
compresses. We clearly need models which can predict volume
variations... Such
models exist in speech modelling : in speech signals, high volumes are
caused
by “voicing” (ie vowels), whereas unvoiced signals (consonants)
tend to have low volumes. Voicing seems to result from low frequency
(100-1000 Hz)
high amplitude components in the signal, which LPC coders cannot
detect. If the
predictor could model such “voiced” situations in musical files
(which more or less correspond to the actual notes on the score), then
the
variance could be reduced, resulting in better compression, the
residual
distribution might evolve from a Laplace towards a Gaussian, and a
better code
might be used...
I hope these
ramblings can be
of some interest to you. Please feel free to react (or let me know it
was plain
silly and drop it in the appropriate electronic waste basket...).
Francois
Charton