Lossless data compression
Lossless data compression make use of data compression algorithms that allows the exact original data to be
reconstructed from the compressed data. This can be contrasted to lossy data compression, which does not
allow the exact original data to be reconstructed from the compressed data. Lossless data compression is
used in many applications. For example, it is used in the popular ZIP file format and in the Unix tool
gzip. It is also often used as a component within lossy data compression technologies.
Lossless compression is used when it is important that the original and the decompressed data be identical,
or when no assumption can be made on whether certain deviation is uncritical. Typical examples are executable
programs and source code. Some image file formats, notably PNG, use only lossless compression, while
others like TIFF and MNG may use either lossless or lossy methods. GIF uses a lossless compression method,
but most GIF implementations are incapable of representing full color, so they quantize the image (often
with dithering) to 256 or fewer colors before encoding as GIF. Color quantization is a lossy process,
but reconstructing the color image and then re-quantizing it produces no additional loss.
Lossless compression techniques
Lossless compression methods may be categorized according to the type of data they are designed to
compress. Some main types of targets for compression algorithms are text, executables, images, and sound.
Whilst, in principle, any general-purpose lossless compression algorithm (general-purpose means that
they can handle all binary input) can be used on any type of data, many are unable to achieve significant
compression on data that is not of the form that they are designed to deal with. Sound data, for instance,
cannot be compressed well with conventional text compression algorithms.
Most lossless compression programs use two different kinds of algorithms: one which generates a
statistical model for the input data, and another which maps the input data to bit strings using this
model in such a way that "probable" (e.g. frequently encountered) data will produce shorter output than
"improbable" data. Often, only the former algorithm is named, while the second is implied (through
common use, standardization etc.) or unspecified.
Statistical modelling algorithms for text (or text-like binary data such as executables) include:
- Burrows-Wheeler transform (BWT; block sorting preprocessing that makes compression more efficient)
- LZ77 (used by Deflate)
- LZW
- PPM
Encoding algorithms to produce bit sequences are:
- Huffman coding (also used by Deflate)
- Arithmetic coding
Many of these methods are implemented in open-source and proprietary tools, particularly LZW and its
variants. Some algorithms are patented in the USA and other countries and their legal usage requires
licensing by the patent holder. Because of patents on certain kinds of LZW compression, some open source
activists encouraged people to avoid using the Graphics Interchange Format (GIF) for compressing image
files in favor of Portable Network Graphics PNG, which combines the LZ77-based deflate algorithm with a
selection of domain-specific prediction filters. However, the patents on LZW have now expired.
Many of the lossless compression techniques used for text also work reasonably well for indexed images,
but there are other techniques that do not work for typical text that are useful for some images
(particularly simple bitmaps), and other techniques that take advantage of the specific characteristics
of images (such as the common phenomenon of contiguous 2-D areas of similar tones, and the fact that
colour images usually have a preponderance to a limited range of colours out of those representable in
the colour space).
As mentioned previously, lossless sound compression is a somewhat specialised area. Lossless sound
compression algorithms can take advantage of the repeating patterns shown by the wave-like nature of
the data - essentially using models to predict the "next" value and encoding the (hopefully small)
difference between the expected value and the actual data. If the difference between the predicted
and the actual data (called the "error") tends to be small, then certain difference values (like 0, +1,
-1 etc. on sample values) become very frequent, which can be exploited by encoding them in few output
bits. This is called "Delta coding".
Lossy data compression
A lossy data compression method is one where compressing data and then decompressing it retrieves data that may well be different from the original, but is "close enough" to be useful in some way. Lossy data compression is used frequently on the Internet and especially in streaming media and telephony applications. These methods are typically referred to as codecs in this context. Most lossy data compression formats suffer from generation loss: repeatedly compressing and decompressing the file will cause it to progressively lose quality. This is in contrast with lossless data compression.
Types of lossy compression
There are two basic lossy compression schemes:
- In lossy transform codecs, samples of picture or sound are taken, chopped into small segments, transformed into a new basis space, and quantized. The resulting quantized values are then entropy coded.
- In lossy predictive codecs, previous and/or subsequent decoded data is used to predict the current sound sample or image frame. The error between the predicted data and the real data, together with any extra information needed to reproduce the prediction, is then quantized and coded.
In some systems the two techniques are combined, with transform codecs being used to compress the error signals generated by the predictive stage.
Lossless vs. lossy compression
The advantage of lossy methods over lossless methods is that in some cases a lossy method can produce a much smaller compressed file than any known lossless method, while still meeting the requirements of the application.
Lossy methods are most often used for compressing sound, images or videos. The compression ratio (that is, the size of the compressed file compared to that of the uncompressed file) of lossy video codecs are nearly always far superior to those of the audio and still-image equivalents. Audio can be compressed at 10:1 with no noticeable loss of quality, video can be compressed immensely with little visible quality loss, eg 300:1. Lossily compressed still images are often compressed to 1/10th their original size, as with audio, but the quality loss is more noticeable, especially on closer inspection.
When a user acquires a lossily-compressed file, (for example, to reduce download-time) the retrieved file can be quite different from the original at the bit level while being indistinguishable to the human ear or eye for most practical purposes. Many methods focus on the idiosyncrasies of the human anatomy, taking into account, for example, that the human eye can see only certain frequencies of light. The psycho-acoustic model describes how sound can be highly compressed without degrading the perceived quality of the sound. Flaws caused by lossy compression that are noticeable to the human eye or ear are known as compression artifacts.
Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely, but nevertheless perfectly. Lossless compression is possible because most real-world data has statistical redundancy. For example, in English text, the letter 'e' is much more common than the letter 'z', and the probability that the letter 'q' will be followed by the letter 'z' is very small.
Another kind of compression, called lossy data compression, is possible if some loss of fidelity is acceptable. For example, a person viewing a picture or television video scene might not notice if some of its finest details are removed or not represented perfectly. Similarly, two clips of audio may be perceived as the same to a listener even though one is missing details found in the other. Lossy data compression algorithms introduce relatively minor differences and represent the picture, video, or audio using fewer bits.
Lossless compression schemes are reversible so that the original data can be reconstructed, while lossy schemes accept some loss of data in order to achieve higher compression. However, lossless data compression algorithms will always fail to compress some files; indeed, any compression algorithm will necessarily fail to compress any data containing no discernible patterns. Attempts to compress data that has been compressed already will therefore usually result in an expansion, as will attempts to compress encrypted data.
In practice, lossy data compression will also come to a point where compressing again does not work, although an extremely lossy algorithm, which for example always removes the last byte of a file, will always compress a file up to the point where it is empty.
©2003-2008 MaximumCompression (lossless data compression software benchmarks)