Summary of the multiple file compression benchmark tests

File type : Multiple file types (46 in total)
# of files to compress in this test : 510
Total File Size (bytes) : 316.355.757
Average File Size (bytes) : 620,305
Largest File (bytes) : 18,403,071
Smallest File (bytes) : 3,554

This test is designed to model 'real-world' performance of lossless data compressors. The test set contains a mix of different file types which are chosen with 'What do people use archivers for the most' in mind. The testset should contain data, weighted (in both type and proportion of files in the set) by how often these files are used for compression by normal users using compression software. So for example there will be more txt files then .ocx files in the set (yes, this is arbitrary). The set contains 100's of files and has a total size of over 300 Mb. The idea of a large collection is filtering out the 'noise'. A compressor might perform bad on 1 or 2 filetypes, but on a very large collection it will not hurt as much.

Some programs like CCM and BZIP2 can only compress one file at a time. For these programs a single TAR-file is created containing all files. The files in this TAR-file are ordered alphabetically on suffix, then name. Results of these compressors are marked with an 'Y' in the tarred column.

The testset consists of the following file types :

Filetype(s) Description % of total # of files
TOC, MBXEudora mailboxes12.31 16
EXE, DLL, OCX, DRVExecutables10.99 35
TXT, RTF, DIC, LNGText files in several languages10.21 41
BMP, TIFFBitmaps/TIF images7.88 15
LOGLog files6.34 6
HTM, PHPHTML files6.13 19
DOCMS Word files6.08 30
C, CPP, PAS, DCUSource Code6.00 235
MDB, CSVDatabases4.26 7
HLPWindows Help files4.23 7
CBF, CBGPrecompressed chess-databases3.55 2
WAVWave soundfiles3.45 9
XLSXLS Spreadsheets2.41 16
PDFAdobe Acrobat document1.59 6
TTFTrue Type Fonts1.15 15
DEFVirus definition files1.10 3
JPG, GIFImage files0.53 9
CHMPrecompressed help files0.49 2
INI, INFINI files0.42 10

Considering the fact it's supposed to be a 'real-world' test I will not look at the best possible (command-line or gui) switch combination to use for optimal compression, but only test a limited set as 'normal users' would do. For 7-zip this means for example I will use the GUI and select the Ultra compression method (which can be easily beaten with some good command line switches), WinRar will be tested with max dictionary size and solid archiving etc. Programs are allowed to use a maximum of 800 MB memory and must finish the compression stage within 12h. Compressed size must be 50% or less compared to the original size to be listed on MFC.

For my single file tests I got lots of requests to add the compression time to the tables. I did not do this for the reasons stated in the single file summary file, but I'm planning to measure compression times for this multiple file test!. I also decided to make this testset 'non public', so it's harder for developers to tune their program towards this specific test. I think this is the most fair way to get 'real life' performance tests.

Scoring system: The program yielding the lowest compressed size is considered the best program. The most efficient (read:use full) program is calculated by multiplying the compression + decompression time (in seconds) it took to produce the archive with the power of the archive size divided by the lowest measured archive size. The lower score the better. The basic idea is a compressor X has the same efficiency as compressor Y if X can compress twice as fast as Y and resulting archive size of X is 10% larger than size of Y. (Special thanks to Uwe Herklotz to get this formula right)

score_X = POWER(2; ((size_X / size_TOP) - 1) / 0,1) * time_X

with  score_X     efficiency score for a certain compressor X
      time_X      time elapsed by compressor X (comp + decomp time)
      size_X      archive size achieved with compressor X
      size_TOP    archive size by top archiver (smallest benchmark result)
Formula to calculate compressor efficiency based on compressed size and compression time
"0,1" represents 10% and power of 2 ensures that for each 10% worse results (compared
with top) the time is doubled, so any archiver (except top compressor) will get a
penalty on time. The score of top compressor is always equal to its time value.

Fatal error: Uncaught Error: Call to undefined function mysql_connect() in /var/www/vhosts/ Stack trace: #0 /var/www/vhosts/ include() #1 {main} thrown in /var/www/vhosts/ on line 3