Performance of various general compression algorithms – some of them are unbelievably fast!

07 Jan 2015 update: extending LZ4 description (thanks to Mikael Grev for a hint!)

This article will give you an overview of several general compression algorithm implementations performance. As it turned out, some of them could be used even when your CPU requirements are pretty strict.

In this article we will compare:

JDK GZIP – a slow algorithm with a good compression, which could be used for long term data compression. Implemented in JDK java.util.zip.GZIPInputStream / GZIPOutputStream.
JDK deflate – another algorithm available in JDK (it is used for zip files). Unlike GZIP, you can set compression level for this algorithm, which allows you to trade compression time for the output file size. Available levels are 0 (store, no compression), 1 (fastest compression) to 9 (slowest compression). Implemented as java.util.zip.DeflaterOutputStream / InflaterInputStream.
Java implementation of LZ4 compression algorithm – this is the fastest algorithm in this article with a compression level a bit worse than the fastest deflate. I advice you to read the wikipedia article about this algorithm to understand its usage. It is distributed under a friendly Apache license 2.0.
Snappy – a popular compressor developed in Google, which aims to be fast and provide relatively good compression. I have tested this implementation. It is also distributed under Apache license 2.0.

Compression test

I had to think a little what file set could be useful for data compression testing and at the same time could be present on most of Java developers machines (I don’t want to ask you to download hundreds of megabytes of files just to run the tests). Finally I realised that most of you have JDK javadoc installed locally. I decided to build a single file out of javadoc directory – concatenate all files. This can be easily done with tar, but not all of us are Linux users, so I have used the following class to generate such file:

public class InputGenerator {
    private static final String JAVADOC_PATH = "your_path_to_JDK/docs";
    public static final File FILE_PATH = new File( "your_output_file_path" );
 
    static
    {
        try {
            if ( !FILE_PATH.exists() )
                makeJavadocFile();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
 
    private static void makeJavadocFile() throws IOException {
        try( OutputStream os = new BufferedOutputStream( new FileOutputStream( FILE_PATH ), 65536 ) )
        {
            appendDir(os, new File( JAVADOC_PATH ));
        }
        System.out.println( "Javadoc file created" );
    }
 
    private static void appendDir( final OutputStream os, final File root ) throws IOException {
        for ( File f : root.listFiles() )
        {
            if ( f.isDirectory() )
                appendDir( os, f );
            else
                Files.copy(f.toPath(), os);
        }
    }
}

public class InputGenerator {
    private static final String JAVADOC_PATH = "your_path_to_JDK/docs";
    public static final File FILE_PATH = new File( "your_output_file_path" );

    static
    {
        try {
            if ( !FILE_PATH.exists() )
                makeJavadocFile();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private static void makeJavadocFile() throws IOException {
        try( OutputStream os = new BufferedOutputStream( new FileOutputStream( FILE_PATH ), 65536 ) )
        {
            appendDir(os, new File( JAVADOC_PATH ));
        }
        System.out.println( "Javadoc file created" );
    }

    private static void appendDir( final OutputStream os, final File root ) throws IOException {
        for ( File f : root.listFiles() )
        {
            if ( f.isDirectory() )
                appendDir( os, f );
            else
                Files.copy(f.toPath(), os);
        }
    }
}

The total file size on my machine is 354,509,602 bytes (338 Mb).

Testing

Initially I thought about reading the whole file into RAM and compressing it in RAM. It turned out that you may pretty easily run out of heap space on commodity 4G machines with such approach

Instead I decided to rely on the OS file cache. We will use JMH as a test framework. The file will be loaded in OS cache during warmup phase (we will run compression test twice during warmup). We will compress into ByteArrayOutputStream (I know, it is not the fastest solution, but it is consistent across all tests and it does not need to spend more time writing compressed data to disk), so you still need some RAM to keep the output in memory.

Here is a test base class. All tests differ only in the compressing output stream implementation, so they will create a stream in StreamFactory implementation and reuse the base class test:

@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Thread)
@Fork(1)
@Warmup(iterations = 2)
@Measurement(iterations = 3)
@BenchmarkMode(Mode.SingleShotTime)
public class TestParent {
    protected Path m_inputFile;
 
    @Setup
    public void setup()
    {
        m_inputFile = InputGenerator.FILE_PATH.toPath();
    }
 
    interface StreamFactory
    {
        public OutputStream getStream( final OutputStream underlyingStream ) throws IOException;
    }
 
    public int baseBenchmark( final StreamFactory factory ) throws IOException
    {
        ByteArrayOutputStream bos = new ByteArrayOutputStream((int) m_inputFile.toFile().length());
        try ( OutputStream os = factory.getStream( bos ) )
        {
            Files.copy(m_inputFile, os);
        }
        return bos.size();
    }
}

@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Thread)
@Fork(1)
@Warmup(iterations = 2)
@Measurement(iterations = 3)
@BenchmarkMode(Mode.SingleShotTime)
public class TestParent {
    protected Path m_inputFile;

    @Setup
    public void setup()
    {
        m_inputFile = InputGenerator.FILE_PATH.toPath();
    }

    interface StreamFactory
    {
        public OutputStream getStream( final OutputStream underlyingStream ) throws IOException;
    }

    public int baseBenchmark( final StreamFactory factory ) throws IOException
    {
        ByteArrayOutputStream bos = new ByteArrayOutputStream((int) m_inputFile.toFile().length());
        try ( OutputStream os = factory.getStream( bos ) )
        {
            Files.copy(m_inputFile, os);
        }
        return bos.size();
    }
}

All tests look similar (you can find them in the source code at the end of this article), but here is an example – JDK deflate test:

public class JdkDeflateTest extends TestParent {
    @Param({"1", "2", "3", "4", "5", "6", "7", "8", "9"})
    public int m_lvl;
 
    @Benchmark
    public int deflate() throws IOException
    {
        return baseBenchmark(new StreamFactory() {
            @Override
            public OutputStream getStream(OutputStream underlyingStream) throws IOException {
                return new DeflaterOutputStream( underlyingStream, new Deflater( m_lvl, true ), 512 );
            }
        });
    }
}

public class JdkDeflateTest extends TestParent {
    @Param({"1", "2", "3", "4", "5", "6", "7", "8", "9"})
    public int m_lvl;

    @Benchmark
    public int deflate() throws IOException
    {
        return baseBenchmark(new StreamFactory() {
            @Override
            public OutputStream getStream(OutputStream underlyingStream) throws IOException {
                return new DeflaterOutputStream( underlyingStream, new Deflater( m_lvl, true ), 512 );
            }
        });
    }
}

Test results

Output file sizes

First of all let’s see the output file sizes:

Implementation	File size (bytes)
GZIP	64,214,683
Snappy (normal)	138,250,196
Snappy (framed)	101,470,113
LZ4 (fast 64K)	98,326,531
LZ4 (fast 128K)	94,403,752
LZ4 (fast double 64K)	94,478,009
LZ4 (fast 32M)	89,758,917
LZ4 (fast double 32M)	84,337,838
LZ4 (fast triple 32M)	83,426,446
LZ4 (high)	82,085,338
Deflate (lvl=1)	78,383,316
Deflate (lvl=2)	75,280,213
Deflate (lvl=3)	73,251,533
Deflate (lvl=4)	68,110,895
Deflate (lvl=5)	65,721,750
Deflate (lvl=6)	64,214,665
Deflate (lvl=7)	64,019,601
Deflate (lvl=8)	63,874,787
Deflate (lvl=9)	63,868,222

As you can see, the difference between the smallest and the biggest compressed files is pretty large (from 61 to 131 Mb). There are several LZ4 algorithm options in this table – I will cover it in more details closer to the end of this article. Let’s see how long did it take to compress for each implementation.

Compression time

Implementation	Compression time (ms)
Snappy.framedOutput	2264.700
Snappy.normalOutput	2201.120
Lz4.testFastNative64K	1075.138
Lz4.testFastNative128K	1068.932
Lz4.testFastNativeDouble64K	1261.138
Lz4.testFastNative32M	1076.141
Lz4.testFastNativeDouble32M	1230.563
Lz4.testFastNativeTriple32M	1433.068
Lz4.testHighNative64K	6812.911
deflate (lvl=1)	4522.644
deflate (lvl=2)	4726.477
deflate (lvl=3)	5081.934
deflate (lvl=4)	6739.450
deflate (lvl=5)	7896.572
deflate (lvl=6)	9783.701
deflate (lvl=7)	10731.761
deflate (lvl=8)	14760.361
deflate (lvl=9)	14878.364
GZIP	10351.887

Let’s merge compression time and file size on one diagram in order to calculate the throughput and make some conclusions.

Throughput and efficiency

Implementation	Time (ms)	Uncompressed file size (Mb)	Throughput (Mb/sec)	Compressed file size (Mb)
Snappy.normalOutput	2201.12	338	153.5581885586	131.8456611633
Snappy.framedOutput	2264.7	338	149.2471409017	96.7694406509
Lz4.testFastNative64K	1075.138	338	314.3782472576	93.771487236
Lz4.testFastNative128K	1068.932	338	316.2034628957	90.0304336548
Lz4.testFastNativeDouble64K	1261.138	338	268.0119067065	90.1012506485
Lz4.testFastNative32M	1076.141	338	314.0852360425	85.6007738113
Lz4.testFastNativeDouble32M	1230.563	338	274.6710245636	80.4308300018
Lz4.testFastNativeTriple32M	1433.068	338	235.8576145724	79.5616588593
Lz4.testHighNative64K	6812.9	338	49.6117659147	78.2826786041
deflate (lvl=1)	4522.644	338	74.7350443679	74.752155304
deflate (lvl=2)	4726.477	338	71.5120374012	71.7928056717
deflate (lvl=3)	5081.934	338	66.5101120951	69.8581056595
deflate (lvl=4)	6739.45	338	50.1524605124	64.9556112289
deflate (lvl=5)	7896.572	338	42.8033835442	62.6771450043
deflate (lvl=6)	9783.701	338	34.5472536415	61.2398767471
deflate (lvl=7)	10731.761	338	31.4952969974	61.0538492203
deflate (lvl=8)	14760.361	338	22.8991689295	60.9157438278
deflate (lvl=9)	14878.364	338	22.7175514727	60.9094829559
GZIP	10351.887	338	32.651051929	61.2398939133

Many of these implementations are pretty slow: ~23 Mb/sec for high level deflate or even ~33 Mb/sec for GZIP is not something you should be happy with on Xeon E5-2650. At the same time fastest deflate version is running at ~75 Mb/sec, Snappy at ~150 Mb/sec and LZ4 (fast, JNI) at truly surprising ~320 Mb/sec (actually much faster, but this time includes reading file from OS cache).

This diagram clearly shows that 2 implementations are not competitive at the moment: Snappy is slower than LZ4 (fast), but produces the bigger files. LZ4 (high) is in turn slower than deflate levels 1 to 4 and produces a bigger output compared to even deflate level=1.

As a result, I would probably choose between LZ4(fast) JNI implementation and deflate level=1 when I need an “on the fly compression”. You may have to use deflate if your organization does not allow 3rd party libraries. You should consider how much spare CPU cycles you have on your box as well as where the compressed data is being sent. For example, if you are writing compressed data directly to HDD, then performance above ~100 Mb/sec would not help you (provided that your file is large enough) – HDD speed will become the bottleneck. Same output written into a modern SSD – even LZ4 would not be fast enough If you compress your data prior to sending it over a Gigabit network, you should probably use LZ4, because 75 Mb/sec of deflate performance is considerably less than 125 Mb/sec of network throughput (yes, I know about packet headers, but the difference will be still considerable).

LZ4 compression algorithm

LZ4 is an algorithm which encodes data in frames. Each frame contains a header and compressed data. The size of the compression buffer (amount of data which will be compressed in one frame) is an LZ4BlockOutputStream constructor argument:

1	public LZ4BlockOutputStream(OutputStream out, int blockSize, LZ4Compressor compressor)

public LZ4BlockOutputStream(OutputStream out, int blockSize, LZ4Compressor compressor)

Current implementation allows the block size to be between 64 bytes and 32 Mb. Obviously, the bigger frame you will use, the higher will be the compression ratio. You should keep in mind that LZ4BlockOutputStream will allocate identically sized uncompression buffer (this info is stored in the frame header).

As you have seen above, there is very little difference in the time required to compress the data with either 64K or 32M buffer, which means you should try using the bigger buffer in order to obtain some extra compression.

Another interesting LZ4 property (thanks to Mikael Grev for idea) is that it makes sense to use 2 LZ4BlockOutputStream-s in a row, because subsequent blocks may contain similarly encoded data. The performance penalty is pretty unnoticeable, but you can gain extra compression (in case of 32M buffer, the output was 89M for the single pass and 84 for the double pass at a tiny cost of ~200 ms for 89M of data output from the first pass). It does not make much sense to make three or more passes – you will have very little compression improvement.

At the same time, it makes more sense to double the buffer size on the single pass rather than make 2 passes on the smaller buffers (the exception are buffers over 16M, which will allow you to circumvent 32M compression buffer limitation) – as you can see, you get nearly identical file size for double 64K pass and for single 128K pass. Double pass, as you have seen, will always be slower.

Summary

If you think that data compression is painfully slow, then check LZ4 (fast) implementation, which is able to compress a text file at ~320 Mb/sec – compression at such speed should be not noticeable for most of applications. It makes sense to increase the LZ4 compression buffer size up to its 32M limit if possible (keep in mind that you will need a similarly sized buffer for uncompression). You can also try chaining 2 LZ4BlockOutputStream-s with 32M buffer size to get most out of LZ4.
If you are restricted from using 3rd party libraries or want a little bit better compression, check JDK deflate (lvl=1) codec – it was able to compress the same file at ~75 Mb/sec.

Source code

Java compression test project source code

Use standard JMH approach to run this project:

mvn clean install
java -jar target/benchmarks.jar

The post Performance of various general compression algorithms – some of them are unbelievably fast! appeared first on Java Performance Tuning Guide.

Performance of various general compression algorithms – some of them are unbelievably fast!

Compression test

Testing

Test results

Output file sizes

Compression time

Throughput and efficiency

LZ4 compression algorithm

See also

Summary

Source code

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List