07 Jan 2015 update: extending LZ4 description (thanks to Mikael Grev for a hint!)
This article will give you an overview of several general compression algorithm implementations performance. As it turned out, some of them could be used even when your CPU requirements are pretty strict.
In this article we will compare:
-
JDK GZIP – a slow algorithm with a good compression, which could be used for long term data compression. Implemented in JDK
java.util.zip.GZIPInputStream / GZIPOutputStream
. -
JDK deflate – another algorithm available in JDK (it is used for zip files). Unlike GZIP, you can set compression level for this algorithm, which allows you to trade compression time for the output file size. Available levels are 0 (store, no compression), 1 (fastest compression) to 9 (slowest compression). Implemented as
java.util.zip.DeflaterOutputStream / InflaterInputStream
. - Java implementation of LZ4 compression algorithm – this is the fastest algorithm in this article with a compression level a bit worse than the fastest deflate. I advice you to read the wikipedia article about this algorithm to understand its usage. It is distributed under a friendly Apache license 2.0.
- Snappy – a popular compressor developed in Google, which aims to be fast and provide relatively good compression. I have tested this implementation. It is also distributed under Apache license 2.0.
Compression test
I had to think a little what file set could be useful for data compression testing and at the same time could be present on most of Java developers machines (I don’t want to ask you to download hundreds of megabytes of files just to run the tests). Finally I realised that most of you have JDK javadoc installed locally. I decided to build a single file out of javadoc directory – concatenate all files. This can be easily done with tar
, but not all of us are Linux users, so I have used the following class to generate such file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 public class InputGenerator { private static final String JAVADOC_PATH = "your_path_to_JDK/docs"; public static final File FILE_PATH = new File( "your_output_file_path" ); static { try { if ( !FILE_PATH.exists() ) makeJavadocFile(); } catch (IOException e) { e.printStackTrace(); } } private static void makeJavadocFile() throws IOException { try( OutputStream os = new BufferedOutputStream( new FileOutputStream( FILE_PATH ), 65536 ) ) { appendDir(os, new File( JAVADOC_PATH )); } System.out.println( "Javadoc file created" ); } private static void appendDir( final OutputStream os, final File root ) throws IOException { for ( File f : root.listFiles() ) { if ( f.isDirectory() ) appendDir( os, f ); else Files.copy(f.toPath(), os); } } }public class InputGenerator { private static final String JAVADOC_PATH = "your_path_to_JDK/docs"; public static final File FILE_PATH = new File( "your_output_file_path" ); static { try { if ( !FILE_PATH.exists() ) makeJavadocFile(); } catch (IOException e) { e.printStackTrace(); } } private static void makeJavadocFile() throws IOException { try( OutputStream os = new BufferedOutputStream( new FileOutputStream( FILE_PATH ), 65536 ) ) { appendDir(os, new File( JAVADOC_PATH )); } System.out.println( "Javadoc file created" ); } private static void appendDir( final OutputStream os, final File root ) throws IOException { for ( File f : root.listFiles() ) { if ( f.isDirectory() ) appendDir( os, f ); else Files.copy(f.toPath(), os); } } }
The total file size on my machine is 354,509,602 bytes (338 Mb).
Testing
Initially I thought about reading the whole file into RAM and compressing it in RAM. It turned out that you may pretty easily run out of heap space on commodity 4G machines with such approach
Instead I decided to rely on the OS file cache. We will use JMH as a test framework. The file will be loaded in OS cache during warmup phase (we will run compression test twice during warmup). We will compress into ByteArrayOutputStream
(I know, it is not the fastest solution, but it is consistent across all tests and it does not need to spend more time writing compressed data to disk), so you still need some RAM to keep the output in memory.
Here is a test base class. All tests differ only in the compressing output stream implementation, so they will create a stream in StreamFactory
implementation and reuse the base class test:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 @OutputTimeUnit(TimeUnit.MILLISECONDS) @State(Scope.Thread) @Fork(1) @Warmup(iterations = 2) @Measurement(iterations = 3) @BenchmarkMode(Mode.SingleShotTime) public class TestParent { protected Path m_inputFile; @Setup public void setup() { m_inputFile = InputGenerator.FILE_PATH.toPath(); } interface StreamFactory { public OutputStream getStream( final OutputStream underlyingStream ) throws IOException; } public int baseBenchmark( final StreamFactory factory ) throws IOException { ByteArrayOutputStream bos = new ByteArrayOutputStream((int) m_inputFile.toFile().length()); try ( OutputStream os = factory.getStream( bos ) ) { Files.copy(m_inputFile, os); } return bos.size(); } }@OutputTimeUnit(TimeUnit.MILLISECONDS) @State(Scope.Thread) @Fork(1) @Warmup(iterations = 2) @Measurement(iterations = 3) @BenchmarkMode(Mode.SingleShotTime) public class TestParent { protected Path m_inputFile; @Setup public void setup() { m_inputFile = InputGenerator.FILE_PATH.toPath(); } interface StreamFactory { public OutputStream getStream( final OutputStream underlyingStream ) throws IOException; } public int baseBenchmark( final StreamFactory factory ) throws IOException { ByteArrayOutputStream bos = new ByteArrayOutputStream((int) m_inputFile.toFile().length()); try ( OutputStream os = factory.getStream( bos ) ) { Files.copy(m_inputFile, os); } return bos.size(); } }
All tests look similar (you can find them in the source code at the end of this article), but here is an example – JDK deflate test:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 public class JdkDeflateTest extends TestParent { @Param({"1", "2", "3", "4", "5", "6", "7", "8", "9"}) public int m_lvl; @Benchmark public int deflate() throws IOException { return baseBenchmark(new StreamFactory() { @Override public OutputStream getStream(OutputStream underlyingStream) throws IOException { return new DeflaterOutputStream( underlyingStream, new Deflater( m_lvl, true ), 512 ); } }); } }public class JdkDeflateTest extends TestParent { @Param({"1", "2", "3", "4", "5", "6", "7", "8", "9"}) public int m_lvl; @Benchmark public int deflate() throws IOException { return baseBenchmark(new StreamFactory() { @Override public OutputStream getStream(OutputStream underlyingStream) throws IOException { return new DeflaterOutputStream( underlyingStream, new Deflater( m_lvl, true ), 512 ); } }); } }
Test results
Output file sizes
First of all let’s see the output file sizes:
Implementation | File size (bytes) |
GZIP | 64,214,683 |
Snappy (normal) | 138,250,196 |
Snappy (framed) | 101,470,113 |
LZ4 (fast 64K) | 98,326,531 |
LZ4 (fast 128K) | 94,403,752 |
LZ4 (fast double 64K) | 94,478,009 |
LZ4 (fast 32M) | 89,758,917 |
LZ4 (fast double 32M) | 84,337,838 |
LZ4 (fast triple 32M) | 83,426,446 |
LZ4 (high) | 82,085,338 |
Deflate (lvl=1) | 78,383,316 |
Deflate (lvl=2) | 75,280,213 |
Deflate (lvl=3) | 73,251,533 |
Deflate (lvl=4) | 68,110,895 |
Deflate (lvl=5) | 65,721,750 |
Deflate (lvl=6) | 64,214,665 |
Deflate (lvl=7) | 64,019,601 |
Deflate (lvl=8) | 63,874,787 |
Deflate (lvl=9) | 63,868,222 |
As you can see, the difference between the smallest and the biggest compressed files is pretty large (from 61 to 131 Mb). There are several LZ4 algorithm options in this table – I will cover it in more details closer to the end of this article. Let’s see how long did it take to compress for each implementation.
Compression time
Implementation | Compression time (ms) |
Snappy.framedOutput | 2264.700 |
Snappy.normalOutput | 2201.120 |
Lz4.testFastNative64K | 1075.138 |
Lz4.testFastNative128K | 1068.932 |
Lz4.testFastNativeDouble64K | 1261.138 |
Lz4.testFastNative32M | 1076.141 |
Lz4.testFastNativeDouble32M | 1230.563 |
Lz4.testFastNativeTriple32M | 1433.068 |
Lz4.testHighNative64K | 6812.911 |
deflate (lvl=1) | 4522.644 |
deflate (lvl=2) | 4726.477 |
deflate (lvl=3) | 5081.934 |
deflate (lvl=4) | 6739.450 |
deflate (lvl=5) | 7896.572 |
deflate (lvl=6) | 9783.701 |
deflate (lvl=7) | 10731.761 |
deflate (lvl=8) | 14760.361 |
deflate (lvl=9) | 14878.364 |
GZIP | 10351.887 |
Let’s merge compression time and file size on one diagram in order to calculate the throughput and make some conclusions.
Throughput and efficiency
Implementation | Time (ms) | Uncompressed file size (Mb) | Throughput (Mb/sec) | Compressed file size (Mb) |
Snappy.normalOutput | 2201.12 | 338 | 153.5581885586 | 131.8456611633 |
Snappy.framedOutput | 2264.7 | 338 | 149.2471409017 | 96.7694406509 |
Lz4.testFastNative64K | 1075.138 | 338 | 314.3782472576 | 93.771487236 |
Lz4.testFastNative128K | 1068.932 | 338 | 316.2034628957 | 90.0304336548 |
Lz4.testFastNativeDouble64K | 1261.138 | 338 | 268.0119067065 | 90.1012506485 |
Lz4.testFastNative32M | 1076.141 | 338 | 314.0852360425 | 85.6007738113 |
Lz4.testFastNativeDouble32M | 1230.563 | 338 | 274.6710245636 | 80.4308300018 |
Lz4.testFastNativeTriple32M | 1433.068 | 338 | 235.8576145724 | 79.5616588593 |
Lz4.testHighNative64K | 6812.9 | 338 | 49.6117659147 | 78.2826786041 |
deflate (lvl=1) | 4522.644 | 338 | 74.7350443679 | 74.752155304 |
deflate (lvl=2) | 4726.477 | 338 | 71.5120374012 | 71.7928056717 |
deflate (lvl=3) | 5081.934 | 338 | 66.5101120951 | 69.8581056595 |
deflate (lvl=4) | 6739.45 | 338 | 50.1524605124 | 64.9556112289 |
deflate (lvl=5) | 7896.572 | 338 | 42.8033835442 | 62.6771450043 |
deflate (lvl=6) | 9783.701 | 338 | 34.5472536415 | 61.2398767471 |
deflate (lvl=7) | 10731.761 | 338 | 31.4952969974 | 61.0538492203 |
deflate (lvl=8) | 14760.361 | 338 | 22.8991689295 | 60.9157438278 |
deflate (lvl=9) | 14878.364 | 338 | 22.7175514727 | 60.9094829559 |
GZIP | 10351.887 | 338 | 32.651051929 | 61.2398939133 |
Many of these implementations are pretty slow: ~23 Mb/sec for high level deflate or even ~33 Mb/sec for GZIP is not something you should be happy with on Xeon E5-2650. At the same time fastest deflate version is running at ~75 Mb/sec, Snappy at ~150 Mb/sec and LZ4 (fast, JNI) at truly surprising ~320 Mb/sec (actually much faster, but this time includes reading file from OS cache).
This diagram clearly shows that 2 implementations are not competitive at the moment: Snappy is slower than LZ4 (fast), but produces the bigger files. LZ4 (high) is in turn slower than deflate levels 1 to 4 and produces a bigger output compared to even deflate level=1.
As a result, I would probably choose between LZ4(fast) JNI implementation and deflate level=1 when I need an “on the fly compression”. You may have to use deflate if your organization does not allow 3rd party libraries. You should consider how much spare CPU cycles you have on your box as well as where the compressed data is being sent. For example, if you are writing compressed data directly to HDD, then performance above ~100 Mb/sec would not help you (provided that your file is large enough) – HDD speed will become the bottleneck. Same output written into a modern SSD – even LZ4 would not be fast enough If you compress your data prior to sending it over a Gigabit network, you should probably use LZ4, because 75 Mb/sec of deflate performance is considerably less than 125 Mb/sec of network throughput (yes, I know about packet headers, but the difference will be still considerable).
LZ4 compression algorithm
LZ4 is an algorithm which encodes data in frames. Each frame contains a header and compressed data. The size of the compression buffer (amount of data which will be compressed in one frame) is an LZ4BlockOutputStream
constructor argument:
1 public LZ4BlockOutputStream(OutputStream out, int blockSize, LZ4Compressor compressor)public LZ4BlockOutputStream(OutputStream out, int blockSize, LZ4Compressor compressor)
Current implementation allows the block size to be between 64 bytes and 32 Mb. Obviously, the bigger frame you will use, the higher will be the compression ratio. You should keep in mind that LZ4BlockOutputStream
will allocate identically sized uncompression buffer (this info is stored in the frame header).
As you have seen above, there is very little difference in the time required to compress the data with either 64K or 32M buffer, which means you should try using the bigger buffer in order to obtain some extra compression.
Another interesting LZ4 property (thanks to Mikael Grev for idea) is that it makes sense to use 2 LZ4BlockOutputStream
-s in a row, because subsequent blocks may contain similarly encoded data. The performance penalty is pretty unnoticeable, but you can gain extra compression (in case of 32M buffer, the output was 89M for the single pass and 84 for the double pass at a tiny cost of ~200 ms for 89M of data output from the first pass). It does not make much sense to make three or more passes – you will have very little compression improvement.
At the same time, it makes more sense to double the buffer size on the single pass rather than make 2 passes on the smaller buffers (the exception are buffers over 16M, which will allow you to circumvent 32M compression buffer limitation) – as you can see, you get nearly identical file size for double 64K pass and for single 128K pass. Double pass, as you have seen, will always be slower.
See also
Benchmark suite for data compression library on the JVM – a comprehensive test suite for data compressors implemented in Java or accessible via JNI. It tests time and space efficiency using several sets of data to compress. Thanks to Sam Van Oort for a link!
Summary
-
If you think that data compression is painfully slow, then check LZ4 (fast) implementation, which is able to compress a text file at ~320 Mb/sec – compression at such speed should be not noticeable for most of applications. It makes sense to increase the LZ4 compression buffer size up to its 32M limit if possible (keep in mind that you will need a similarly sized buffer for uncompression). You can also try chaining 2
LZ4BlockOutputStream
-s with 32M buffer size to get most out of LZ4. - If you are restricted from using 3rd party libraries or want a little bit better compression, check JDK deflate (lvl=1) codec – it was able to compress the same file at ~75 Mb/sec.
Source code
Java compression test project source code
Use standard JMH approach to run this project:
mvn clean install java -jar target/benchmarks.jar
The post Performance of various general compression algorithms – some of them are unbelievably fast! appeared first on Java Performance Tuning Guide.