The zlib compression library is widely used to compress and decompress data. This library is utilized by popular programs like the Java Development Kit (JDK), Linux distributions, libpng, Git, and many others. Because zlib is widely adopted, the maintainer of the original version accepts only bug fixes with significant impact. This approach has resulted in the creation of several zlib forks, as there are multiple places in the code where performance improvements can be made.

zlib forks

Until recently, the Cloudflare fork of zib (zlib-cloudflare) had the best performance for compression, and the Chromium fork (zlib-chromium) had the best performance for decompression among the four zlib forks considered:

  • zlib-madler (original zlib)
  • zlib-ng
  • zlib-cloudflare
  • zlib-chromium

As a result, no single zlib fork was fastest in both compression and decompression.

In this article, we’ll describe the work done in zlib-cloudflare to improve its decompression performance. Additionally, we compare the four zlib forks on both compression and decompression operations, using the Silesia Corpus (a collection of modern-day workloads) and a custom benchmark tool.

Motivation

The aim of improving zlib-cloudflare was to have a single version of zlib that could be used for both compression and decompression operations, regardless of the CPU architecture.

Between zlib-cloudflare and zlib-chromium, we chose zlib-cloudflare because it has an easy-to-use build system (make) compared with zlib-chromium (gn, ninja). A handcrafted build script is required to compile zlib-chromium on both Arm and x86.

Work done

We improved the performance of zlib-cloudflare by porting the decompression performance enhancement patches from zlib-chromium. After the patches are merged with the Silesia Corpus, we can see that zlib-cloudflare compresses better (smaller file size) and faster when compared against the original zlib (zlib-madler) on both Arm and x86. At compression level 6 (default) on Arm, we see:

  • zlib-cloudflare is on average 90 percent faster than zlib-madler in compression operations.
  • zlib-cloudflare is on average 52 percent faster than zlib-madler in decompression operations.

With these changes, zlib-cloudflare is now the best performing zlib fork for compression and decompression operations on both Arm and x86 systems.

Improvements

We selected the relevant patches from zlib-chromium and sent the pull request to the zlib-cloudflare repo. The patches are:

The improvements seen in zlib-cloudflare are primarily due to the usage of Arm NEON, x86 SSE intrinsics, and loads wider than 1 byte at a time. These are used in the Adler-32 checksum and when performing wider loads/stores in inflate_fast().

Benchmark

We created a benchmark using an example implementation of zlib by Mark Adler (original author of zlib). It has been modified to run for 100M streams and tested with a wide variety of workloads in the Silesia Corpus.

We measure the throughput (MB/s) and compression ratio for levels 0 to 9 and report three numbers (compression level, throughput, compression ratio).

After porting the patches to zlib-cloudflare, we ran the benchmark using zlib-cloudflare and compared the result against zlib-madler in both M6g (Arm) and M5 (Intel).

Comparing compression performance between zlib-cloudflare and zlib-madler, we see that zlib-cloudflare compresses better and faster than zlib-madler for the corresponding compression level on M6g.

Dickens (text file, higher and to the right is better; compression level is the number above the graph points):

Graph illustrating Dickens throughput to compression ratio.

The other workloads in the Silesia Corpus show similar trends.

Arm throughput (MB/s, higher is better):

  • At level 6, zlib-cloudflare is, on average, 90 percent faster in compression operations than zlib-madler.Bar graph illustrating zlib-cloudflare compression operations.
  • zlib-cloudflare is, on average, 52 percent faster in decompression operations than zlib-madler.Graph illustrating the zlib-cloudflare decompression operations.
  • On M5 (x86), we see compression and decompression performance similar to M6g (Arm).

Conclusion

With these changes, we were able to make a platform-agnostic library that is easy to use and integrate. Overall, we see zlib-cloudflare compressing better and faster when compared to zlib-madler (original zlib). We recommend trying zlib-cloudflare for compression/decompression needs on Arm and x86 platforms.

Appendix

Benchmark tool main loop:

for (i = 0; i < runs; i++) { int counter = 0; long processed = 0; long process = 100000000L; start = get_clock_time(); while (1) { if (action & DEFLATE) { run_deflate(inflated, deflated, comp_level_start); processed += in_size; } if (action & INFLATE) { run_inflate(deflated, inflated); processed += out_size; } counter++; if (processed > process) break; } end = get_clock_time(); set_metrics((end - start), in_size, out_size, processed, comp_level_start, &m); print_metrics(&m, counter);
}