Here’s a quick post about a topic I find myself revisiting every few months: how to compress large batches of files efficiently. As I perform lots of calculations on a computing cluster, I like to routinely back things up at around publication time: that way I can have access to the data at the point where my paper was submitted, for example, and come back to it at a later date.
Now, for UN*X users, a lot of people probably just use
tar like so:
tar -cf new_tarball.tar.gz file1 file2 file3
tar doesn’t offer the best compression nor speed. There are a few more modern approaches that promise to be better. One approach, for example, are parallel implementations: every computer has multiple cores/threads these days, why not use them? Back in 2015, this blog post showed a quick benchmark of how parallel implementations of common compression algorithms perform:
With some pretty dated hardware, you could still get >3x faster. The simple thing about this is the plug-and-play nature of it all; you can get
pbzip2 through your package manager (e.g.
sudo apt install pigz) or if you don’t have
sudo access, you could get the binaries from their websites or with
conda install -c bioconda pigz
Replace your regular
tar command with this:
tar -I pigz -cf tarball.tar.gz file1 file2 file3
Maybe not necessarily newer, but there are algorithms that promise to be faster and higher compression ratios than
tar bundled with most systems. These include the ones mentioned in the table above, as well as
lrzip or “long range zip” which is apparently ~2x faster than
bzip2 thanks to its LZMA algorithm. For even beefier compression, there’s ZPAQ which, according to the author of
lrzip, “extreme compression while taking forever”.
As far as my experience goes,
lrzip are as far as I’ve had to take it. Given how problem dependent compression efficiency actually is, for your routine compression I would simply recommend
lrzip thanks to its ease of use, speed, and compression.
Compressing lots of files
One scenario I find myself in often is taking many files and throwing them together - to do this quickly, I would do a first pass compressing the many files into a single, low compression tarball using one of the parallel algorithms. A second pass is done to further compress the file, where now the bottleneck isn’t grabbing many files, it’s simply to compress a single file.
So, something like this:
tar -cf single_tarball.tar file1 file2 ... file999999 lrztar single_tarball.tar
Should generate a
single_tarball.tar.lrz that is more compressed than the original, and because you’re using parallelization to do many smaller files, ensures that your high compression algorithm is most spending its time working on a single file rather than many files.