NeurIPS Your Dataset is a Multiset and You Should Compress it Like One

Poster
in
Workshop: Deep Generative Models and Downstream Applications

Your Dataset is a Multiset and You Should Compress it Like One

Daniel Severo · James Townsend · Ashish Khisti · Alireza Makhzani · Karen Ullrich

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: Neural Compressors (NCs) are codecs that leverage neural networks and entropy coding to achieve competitive compression performance for images, audio, and other data types. These compressors exploit parallel hardware, and are particularly well suited to compressing i.i.d. batches of data. The average number of bits needed to represent each example is at least the well-known cross-entropy. However, the cross-entropy bound assumes the order of the compressed examples in a batch is preserved, which in many applications is not necessary. The number of bits used to implicitly store the order information is the logarithm of the number of unique permutations of the dataset. In this work, we present a method that reduces the bitrate of any codec by exactly the number of bits needed to store the order, at the expense of shuffling the dataset in the process. Conceptually, our method applies bits-back coding to a latent variable model with observed symbol counts (i.e. multiset) and a latent permutation defining the ordering, and does not require retraining any models. We present experiments with both lossy off-the-shelf codecs (WebP) as well as lossless NCs. On Binarized MNIST, lossless NCs achieved savings of up to $7.6\%$, while adding only $10\%$ extra compute time.

Chat is not available.

Poster in Workshop: Deep Generative Models and Downstream Applications

Your Dataset is a Multiset and You Should Compress it Like One

Daniel Severo · James Townsend · Ashish Khisti · Alireza Makhzani · Karen Ullrich

Poster
in
Workshop: Deep Generative Models and Downstream Applications