Poster
in
Workshop: Workshop on Machine Learning and Compression
Unified Lookup Tables: Privacy-Preserving Foundation Models
Nikita Janakarajan · Irina Morales · Marvin Alberts · Andrea Giovannini · Matteo Manica · Antonio Foncubierta-Rodriguez
Abstract:
Transformers, despite their success in a variety of sequence modeling tasks, have a significant limitation: they are inherently data-greedy, which can lead to overfitting when the data are scarce. In such cases, common practice is to build a Foundation Model (FM), a model trained on large amounts of publicly available data, that can then be fine tuned for a specific task. Another known problem of FMs is training data leakage. It has been demonstrated that excerpts of the training data can be obtained by prompt engineering on a FM, which poses a high risk of exposing confidential data. In this paper we propose Unified Lookup Tables (ULTs), a data preprocessing step for building and fine tuning FMs in a privacy preserving manner, which simultaneously enables the reuse of a trained model on new datasets without exposing any training data. The method relies on data compression methods as efficient modality tokenizers, and a common representation vocabulary for all datasets. We evaluate the effect of using ULT with a text compression mechanism in training both decoder-only and encoder-decoder language models. Results show that the evaluation loss decreases consistently compared with the raw data one when using ULT on different data domains, proving that the transformation does not negatively affect model training. Moreover, we evaluate the performance of adopting ULT on natural language showing that the resulting model exhibits an average relative increase of $\sim$16\% on a collection of text metrics. The experiments using ULT as pre-processing step with chemical reaction data on the task of forward prediction report non-significant performance degradation with respect to training with traditional SMILES strings. Finally, we perform experiments to test the privacy preserving capabilities of the ULT pre-processing on a realistic setup of data consisting of chemical reactions, a field in which confidentiality is a key factor and a major intellectual property concern. Our series of experiments show that ULTs are attack-proof without the entire dataset from which it is built. Even if partially correct mappings of the ULT are generated by mixing datasets, an attacker cannot successfully decode the data. Code to reproduce the experiments is available at: https://link-redacted-for-anonymous-review.
Chat is not available.