Poster
in
Workshop: Safe Generative AI
Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption
Leo de Castro · Antigoni Polychroniadou · Daniel Escudero
Abstract:
As large language models (LLMs) become more ubiquitous, security concerns regarding sensitive queries grow. Due to the complexities of deploying these models, LLM evaluation is often outsourced to a third-party cloud, which leaks the clients' queries to this external provider. These queries could contain sensitive information such as intellectual property, medical information, and proprietary data. Protecting this data while maintaining the LLM's functionality is a major privacy challenge. Fully homomorphic encryption (FHE) presents a natural solution to this problem: simply encrypt the query and evaluate the LLM homomorphically on the cloud machine. The result remains encrypted and can only be learned by the client who holds the secret key. There are two barriers to this solution: (1) FHE operations do not easily support the LLM activation functions and (2) FHE implementations remain too slow to evaluate an LLM in a reasonable time. In this work, we address both of these barriers to present a fully encrypted version of GPT-2 with forward pass times over $150\times$ faster than the CPU baseline. This result builds on two main technical contributions. First, we present the first open-sourced implementation of GPU-accelerated FHE as an extension to the popular OpenFHE library, achieving roughly $200\times$ performance improvement for many critical functions including bootstrapping. Second, we present novel and extensive experimental analysis of approximations of LLM activation functions to maintain accuracy while achieving this performance. We run extensive benchmarks using the HellaSwag, LAMBADA and ARC datasets, and our results show that the accuracy/perplexity degradation with respect to ``out-of-the-box'' GPT-2 is minimal.
Chat is not available.