Poster
in
Workshop: Machine Learning in Structural Biology Workshop
FLIGHTED: Inferring Fitness Landscapes from Noisy High-Throughput Experimental Data
Vikram Sundar · Boqiang Tu · Lindsey Guan · Kevin Esvelt
Machine learning (ML) for protein design frequently requires large datasets of protein fitness generated by high-throughput experiments, and many ML models use these datasets for training, fine-tuning, and benchmarking. However, these approaches do not account for underlying experimental noise, potentially making their conclusions inaccurate. In this work, we present FLIGHTED (Fitness Landscape Inference Generated by High-Throughput Experimental Data), a Bayesian method for generating fitness landscapes with calibrated errors from noisy high-throughput experimental data. We apply FLIGHTED to datasets generated by single-step enrichment-based selection assays such as Fluorescence-Activated Cell Sorting (FACS) and phage display and to data from a novel high-throughput assay DHARMA (direct high-throughput activity recording and measurement assay) that ties fitness to base editing activity. Our results suggest that de-noising single-step selection data generates well-calibrated predictions that are sufficient to change which models perform best in benchmarking studies. Applying FLIGHTED to DHARMA provides more accurate fitness measurements with better calibrated errors; FLIGHTED-DHARMA can be used to generate large protein fitness datasets with up to 10^6 variants. FLIGHTED can be used on any high-throughput assay and makes it easy for ML scientists to account for experimental noise when modeling protein fitness.