Poster
in
Workshop: Machine Learning in Structural Biology Workshop
HiFi-NN annotates the microbial dark matter with Enzyme Commission numbers
Gavin Ayres
The accurate computational annotation of protein sequences with enzymatic function, especially those that are part of the functional and taxonomic dark matter, remains a fundamental challenge in bioinformatics. Here, we present HiFi-NN, (Hierarchically-Finetuned Nearest Neighbor search) which annotates protein sequences to the 4th level of EC (enzyme commission) number with greater precision and recall than all existing deep learning methods. HiFi-NN is a hierarchically-finetuned deep learning method based on a combination of semi-supervised representation learning and a nearest neighbours classifier. Furthermore, we show that this method can correctly identify the EC number of a given sequence to identities below 40%, where the current state of the art annotation tool, BLASTp, cannot. We proceed to improve the representations learned by increasing the diversity of the training set, not just in sequence space but also in terms of the environment the sequences have been sampled from. Finally, we use HiFi-NN to annotate a portion of microbial dark matter sequences in the MGnify database.