A deep unsupervised Model for Protein Design
The design of new functional proteins can tackle many of the problems humankind is facing today but so far has proven very challenging1. Analogies between protein sequences and human languages have been long noted and a summary of their most prominent similarities is described. Given the tremendous success of Natural Language Processing (NLP) methods in recent years, its application to protein research opens a fresh perspective, shifting from the current energy-function centered paradigm to an unsupervised learning approach based entirely on sequences. To explore this opportunity further we have pre-trained a generative language model on the entire protein sequence space. We find that our language model, ProtGPT2, effectively speaks the