Head of Discovery Science UK Charles River Laboratories, United Kingdom
The design and production of proteins is one of the foundational building blocks of modern drug discovery. The current Design-Make-Test-Analyse Cycle (DMTA) for recombinant protein generation is empirical, iterative, and can be extremely time and resource intensive. This is especially true as both industry and academic laboratories move into less well-known regions of the human proteome. Recent advances in machine learning and artificial intelligence offer an opportunity to apply these techniques to protein engineering to reduce or even “break” the current DMTA cycle for recombinant protein generation. However, to even probe the feasibility of using AI/ML approaches for construct design and protein production, the protein science field needs to provide large data sets. There are starting points for “big data” for protein science including the SGC (Structural Genomic Consortium) effort to obtain structural information on the entire human proteome. These data sets as well as the majority of other protein design, molecular biology generation, expression, purification and analysis data are not normalized in any uniform way that is easily amenable to machine learning. We have established this SIG with an intent to establish standards to capture data generated in the DMTA cycle for recombinant proteins that could be adopted across academia, government and industry, to maximize the opportunity for AI/ML to produce breakthroughs in protein science. Please join us in developing this SIG to help us get the data we need to continue to explore AI/ML for molecular biology and protein engineering.