Artificial intelligence in drug discovery: combining clinical and genomic data from cancer patients to identify genetic biomarkers and potential novel targets
Head of Drug Discovery and Biostatistics Topazium Artificial Intelligence, Spain
Abstract: Drug discovery is a costly process affected by a high attrition rate. The huge investment from pharma industry in past years has generated a vast wealth of information in the field that constitutes a great opportunity if it is conveniently exploited. Topazium is a company fully committed to creating a new collective intelligence that enables wiser medicine and, in this context, it has developed new tools to improve drug discovery performance based on a novel, circular process paradigm. That paradigm leverages the value of the information generated at each stage to feed the other stages, regardless of their relative position in the cycle. One such tool is Topazium’s Genetic Fingerprints, a machine learning framework (MLF) suitable to dive large amounts of clinical and genetic information. This tool has been applied to public cancer genomics datasets in order to identify genetic biomarkers of worse survival as well as potential new therapeutic targets. We used data from The Cancer Genome Atlas which includes whole exome sequences of samples from circa 12K patients with tumors corresponding to 150 different histotypes, affected by more than 2M mutations. Data were encoded using a proprietary encryption tool to generate vectors of adequate length and sparcity that were used by the MLF to identify clusters of patients based on genetic similarities. These clusters showed significant differences in their overall survival curves (p-values < 0.001). Circa 200K of those mutations dictated survival differences by themselves, and they were scrutinized to remove genes related to known cancer driver mutations: this enabled the identification of new genes that had not been directly related to cancer so far. Moreover, if patients affected by those 200K mutations were removed from the analysis, the clusters still showed differences in their survival curves, pointing to polygenic effects that can be pharmacologically exploited. Analysis of the pathways affected by those genes using KEGG revealed that the pathway with the highest gene ratio was microRNA in cancer, and the genes involved in that pathway were also identified. A similar exercise was performed with clinical and genetic data from 945 breast tumors from Broad Institute of MIT and Harvard. The resulting clusters showed differences in survival after 7 years (p-value 0.04) and the most outstanding contribution corresponded to genes related to the Pi3K/Akt signaling pathway and to pathways involved in cell adhesion and migration: they were analyzed to identify genes whose implication inn cancer was ignored so far. Altogether, the MLF presented here appears as a useful tool to exploit real world clinical and genomic data, allowing the identification of genes that can be used as genetic biomarkers of higher mortality risk and as novel cancer targets providing the basis of new drug discovery programs.