Research Paper Volume 15, Issue 18 pp 9293—9309
Biomedical generative pre-trained based transformer language model for age-related disease target discovery
- 1 Insilico Medicine Hong Kong Ltd., Hong Kong Science and Technology Park, New Territories, Hong Kong, China
- 2 Insilico Medicine AI Limited, Level 6, Unit 08, Block A, IRENA HQ Building, Masdar City, Abu Dhabi, UAE
Received: June 15, 2023 Accepted: August 20, 2023 Published: September 22, 2023
https://doi.org/10.18632/aging.205055How to Cite
Copyright: © 2023 Zagirova et al. This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Abstract
Target discovery is crucial for the development of innovative therapeutics and diagnostics. However, current approaches often face limitations in efficiency, specificity, and scalability, necessitating the exploration of novel strategies for identifying and validating disease-relevant targets. Advances in natural language processing have provided new avenues for predicting potential therapeutic targets for various diseases. Here, we present a novel approach for predicting therapeutic targets using a large language model (LLM). We trained a domain-specific BioGPT model on a large corpus of biomedical literature consisting of grant text and developed a pipeline for generating target prediction. Our study demonstrates that pre-training of the LLM model with task-specific texts improves its performance. Applying the developed pipeline, we retrieved prospective aging and age-related disease targets and showed that these proteins are in correspondence with the database data. Moreover, we propose CCR5 and PTH as potential novel dual-purpose anti-aging and disease targets which were not previously identified as age-related but were highly ranked in our approach. Overall, our work highlights the high potential of transformer models in novel target prediction and provides a roadmap for future integration of AI approaches for addressing the intricate challenges presented in the biomedical field.
Abbreviations
AI: Artificial Intelligence; GO: Gene Ontology; HGNC: HUGO Gene Nomenclature Committee; iPSC: Induced pluripotent stem cell; LLM: Large language model; NLTK: Natural Language Toolkit.