人工智能助力上海研究生熊招平摘取多靶点药物分子设计国际挑战大赛冠军
责编 | 迦溆
在人工智能和计算科学领域,算法挑战赛是检验算法的试金石,参赛者在预先未知结果的情况下利用已有的数据设计并发展算法预测,由比赛组织方评判结果,或邀请第三方实验验证预测结果,能保证客观公正,让算法有更好的拓展性,而不是仅仅拟合已知结果的数据。比如谷歌旗下的Kaggle,阿里的天池竞赛,都是各大公司和学术机构掰手腕的竞技场,能赢得比赛是一项很高的荣誉。
Dream Challenge是国际计算生物医学领域历史最悠久也最有影响力的算法挑战赛【1】。从2006年开始,每年一届,已经举办了12届,赢得比赛的参赛者会被邀请到次年由国际计算生物学会(ISCB)举办的RECOMB计算生物大会作口头汇报。该比赛由美国哥伦比亚大学的Gustavo Stolovitzky 和Andrea Califano教授发起,每一届会由不同的比赛组织方开放自己的私有数据、设计不同主题的任务由参赛者建模预测。这些任务都是当时最新最紧迫的科学问题,如转录因子结合位点预测,体细胞突变识别算法,疾病模块预测,肿瘤病人生存期预测,帕金森动作体态模式识别,阿尔兹海默症生物标志物预测,乳房X片识别,肿瘤蛋白组生物标志物预测,细菌或病毒急性感染后动态时序性的血液生物标志物识别等等,都是非常有趣并且非常有挑战性的课题。该竞赛相当于向公众众筹算法,很多机构愿意去开放自己的私有数据,所以极大地鼓励了生物医学领域数据的开放共享;同时由于其第三方验证的特性,保证了算法的可重复性,能让算法能得到最有效的验证。
在纽约刚刚举办的第十一届RECOMB/ISCB会议公布了DREAM Challenge 单细胞转录组挑战赛和多靶点药物预测挑战赛【2】的比赛结果。上海科技大学和中科院上海药物所联合培养的博士研究生熊招平以总分28分的最高分获得多靶点药物预测挑战赛冠军,指导老师是来自中国科学院上海药物研究所的蒋华良院士和郑明月研究员。
多靶点药物,是指同时作用于疾病相关网络中多个靶点的药物。多年来,一些科研人员认为多靶点作用就意味着选择性差,更加倾向于开发作用于单一靶点的高特异性药物。然而,人体是一个有机整体,很多疾病必须要同时精细调控多个靶点才能得到控制。这种背景下通过调控多个靶点发挥作用的药物不是选择性低,而是选择性更高,作用更精准,即一定要作用于某些靶点,而又一定不能作用于另外一些靶点。以针对多靶点设计的蛋白激酶抑制剂PP121为例(下图),该化合物不仅可以靶向酪氨酸激酶,还可以阻断mTOR和PI3K之间的负反馈通路(红色框显示抑制剂靶点),从而具有更好的协同作用【3】。目前已发现有多种靶向激酶抗肿瘤药物的临床疗效与其多向药理学作用相关。然而,目前大多数激酶抑制剂通过与激酶高度保守的ATP结合口袋结合发挥作用,选择性低且易于产生毒副作用。因此,开发新型、有效和安全的激酶抑制剂,需要实现化合物多向药理和选择性的均衡,这是十分具有挑战性的。
本次的多靶点预测挑战赛由美国西奈山医学院组织,要求选手利用私有和公开的生物活性数据,找到对几个特定蛋白靶点(target)有活性,但对另外几个靶点(anti-target)不能有活性的化合物,也就是要求化合物具有高度精准的多靶点选择性。为了对算法进行客观验证,比赛组织方要求找出的化合物是ZINC15数据库中可购买的化合物,之后赛方会根据参赛者提交的算法创新性和可扩展性,以及化合物结构的新颖性(通常指在Scifinder数据库中和专利数据库中没有报道的化合物)来选择性地购买化合物样品进行生物活性测试。最后根据第三方的实验测试结果给参赛者打分(挑战赛详细规则见附录)。熊招平在甲状腺髓样瘤(medullary thyroid carcinoma)和tau蛋白神经退行性模型两项任务中均斩获全场最高分,从190个参赛队伍中脱颖而出。
值得一提的是,中国科学院上海药物研究所药物设计与发现中心(DDDC)是我国药物分子设计发源地,自1978年嵇汝运院士建立实验室开始,已经培养了一大批药物分子设计人才,包括陈凯先院士和蒋华良院士,该中心在药物分子设计方法学发和新药研发应用方面均取得许多重要突破。自2005年起,蒋华良院士指导学生发展了根据序列即可预测蛋白-蛋白相互作用(PPI)和仅基于蛋白质序列信息(无需三维结构)进行药物分子设计方法【4-6】,在国际上被同行广泛应用。郑明月研究员(曾师从陈凯先院士和蒋华良院士)长期从事药物分子设计方法发展以及新靶点和新药发现研究,特别在机器学习在药物设中应该领域有较扎实的工作基础。近年来,蒋华良院士和郑明月研究员带领一个研究团队进行基于人工智能(AI)和大数据的精准药物设计方法发展和新药发现研究,熊招平同学是该研究团队在读研究生。
这次挑战赛中并不是所有靶点蛋白都有被解析的三维结构,熊招平发挥DDDC长期积累的基础优势以及团队最近发展的将人工智能(AI)处理自然语言等序列类型数据的创新方法,对靶点蛋白进行编码,并利用最新的图神经网络原理对小分子图结构建模,最终生成一个端到端的神经网络预测模型,该AI模型不需要像传统机器学习算法进行人工特征选取,从而使模型具有很高的扩展性和预测能力。这一结果也表明人工智能和机器学习在提高创新药物研发效率方面的巨大潜力。
熊招平的导师之一蒋华良院士说: "人工智能正成为药物研发的重工具,上海药物研究所多年前已布局这一研究方向,并与化学生物学、DNA编码库等新技术以及传统药理学和药物化学结合,在提高药物研发效率方面已经取得了较大的进展。"
参考文献及资料
1. http://dreamchallenges.org/
2. https://www.synapse.org/#!Synapse:syn8404040/wiki/478422
3. Knight, Z. A., Lin, H., & Shokat, K. M. (2010). Targeting the cancer kinome through polypharmacology. Nature Reviews Cancer, 10(2), 130.
4. Wang, F., Liu, D., Wang, H., Luo, C., Zheng, M., Liu, H., ... & Jiang, H. (2011). Computational screening for active compounds targeting protein sequences: methodology and experimental validation. Journal of chemical information and modeling, 51(11), 2821-2828.
5. Shen, J., Zhang, J., Luo, X., Zhu, W., Yu, K., Chen, K., ... & Jiang, H. (2007). Predicting protein–protein interactions based only on sequences information. Proceedings of the National Academy of Sciences, 104(11), 4337-4341.
6. Zhong, F., Xing, J., Li, X., Liu, X., Fu, Z., Xiong, Z., ... & Li, F. (2018). Artificial intelligence in drug design. Science China Life Sciences, 1-14.
附录
多靶点药物DREAM挑战赛规则
Overview
· This is a Methods Development challenge
Many DREAM challenges seek to benchmark methods on a dataset with a known truth set. Competitive performance is assessed by score on the truth set. In contrast, in this challenge, we emphasize the submission of new methods to solve a drug-design problem.
·Incentivizing methods for Rational Design of multi-targeting compounds
The objective of this challenge is to incentivize development of methods for predicting compounds that bind to multiple targets. In particular, methods that are generalizable to multiple prediction problems are sought. To achieve this, participants will be asked to predict 2 separate compounds, each having specific targets to which they should bind, and a list of anti-targets to avoid. Participants should use the same methods to produce answers for questions 1 and 2.
·Predictions will be validated
Highly-ranked submissions will be validated by an external lab for activity against targets and may be further investigated in biological systems. Participants who are concerned about protecting intellectual property are encouraged to file a patent application before submitting to the challenge.
Challenge Question 1
Recent genetic and chemical dissection of models of medullary thyroid carcinoma have identified four key targets that together strongly reduce the ability of oncogenic Ret(M918T) to transform cells: RET, SRC, EPH, and FRK; MKNK1 and others were identified as ‘anti-targets’ (Sonoshita et al, submitted). Based on this work, problem 1 was developed with relevance to cancer.
Participants are asked to predict compounds that have the following behaviors:
Assume all targets and anti-targets are wild-type, unless specified otherwise.
Challenge Question 2
Challenge problem 2 is based on a neurodegenerative model of tauopathies. Background on tauopathies is available in this review article and this recent paper on molecular pathways.
Participants are asked to predict compounds that have the following behaviors:
Assume all targets and anti-targets are wild-type, unless specified otherwise.
Additional Criteria
1. Must be purchasable: listed as “in stock” by the ZINC15 database, at cost less than $250 per 1mg. Alternatively, for compounds not listed in ZINC, if you provide a vendor from which the compound can be purchased, then it is eligible to submit.
2. Binding is defined as >= 50% binding at 10 µM
3. Non-binding is defined as < 10% binding at 10 µM
Scoring
Points are awarded for methods that meet the criteria outlined above for binding specified targets and avoided specified anti-targets. In addition, for each challenge question, points may be awarded for exhibiting the following properties. The point score will determine the initial ranking of teams, prior to validation.
* Tanimoto coefficient (ECFP6) is calculated with respect to small-molecule ligands of the target kinases, as defined by compounds in ChEMBL known to be active against these targets. Ligand activity is as defined above for binding/non-binding.
** Only applicable to question 2.
Validation assays and final scoring
Submissions will be reviewed and ranked by the scientific board for innovation of method, generalizability of method, and novelty of compound. Board members will be blinded to the identity of submitters when conducting this evaluation.
Highly-ranked submissions will be evaluated by a contract research organization to verify activity with respect to targets and anti-targets. Up to one submission per team will be validated. The validation assay is a competition binding assay that yields a quantitive measure of target interaction and is conducted in vitro, independent of a biological system. Verification of inhibition activity may also be assessed, but this will not be considered in final challenge scoring.
The final score for a team will be a composite of the score for efficacy determined by the contract research organization's evaluation and their score from the ranking procedure.
Post-challenge assessment
Promising candidates from the challenge will undergo further evaluation in cell lines and fly models. These assessments are not a part of the scored challenge and will not affect selection of a challenge winner.
Intellectual property
Participants are encouraged to file a patent application before submission to the challenge if they are concerned about protecting intellectual property rights.
BioArt,一心关注生命科学,只为分享更多有种、有趣、有料的信息。关注请长按上方二维码。投稿、合作、转载授权事宜请联系微信ID:fullbellies 或邮箱:sinobioart@bioart.com.cn