人工智能正在接管药物开发

经济学人:

The most striking evidence that artificial intelligence can provide profound scientific breakthroughs came with the unveiling of a program called AlphaFold by Google DeepMind. In 2016 researchers at the company had scored a big success with AlphaGo, an AI system which, having essentially taught itself the rules of Go, went on to beat the most highly rated human players of the game, sometimes by using tactics no one had ever foreseen. This emboldened the company to build a system that would work out a far more complex set of rules: those through which the sequence of amino acids which defines a particular protein leads to the shape that sequence folds into when that protein is actually made. AlphaFold found those rules and applied them with astonishing success.

人工智能可以提供深刻的科学突破的最引人注目的证据来自谷歌 DeepMind 推出的一个名为 AlphaFold 的程序。2016 年,该公司的研究人员在 AlphaGo 上取得了巨大的成功,AlphaGo 是一个人工智能系统,它基本上自学了围棋规则,继续击败了游戏中评价最高的人类玩家,有时使用没有人预见到的策略。这鼓励了该公司建立一个系统,该系统将制定一套更复杂的规则:通过这些规则,定义特定蛋白质的氨基酸序列导致该序列在实际制造蛋白质时折叠成的形状。AlphaFold 发现了这些规则,并取得了惊人的成功。

The achievement was both remarkable and useful. Remarkable because a lot of clever humans had been trying hard to create computer models of the processes which fold chains of amino acids into proteins for decades. AlphaFold bested their best efforts almost as thoroughly as the system that inspired it trounces human Go players. Useful because the shape of a protein is of immense practical importance: it determines what the protein does and what other molecules can do to it. All the basic processes of life depend on what specific proteins do. Finding molecules that do desirable things to proteins (sometimes blocking their action, sometimes encouraging it) is the aim of the vast majority of the world’s drug development programmes.
这一成就既了不起又有益。这很了不起,因为几十年来,许多聪明的人类一直在努力创建将氨基酸链折叠成蛋白质的过程的计算机模型。AlphaFold 几乎与激发它的系统击败人类围棋玩家一样彻底地击败了他们的最大努力。之所以有用,是因为蛋白质的形状具有巨大的实际意义:它决定了蛋白质的作用以及其他分子可以对它做什么。生命的所有基本过程都取决于特定蛋白质的作用。找到对蛋白质有理想作用的分子(有时阻断它们的作用,有时鼓励它们的作用)是世界上绝大多数药物开发计划的目标。

Because of the importance of proteins’ three-dimensional structure there is an entire sub-discipline largely devoted to it: structural biology. It makes use of all sorts of technology to look at proteins through nuclear-magnetic-resonance techniques or by getting them to crystallise (which can be very hard) and blasting them with x-rays. Before AlphaFold over half a century of structural biology had produced a couple of hundred thousand reliable protein structures through these means. AlphaFold and its rivals (most notably a program made by Meta) have now provided detailed predictions of the shapes of more than 600m.
由于蛋白质三维结构的重要性,有一整个子学科主要致力于它:结构生物学。它利用各种技术通过核磁共振技术或让它们结晶(这可能非常困难)并用 X 射线爆破它们来观察蛋白质。在 AlphaFold 之前,半个多世纪的结构生物学已经通过这些手段产生了几十万个可靠的蛋白质结构。AlphaFold 及其竞争对手(最著名的是 Meta 开发的程序)现在已经提供了超过 600m 形状的详细预测。

As a way of leaving scientists gobsmacked it is a hard act to follow. But if AlphaFold’s products have wowed the world, the basics of how it made them are fairly typical of the sort of things deep learning and generative AI can offer biology. Trained on two different types of data (amino-acid sequences and three-dimensional descriptions of the shapes they fold into) AlphaFold found patterns that allowed it to use the first sort of data to predict the second. The predictions are not all perfect. Chris Gibson, the boss of Recursion Pharmaceuticals, an AI-intensive drug-discovery startup based in Utah, says that his company treats AlphaFold’s outputs as hypotheses to be tested and validated experimentally. Not all of them pan out. But Dr Gibson also says the model is quickly getting better.
作为一种让科学家目瞪口呆的方式,这是一个很难遵循的行为。但是,如果说 AlphaFold 的产品让世界惊叹不已,那么它如何制造它们的基本原理是深度学习和生成式人工智能可以为生物学提供的那种东西。在两种不同类型的数据(氨基酸序列和它们折叠成的形状的三维描述)上进行训练后,AlphaFold 发现了允许它使用第一种数据来预测第二种数据的模式。预测并不都是完美的。总部位于犹他州的人工智能密集型药物发现初创公司 Recursion Pharmaceuticals 的老板克里斯 · 吉布森(Chris Gibson)表示,他的公司将 AlphaFold 的输出视为需要实验测试和验证的假设。并非所有人都成功了。但吉布森博士也表示,这种模式正在迅速变得更好。
Crystal dreams 水晶梦

This is what a whole range of AIs are now doing in the world of biomedicine and, specifically, drug research: making suggestions about the way the world is that scientists could or would not come up with on their own. Trained to find patterns that extend across large bodies of disparate data, AI systems can discover relationships within those data that have implications for human biology and disease. Presented with new data they can use those patterns of implication to produce new hypotheses which can then be tested.
这就是生物医学领域,特别是药物研究领域正在做的一系列人工智能:对世界的方式提出建议,科学家可以或不会自己想出。经过训练,人工智能系统可以发现跨越大量不同数据的模式,可以发现这些数据中对人类生物学和疾病有影响的关系。有了新的数据,他们就可以使用这些暗示模式来产生新的假设,然后可以对其进行测试。

The ability of AI to generate new ideas provides users with insights that can help to identify drug targets and to predict the behaviour of novel compounds, sometimes never previously imagined, that might act as drugs. It is also being used to find new applications for old drugs, to predict the side effects of new drugs, and to find ways of telling those patients whom a drug might help from those it might harm.
人工智能产生新想法的能力为用户提供了见解,可以帮助识别药物靶点并预测可能充当药物的新化合物的行为,有时是以前从未想象过的。它还被用于寻找旧药的新应用,预测新药的副作用,并找到告诉那些药物可能帮助的患者和可能伤害的患者的方法。

Such computational ambitions are not new. Large-scale computing, machine learning and drug design were already coming together in the 2000s, says Vijay Pande, who was a researcher at Stanford University at the time. This was in part a response to biology’s fire hose of new findings: there are now more than a million biomedical research papers published every year.
这样的计算野心并不新鲜。大规模计算、机器学习和药物设计在 2000 年代就已经融合在一起,当时在斯坦福大学担任研究员的 Vijay Pande 说。这在一定程度上是对生物学新发现的回应:现在每年有超过一百万篇生物医学研究论文发表。

One of the early ways in which AI was seen to help with this was through “knowledge graphs”, which allowed all that information to be read by machines and mined for insights about, say, which proteins in the blood might be used as biomarkers revealing the presence or severity of a disease. In 2020 BenevolentAI, based in London, used this method to see the potential which baricitinib, sold by Eli Lilly as a treatment for rheumatoid arthritis, had for treating covid-19.
人工智能帮助解决这个问题的早期方法之一是通过 “知识图谱”,它允许机器读取所有这些信息,并挖掘血液中的哪些蛋白质可以用作揭示疾病存在或严重程度的生物标志物。2020 年,总部位于伦敦的 Benevolentai 使用这种方法看到了礼来公司销售的用于治疗类风湿性关节炎的巴瑞替尼在治疗 covid-19 方面的潜力。

This January, research published in Science described how AI algorithms of a different sort had accelerated efforts to find biomarkers of long covid in the blood. Statistical approaches to the discovery of such biomarkers can be challenging given the complexity of the data. AIs offer a way of cutting through this noise and advancing the discovery process in diseases both new, like long covid, and hard to diagnose, like the early stages of Alzheimer’s.
今年 1 月,发表在《科学》杂志上的研究描述了不同类型的人工智能算法如何加速在血液中寻找长期新冠病毒生物标志物的努力。鉴于数据的复杂性,发现此类生物标志物的统计方法可能具有挑战性。人工智能提供了一种消除这种噪音的方法,并推进了新疾病(如长期新冠)和难以诊断的疾病(如阿尔茨海默氏症的早期阶段)的发现过程。
The time is right
时机已到,

But despite this past progress, Dr Pande, now at Andreessen Horowitz, a venture-capital firm that is big on AI, thinks that more recent advances mark a step change. Biomedical research, particularly in biotech and pharma, was steadily increasing its reliance on automation and engineering before the new foundation models came into their own; now that has happened, the two seem to reinforce each other. The new foundation models do not just provide a way to cope with big bodies of data; they demand them. The scads of reliable data highly automated labs can produce in abundance are just the sort of thing for training foundation models. And biomedical researchers need all the help they can get to understand the torrents of data they are now capable of generating.
但是,尽管过去取得了这些进展,但现在在安德森 · 霍洛维茨(Andreessen Horowitz)工作,这是一家专注于人工智能的风险投资公司,他认为最近的进展标志着一个阶段的变化。生物医学研究,特别是生物技术和制药领域的生物医学研究,在新的基础模型出现之前,对自动化和工程的依赖正在稳步增加; 现在事情已经发生了,两者似乎相辅相成。新的基础模型不仅提供了一种处理大量数据的方法; 他们要求他们。高度自动化的实验室可以大量生成大量可靠数据,这些数据正是训练基础模型的那种东西。生物医学研究人员需要他们所能获得的所有帮助来理解他们现在能够生成的大量数据。

By finding patterns humans had not thought to look for, or had no hope of finding unaided, AI offers researchers new ways to explore and understand the mysteries of life. Some talk of AIs mastering the “language of biology”, learning to make sense of what evolution has wrought directly from the data in the same way that, trained on lots of real language, they can fluently generate meaningful sentences never uttered before.
通过寻找人类没有想过要寻找的模式,或者没有希望在没有帮助的情况下找到模式,人工智能为研究人员提供了探索和理解生命奥秘的新方法。有人说人工智能掌握了 “生物学语言”,学会了直接从数据中理解进化的结果,就像在大量真实语言上训练后,他们可以流利地生成以前从未说过的有意义的句子一样。

Demis Hassabis, the boss of DeepMind, points out that biology itself can be thought of as “an information processing system, albeit an extraordinarily complex and dynamic one”. In a post on Medium, Serafim Batzoglou, the chief data officer at Seer Bio, a Silicon Valley company that specialises in looking at how proteins behave, predicts the emergence of open foundation models that will integrate data spanning from genome sequences to medical histories. These, he argues, will vastly accelerate innovation and advance precision medicine.
DeepMind 的老板 Demis Hassabis 指出,生物学本身可以被认为是 “一个信息处理系统,尽管它是一个非常复杂和动态的系统”。在 Medium 上的一篇文章中,专门研究蛋白质行为的硅谷公司 Seer Bio 的首席数据官 Serafim Batzoglou 预测,开放基础模型的出现将整合从基因组序列到病史的数据。他认为,这些将大大加速创新并推动精准医疗的发展。

Like many of the enthusiasts piling into AI Dr Pande talks of an “industrial revolution…changing everything”. But his understanding of the time taken so far leads him to caution that achievements that justify that long-term enthusiasm change will not come overnight: “We are in a transitory period where people can see the difference but there is still work to do.”
像许多热衷于人工智能的爱好者一样,潘德博士谈到了 “工业革命...... 改变一切 “。但他对迄今为止所花费时间的理解使他警告说,证明长期热情变化的成就不会在一夜之间到来:“我们正处于一个过渡时期,人们可以看到差异,但仍有工作要做。
All the data from everywhere all at once
来自各地的所有数据一次全部

A lot of pharma firms have made significant investments in the development of foundation models in recent years. Alongside this has been a rise in AI-centred startups such as Recursion, Genesis Therapeutics, based in Silicon Valley, Insilico, based in Hong Kong and New York and Relay Therapeutics, in Cambridge, Massachusetts. Daphne Koller, the boss of Insitro, an AI-heavy biotech in South San Francisco, says one sign of the times is that she no longer needs to explain large language models and self-supervised learning. And Nvidia—which makes the graphics-processing units that are essential for powering foundation models—has shown a keen interest. In the past year, it has invested or made partnership deals with at least six different AI-focused biotech firms including Schrodinger, another New York based firm, Genesis, Recursion and Genentech, an independent subsidiary of Roche, a big Swiss pharmaceutical company.
近年来,许多制药公司在基础模型的开发方面进行了大量投资。与此同时,以人工智能为中心的初创公司也有所增加,例如位于硅谷的 Recursion、Genesis Therapeutics、位于香港和纽约的 Insilico 以及位于马萨诸塞州剑桥的 Relay Therapeutics。达芙妮 · 科勒(Daphne Koller)是位于南旧金山的一家以人工智能为主的生物技术公司 Insitro 的老板,她说,这个时代的一个标志是,她不再需要解释大型语言模型和自我监督学习。英伟达(Nvidia)制造了为基础模型提供动力必不可少的图形处理单元,该公司也表现出了浓厚的兴趣。在过去的一年里,它已经与至少六家不同的人工智能生物技术公司进行了投资或合作交易,包括另一家总部位于纽约的公司薛定谔(Schrodinger)、瑞士大型制药公司罗氏(Roche)的独立子公司 Genesis、Recursion 和基因泰克(Genentech)。

The drug-discovery models many of the companies are working with can learn from a wide variety of biological data including gene sequences, pictures of cells and tissues, the structures of relevant proteins, biomarkers in the blood, the proteins being made in specific cells and clinical data on the course of disease and effect of treatments in patients. Once trained, the AIs can be fine tuned with labelled data to enhance their capabilities.
许多公司正在使用的药物发现模型可以从各种生物学数据中学习,包括基因序列、细胞和组织的图片、相关蛋白质的结构、血液中的生物标志物、特定细胞中产生的蛋白质以及有关疾病过程和患者治疗效果的临床数据。经过训练后,可以使用标记数据对 AI 进行微调,以增强其功能。

The use of patient data is particularly interesting. For fairly obvious reasons it is often not possible to discover the exact workings of a disease in humans through experiment. So drug development typically relies a lot on animal models, even though they can be misleading. AIs that are trained on, and better attuned to, human biology may help avoid some of the blind alleys that stymie drug development.
患者数据的使用特别有趣。由于相当明显的原因,通常不可能通过实验发现人类疾病的确切运作方式。因此,药物开发通常在很大程度上依赖于动物模型,即使它们可能具有误导性。接受过人类生物学培训并更好地适应人类生物学的人工智能可能有助于避免一些阻碍药物开发的死胡同。

Insitro, for example, trains its models on pathology slides, gene sequences, MRI data and blood proteins. One of its models is able to connect changes in what cells look like under the microscope with underlying mutations in the genome and with clinical outcomes across various different diseases. The company hopes to use these and similar techniques to find ways to identify sub-groups of cancer patients that will do particularly well on specific courses of treatment.
例如,Insitro 在病理切片、基因序列、MRI 数据和血液蛋白上训练其模型。其中一个模型能够将显微镜下细胞外观的变化与基因组中的潜在突变以及各种不同疾病的临床结果联系起来。该公司希望利用这些和类似的技术来找到识别癌症患者亚组的方法,这些亚组在特定的治疗过程中表现特别好。

Sometimes finding out what aspect of the data an AI is responding to is useful in and of itself. In 2019 Owkin, a Paris based “AI biotech”, published details of a deep neural network trained to predict survival in patients with malignant mesothelioma, a cancer of the tissue surrounding the lung, on the basis of tissue samples mounted on slides. It found that the cells most germane to the AI’s predictions were not the cancer cells themselves but non-cancerous cells nearby. The Owkin team brought extra cellular and molecular data into the picture and discovered a new drug target. In August last year a team of scientists from Indiana University Bloomington trained a model on data about how cancer cells respond to drugs (including genetic information) and the chemical structures of drugs, allowing it to predict how effective a drug would be in treating a specific cancer.
有时,找出人工智能响应数据的哪个方面本身就是有用的。2019 年,总部位于巴黎的 “人工智能生物技术公司”Owkin 公布了一个深度神经网络的细节,该网络经过训练,可以根据安装在载玻片上的组织样本预测恶性间皮瘤(一种肺部周围组织的癌症)患者的生存率。研究发现,与人工智能预测最相关的细胞不是癌细胞本身,而是附近的非癌细胞。Owkin 团队将额外的细胞和分子数据带入图片中,并发现了一种新的药物靶点。去年 8 月,印第安纳大学布卢明顿分校的一组科学家训练了一个模型,该模型基于癌细胞对药物的反应(包括遗传信息)和药物的化学结构,使其能够预测药物在治疗特定癌症方面的有效性。

Many of the companies using AI need such great volumes of high quality data they are generating it themselves as part of their drug development programmes rather than waiting for it to be published elsewhere. One variation on this theme comes from a new computational sciences unit at Genentech which uses a “lab in the loop” approach to train their AI. The system’s predictions are tested at a large scale by means of experiments run with automated lab systems. The results of those experiments are then used to retrain the AI and enhance its accuracy. Recursion, which is using a similar strategy, says it can use automated laboratory robotics to conduct 2.2m experiments each week.
许多使用人工智能的公司需要大量高质量的数据,他们自己生成这些数据,作为其药物开发计划的一部分,而不是等待它在其他地方发布。这个主题的一个变体来自基因泰克公司一个新的计算科学部门,该部门使用 “循环实验室” 方法来训练他们的人工智能。该系统的预测通过自动化实验室系统运行的实验进行大规模测试。然后,这些实验的结果被用于重新训练人工智能并提高其准确性。Recursion 正在使用类似的策略,该公司表示,它可以使用自动化实验室机器人每周进行 2.2m 的实验。
The point is to change it
关键是要改变它

As pharma firms become increasingly hungry for data, concerns about the privacy of patient data are becoming more prominent. One way of dealing with the problem, used by Owkin among others, is “federated learning”, in which the training data it needs to build an atlas of cancer cell types never leaves the hospital where the tissue samples required are stored: what the data can offer in terms of training is taken away. The data themselves remain.
随着制药公司对数据的需求越来越大,对患者数据隐私的担忧也变得越来越突出。Owkin 等人使用的一种处理问题的方法是 “联邦学习”,在这种学习中,构建癌细胞类型图谱所需的训练数据永远不会离开存储所需组织样本的医院:数据在训练方面可以提供的东西被带走了。数据本身仍然存在。

Chart: The Economist 图:《经济学人》

The implications of AI go beyond understanding disease and on into figuring out how to intervene. Generative AI models, such as ProteinSGM from the University of Toronto, are now powerful tools in protein design because they are not merely able to picture existing proteins but also to design new ones—with desired characteristics—that do not currently exist in nature but which are possible ways of embodying a desired function. Other systems allow chemists to design small molecules that might be useful as drugs as they interact with a target in a desired way.
人工智能的意义不仅仅是理解疾病,而是弄清楚如何干预。生成式人工智能模型,如多伦多大学的 Proteinsgm,现在是蛋白质设计的强大工具,因为它们不仅能够描绘现有的蛋白质,而且还能够设计新的蛋白质——具有所需的特征——这些特征目前在自然界中不存在,但可能是体现所需功能的方式。其他系统允许化学家设计可能用作药物的小分子,因为它们以所需的方式与靶标相互作用。

At every stage the AI hypotheses need to be checked against reality. Even so, such an approach seems to speed up discovery. A recent analysis of drugs from “AI-intensive” firms carried out by BCG, a consulting group, found that of eight drugs for which information was available, five had reached clinical trials in less than the typical time for doing so. Other work suggests AI could yield time and cost savings of 25% to 50% in the preclinical stage of drug development, which can take four to seven years. Given the cost in time and money of the whole process, which can be several billions of dollars for a single drug, improvements could transform the industry’s productivity. But it will take time to know for sure. Drug pipelines are still slow; none of these promised new drugs has yet got to market.
在每个阶段,人工智能的假设都需要根据现实进行检查。即便如此,这种方法似乎可以加快发现速度。咨询集团 BCG 最近对 “人工智能密集型” 公司的药物进行了分析,发现在有信息的八种药物中,有五种药物在不到正常时间的时间内进入了临床试验。其他研究表明,在药物开发的临床前阶段,人工智能可以节省 25% 至 50% 的时间和成本,这可能需要四到七年的时间。考虑到整个过程的时间和金钱成本,一种药物可能要花费数十亿美元,改进可能会改变行业的生产力。但这需要时间才能确定。药物管线仍然缓慢; 这些承诺的新药都没有进入市场。

Insilico Medicine is one of the companies hoping for that to change. It uses a range of models in its drug development process. One identifies the proteins that might be targeted to influence a disease. Another can design potential new drug compounds. Using this approach it identified a drug candidate which might be useful against pulmonary fibrosis in less than 18 months and at a cost of $3m—a fraction of the normal cost. The drug recently started Phase 2 trials.
英矽智能是希望改变这种状况的公司之一。它在药物开发过程中使用了一系列模型。一种是确定可能被靶向影响疾病的蛋白质。另一个可以设计潜在的新药化合物。使用这种方法,它确定了一种候选药物,该候选药物可能在不到 18 个月的时间内对肺纤维化有用,成本为 300 万美元 - 只是正常成本的一小部分。该药物最近开始了 2 期试验。

A lot of pharma firms in China are doing deals with AI-driven companies like Insilico in the hope of seeing more of the same. Some hope that such deals might be able to boost China’s relatively slow-growing drug-development businesses. China’s contract research organisations are already feeling the benefits of AI fuelled interest in new molecules from around the world. Investment in AI-assisted drug discovery in China was more than $1.26bn in 2021.
中国的许多制药公司正在与英矽智能等人工智能驱动的公司进行交易,希望看到更多类似的交易。一些人希望这些交易能够提振中国增长相对缓慢的药物开发业务。中国的合同研究机构已经感受到了人工智能的好处,这激发了人们对世界各地新分子的兴趣。2021 年,中国人工智能辅助药物发现投资超过 12.6 亿美元。

The world has seen a number of ground breaking new drugs and treatments in the past decade: the drugs targeting GLP-1 that are transforming the treatment of diabetes and obesity; the CAR-T therapies enlisting the immune system against cancer; the first clinical applications of genome editing. But the long haul of drug development, from discerning the biological processes that matter to identifying druggable targets to developing candidate molecules to putting them through preclinical tests and then clinical trials, remains generally slow and frustrating work. Approximately 86% of all drug candidates developed between 2000 and 2015 failed to meet their primary endpoints in clinical trials. Some argue that drug development has picked off most of biology’s low-hanging fruit, leaving diseases which are intractable and drug targets that are “undruggable”.
在过去十年中,世界出现了许多突破性的新药和治疗方法:靶向 GLP-1 的药物正在改变糖尿病和肥胖症的治疗; CAR-T 疗法利用免疫系统对抗癌症; 基因组编辑的首次临床应用。但是,从辨别重要的生物过程到确定可成药靶点,再到开发候选分子,再到通过临床前测试,再到临床试验,药物开发的长期过程通常仍然是缓慢而令人沮丧的工作。在 2000 年至 2015 年间开发的所有候选药物中,约有 86% 未能达到临床试验的主要终点。一些人认为,药物开发已经摘掉了生物学中大部分唾手可得的果实,留下了难以解决的疾病和 “不可成药” 的药物靶点。

The next few years will demonstrate conclusively if AI is able to materially shift that picture. If it offers merely incremental improvements that could still be a real boon. If it allows biology to be deciphered in a whole new way, as the most boosterish suggest, it could make the whole process far more successful and efficient—and drug the undruggable very rapidly indeed. The analysts at BCG see signs of a fast-approaching AI-enabled wave of new drugs. Dr Pande warns that drug regulators will need to up their game to meet the challenge. It would be a good problem for the world to have. ■
未来几年将最终证明人工智能是否能够实质性地改变这一局面。如果它只是提供渐进式的改进,那仍然是一个真正的福音。如果它允许以一种全新的方式破译生物学,正如最有力的建议,它可以使整个过程更加成功和高效,并且确实非常迅速地对不可成药的人进行药物治疗。BCG 的分析师看到了人工智能支持的新药浪潮快速逼近的迹象。潘德博士警告说,药品监管机构需要提高他们的水平来应对挑战。这对世界来说将是一个好问题。

评论

此博客中的热门博文

中国房地产泡沫早有警示信号,为何无人悬崖勒马? - 华尔街日报

2023年8月,中国资本外流 490 亿美元,创 2015 年以来之最

CBS:中国非法移民是如何走线进入美国的