Recordings
Knowledge Graphs in Drug Discovery part 9
290 views
The webinar conference will last approximately 2.5 hours. If you register for the event, you will receive the recording when it is ready via email even if you are unable to attend the live event. Please follow Biorelate on LinkedIn for more webinars and data science news. The conference takes place Wednesday, June 12th at 15.00 BST, 16.00 CEST, 10am EST, 9am CDT, 7am PDT.
Talks will include:
Using Retrieval Augmented Generation Approaches To Gather Information for Drug Discovery
Jon Hill (Boehringer Ingelheim)
Large language models have captured the public imagination; but how can they be useful for work in drug discovery? What about the risks of introducing false information? This talk will provide an overview of different approaches to using LLMs in a pragmatic way in early research, including the management of hallucination. These models are increasingly accessible to non-AI experts who are aware of their limitations and can improve the speed and comprehensiveness of the information that you bring to your research.
Multi-modal Knowledge Graphs for Precision Oncology
Miguel Gonçalves (AstraZeneca)
In this talk, I will go over the work we have been doing with Knowledge Graphs to uncover clinical insights in specific indications via incorporating multi-modal data. I will describe our flexible KG approach and provide examples on how this is making an impact at AZ.
Ask ARCH: LLM Question Answering over Large-Scale Knowledge Graphs
Jon Stevens (AbbVie, Inc.)
Knowledge graphs provide a vehicle for grounding LLM answers in harmonized structured data, reducing hallucinations and allowing easy fact-checking. In turn, LLMs provide a natural way for end users to query knowledge graph data, without requiring a query language or deep understanding of database structure. We present our integration of AbbVie's 30-million-node R&D knowledge graph, the ARCH Graph, with GPT-based LLMs to create a scientific question-answering system. The ARCH Graph is a Neo4J graph that harmonizes and connects molecules, drugs, genes, health conditions, and other entities from a variety of data sources, allowing scientists to make connections between disparate data points. However, querying the graph can be challenging for end users without a natural language interface. The new Ask ARCH Graph provides such an interface, allowing users to ask questions in natural language (e.g., "What genetic markers are associated with acute myeloid leukemia?") and receive natural language answers ("Some genetic markers associated with acute myeloid leukemia include PICALM (ENSG00000073921), CEBPA (ENSG00000245848), ...") along with the underlying data and the Cypher query used to retrieve it. To achieve this, the system utilizes a combination of vector search, Cypher query generation and validation, and LLM-based summarization of the Cypher output. The process of accurately retrieving information from a large-scale knowledge graph is more complex and less researched than simpler RAG methods on document corpora. We discuss the evolution of our approach and evaluate its accuracy and performance. The integration of LLMs with knowledge graphs helps reduce hallucinations, improve reliability in specialized domains, enhance reasoning with context, and enable dynamic and interactive knowledge discovery.
At this virtual free-to-attend conference for the biopharma data professional community, the speakers from across biopharma research give 30-minute presentations on knowledge graphs, NLP, and other related topics of interest to the data science, bioinformatics, computational biology and greater biopharma communities, then at the end, we have a roundtable Q&A session with all of the speakers.
Our previous conferences have brought in an average of over 150 attendees each and have featured speakers from organisations including Roche, AstraZeneca, Boehringer Ingelheim, Bayer, Novo Nordisk, and NASA, and many other leading biopharma research leaders. Most recent webinar recordings can be accessed here and some recordings are also available on YouTube.
View transcript
So as I say, we've been running these for about three years now. We have had up to now 28 different speakers. And again, looking forward today to three really very exciting talks coming up. So all of the the previous ones have been recorded. So please feel free to contact somebody from Bartlett if you are interested in looking at or viewing any of the previous talks from our previous seminars. So today, as I say, this is part nine of our series. Um, we have three really sort of exciting talks here today from Free large pharma from AstraZeneca, Boehringer Ingelheim, and AbbVie. So as part of this, I would like to introduce the first speaker, um, John Hill. John, if you'd like to turn your Cameron. -Let me see. -So today, John Hill from Boehringer Ingelheim. He is a senior principal scientist with over 20 years of experience, currently focusing on discovery and validation of new therapeutic concepts for liver disease. Although his core role relies on transcriptomics and similar technology, he has found unstructured data such as text to be an invaluable component. Today, John will be talking about using retrieval augmented generation approaches to gather information for drug discovery. So on that I will hand over to you, John. Thanks very much, Marc. So I'll just start sharing my screen. Wonderful. And, um, I hope that that's visible for everyone. So thanks again for the kind introduction. Um, today I'm really going to be talking about a problem for using these LMS to retrieve and structure information. And although this isn't, I think, part of the core knowledge graphs approach, I think it serves a valuable complement to this, because this provides really a lot of new ways that we can take this data and structure it in a way that could be amenable to Knowledge Graph usage downstream. So I'll start off with like a little bit of an introduction to the company I work for, which is Beringer Ingelheim. It's quite large, but I think unfamiliar to to some folks since it is privately held. It's been independent family owned since 1885. So a very long history, 25 billion in sales and a very heavy reinvestment into R&D. So 25%, which is, I think, quite impressive across the industry and over 50,000 employees worldwide. I myself am located at the Connecticut site, quite near New New York City. But there's also research taking part in Vienna and Baroque Germany and other locations. We focus on a lot of different diseases. My core focus personally is on cardio renal metabolic diseases. I work a lot on Mash, um, which is this metabolically driven liver disease as well as chronic kidney disease. But many, many different research areas are covered by the research of the company. And my core focus is really on finding these new drug targets, um, from a variety of data sources. So historically, I was using a lot of transcriptomic data and still rely on that as it evolves from bulk RNA seq to single cell to spatial. That's been very valuable for us genetic information, these kinds of things. But oftentimes the outputs of these analyzes give us lists of genes and things that we're it's really a starting point for developing a new therapeutic concept. But you have a lot of work to do to understand the background information about those things. So for those reasons, I've been interested in natural language processing, um, for many, many years. And I think it delivers a lot of value that you don't necessarily get from the more structured data sources. So what I'm going to show today is some of the approaches I've taken with Lmms, and particularly building the trustworthiness in them. Um, and what I'm hoping is that it can give you a little bit of an appetite for some of the potential that you may have for these approaches and how you can apply it to your own research, as well as some practical things I've run into which make them more or less useful in particular cases. So a language model itself. You see lots of different varieties of it. They're used for lots of different things, but at its core it's this probability distribution over a sequence of words. So there's a lot of different ways that language is is used. And these language models provide sort of a probabilistic understanding of that. And the most simple, um, maybe way to view it for, for folks who are unfamiliar to the area, is sort of these auto suggest things you might have in your phone where you start typing something and based on what you've said in the past and how language is commonly used, it'll give you an idea of what may come next. So, um, the simplest cases for this are these auto predict things. But you can imagine that as a model is trained on more and more information and has more and more parameters. It starts not to just to be good at predicting the next word in a trivial case of a conversation where you, you know, you don't really depend on it being right all the time, as long as you get some decent suggestions to these cases where there's a lot of knowledge embedded in the model itself, and it can be used more directly for things like question answering and information extraction. So the example here, of course, is if you start off with a word like the cat, there are certain things that cats do and that they're referred to as doing in language. And you can navigate that from one word to the next to say, okay, if the cat is sitting, then there's certain places that the cat sits. And in that way you build build up a sentence or a paragraph or a large amount of text. Now again, the thing to to remember about these is that they are probabilistic. So a cat can do a lot of things in different proportions. And it's not that there is necessarily one right answer to a language model, but there's really this spectrum of suggestions with varying degrees of certainty. These have evolved a lot over the years. So there's a long history of people being interested in sort of structuring and understanding language way back to the 1950s with these more handcrafted rules about syntax and grammar, um, to statistical methods, machine learning and so on. Um, I think for my own use of these, um, I probably started getting interested in this stuff really in the 2000. And although there were some maybe more sophisticated approaches then what I found I was using a lot of were things like co-occurrence. So, um, a common case for this would be you have a new gene and you want to know what it's mentioned within the literature. Chances are pretty good that if a gene is mentioned with a disease, there's at least some literature and study around that. And that can be a useful starting point versus something that has zero for co-occurrence. Of course, the complexity is that co-occurrence doesn't give you any sense of why this is a co-occurring in the literature. It could just as easily be a study that invalidated a link between a gene and a target. So you miss a lot with co-occurrence, even though it's a useful initial filter around this time, and I would say probably into the early 2000. I also started to get some experience with some of these tools, which I would describe as maybe an extension of the basic Boolean searches to to get these more complicated pattern matching things. Um, this would be like the earlier versions of linguistics ai2, although that software has evolved considerably over the years. Um, these were things where you could provide maybe not just one word, but you could provide a dictionary of synonyms. So if a gene is referred to as five different things, you can slot that into a query. You can look for things like proximity search where you want the disease to be mentioned within five words, or in the same sentence as, um, as your particular gene of interest. And you can start to. There are also improvements, I would say here as far as indexing and things like this go that made these complicated queries retrievable from a corpus of information. The issue with these is that, um, there was a certain learning curve so many of the folks were interacting with these systems were extensively trained, um, and they had to bring a lot of experience to translate, um, the sort of questions that a subject matter expert might have into these, these complicated queries to get them the relevant information. Um, they also, um, didn't, I would say, do a great job with things like logic. You could find the things you were interested in, but some it would take even more effort to, to extract them into a structured format. And then of course, we had all these these advanced approaches. I think that Bert was when I started to follow this more closely again, um, the issue I found with Bert was that there was also this huge curve to getting these things implemented. So, um, in the very beginning, it was hard to get a lot of these pre-trained models. There start to be things like hugging faces where you could get a model and implement it, but you still needed to have quite a sophisticated infrastructure for managing your documents for, um, either fine tuning your models or even running them sometimes would take quite a bit of compute effort since it was all sort of locally done. And really everything started to change with these more recent large language models. Um, ChatGPT being probably the one that kicked it all off. And then the various evolutions, either with the OpenAI products or with, uh, with Google or Microsoft. Um, so this is the point where I think the technology not only became more compelling, but more accessible to many people. Um, this was the point where you you didn't have language specialists sort of playing with this off on their own, separate from the subject matter experts. But you start to have experts more directly engaging with the systems, both because the quality was improved but also because of the user interface allowed for more of this natural language conversation with the models. Now with this accessibility and power comes additional risks. So I would say the concerns that I saw much earlier with when with the rollout of GPT and the rest were focused very much on security. Um, there were some cases for data leaks. And I think that, um, maybe the way that I view a lot of this now is that this wasn't really a feature of the LMS. It was more of a feature of the cloud based implementation of a lot of these, these LMS. So people had very real concerns about interacting with third parties that had these APIs that weren't really vetted, um, to their satisfaction for security practices. So in the same way that you would, um, not trust certain, uh, publicly available APIs to process your raw data. You sort of have these concerns also with LMS, and maybe what I think is the unique risk of an LLM here compared to some of the other data sources available by API, is that, um, the kind of data that can go into an LM is unusually broad. If I'm interacting with a transcriptomics based API, then the risk is that my transcriptomics data may be disclosed in some form. So and across the whole organization, that's a risk. But it's also sort of a risk that has some bounds to it. With LMS you could just as easily imagine, um, a very high level person in the organization using an LM to restructure and format their meeting minutes as a scientist focused in drug discovery, submitting their queries. So I think that the risk here that's unique to LMS is how many places in the organization and how many types of sensitive data on LM can at least potentially touch. I would say that for the most part, this has been managed by organizations. At this point. They're, um, largely working with trusted players in the LM space or in some cases, they've moved to more local implementations. Either way, I think there's a good path forward for this. Um, uh, even for, for folks who may not have addressed it completely yet. Uh, maybe the, the persistent, um, thing that's unique to, to these machine learning approaches is also the consummation of, uh, information leakage, which is that you don't necessarily, um, have a disclosure of your sensitive data by the entity running the API. But if that, um, it's always, of course, in the interest of folks who are providing machine learning solutions to have as much training data and as much high quality training data as possible. So it's very tempting for a company to say, well, we've got query submissions. Uh, we know how people are interacting with these models. We'd love to use that to make our models better in the future. The risk, of course, is that if you are sensitive data is used for training these models, then there may may be some carryover of that information into the model itself. And that may be accessible by a third party later when they interrogate the model on their own. So, um, again, people are quite sensitive to this now. And I would say that, uh, many, many policies, uh, rely on, um, how you let the vendor interact and reuse your data. So this this is also something that I think there's awareness of and proper processes for managing. But I can imagine, especially for smaller players in this space where the temptation is a little bit stronger, that of course, as an industry person dealing with sensitive data, you want to make sure that's very properly vetted and addressed. Now the rest I'm going to talk about a little more today is this risk of hallucination. Um, and so for folks who are not as familiar with with how this stuff works, if we look at this rock to the right, um, many people would look at this, this boulder and say, oh, I see a face there, and I can have some eyes, nose and mouth. And this is really the function of the fact that sort of the way that humans interact with the environment, we're very keen into facial expressions and facial expression, facial sort of recognition in terms of, uh, managing our social interactions. So it's really important for a human to see faces and interpret them quickly. Uh, the downside of that is that we have a little bit of carryover for this, where we see faces sometimes, uh, where they shouldn't be. And these models, um, operate in a similar way since they're so designed to produce plausible output, not necessarily truth, they often come up with convincing but unsupported answers. Um, so, uh, an example that I, I've seen quite a lot of would be if you ask a model for references around something, um, a model naturally knows very well what a PubMed ID should look like, and it knows what an author year should look like. So the model has, um, a certain, uh, predisposition towards answering with something that looks like a real reference, uh, but may not actually exist. It's not necessarily doing this second evaluation of truth on its own. Um, and I think the risk for this is that. Aside from introducing false information, people are not necessarily well positioned or trained to evaluate this kind of risk. Um, I know that as somebody who's been trained to look at different information on the internet and all this and has a lot of experience with it, um, you have certain tools for whether information is false. So you always look well, is this reference if it's referenced, is it reference to a trustworthy source? Is the language consistent with somebody who is making a scientific statement, or are there things that are a little bit off? So many people could look at a spam email or something like this and say it doesn't, doesn't quite sound right. We're going to disregard it. The problem is that a lot of that training in those filters don't work for LMS, because LMS are designed to provide you with plausible output. So you've got to you have to vet them. I think you have to control the risk first. I'll get into some strategy strategies for that. But um, you also have to realize that your standard ways of detecting false information aren't necessarily going to be as applicable to the false information you see from LMS. So a strategy that's been around for for a little while to address this now is called retrieval augmented generation. You'll often see it just abbreviated to Rag. And this is a way to manage hallucinations. So the scheme at its heart is to build a database, um, of the information that you have prevented in some way that you think is uh, is valid for the model to output. And then you pass that into the prompt to be used by the model. So, um, an example would be the prompt would look something like given the following text abstract one. Abstract two. Abstract three. Answer the following question. Um, it can look behind the scenes a little bit different. Sometimes it's passed directly into the prompt. Sometimes there's technologies like Long Chain that sort of hide the fact that this is going on. But, you know, it's it's a very similar concept. And, um, there's a couple advantages to this. So the first is it allows new information to be used by a model without retraining. So something I didn't mention is a limitation of elements before is they're hugely costly to train. Um, and so a lot of times the training is not that current. Um, you may be a year out of date or two years out of date. And if somebody's interested in keeping up with the most current scientific information, um, that year can start to be really important. So this rag based approach lets you query the latest data sources, pass that into the model at runtime to to inform its querying. Um, the other nice thing about this is that the information is separate from the model. So that idea of training a model and then or fine tuning model is more likely to accompany new information or sensitive information. Um, aside from being costly, this approach also lets you have some control over what information is going to the model. So if you have this, uh, super sensitive database that shouldn't be accessed by all internal employees, but just by a subset, you can use very conventional, um, access, uh, controls on that database by user role or by group, and use that to inform what data is passed into the model. So even though the base model may be very identical between two users, the kind of information that each user has access to and can bring to bear on their questions is going to be distinct according to their their roles. So the way I like to view this is not entirely accurate, but it's, it's a it feels like a nice explanation for me is that we're basically no longer using the LM as a source of truth. We're using it as a highly sophisticated language parser. So it's a language parser that doesn't just understand co-occurrence or proximity. It understands linguistic structures. It understands basic logic. I should be putting understand, I guess every time an air quotes. But yeah, whatever. That's how it works. Um, so you're not relying on the model so much for information, but more for interpretation and structuring of information and the schema on the right. This shows just one basic example of how you might do this. Um, so I'd have some code. Uh, here I'm showing it with Jupyter just as a raw Python script. You pull data from PubMed. Uh, there's great utilities for this. Um, utils have been around for ages and they work very well. You build a database called a vector database. I'll go into more about that soon. Um, using embedding function. And now you've got this nice database of structured information. Now if you go back in later with questions, um, you can go with another script, retrieve relevant information from your database, pass that into the prompt and get a report that is both informed by the data, and also has the ability to reference to distinct documents that were found in that that first data set. Okay, so a little side note here is the importance of this, uh, vector database. So large language models don't really deal with text directly as written start to finish. Um, they'll undergo this process of tokenizing it to split it into these little units, and then it will be slowly converted into an embedding, which is, um, in essence, how the model structures and views the data. What an embedding means is going to depend a little bit on the model, and where things are in relation to each other in vector space is going to depend on the model. And so the the idea or the thing I would mention here is that the choice of embedding is quite important. And the anecdote I can give you is that, um, I tried a couple different embeddings, uh, for some question answering tasks. I tried, uh, bio GPT a while ago, and then I tried some, uh, some of the more chat based models. Now, um, if you think of a biological question like, uh, what cells or what is the role of the hepatocyte in the liver, that might be be an interesting one for you. Um, the embedding will convert this into a spot in that space. If I ask bio GPT to embed things into into the space, uh, the question of what is the function of a hepatocyte in liver is going to appear very closely in that space to what is the function of a cook for cell in the liver. So bio GPT is really good at these kind of biological entities, things like this, but it doesn't really have as part of the embedding, the concept of question answering where questions and answers are very near each other, it has to be specifically fine tuned to do something like that. If I take a chat based model and I put that into, um, if I put a chat based model and I'm running, you know, an embedding that's more based on that kind of architecture, what is the function of the hepatocyte in the liver will occur very closely to answers in that space. So it could be, uh, processing and uh, processing toxins or something or something like this. So that's very helpful. If you're using this for rag, you want um, if your task is going to be more question answering or something like this, you want to choose an embedding that really permits this, so that when you retrieve stuff, it's going to be the stuff that answers your question, not just more questions. Now, uh, the vector database, all this does is it takes your embeddings and it stashes things in a way where it's very quick to do this indexing of I have an embedding that looks like this. Give me the 1000 500 most similar things in embedding space. So um, very, very fast at that task. A lot of times they'll support other metadata. And depending on the vector database, the performance can be kind of variable there. So if you want to have a combination vector database and structured data metadata on top, um, you may have to play around with that a little bit and optimize it. Um, but it's it is a widely supported feature. Um, last thing I'll note here is that any LM can generate embeddings. So you could run a straight up, uh, GPT for um, model to do your embedding. However, um, that's going to get expensive really fast because your chances are you're doing embeddings. If you're building one of these, uh, vector databases, you're not going to be processing like five documents. You're going to be processing thousands. So, um, a lot of times there are these stripped down, lightweight models, uh, which are sufficient for embedding, even if you wouldn't use them directly for language processing or question answering. So the classic one for OpenAI would be ETA. But now there's a variety of higher performance for the cost models that are these text embedding, you know, large, small, whatever, uh, for OpenAI as well. And these, these are quite fast and, um, quite good at scaling. Uh, so when you, when you do want to build these larger databases, um, then you're in good shape. All right. Um, this is just kind of an aside because I thought this is kind of a fun example is that the embedding space can actually be really interesting for some questions. So it's not just always a convenience for we run this and see what happens. So this was a case I was talking to some colleagues from the therapeutic areas and getting a sense of what assays they were running, what scientific methods they were investigating, this kind of thing. And as anybody who's sort of talked to end users or subject matter experts who don't care a lot about ontologies, they'll give you all these long lists that are redundant, unstructured, variable language usage, this kind of thing. And if you want to turn that into an ontology or even just structured in a simple way, that can be a very manual and complicated task. So, um, what you can do for this is you can basically use this, uh, I almost think of this as like an extension of some of the approaches you'd use for word defect, uh, where basically what you do is you take these things, throw them into embedding space, and then you can take that embedding space, which is basically a complicated series of coordinates, and you can perform clustering on it. You can also uh, so here I'm showing just like a I think this is a you map plot on the, on the clustering or sorry, not on the coordinates. Just so I've got something to show in 2D, but you can basically assign clusters based on the higher dimensional space. So then you can take these things. You can say, oh, these things all belong to one cluster. Um, by how the language model views it. Again, choice of embedding might be uh, might be important here, but a lot of the general purpose ones will work quite well for this too. Okay. So we can see over here we've got a bunch of things. Food intake preference nutrient absorption over here. Uh, cardiomyocytes, contractility, um, adipocytes, this kind of thing. Um, human IPS cardiomyocytes. So first up is you can assign the cluster, but then the cool wrinkle to this that makes use of the LMS is LMS are also really good at assigning, uh, more human readable labels. So rather than having this ambiguous cluster one where you have to look at each membership and figure out, well, well, what is cluster one? Uh, you can pass the membership back to an LM and say, um, give me five words that describe this cluster. Uh, it's not going to work out in every case, because if your clusters are too generic, you might get a cluster name. That's biology concepts. Okay. Well, you know, you can't do much with that. But if you do get clusters that are distinct, well separated, then, um, you can actually get some very helpful, uh, names. And this can really help you structure your data with, um, what I would say is a very modest investment and curation. Okay. This is showing a more concrete example of the Rag approach. This was something for, say, connecting in vivo models to different diseases. So in this case I've done it as almost like a two part, uh, query. But, you know, you could run a version of this that's a one pass with a rag. So basically I start off with a scientific question, what are some in vivo models of celiac disease? Um, I'm just asking this without rag to to the system. I just want to see what, uh, what would be suggested by the LM. Um, and it comes up with a list of different models. I'll 15 treated mice was one. But, you know, say you get ten, 20 different examples. Again, you don't trust these. These could be hallucinations. They could be, um, just not supported well by the research. So then what you can do is you can go back to a reference database. So in this case I've just got a simple database. At the time I was using 500 PubMed abstracts on, on each disease. And I ask it, um, it, so for this animal model I'll, uh, 15 treated mice provide evidence and references that it is associated with celiac disease. And then I can get these reference things which would say it is supported. And here's the PubMed reference for this. So um, I think that this can give you a lot of power for validating your findings. And uh, you can take approaches like this as well, which come up with the inventiveness of the model and then the validation with the RAC. So just one potential idea. But you can imagine this is extremely generalizable. Um, just a little aside here, because anybody who's got a text mining, uh, background, of course, wonders about full text. And should you use it? Is it beneficial? Um, and I think historically, people have gotten enormous value from using full text rather than strictly abstracts. Um, the issue with this normally is data access. So aside from PubMed central, which is fantastic and, um, relatively straightforward to to ingest and parse. You also wonder about all these paywalled articles behind, uh, publisher websites. And, um, even as a company that may have broad license to these, actually navigating the retrieval of them is non-trivial. Um, and often what you'll have is you may have solutions that are specific to a publisher or something like this. So this is, you know, the first thing you've got to get around to use your, your full text for, for based approaches. The other issue that you run into is the context window. This is not something that I think is as much of an issue with like some of the the older technologies, but basically for Rag, you are deciding which subset of articles is interesting and using those to cram into the prompt to answer your question. So, um, most of these models have reasonably, uh, large but not infinite context windows. So, um, you know, there's GPT four versions that I think go up to 128 K tokens now, uh, but often it's commonly smaller, like some of the routine models, maybe 16 K, something like this. So if you go from abstracts to full text, there's a trade off, which is maybe you get more high quality information, but it's at the expense of a slimmer breadth of overall articles. Unless you do this in batches or something like this and synthesize it at the end. So, um, what I tend to say is a reasonable rule of thumb is if you're interested in general questions, relationship between a target and a disease, um, evidence for an animal model, something like this. Then stick to the abstracts when you want to switch to, uh, to full text, that's when you want to consider is the evidence not even going to be in my abstracts? Like, am I interested in, um, how long the animal models were, were dosed or specific, um, affinities for a compound or something like this? If it doesn't exist in the abstracts, you need full text. You just want to use it consciously. Um, for for the specific questions where it makes sense longer term, this is probably not going to be as much of an issue. The context window is creeping upwards and upwards for all these models. Um, so I think, you know, but in the end term, um, this is something you're going to have to deal with. Of course, you don't just have to use Rag for published literature. Super easy to get stuff from PubMed, but you can also use your own documents. And in many cases that's going to be the highest value for your organization. Uh, really, a lot of these just rely on text extraction. So your text can come from anywhere. It can be from PDFs, it can be from word documents. It can come from PowerPoints. Uh, you just yank that stuff out, throw it into your vector database in a nice, structured way, give your questions to your LMS and you get structured data out. Um, the other thing I'd say that's nice about these models is, um, at least in my experience, they appear less sensitive to some of the concerns of formatting than some of the previous text based approaches. So you used to have to be really strict about, well, what case are the letters, and do I have spaces and unusual spots in all this. You know, if you've got cleaner data that's fantastic. But I and I think it's, you know, worth making some minimal QC checks in your data and cleaning them up. But I do, uh, feel that these land based approaches are less inherently sensitive to it than some of the things that are more these pattern based approaches. Okay. A couple things to watch for. First is you can run up a big bill. So many models charge not just for the generated response but the size of the prompt. If you cram a lot of stuff into your prompt, you you may wind up with a bigger bill than you're anticipating, particularly if it's, you know, dozens of abstracts. Um, that's particularly an issue with some of the larger and more expensive models. The other thing I'd say is that the Rad concept works well if you're looking for a specific reference for a concept. So if you're not sure there's an association between a gene and disease, Rag is great. If the concept is really well covered by the literature, then you get into this issue where the rag is only considering so many documents again, unless you iterate. So say I'm covering something like use of anti-TNF antibodies in rheumatoid arthritis. Super well published on established clinical trials, um, hundreds if not thousands of articles. Um, the model is only going to consider a subset of those if you go with a conventional Rag approach. And so because of that, um, it's not going to be completely summarized in a way. Uh, so what I would suggest is maybe a solution to that, unless you want to be comprehensive and again loop, summarize and then consolidate again. I would try to match the resolution of the documents to the questions. There could be cases if it's a very new area for you, and you just want to see what the expert opinions are. Maybe you actually use review articles rather than, um, all abstracts. In other cases, you really need to going to drill down and need to drill down to full text to get what you want. Okay, last thing I want to sort of leave folks with, because I'm super excited about this and I think it opens a lot of opportunities, is these use of mixed modality LMS? Um, so I think many people have seen some examples of GPT, uh, for vision and similar things. And uh, it seems like, oh, that's an AI vision thing. Maybe it's not as applicable to how I'm viewing my documents and scientific information. Um, but this stuff has really come a long way. And what I'd suggest is that for some formats, like, um, in many corporations, PowerPoint is a currency of information exchange for us and many PowerPoints. I think speakers have been hammered, uh, with this information that I guess I'm currently ignoring based on the slide, that you want your slides to be visually appealing. And a lot of times the message may be carried by the visuals of the slide, and the speaker's voice is sort of the complement to it. It's not that everything's in text on the slide. So if you just yank the text out of a slide, pass it to an LM, you're going to wind up with big gaps in your knowledge. Um, so really, these vision, uh, solutions can provide some, some path forward here. It's currently expensive and slow, but even in the three months or so since I started piling this, it's gone from. Oh, that's a nice thing to show to. We could actually do this on a subset of our information. And again, you know, give it six months or a year or whatever. And I'm sure this will scale just fine to some of the things you're interested in. So this is a quick example. This is a graph that I yanked out of a publications and I drug discovery thing. And this is just showing number of publications growth by year this kind of thing. So um what you can see here is a graph that could represent any kind of scientific data presentation. Right. This could be um, uh, a mortality curve. It could be something like, um, information on how, um, a schematic of how a mouse was dosed over the lifespan of a product. It could be a Gantt chart from one of your, uh, your project presentations. You can pass in a prompt like this. Uh, so, um, please interpret and figure describe content, give an interpretation what specific questions like what is the highest growth rate, all this kind of stuff. Passing this into the model it got I would say four out of sorry five out of six of the questions pretty much dead right. So right down to what symbol is used to represent the vaccines category. So not only could it figure out where vaccines is, but it could say that that is a syringe. Um, and it could even do basic interpretation of some of the content. So, uh, you know, showing that understanding diseases has the highest growth, this kind of thing. Um, so I would suggest that it's not I've got some examples where it is misinterpreted graph. And I think that this is still very much a keep a human around to review stuff before you really trust it. But you can, uh, take data that is unstructured, even by unstructured data standards, and start to convert it into things that you can really make, um, operational and useful for, for your organization. Okay. So in summary, I just like to suggest these, uh, give better transparency. Um, uh, with the use of rags, you can supplement and supercharge some of your conventional curation approaches using this by extracting information and helping to group concepts. And the other thing is, when you complement your curation with LMS, this doesn't have to be your only approach. Now that you've got this great structured data that can go into your knowledge graphs or even in conventional databases, you can still get a lot of utility by just making all this, this text that you have laying around or that you're pulling from the literature, uh, putting it into a searchable, indexed format. And then the last thing I'll leave you with is, of course, this multimodal stuff, which I think is, is fantastic. So, um, had a number of great conversations with, uh, with fellow colleagues at Beringer and folks in the it, uh, part of the process that made all this possible. Um, and so I'm looking forward to the discussion that we have after the other speakers presentations. Um, and I think that there's this is certainly a case where, um, I feel like we're at the part where these these are really becoming applicable to our project. And we're we're also starting to build the appropriate level of trust in LMS, um, to really get the most out of them. So thanks everyone. Okay. Well thank you, John, for a great talk. Obviously LMS and sort of the rug implementations really, you know, of interest at the forefront at the moment. And again, something particularly here at bar like we're very interested in with, you know, our galactic data products. So I'd like to sort of just remind everybody of the format of today. So we're going to have the next speaker come along shortly. But please do send any questions to the chat box. And at the end, all the speakers will be coming back with the CEO bar late for a sort of roundtable discussion and to answer those questions. So we'll go straight on to our next speaker. Um, so, Miguel, if I could ask you to turn your camera on. So our next speaker is Miguel Gonzalez Alves. Um, he is a senior biomedical information scientist at AstraZeneca, currently focusing on the implementation of multimodal knowledge graphs for patient stratification and biomarker discovery. He has experience in both early stage and late stage pharmaceutical R&D in multiple therapeutic areas. Miguel has a background in biomedical engineering and completed his PhD in biophysics at UCL, focusing on the tumor and microenvironment using magnetic resonance imaging. And today, as I say, Miguel will be talking about multimodal knowledge graphs for precision oncology. -And on that I will hand over to you. -Thanks very much, Marc, for the kind introduction. Um, it's a great pleasure to be here today presenting on behalf of AstraZeneca and specifically the early data science team in oncology R&D. So today I would like to discuss what we have been doing in the field of multimodal knowledge graphs for precision oncology. But before I would like to briefly go over who we are and what we do. We are part of the early data science team in oncology R&D, which is divided into three main groups. The first one portfolio bioinformatics. We focus on new targets candidate drugs and biomarkers. The second one is the data Science team, whose main aim is to develop new methodologies to accelerate the portfolio and harness the power of data. And this is also where the Knowledge Graph search team is. And finally, we have a computational oncology team to model the complexity of the tumor microenvironment to find druggable biomarkers. And these teams are not siloed. We work in collaboration across different projects. So focusing now on drug discovery, we are primarily interested in a couple of things. First, the disease biology to find the druggable targets. And second, the pharmacokinetics and pharmacodynamics of the drugs that we develop. And finally, how these two interact such that we find the right drug for the right patient. And all along the phases of the the drug discovery pipeline. Knowledge graphs are now being used. Knowledge graphs can be leveraged in many ways. And at AstraZeneca, we have used them as a recommendation system for drug resistance and target selection. We have, in fact, developed our own biomedical knowledge graph, which we call biomedical insights Knowledge Graph or Big for short. And at present it contains data from 60 plus sources, both public and internal, which granted a size of 14 million nodes and for 110 million edges. And the idea is really to use the graph as a source of context and features for recommendation systems that are inherently explainable. As you can go back to the graph and map the path. And so he allows you to answer these types of questions. What is the disease mechanism? Which targets should I choose? What drives the efficacy of drugs? And leveraging the big, we were able to implement and explainable recommendation system for drug resistance mechanisms. I won't really delve into much detail of this as as Ben has already done an excellent review of this work at an earlier talk on this forum, which you can review here, but I wanted nevertheless to highlight this application, which was a successful one and has been published in Nature Communications. The core of my talk will will thus be focused on the recent work that we have been doing around patient stratification and biomarker discovery, with knowledge graphs, and the versatile and flexible framework that we have developed around that. So we all know that knowledge graphs are suitable for multi-omics integration. And here we have an example representation of how multi-omics knowledge graphs could look like. Uh, we can connect patient nodes representing their features here in green, uh, from different omics. You can see that the mutation gene in blue becomes its own node. And the gene expression in mulberry also becomes its own node. And you can see that patients are connected to their clinical endpoints. Some are alive and some are dead. Um, and you can also see that patients with different endpoints can share features, but also an equally have features -that distinguish them. -Okay. So from here we can apply tools and methods such as community detection algorithms to identify patients with similar characteristics and group them accordingly. And similarly you can extract biomarker signatures associated to your endpoint. So in blue here we have um a node of gene a wild type that is um common in community one, which represents the patients that are alive. And for the patients that are dead, we have a gene with a particular characteristics in terms of methylation and RNA-Seq values. So leveraging these principles, we have been developing internally an end to end framework for patient stratification and biomarker discovery, which we call press. Press net stands for Patient recommendation via Stratification and Selection using networks. It's highly flexible and it supports, as you can see here on the left, multi-modal integration. You can add clinical features, omics imaging features, prior knowledge, you name it. So what happens is that the tool then creates a dedicated knowledge graph, which is based on the input data, and performs computation on that knowledge graph. It is satisfies then the patients into suitable communities which are based on the endpoint that you define. This is usually overall survival, but it's not restricted to survival endpoints. It can be anything. So this then ultimately allows, um, what we really wanted to happen in extract out of it, which is patient stratification alongside the identification of composite biomarkers associated to these communities, uh, which can look something like the the table here. So we have implemented this framework on large public clinical datasets such as Tcga as well as internal studies, and have in fact a couple of manuscripts in preparation now. Next, I would like to show you a couple of use cases of using presets for patient stratification and biomarker discovery. Both of these applications were accepted for publication earlier this year at the American Association for Cancer Research conference, which happened in San Diego, California. And the first one is from the work of Jake in our team. Jake is, uh, extremely talented data scientist and is also the man responsible for developing. So here we use the lung adenocarcinoma cohort from the Memorial Sloan-Kettering Center which includes clinical data and treatment history. In this case, it's almost exclusively, uh, anti-PD-1 treatment, um, as well as a mutation panel. And press net was then applied using the supervised community detection on overall survival. And you can see how it was able to separate the patients into three different communities with different survival rates. On the first plot. But crucially, you can also associate the survival to the biomarkers that are mostly present in such communities of patients. So you can see here on the right plot that the patients that have both a mutation in SDK 11 gene and have low albumin. So this is the blue curve. They have worse survival. And this is tightly associated to the patients of community two in green on the first plot which are the ones with the worst survival. Additionally, we assessed the performance of a deep neural network when using the composite biomarker binary outputs from presented against using the original raw continuous data. So the table on the right here, um, shows that representing patients in terms of binary matrices capturing their composite biomarkers. Massively, massively outperforms using raw univariate features for downstream classification of endpoints. And you can see that this is true for the survival status and also for the best overall response. So this is encouraging now in that we have not only explainable biomarkers but also that we have high performance metrics. In the second use case, we continue in the theme of patient stratification and biomarker discovery and. It involves using the well-known multi-modal The Cancer Genome Atlas data set alongside the proteomics heavy CP tag dataset for validation. In both datasets, we focused on non-small cell lung cancer, which, as you know, includes two tumor subtypes lung adenocarcinoma and lung squamous cell carcinoma. And for Tcga, we focused on the clinical rnaseq and methylation data. And for Ketek we use the proteomic data that is the main feature of. For both of these modalities. Um, uh, for all of these modalities, we filtered, um, the data sets to a common gene panel. And as in the previous example, we used the approach of supervised community detection on overall survival. So firstly for Tcga, we were able to identify and separate three different communities from the original non-small cell non-small cell lung cancer cohort. Yeah. And each have a different survival profile. Now for cpp tag. We obtain a similar result, this time obtaining two communities from the original non-small cell lung cancer cohort. So now we have these communities of patients. And from here we can go on to assess how do they relate to the established histology labels. So, you know, carcinoma or squamous cell carcinoma. And focusing initially on the two largest communities of patients that we obtained, which are community zero and community two. We observed that there is a clear preference for squamous patients to align with one of the communities community zero, and additional patients to align with community two. And so we call these concordant patients for simplicity. However, there were a small number of patients in each community that did not really align with the majority. So we have here 43 patients from community zero that are in fact a dean of patients. And we have 29 patients in community, two that are squamous. And so we call these discordant patients as they do not follow the general observation. So in summary, this means that concordant and discordant patients of the same subtype are assigned to different communities. Now in CP TAC, we performed a conceptually similar analysis. And although the presence of concordant or discordant patients here was not as obvious as in Tcga, we decided nevertheless to adopt the same strategy in order to be able to perform the same kinds of downstream analyzes and comparisons. So from here we found something really quite interesting. We found that patients of the same subtype. Are either adeno or squamous have different overall survival depending on whether they are concordant or discordant, which is something that you would not necessarily expect a priori. In this in Tcga, this is true for the dino carcinoma subtype, as you can see here in the top, but not for the squamous subtype. However, and crucially on CP tag we observe the same behavior this time not just on adenocarcinoma but also with squamous cell carcinoma. So insights such as these can bring us forward in understanding the patient subtypes. And we can potentially even leverage that to refine precision medicine strategies. Now, the other aspect is, is really to understand what are the biomarkers associated to long or poor survival. -In this situation, we saw. -That. When we look at the community of patients with better survival. So here the top, uh, the top plot, there were a number of biomarkers associated to that population. And when we evaluated that on the whole population of Tcga, we saw that patients that presented all of those markers had better survival than the patients that did not have at least one of these markers. And a similar but inverse concept was employed to find the markers of poor survival, where across the whole population of patients that presented that signature. Um, we found that those patients had worse survival than the rest of the patients. And we could only do this because press Net identify those communities in the first place. So what we did next was to validate these survival signatures in CP tag. Where crucially, we found the very same signatures demonstrating the very same survival behaviors for long and poor survivors. So if you look closely, you'll see that the gene signature associated to long survivors in the top row is the same between Tcga and CP tag. And the same observation is true for the signature for poor survivors in the bottom row. So in this way, we validated the ability of ResNet to identify generalizable signatures. But we also found in the process that it is very important to identify the correct data for each task and make use of the strengths of each data source. And. With this. I just want to and I'm coming to the end of my presentation. I just want to leave you with some considerations to be aware of when using knowledge graphs. First of all, knowledge graphs are noisy, and as such, it is important to prepare your data in a way that minimizes this inherent noise. And, um, this ties in with the second point, which is that biology is not standardized source, of course, and so we must make an effort in that regard. Additionally, context is also very important. It almost always, if not always, um, it's better to use a dedicated knowledge graph to answer your question, rather than using a huge knowledge graph that has a wealth of information that you don't need. And even though graphs are inherently explainable, that does not really necessarily mean that interpretation will be easy. And last but not least, do you even need a knowledge graph solution? So with this, I would like to acknowledge all of these people that have been involved in the various Knowledge Graph efforts at AstraZeneca, and we are happy to take some questions -afterwards. Thank you very much. -Thank you, Miguel. Again, a very great talk. And, uh, like your reference to Ben's previous talk in this series. Um, so, so yeah, on that, I just like to remind people again that, you know, all these previous talks have been recorded, so feel free to reach out to anybody from Bartlett if you'd like to access, um, some of these other recordings. So we're doing pretty well for time. We're on time at the moment. Thank you for the two first speakers. Um, so yeah, just to remind you, please ping over any questions. These will all be addressed at the end. So on that we will move forward. Um, so, John Stevens, could you please turn on your webcam and we'll introduce you. So John Stevens is part of Abby AbbVie's Raiders team, and is a founding member of a new team dedicated to bringing generative AI solutions to the enterprise. He received his PhD in Linguistics in Linguistics in 2013 from the University of Pennsylvania, after which he worked as a researcher in computational linguistics, linguistics, and cognitive science for five years before joining AbbVie as an NLP engineer in 2018. When he is not harnessing the power of language models, Jon enjoys playing video games with his family and noodling on the banjo. So today, Jon is going to be talking to us about Ask Arch. This is an LLM question. Question answering over large scale knowledge graphs. -And on that I'll hand over to you, Jon. -Thank you so much. Um, can you -hear me? Fine. -Yes. -All looks good. -Wonderful. All right. So, uh, thanks so much for for having me out. And thanks so much to the organizers for organizing this. I think there's a nice, uh, arc to it where in the first talk, we learned all about rag. And the second talk, we learned about the power of knowledge graphs. And, uh, in this talk, I will attempt to combine the two. Uh, so this is presenting some work that we did late last year, uh, at AbbVie, uh, together with some friends at Zs, exploring whether we could build essentially a Rag system over, uh, a large scale knowledge graph that we have for our R&D data, uh, at AbbVie, which we call the arch. Uh, and so I want to basically present, you know, what it was that we did, um, for this project, um, overview, the solution, the performance of the solution, the anatomy, and then really focus on the lessons that we learned. Uh, so this was a system that we built, functioning prototype. And then, uh, rather than putting that prototype into production, we're sort of dissecting and embedding different parts of it in different places, having learned some interesting lessons, uh, which I hope are useful to everybody, uh, here, um, or at least a subset about, uh, what looms over knowledge. Graphs are good at what they're not good at. You know, what the end user experience is like there. So let's get into it. What we wanted to achieve was adding, um, a what we call a sauce. So this is a, you know, in the, the, the, the parlance of the, um, first talk that we saw today, this when we say a sauce, this is a essentially a database for rag. Uh, so we have already at AbbVie what we call our AbbVie intelligence platform. And yes, we named it before Apple Intelligence came along. Uh, I guess it's not that original of an idea. Uh, but we have, uh, it's basically our company internal tool for accessing, uh, ChatGPT in a data safe, protected way. And then we've got some utilities like, um, analyzing and summarizing uploaded documents, some dedicated language translation models, and this thing called Ask a Source, which is really, um, very much along the lines of what we were seeing, uh, in John's talk earlier, which is, uh, like an ask PubMed is what we started with as the main knowledge source. So I can ask a scientific question and, uh, get an answer from the literature with citations to PubMed articles, um, uh, using the abstracts, not the full text. So I love that, um, that presentation did a really good job of of laying out all the different considerations, including access restrictions, which is always a headache. So what we wanted to do is add to this data from our AbbVie R&D convergence hub. So the AB, R&D Convergence Hub, uh, or arch for short, uh, is it consists of a couple things. Number one, it consists of a data lake that brings together a lot of different sources of data, both structured and unstructured, from the R&D space. So this includes real world data. But this also includes literature data and includes, uh, you know, data about genes and targets and, and drug bank and you name it. And we we bring it all together. And then a subset of that data is then mapped, uh, using ontologies input into a knowledge graph. So that's what we call the arch graph. It's an arch graph is a new for a knowledge graph. Um, and it's, it's uh, it's got 30 million nodes and a billion edges, uh, representing things like, you know, chemical similarity, adverse effects. Again, it, it, you name it in the R&D space. Uh, you know, it's probably in there somewhere. So how do we get questions if somebody has a question, uh, you know, a scientific question and there's an answer lurking somewhere in that knowledge graph. How can we use large language model to to get it out? So, so we conceived of this as kind of adding to our rack system answers from the Knowledge Graph. That was the idea. And so, um, you know, probably preaching to the choir a little bit here, but we think it's really powerful to combine these two technologies. Knowledge graphs are great at, uh, you know, handling relationship data. They're super flexible. Uh, the querying can be heavily optimized. And so, you know, running, um, cipher queries or a large scale graph, uh, you know, can be can be pretty fast and get you a lot of complex kind of multi-hop relationships between entities, which are really useful. Uh, and then on the other side of the coin, generative AI, of course, is the thing of the day. Uh, there are definitely limitations and risks, which, um, you know, I think we're well presented as well earlier. Uh, but they can really accelerate searching, can accelerate data supported reasoning. Uh, they can improve user experience by adding a natural language layer on top of structured or unstructured knowledge or data, uh, and provide some amount of contextual understanding. Um, and so with that in mind, I'll just present what it is that we did. Uh, and again, this was a prototype. So, so we didn't go all in on pretty visualizations for the graph component. But what we were able to do was, uh, it's kind of small maybe on the screen here, but you can see, uh, you know, an example from our knowledge graph, uh, which is just a super simple example that I kind of made up for to illustrate, because I don't know what to ask it because I'm not a scientist, but something like, what are the side effects of aspirin, right. And you can imagine what that looks like coming from PubMed. Okay. The side effects of aspirin include xyz, ABC. And then here's links to the abstracts on PubMed where uh, those things are listed out. Uh, it looks quite different on, uh, the Asgard graph system because you get a similar answer. The side effects can include memory impairment, but what you actually get is a cipher query which was matching against the knowledge graph. And then you get raw output from the knowledge graph to fact check again. So when you see these nodes, uh, and again in a future tool we will visualize these a little bit better. But when you see these nodes listed out, um, you know, memory impairments, uh, hypoglycemia, etc., these are um like adverse adverse event entity nodes. Uh, in, in the, in our knowledge graph that are being pulled back by a cipher query that's being generated based on the user's question. Um, and so to the extent. That we trust the data that's in the graph, we can trust the the answer, if that makes sense. Um, and so we are using, uh, you know, GPT models, GPT 3.5 and four. Um, since then, you know, if we were to do this again today, we would have other models at our disposal as well that we would use, um, and, uh, you know, trying to do this to really increase the relevance of the answers as well to explainability and scale. And here's our solution. So this is a little bit maybe getting into the weeds. But uh maybe fun for some of you. So here's generally how it works. Uh, the user has a query. The user asks a question, right. What's the brand name of operator in this example, uh, and anytime you see the little green OpenAI logo there, um, that's a place where we're making a call to a large language model API. Uh, in this case, we ended up landing on the GPT four model. Again, we've got access now to cloud three. Llama three other models. It wouldn't have to be an open AI model, but that's just what we happen to use for for our prototype. And there's a couple of things that that, that go on. Uh, and I'll explain some of these design choices a little bit later. So the first thing is we actually took every node, every entity in that knowledge graph. And we created a linguistic representation of that entity just using a template and vectorized it and put it in a vector database. Uh, and the reason for that is that it improves performance to do so. So, uh, I'll show those numbers in a little bit. But basically we, uh, we vectorize all that and then we can retrieve based on the user's question, the top most relevant entities from the database, so that what that will do is that'll those will serve as reference entities to the large language models, so that we're not just asking the large language model, hey, generate me a cipher query corresponding to this question based on this graph schema, but we give it the schema and we give it these reference example entities. Uh, so it can see exactly what are the, the property names uh associated with adverse event nodes. Right. Or excuse me, in this case, uh, drug nodes. Uh, right. How is brand name actually called in the, in the, in the in terms of the the name of the property on that node. Things like that. So you pull that contextual information, put it in your prompt along with the general graph schema and the user's question. And all of that goes into a prompt. You see the little prompt engineering icon, uh, that basically says generate me a cipher statement. Um, you also see another call that says enter and query conversion. Uh, also going into that prompt is any kind of expansions of synonyms that might be relevant. So all of that goes into this large prompt to generate a cipher statement. So now we're at step three. Uh, where in the cipher statement, uh, it's going to be something like, you know, match D being a drug node where and then something about the brand name, you know, synonyms. Uh, or excuse me, the name synonym matching like a previous and then extracting the, the brand name property, um, or something like that. And so then once you have the cipher statement, you just execute the cipher statement, you pull in the data from the query, uh, and then you get that context back from you for J. And then you make a third and final call to the LM to take the raw graph data and summarize it back into a human readable response. The brand name of the drug is Navarette. Uh, and here's the the data from the knowledge graph that shows it. Um, so that's really it's a little bit convoluted. But again, there's reasons for it that that's what we ended up with. Um, in our prototype, how it's actually working. Uh, so now, um, to talk a little bit about how, uh, in more detail about how the vectorization is working. As I said, we actually come up with linguistic, um, um, representations of, of the nodes of the entities. And this was interesting, right? Because we thought, well, we might do better if we can pull in reference entities using some sort of vector search. But then how do you vectorize a node on the knowledge graph? Maybe there are. Well, surely there are graph based methods, right? There are graph embedding methods that we could have played with, but we were coming at this from the standpoint of NLP and large language models. So what we knew was how to embed natural language. But the embedding models, those vectorization models that join earlier, those are tuned to natural language. They're not tuned to structured entities or nodes on a graph. And so we just use simple templates to transform one into the other. So we said the drugs ID is 1337, it's name premier brand name is Maverick etc., etc. and we take all that and then that's what we actually create a vector representation for and associate it to that node so that we can really quickly retrieve the the nodes that are most relevant to an input question. Uh, we use chroma for this, uh, as our storage. Um, you know, there's a lot of options out there. Uh, we aren't, by and large, using chroma anymore for most of our rack. Um, you know, we're doing a lot more with Elasticsearch and other things, but for lightweight, kind of lighter weight, simple, um, vector based applications. Chrome is pretty good. -Uh. -Okay. So that's the solution. So let's talk about kind of the lessons that we learned and some of the design choices, uh, that we, uh, made, how they were informed by those lessons that we learned along the way as we iterated on this and experimented with it. Um, so, so high level that, you know, the first thing we learned is that accurate and personalized responses are achieved by, uh, using a fair amount of prompt engineering. Uh, vector search and query generation approaches are, um, uh, they're jointly applied or sequentially applied to show the best performance, and the optimal structure maximizes the efficiency of, uh, optimal structure of the graph. So having a data model in your graph that that optimizes for natural language responses will enhance the semantic search more at the end. Um, so yeah, let's get into it. So here is kind of a a look into our prompt engineering for this. So we want to generate a cipher statement with all of this. And so what we do is, you know, there's these um. Cute, you know, prompting methods. You might take them with a grain of salt, but there's a lot of literature out there, some of it conflicting about the best ways to prompt different models for different tasks and things like emotional prompting, which is like adding take a deep breath and, you know, work, work step by step, show your work kind of thing tends to work better. Um, and then pulling in that subgraph, that's the retrieval, the vector retrieval results that, that, that, that show exactly what the different properties of the most relevant nodes are. Uh, and then saying, what I want you to do is follow these rules, uh, right. And just giving it some guidance on how to optimally generate a cipher query and then giving finally the user question, what's the brand name like a parameter. And then having it use all of that to then generate a self-contained, uh, cipher query. Um, so here we've got LM internalization. They call it uh, relevant context. Uh, very clear step by step instructions. That's always important for LMS. You have to hold their hand through everything. Uh, and then you get a good result. So now we're going to get into why do we bother with the vector stuff. Right. So so there's the idea that you could take everything from your database and vectorize it, store the vector database and do rank that way. Or you could just do database native queries. Why combine them both? Uh, so we started out the project thinking we would do one or the other. We wanted to evaluate one versus the other and see which one did better. Uh, here is an example. Uh, this is sort of a subset of our evaluation set. Uh, here we didn't include everything just for kind of readability. But you can see, you know, we kind of grouped our questions into a number of different types, um, extracting single properties, extracting properties with relations, extracting, uh, multiple relations, um, including, um, complex multi-hop kind of situations. Um, you know, so you can do things like give me the list of another dog's name that's also approved for the same indication as. And we want to see whether the system can. Generate a cipher query that's that, that matches that and pull in the data so that the LM can then answer that question. Uh, which is a question you wouldn't normally get. Um. Uh, you know, just from a standard rack system. But actually, uh, in that case, the vector search actually was able to do it in the database needed wasn't we saw this kind of interesting, um, mostly non-overlapping Venn diagram, I'll say, where, you know, both systems were failing about half the time, uh, but but often failing on different examples. So we thought, why don't we combine the two in the way that we did and see if we can, um, if they can lift each other up? And that's exactly what happened. So by pulling in the vector search, um, uh, data as reference examples to do cipher generation and then running the database native query on the output of that. This hybrid approach, uh, did much, much better. So in our evaluation set, we were getting to, um, you know, 80 to 90. We had started out like 60 to 70% accuracy, uh, over our test set. And we were getting to 80, 90, 80 to 90%. Um, by the end, which is really good for a rank system. Um, one issue that we ran into, though, was that our evaluation set was was kind of tuned to what we knew was answerable from the Knowledge Graph. Uh, and one of the kind of practical things that, that you might run into if you build a system like this is you have to ask yourself, who are the users of this system, who's this for? And are those users going to know what's in and what's out? So as I said at the beginning, the our graph is actually a subset of our data lake. And so if we just let loose the scientists who have some vague notion of what's in the arch onto this system, they're going to start asking questions, which, um, might be in the data lake somewhere but don't have mapped answers in the graph. Uh, and so they're not answerable from the data. So having users understand what a rack system is capable of answering is actually not trivial. We we learned that lesson also. Uh, and that's one of the reasons why we're kind of rethinking who who the user base actually is for this. Uh, but on the evaluation set, which assumes that we kind of know what's answerable and what's not there, it's doing really, really well. I'll put it that way. And the places where it's still fails sometimes are the places where we have complexities in the data model, and places where the data model of the knowledge graph doesn't line up very well to our intuitions about how natural language sentences are formed. It's really interesting. So it's going to do well in something like which diseases are treated using Humira. That's going to do great because it knows that, you know, there's an is treatment for relationship between a drug node and an indication that LME, Mab and RA and LME Mab has a synonym for being its brand, one of its brand names. And so it's able to write a separate query that pulls that in and then answer the question. So we get a good response there 111 area where we have, you know, maybe some issues is, uh, which ab drug is which AB drug is being used to treat MCI. So specifically zeroing in on AB drugs. What's interesting there is that it finds a drug that treats hep C or it'll find it'll pull in, try to pull in drugs to treat hep C that have some property, some company property where the company that that owns it is AbbVie. The problem with that is that in our graph, drug nodes don't have such a problem. It's not specified. Rather there's a separate entity type which is a product. So there's a there's a drug entity which is sort of a chemical thing or a biologic thing. And then there's a product which has a relationship to the drug thing, but the product thing is different and the product has an attributes like company, right. But the drug itself doesn't. So what the cipher generation with the LM will try to do is try to match drug nodes that have a company property, and it fails because that property doesn't exist, so it just returns an empty result. So we see where the complexity of the data model makes a distinction between drugs and drug products. Whereas in common language, I think we mostly don't. We just talk about drugs, uh, being owned by a company, having a certain certain chemical properties. So that mismatch between how we would structure sentences of natural language and how we structure our data model, the knowledge graph, I think is really important. And so that's going to the the bigger that gap, the more performance and quality issues I think we'll see. Um, and I'll just end on, um, you know, because it comes up a lot. People are curious about the response times of these things because we are making, you know, multiple calls to GPT four inches the system. So it's like, gosh, how long does this take to answer a question? Uh, you know, in some complex cases, it can take quite a while. There's one case where it took a whole minute. Uh, usually it's taking about half. It's usually taking like 20 to 30s to completely formulate answer to the question. And I'll leave that up to you whether you think that's a good turnaround time. Uh, I think for these more complex Rag systems over large scale databases, um, especially if you're doing this more agent based approach where you're doing two, three or maybe even four different iterations of calling the LM, getting response, calling the flow again, getting another response. Uh, you know, depending on the models that you use. Yeah, it's not going to be instant. It might take a little while for the answer to be formulated. Uh, and we have to decide users have to decide if that's something they want to tolerate. Uh, and if the quality level kind of matches the, the, you know, the convenience level there. But as one of the speakers alluded to before, I mean, the technology is always getting better. So, I mean, these are concerns that I think eventually will go away as everything's always getting better and more accurate and faster. Um, so that's, uh, that's our, um, that's our experiment. And I'll, I'll just end by saying, you know, this is part of a larger, uh, enterprise at AbbVie to, you know, bring LMS to bear on as many projects where we see it being useful. We find out what works, we find out what doesn't. Um, we have ask PubMed currently in production. We don't have a knowledge graph in production yet because we are still figuring out, uh, some of these nuances. Right. And so learning about those challenges, um, is, uh, really informative and also really fun, actually, to kind of put on my, uh, you know, scientist hat and do these experiments like this. -So thank you all. -Well, thank you, John, again for a great talk. Um, certainly impressive the amount of data you've got in your knowledge graph. So I just want to remind people, please, you know, post for any questions you've got for any of the speakers. Um, we're about to start the the panel session now where we will cover those questions and a bit of a discussion between, um, Dan, the CEO, by relating the three panelists. So if I can ask everybody to turn on their cameras now. Mm. And I will hand over to -you, John. Um. Sorry. Dan. -Thanks, Mark, and thanks to our speakers. What fantastic talks as usual. Really really interesting. We've got tons of audience questions, but I just want to kick off with a very generic question to get the conversation going between our panelists here. And so, um, it seems like there was. Yeah, as, as, um, as John John Stevens pointed out, there was a nice segue between the three talks. I want to try and ask a very generic question to bring them together a little bit by saying, how have knowledge graphs impacted target ID and validation in drug discovery in general? Um, who would who would like to to take take a stab at that question to begin with? And then I'll dig into some of the audience questions afterwards. So it's a very, very generic question. I'm going to dig a bit deeper afterwards. -Anyone anyone following soon on that one. -I can go in first. Um, Daniel. So at AstraZeneca, we have been working with the Knowledge Graph for some time. And, um, the, the big product that we have as influenced a number of decisions in our drug development pipeline as I showed today. There's, uh. An aspect to. To understand drug resistance. But big has been leveraged in so many other ways. Um, that that really then on their own. Um. It will serve as a springboard for the activities that we're doing, uh, right now, and very similarly to what John Stevens presented at AstraZeneca. We are we -are, um. -Uh. Embarking on a similar adventure of combining large language models with knowledge graphs for for a similar purpose. So that was that was great to hear from from you as well, John. So, um, yeah, this is a very generic answer to a very generic question, but there's definitely, um, interest and also from the from the senior leadership team of, you know, continual interest. And, um. And we have some, some plans to continue this going going forward for -sure. -Yeah. And John and John mean. So how fundamental are knowledge graphs to work? I guess another way to ask that question. So where we are doing target ID and target validation like approaches, if you were to take away knowledge graphs and that technology from from the process and say you were just then using LMS with, uh, other approaches here to to try and answer these questions, what impact would that have on your work, how fundamental knowledge graphs to the types of of, um, answers you're seeking and the questions that you're trying to get answers to? So for me, I guess I would say it's it's more about bridging data types. So there's certainly NTC discovery efforts or sorry, discovery, new therapeutic concepts that can rely and be driven from particular data types. And I think in those cases, maybe the impact of the knowledge graphs is less critical than when you're starting to either, you know, build on your historical information or to combine things that are not normally combined. And I think for the kind of like natural language interface with, you know, data that would be relevant for target identification. The knowledge graph is I mean, that's something where to be clear, I mean, we're not there yet, but I think getting there, something like Knowledge Graph is really important because of the the data mapping and just the general I mean, the alternative to what we, um, you know, to what we were trying to do would be to try to run the query, you know, over the sum total of all the the data that we have in that area. Right. So we could run over the data lake, uh, that we have, which has a lot more data, which would solve the problem of not getting all of the answers we need because the questions just aren't in there. Uh, but the flip side of that is that it's really messy and enormous, and not all of it's mapped. And it would involve, you know, running queries over, you know, a great many tables and then trying having the LM try to make sense of mismatches and terminology, etc.. So I think that it would be a lot messier to do that, to having that knowledge graph provide, um, a form of semantics that sits between the raw data and LM, um, for those particular kinds of use cases, -I think will be very important. -Yeah. Thanks, John. And that actually leads into one of my other questions I want to ask, which is I saw the acronym on your slides, Miguel. Fair. Findable. Accessible. Interoperable. Interoperable. Reusable. There you go. I did the the test. Um, what role do we think fairs playing here in, in using an LMS and just how important is it. And says John, one of the things that struck John, sorry, one of the things that struck me in your talk was that you provided a very, very good workflow for how you would answer, um, specific questions. But what was lacking from that workflow was how you map those data points to recognizable IDs and to existing data sets. Yeah. So if you were you were looking to do that at scale, um, it would actually be quite problematic. But on a question by question basis, that approach works quite nicely. So, um, yeah. And John Stevens, I think probably in your talk you gave a much fuller account of why why fair is useful and why why those mappings are important, particularly when you're running database queries. So I wonder if any of you would, um, be keen to answer the role that fair play is in using knowledge graphs, and particularly in conjunction with LMS. I mean, one thing I'd start off with is, um, I think you're you're right for that particular workflow that, uh, Ferrer factored a little into the processing. But I think where LMS can have maybe, um, a more important impact on fair and fair ification is, um, fair is very costly. And if you have non fair data and you're attempting to implement it, there's a huge investment in curation of things and mapping to ontologies and things like this. And I've found in my experience with LMS so far that I don't necessarily trust them in a sense of doing automated mapping, but I think that they can significantly contribute to the tools that curators have at their disposal. So I've got some, um, some examples I've tried out in terms of, um, moving curation from, uh, sort of an expert curate curator exercise and, oh, here, we've got this and we need to match it to something else to, um, more of a suggestion based system so the LMS can take the first pass and say, well, this sure looks like this uniform ID, and then it becomes like a matter for the curator to do more of a sign off than, um, get as entangled in the mapping -exercise. -And I'd just like to point out that those approaches are so not you. So back back ten years ago and I was doing my PhD, semi-automated curation was, uh, was really the only, the only way to, to use NLP, you just generally wouldn't rely on those kinds of results, I think. Um, yeah, I think companies like Byron, we've, we've focused really, really heavily on trying to capture that data automatically using fine tuned lens. Having had the the benefit of all that manually curated data we've generated over the years. Um, but certainly if you're taking us on from scratch, I think you're right. That would be a very difficult, um, challenge indeed. And actually using it to support curation is a very, very good idea. Um, John Stevens, Miguel, anything else to add on there before we move on to next -questions? -Uh, just just to kind of echo, I mean, I think in terms of suggesting terms for ontologies, um, this is something where we always keep in mind what, what what do we expect LMS to be good at? And we expect them to be good at language tasks. Right. So they know language. So anytime you're thinking, well, do I can I get suggestions for synonyms for certain terms or, you know, other key terms and phrases or variants thereof, then I mean, that's a language task. And so that's exactly the kind of thing that we should be doing or exploring with LMS. I mean, in my opinion, because it should be the bread and butter of any decent language model that it can do these things. You know, perhaps until you get into such a specialized domain where maybe the synonyms are not obvious and that we could think about fine tuning and domain specific -models, all that stuff. -Thanks. Um, okay. So I'll take an audience question on this one for John Hill. So, um, that question was this is great, but most of proteomics genomics is dark. So how does this approach bring dark matter into play in your disease analysis and target searches? How do you get an accurate picture of the disease if you -cannot? -Yeah. So I think this is, um, a persistent question and headache, which is just the the enormous bias that we get in the literature. So, um, you know, if you look at a, I don't know, like, again, your TNF or your Interleukins or whatever, you're going to get many, many publications and then it sort of trails off. I think that that's going to be a limitation of just about any literature based approach. And I think that where you can use some of the LMS, it it depends a bit on your burden of proof. So if you're if you're relying in the LMS to enable a processing of literature as is to surface information, um, then you're not going to cover the dark matter stuff. This is really going to be an expedition of discovering the well annotated things, I think, where you can start to play with it a little bit more is, um, some of these models can be a little bit more flexible for things like hypothesis generation, where like an example I was thinking of is that you may have, um, a poorly annotated gene with well annotated gene family members or something like this. So you can start to navigate those kinds of questions. Um, I think the other place could be if you are using things like LMS to interact with or structure knowledge graphs, um, you can do things like some logical inference on knowledge graphs for these dark matter things. That can be helpful. But I just think it's really important to manage expectations for this. It's not going to generate new data for you. That doesn't. Well, I guess it will, but it'll be hallucinations at that point. Um, so it's not going to generate new new data. It's at most going to help surface things that are difficult to find through conventional search or to suggest hypotheses. I think if you're really interested in exploring the dark matter stuff, then that's where you you need to take, um, less, uh, literature based approaches and more sort of genome wide target agnostic approaches. That's where you go back into, um, you know, your medium throughput screening kind of approaches or other means of target validation that don't rely on previously published, -published results. -Yeah, it's very far on. So John Stevens I wonder. So I was pondering this I guess. Yeah. John, here in your talk effectively all you can possibly do with those rag like approaches is, is surface. What's out there already. Right. And you just you're trying to do that as well as possible. So you're trying to get just essentially it's a it's a good search engine. It's really what it is, is giving you the direct answers to those, those kinds of questions. What it won't be able to do is, is kind of surface to insights that you get on a multi-hop basis, which is what you can get from, say, a knowledge graph where you've connected all those data points up together, and then you can infer insights across multiple steps. So for example, targets and diseases and the strength of those linkages and how those um might be linked upstream or downstream of each other. Um, that would be a good example of the kind of insights you could get where you potentially combined, um, a large language model alongside that type of database to then answer those types of questions. And John Stevens, I wonder, do you see it practically like how well do you use the the knowledge graphs in, in AbbVie to try to surface those types of, um, insights which aren't directly out there in the literature and perhaps in the datasets already. Are you looking for those types of things, commonly or generally -speaking, is that not the norm? -Yeah, I think that is kind of the Holy grail. I mean, I don't know in practice how often, you know, these complex multi-hop relationships are discovered and then lead to something, you know, real happening from that. But I know there are at least some isolated examples. Um, and I think that the, the everybody's everybody has access to the same literature. Right. And so everybody has kind of the same low hanging fruit. If you have a certain question about if there's a link between, you know, a certain target and a certain safety profile in a certain population, we can all go to PubMed. We can see what's been written, we can all get the obvious stuff. Right. So where we are always kind of grasping is that the edges. So the edges are no pun intended, you know, at the margins. Right. Like like where it's, it's like okay. Isn't there some hidden relationship, uh, out there in the data and it just it, it takes one person to sort of connect the dots and see, like, wait a minute, maybe there's something here. And so the idea is that we would, you know, increase the rate of those epiphanies from, you know, very low to less low. That doesn't mean that they're happening every day. Right. But just that if you can make those those connections, those distant connections between entities, um, it could lead to something that, you know, nobody else would have discovered before. -Yeah. -And you know, this this must be fundamental in, in the work that you're doing around biomarker discovery, because effectively, that is the point of having a knowledge graph. It's to it's to use, use the um, use it to come up with insights that -nobody else has published directly. Right. -Yeah, exactly. So yeah, if we, if we have a in coming back a little bit to the previous, um, previous question with regards to data standardization. Well, I think that's very important also from a purely knowledge graph perspective, because knowledge graphs are noisy, uh, by nature. So if we standardize the data, I think naturally we make them less noisy. And so we allow these, these, these multi-hop or these, uh, connections between the different nodes a little bit more efficient and, um, potentially less prone, less prone to error. And so kind of moving on to more of the strategic angles here. What, what I guess, could any of you comment on what you think the competitive advantages of actually having a really well structured knowledge graph brings to, to a data science team and, uh, to an organization? Um, so it's kind of similar to my first question, but say, I mean, so so John Hill, you made the important point that actually investing in, in a data set, well structured, that's fair, that that's mapped out, that has all of this, um, you know, really well organized data. Um, once you've done that, do you think that's really worth the investment? What type in terms of the competitive advantages it brings? So Miguel and John, you you'll be experience just as experienced in this first hand. But I wonder if you could just any of you could give any insights into what general competitive advantages that might bring. I mean, one initial comment on that, I think, is that there's, um, for any sort of long standing organization, um, you have to figure that there's a lot of sort of intrinsic knowledge in the different expertise in the organization. And there's, you know, over the course of, I mean, what my, my company's been around for over a century, right? So the things that those scientists have discovered, um, there is a shift over the years, and I think that make maintaining a base of operational knowledge that's accessible to all researchers, regardless of their position in the organization. Because, I mean, people both leave the organization, they move into different roles. And having that as a persistent knowledge store of usable organizational memory, I think even outside of the the cases of, oh, we've got all this together now, we can reason over it in a sophisticated way. I think just establishing the data standards in a way that makes your own data, um, reusable is an enormous benefit. And, um, and necessary step for an organization to achieve a competitive advantage, because we've all done, you know, the organization has done experiments with the idea that those experiments lead competitive insights above what others are doing. And so if we don't make maximum utility of those and reuse those effectively after the initial projects have died, then whether whether or not we get a unique advantage from the the the knowledge base, we preserve the competitive, um, advantage that we've gained from the experimental work that's been done. So I think I think that's the bare minimum. And then, of course, there's all all these pieces on top, which is that, um, for the most part, um, our organization and even the largest organizations don't, um, advance their science strictly from their internal work, but from, from the public work as well. And so the better you can make your internal work focused to complement and fill the deficiencies in the public data sets. You have to do. We know nobody's doing like from scratch disease research. And so it's really important that you have that knowledge base as the starting point so that you can picture. Well, if I make this investment in experimental work, that's going to give me the best edge. And you can only do that if you've got a way to synthesize your internal work with the public work, and also to interrogate it in a way where you can identify those deficiencies and opportunities. -Yes. -If I may interject. I think those are excellent. Excellent points, John. And by by doing that, we allow the data to be, um, available to those people in the company that there are those not only dealing with the data directly, i.e. the data scientists or bioinformaticians or translational medicine leads, but really to anyone that might want some insight on the data and doesn't then have to resort to these people to get it. And and that's one point which is relating to the biological insights. Right. But knowledge graphs are not necessarily just for biological insights, even if you are in a pharmaceutical company. So we can also create competitive intelligence knowledge graphs. Right. So that's one of the things that we're doing to really understand the competitive competitive landscape and in in facilitate in in making those decisions a little bit more, um, more streamlined. Thanks, Miguel. So let's get an audience question for John Stevens this time. Um, somebody asked, do you think the hand holding films always have to be so demanding, or will this improve over time? Hell, it's such a good question. I think that the. Generally speaking, everything about LMS will improve over time. Um, but I think some things will improve faster and some things will improve slower. And I wonder whether the handholding aspect of. Um, of using LMS might actually be one of the slower things to advance. Uh, and the reason is just how they work intrinsically. So the way in like large language model and at least large language models in their current form work with the current architecture is it's all next to token prediction. So I am predicting, you know, let's say I'm predicting a sequence of words from the words that came before. So now I have a larger sequence. Now I'm predicting the next sequence that comes after that or so I'm like sort of predicting word by word by word by word. And there's no such thing from the perspective of a large language model as like an internal thought. So as human beings, when we speak, we also speak in our own minds and leave certain things unsaid, but we're sort of saying them to ourselves. This is what LMS don't have. So if I'm working on a problem, I can work out a problem. I can get the question. I can give the answer verbally. But inside there were all this. There was all this other token prediction happening, you know, internal to my brain. And so until there's some analog to that, um, which maybe there will be, I mean, right, I mean, maybe the foundations will change over time as well, not just the, you know, processing power, but until that day comes, then probably you're still going to want to really have the LMS be very explicit and go step by step, because as they spell out, every step that's informing what they're writing for the next step. Right? And so it's always going to boost the accuracy. Yeah I think of a sort of a side thought to this too, is that, um, the handholding can take place at different parts of the implementation. So, um, I mean, there's a lot of some of these prompt engineering tricks that may be hidden from the end user or reused. So, you know, there's a lot of setup for, uh, how the user interface interacts with the model where they might type a simple question, but behind the scenes, there's a lot of. -Uh. -You know, an expert might have -implemented. -Yeah, that's that's a fantastic point. So you're getting into the world of agents, right. So these LM agents are things where like from the perspective of the user, they might get a question, might get a simple answer, but actually there was a whole intermediate step of writing out the all the intermediate steps to get there and iterating on that, but then that those intermediate outputs just don't get displayed to the user. And I think that that's definitely trying to mimic that, this idea of like internal reasoning. But the problem that I have with it, just from a user experience perspective, is just the processing time, because it is, you are requiring sequential steps to spell out those steps of reasoning and iterate before you actually start writing the final output to the user. So in practice, a lot of these agent based systems, uh, until maybe the models just get so fast that this is all a moot point. I don't know, but but for now, I mean, if you're using a system like GPT four, you know, you might be sitting around waiting quite a while for your answer. So if it's an interactive system, like a chat based system, um, that's where agents, I think that they aren't quite there yet. Now for offline things like just automation processes that run overnight or whatever and produce summaries that there are, there's probably a lot of, um, interesting applications for agents already. And I guess the other the other thing with that too is like the context specificity to me feels like it's going in like crazy waves where we have these like super specific systems. Um, and then the general systems took hold. And what can sometimes seem like handholding is you've got this super generic GPT four model that's capable of, I don't know, telling you, baseball statistics one day or, you know, weather patterns and you're applying it to drug discovery and some of the some of the stuff that appears like hand-holding. Hand-holding is basically like making this stuff domain specific again, so that, you know, the questions that you're asking are just, like, constrained in a way that makes sense for the space that you're operating with. So I also think that, you know, some of that's going to get built back in, in a way where it's going to feel less like hand-holding again, even if there's just these constraints being imposed behind the scenes. -Oh. -Yeah. Thanks for Johns. I'm going to definitely come back to more Liam like chat in a second. Um, so I haven't got much of a segue for this. It's a question for Miguel, and we're going over a biological question this time around. So, um, somebody interested in your use case one, which is albumin, albumin as a biomarker indicating response in anti-PD-1 therapy. And their question is, were you able to use your knowledge graph to go one step further and use your knowledge graph to explore the biological rationale for the relationship between PD one albumin and survival in oncology oncolytic treatments? So, uh, just yeah, so I think we're now getting into the realms of what I can disclose and and not disclose. Uh, but, um, for sure, the, uh, the, the system that we developed present, uh, allows to identify those relationships, but also crucially, allows to to map back onto, onto the graph and understand what the connections are, what the path is. Um, I don't have on the top of my head what the exact answer is, but that's definitely one of the uses of, uh, added uses of, um, of our, uh, strategy. -Okay. -Thanks, Mikhail. Sorry for putting you on -the spot, I didn't realize. -Sorry. But a general question I think it points to, which is is super interesting to navigate is what is the relationship between like as you get into multi-hop predictions and things like this, that you may not be able to have a single reference to? What's the relationship between your predictions and the experimentalists and the experimental plan? Because this is something that I think is always a little bit hard to navigate on the data science side is when you when a prediction is so good that it's like, oh, that's the result. And when the prediction is to this point of, um, can I convince somebody to test it like it's I don't I can't prove it based on the data I've got, but it's plausible enough that somebody should invest the resource to show it. Exactly, exactly. And that's exactly what we're trying to do. We're we're trying to make it so appealing that we would like it to be tested. We, you know, we we are by no means saying this is the, you know, the Holy grail. The true answer. Definitely answer. You don't need to do anything else. No. I mean, these things need to be tested properly in the wet lab. And that's why we have a very good partnership with with bioscience in oncology R&D that does these things. But I think this, you know, what we're doing is trying to first come up with answers that are already known to build trust. And then, um, on top of that, build new, um, discover new biomarkers, new patient populations that were not thought of or that for which we didn't have any insights. Right. And then when you can, when we come up with something that's statistically significant and makes biological sense, um, then that's when we try to test it. But presumably the real difficulty with this challenge is that every benchmark you're benchmarking against is very, very specific to some biological problem, right. So you can probably generate some form of a hit rate for for most things retrospectively. So if you've got some, you know, some phenotypic screening data or, you know, a data set that you've done some previous experiments on, you line your predictions up against those and then you just calculate how good they are. Um, and then, you know, there would be a tolerance that you'll, you'll generally take in which a certain point, you'll start experimenting with those, knowing that your hit rates above, you know, 50% or something like that. But the. Yeah, going back to my initial statement, if it was like that, that would be very specific to each biological challenge because that would you generally agree with that? Miguel? John-John, is that an accurate statement or can you generalize those those types of benchmarks to be used, um, ubiquitously across knowledge -graphs and hypothesis generation? -So a couple of very quick points. The first one is that, um, it is generally, if not always better to have a dedicated data set or knowledge graph or sub knowledge graph from a larger knowledge graph that is relevant to your question. Right. And this is actually an inbuilt function from uh, big AstraZeneca and. The second. The second -point is that. -What? We try to take an unbiased approach in that, say, in the top 50 hits. It's great that we see something that is already known. Like, I don't know, Brca1 for breast cancer, but also something that is equally significant but is not as prevalent in the literature. So that's how we're approaching this, um, this problem. Yeah. So you're trying to separate stuff that you know about already. But but presumably if they're all adopting the same metric of, you know, prediction success, then you remove the ones that you know about, right? And then you're left with a selection of candidates, which, um, are perhaps rare. Maybe you've even got them from a multi-hop style prediction, as John Hill was originally suggesting you might. So but I guess my my point is, once you've got to those predictions that you're only ever going to trust them, if you've got a very specific benchmark for that particular biological hypothesis you're testing that you can measure it directly against. You can't generalize that. That would be my my thought process. And I'm wondering if you could push back -on that idea in general. -Yes. So it all comes back to. Does it make biological sense? Even if we didn't think about this before? That's how we can convey the message and convince, um, people that are not data scientists, um, that work more closely with the biology to -really give it a second look. -So I agree. The explainability part is is super important. I this is a side note, but I also feel like explainability is a weird, um, area to navigate for some of these processes because sometimes I don't know others experience. But if you end up with something that's super explainable, it feels obvious in retrospect. So it's like you go to your to colleagues and you're like, oh, of course, yeah, why that? Never mind how hard it was to discover that thing. Um, that feels like a constant peril of natural language processing and knowledge graphs. But, um, I do tend to agree that the idea of generalization is is tricky, and maybe you can approach it. So maybe it doesn't have to be your specific concept, but it can be maybe one layer out. But I think yeah, overall generalization. Generalization for yeah, this kind of inference always works with this percent. That seems very challenging. Yeah. Again, a note from John Stevens as well. Okay. I'm going to move on to a different, different question here then. Um, so this one's more around data quality. But I want to try and ask this in a slightly different way. Um, so we've talked about rag approaches here quite a bit today. I'm wondering how well NMS can discriminate good quality data from bad quality data once it's been served up for a Rag like approach. So let's say that you serve up a bunch of data, um, for, yeah, you know, for a knowledge graph or something or for just a general, you know, pulling, putting 50 articles together or something like that. Then you feed this to an LM, right. So, so John Hill, you went through all of the challenges of reducing hallucinations and everything else at the beginning, right? So the fundamental question we're asking here is, do we then trust the NLM to distinguish good quality data from bad quality data, once we've served it up to it as this this Rag like approach? Um, I would say out of the box with the very standard approaches. No. Um, so if you're just going into PubMed and fishing out for the most relevant stuff, probably not. Um, I think that out of the box, the one thing that rag tends to, well, rag plus LMS tends to do better than some of the previous approaches I tried. Is it does a better job of summarizing conflicting information. So I have had experiences with Rag where, um, you know, it'll take ten articles, put them in. I'm looking for an association between gene disease and it'll say, well, these are the supported ones. These could find no link. That to me is a little bit superior than the some of the previous approaches. I tried just with search where that summarization and you know, getting a sense of conflicting information was not so readily surfaced. Uh, my sense is if you want to do a better job of data quality, um, okay. So you can take some of the conventional tricks of these are resources I trust more than others. So you could possibly. I'm spitballing here, but like a scheme I could imagine is these are my most trustworthy journals or resources, and I'm going to tear them when I use the rag. So, um, if I only have so much space for context, I'm going to pull for my high quality resource. First, I'm going to pull my from my low quality resource. Second, or I'm going to specifically flag the publications. I mean, something you can do with Rag is sometimes I'll pass in other information, meta information about the articles along with the text that can get you a little bit of the way there. It's still you're making a value judgment around the journals then, and the rest or the data sources. The second thing you um, which I think is another maybe approach for your question, which is feels to me more challenging, is the commenting on the experimental quality itself. So how do I do or how do I encapsulate what a say a trained biologist would do of saying, oh, this was um, this is showing a great result, and it was based on a very -small sample set. -How, you. Know, how do I capture that knowledge about the data quality? To me, that feels like you would have to make those assessments in advance and almost have those as a separate line of questioning. So, you know, you have to capture the knowledge of like, how big is the study? How long was it run, how diverse was the patient population sampled, all those kind of value judgments that a biologist would make in weighing that data, I think you would have to make explicit somehow if you really wanted them captured. I wouldn't trust it just to go to the model and say, like, you know, uh, rank these experiments by how much how how much you think the quality is there, but I don't know. Those are initial thoughts. I don't know what the others -think. -George Stevens. Yeah. I mean, I think generally speaking, uh, yeah. I mean, I agree, I don't I don't have much to say about the quality of the data in terms of, you know, experimental data itself. But when it just comes to, you know, the quality of how the data is presented, uh, the cleanliness of the data, the fairness of the data. I think one thing that we talked about, I think, John, you alluded to earlier, was the domain specificity kind of component of all this. So you're in a sense having to teach a general purpose language model, a specific domain. And I think when you have a good data model, what you're effectively doing is shrinking the size of that domain, uh, and making it so that there's less that needs to be taught through the language model. Uh, that makes sense. So that's kind of the way I think about it, right? Um, LMS are inherently pretty good. If they understand language, they're pretty good at reacting to the fuzziness of language and understanding, you know, synonyms and understanding, as you said, you know, reacting to the fact that sometimes there are contradictory claims. Uh, but when we get into the more specialized domains, or maybe the terminology is less clear, if you've got a good data model or a good ontology that you can, um, provide to the language model, uh, basically to give it a domain, to give it a kind of a world to live in. When it answers the question, then, you know, it should just that, that just fewer things that could confuse it. Yeah, it should lower hallucination. Because the other comment that came to mind here is that, uh, additionally, except for the outliers, and if you're making a value judgment about data quality, I think there's a significant amount of subjectivity that still exists within the human organizational domain, where there's some biologists I can talk to, and they'll trust the paper because they know the lab it came from, and they'll say, this was a former collaborator. I know they do high quality work. There's other people that will go by journal impact factor or that specifically don't trust certain groups. And so capturing that kind of nuance I think is beyond like that's um, that's a separate discussion to have versus just how the system processes the data. But the cool thing about these LMS, I gotta say in a side note, is that the, um. Having to think about human biases in these things. When you're encoding your questions and you find a disconnect between the reality is a super interesting exercise. You know, you see this a lot of times for, um, how people approach, um, these value judgments around the quality of a journal, how they approach their data, how they approach implicit bias, I think is another really cool one where, um, you know, many humans, all humans in some form, are biased, but they rarely have to state their biases. And all of a sudden, if you find a disconnect between what they're seeing in the literature and what they expect, and you make them articulate it, then that also gets to the point where you can start to investigate that bias and find if it's, um, helpful for the way that you're processing the scientific literature or -not. Um. -Indeed. So, John, I'm going to keep you talking. I've got a question for you from the audience. Um. They're asking. I just get this one. By the way, what database are you using for -storing PubMed and XLS? -Oh, uh, so for this, a lot of small scale projects. I also use Chrome ODB, which I think, um, John Stevens said, it pointed out, uh, to me, it just has the advantage of being super easy to set up. I think the more you're doing with centralized resources, things like this, you want to investigate other opportunities. But, um, if you're an end user, honestly, Chrome ODB is, um, super easy to set up. Like it's basically a, you know, little SQLite thing. You build it, you destroy it when you're done. Um, very few headaches from an end -developer standpoint. -Cool. And this is a question directed at all the speakers. Um, so do you think information is sufficiently vetted at large farmer organizations as of now, or do you think more should be done to ensure data quality that is inputting into models? Um, yeah, that that doesn't quite end on the question is fair enough, or even is if it is enough, is it here to enough. So I think what the the gist of that question is, are we are we convinced of the, the data vetting practices at large farmer to ensure that the the data that we're using is high quality enough? -Um. -I think that depends a little bit on the source of the data. So if we if we include internal data from our own companies, that should be the highest quality data that we have. Uh, because it's been run by us or someone that we trust. And it's it's been used to inform clinical trial. Uh, it's like the stakes are high there. So the quality and the confidence in those data are have to be high. Then we have really well established, um, databases, like I mentioned, the Tcga in in my talk. That's an extremely useful, um, and kind of old to be honest, the database. But, you know, it's it's been, uh, normalized. It's been, uh, standardized. Um, and it's been proven to be useful in, in, in many studies, hundreds of papers out there, if not more. And then we have the other, more questionable data. Data sources are the ones that we have to curate a little bit harder. And it's then up to us to what -standard we do those things. -Right. So for for big we have over 60 data sources and, and some are, um, more hard working and the others are easier to, to, to implement. So. I think it really depends on on where you get your data from. I tend to agree with this. I mean it, the data sources matter a lot. I think that something that, um. To the point around internal data and, you know, high quality internal data. I think that something that is still challenging to navigate is not the data quality itself, but the generalizability of the data. So there's a lot of internal data that's generated for very specific purposes and has opportunities for reuse. But oftentimes if it's really dealt with in a generic way, um, you know, in one of these all purpose knowledge stores, then some of the nuance of the interpretation can be lost. So it may be high quality data, but it's high quality data in a specific context. I think that we could probably do a little bit more to, um, to sort of manage that and to figure out like, okay, well, how do we maintain that, um, that context sensitivity and appropriateness as the data is, is reused because we don't want to get we don't want to get rid of it or firewall it in the sense that there are many other contexts where it could be useful. But sometimes you need to understand the restrictions of the data set -by the experimental designs. -We're talking about terabytes and even petabytes of data, aren't we, John Hill? Right. There's enormous amount of data that's siloed away that we don't really know how to use or generalize. Like say, do you think there's an opportunity for a pharma company to train a native large language model on all of that data? So a multimodal approach here. And do you think that might eke out some of these types of generalizable -insights that we're seeking? -I think it's I think it's a great idea. And I think there's, um, there's opportunities there in terms of, um, what I would really love to see of all is this sort of context sensitive, sensitive awareness of data restrictions. So I would, um, you know, I would love it if when I was reviewing historic studies, um, to be able to say like, well, here's my application. What what do I have to worry about for this application? It may be that a study is underpowered, but only for one group. And if I look at the other groups, it's not a problem. And so a lot of times they're really generic. Metadata that may be associated with a study is it's good, but a lot of times it's not that people can review it all in detail for every application because of the scale of it, as you're saying. So I think that's something where, you know, using these different approaches to restrict, um, or to summarize the most relevant characteristics of data sets. -But. -Summarize for the application to which they're being applied. I think that's something that could be enormously -powerful. -Um. Okay, um, let's move on. Let's ask a different question. Um, yeah. Let's go for a go first speaker question. I'm going to go with one for John Stevens. Um, let me just pick one out. Give me a second. Um. -That's it. Uh. -Yeah. Here we go. Um, you talk about on lesson three, the challenge of people who don't know the graph, asking the questions, can the LM be scalable? Used above if it relies on knowing the graph. So I think this is the kind of non-technical user user basically. I think that that's the gist of the question. Yeah. I mean, so the, the LM, I mean, that particular question answering system does rely on the knowledge graph, right? Which is why, I mean, in effect, we are, uh, you know, shrinking this, the, the scope of it to okay, let's embed this in an application where the users already know kind of what, what they're querying, uh, rather than presenting it as a system where, you know, you can ask any scientific question and get an answer. So then the question is, well, how would you scale it to the entire, you know, research user base? Uh, how would you make it so that almost every question was answerable? Uh, and I think there it's a matter of querying different sources. So what's nice, uh, but this actually ties in to the data quality discussion we just have. Because what's nice about the Knowledge Graph is that this is sort of the most vetted part of our data hub, right? So, so, so it's sort of, uh, all the data comes in and some of it's company internal, some of it's from vendors, some of it's public, and there's a spectrum of quality there. Uh, and a spectrum of embeddedness. Right. But when we sort of vet things out and, and map them, that's you're getting the highest quality data, but you're also getting small subset of the data. So there's a I think there's a world in which we could sort of carefully step outward from that and, and encapsulate more of the data so that we really could increase, uh, the scope of what we can, can provide in terms of scientific question answering. Um, one important piece to that, though, which I think we don't currently have quite, you know, in place, is there's an idea of a kind of a conversational router. Right? So at a certain point, having one enormous source of data that you try to query, uh, all at once, um, it's going to be really noisy and, and actually, you'll do better if you can have the question get routed to a number of different sources on the back end. Right. So the question goes in the first step is to to make a judgment about where is the appropriate place to send this question. What source or sources are the best to answer this question. So having a kind of a conversational router where I ask a question, you know, um, about something, you know, that maybe is not well covered by our internal data, and having the router say, you know what? That looks like a bad question. Let's see if there's any public literature on it. Um, something like that. So that's I think, what we would have to do to really, you know, make like a one stop -shop. -A different. That's why you're heading. I mean, is this the future of how we do R&D within with inside pharma? Are we essentially just setting up these Q&A systems that have all these like, well, it could be a multi-agent system or could be a sophisticated graph graph database that contains just an enormous amount of knowledge and the odd API thrown into it as well. You know, is this practically going to be the solution to just answering questions? Um, is that is that where this is going, going? John. Yeah, I think it's definitely where it's headed. I mean, there's there's a sense in which I think a lot of the, you know, if you if you use a lot of the commercially available systems like ChatGPT with plug ins. Right. And if you can sort of activate multiple plug ins and then then the model itself has to decide which plug in to use. And so there's this kind of like movement towards in fact, just like the Apple announcements. Um, uh, was that yesterday or whatever that was, uh, you know, the Apple intelligence. So the idea that we're, we're we're, um, realizing the dream of Siri by just saying, hey, Siri, do this. And then Siri has to decide. I mean, yes, there's an LLM there, right? But it's not an LLM that's just, you know, uh, directly executing the task. Right? There's there's going to be some routing going on where the LMS deciding which functions on the phone to call, in, which parameters to decide. And so if you take that kind of model and you apply it to an R&D, uh, focus, you're looking at, um, exactly what I said, which is where the sources of knowledge that we can route this to. And, and it has to be headed that way because that's where everything else is headed. I mean, I haven't seen a lot of this type of thing in the R&D space. Um, I just set off my own Siri on my phone, and I'm so sorry. I probably did that to other people too. Um, I, uh oops. Um, I haven't seen it as much in the R&D space, but if you look at enterprise, like things are already headed this way. So like AbbVie and I'm sure lots of other companies exploring labs already have, you know, in the works things with personal assistants where like, I need to ask a question about company policies and procedures, right. And like depending on the question, it might get routed to a different database of SOPs or policy documents or whatever. So, uh, people already want to do this in enterprise. And so we just need to figure out, you know, like how to transfer that -over to, to R&D. -Yeah. I think it's a great insight. And I think the other the other thing that's really nice about this model is that it does let the models themselves sort of evolve with their own data and their own, you know, expert contributors. So it's not like you're even though in the end result you maybe appears like one system, it's not a monolith. And you can, you know, add new things as the as your data sources improve. So I think that's a great insight, John. Excellent. Um, Miguel. I mean, because yours was the only talk that didn't mention the the acronym LM, you know, are you seeing this, um, this be an important tool in your your work, particularly when you're training these more specific models around knowledge graphs to solve those challenges around patient selection? Um, yeah. What are LMS generally useful to you or though still a bit gimmicky? Yeah. I mean, I use ChatGPT all the time. Um, but, you know, jokes aside, we are in fact, um, using, uh, building a knowledge graph to, to work with an LM on top to facilitate the insights, um, that only, like, only people like like like myself, like data scientists or bioinformaticians, uh, can derive right now, which is, which is very similar to what John Stevens talk was, was about. So I think that's very much the the future. And it will hopefully, uh, release a lot of time, um, for, for people like myself to work on more interesting, um, problems. Um, and just just a small comment on, on the discussion earlier, I think. When we are talking about a, um, a non-expert user that wants to use an LM, I think you will also be important to kind of guide that person, see, if the question is not quite accurate enough, maybe the agent could prompt the user to to clarify, you know, to okay, what sort of selection would you want? What what genes do you want, what indication this kind of things such that the the answer would be most accurate and most appropriate, uh, for the for the question. But my fear with that would be as soon as you start asking more and more questions and almost having a conversation with some some AI related chat interface, you're just increasing the chances of error because any one of those questions you get, you get an error, you get an error step along the way. You the whole the whole answer becomes erroneous. Right? It's a bit, you know, it's just a whole house of cards you build in there. You know, one of those topples, then that's it. Would you agree with that, -John? -I would. -Say it. -Depends on the I mean, there's certainly roles where the clarification can be a big deal. I mean, one thing that I've found is that, um, of course, you know, the LM seem to do better at, uh, synonyms, as John Stevens mentioned. But of course, some synonyms are inherently ambiguous. If you just if you ask the question without context. So I think those are cases where maybe getting feedback from LM to say, oh, you use this ambiguous synonym, I need to know what it means so I can do the search. You know, that may be a reasonable a reasonable step in approach. Um, so I don't think it it's always going to spiral out of control. Okay. So I've got a link in question here for all of you. Is a speaker question. Sorry, an audience question for the speakers. Um, do you think LMS knowledge graphs and rank based data models are influencing how scientists write reports? What changes have any? Do you predict that? That's a nice -question. -Um, so I have, I guess, the immediate answer and then the science Fictiony version, which I'm hoping will happen. So the first is, yeah, I do think that people are starting to use this. I've seen some, some examples, um, as well as some public examples from other companies around using this for sort of boilerplate, um, reports. So the same reports being filled out a lot. Um, there's, you know, ways to expedite churning out this text that there's going to be necessary still needs a human review, but the human review winds up being pretty fast. And you can cut down your report generation time with that. Um, the science fiction thing that I kind of hope happens, but is, um, that will somehow change our relationship, um, to data in a sense that something that I find, um, kind of weird about science is that science is extremely narrative driven. I mean, for all the data we generate, all the discrete experimental points, things like this. In the end, the currency is still a published scientific paper or a high level internal presentation. Like that's where the magic happens. That's the result has been achieved. And I think that there's been some increased focus on the scientific on the publishing side to say, well, okay, that's not complete. You need to have the data as well, but it's always article. And then data is still a secondary consideration in support of the article. What I think would be ultimately fantastic is if we can get a point to a point of trusting generation of reports and documents to which we can basically have the data again, return to the position of ground truth so sufficiently described data. And then essentially your papers and your presentations, they become almost like, you know, how you have a figure that's, uh, you know, you can generate your figure in, uh, in R or Excel or whatever makes a bar graph. But the ground truth is, the data points that you give to that bar graph, I could imagine a future kind of thing where essentially the scientific report is a generation on the statements of data that go into it. But I guess isn't the purpose of telling it as a story in order to make it communicate better to whoever it is, you're trying to present that information to you. So while it could be trustworthy in terms of being presented as a ground truth, the point of presenting it as a story is so that somebody else can receive that information in the right way. So presumably you still need to have the narrative on top of that regardless, right? Exactly. But but what I think is that the narrative could essentially be a decision. So the narrative decision could be basically a processing that happens on top of the data rather than a careful series of human decisions. And I think that the cool thing about something like this, if it ever happened, is that, you know, you already see these examples in Lmms where, um, I mean, you can take a scientific abstract and say, oh, well, this is for an expert in the field. Now, I'd like it summarized at a college student level. Now I'd like it summarized at a high school level or something like this. So I think that if you can start to get in places where the narrative is decoupled in somewhat, you know, decoupled through means of intermediate processing from the data itself, it gives you an enormous amount of flexibility to say, okay, well, I've got this great result. I want to present it in an academic space. I want it in a presentation, I want to present it to my group leader, and that if we can lower the bar for the amount of work, I mean, you know, don't get me wrong, like people, when people generate graphs, there's still a lot of fiddling around they do to get the margins just right and whatever. So I don't expect that this is going to happen for a long time with humans entirely removed from this. But I do think if we get to a point where, um, at least the bare bones of going from data to a communicative narrative, if that can be sort of smoothed out, I think that would be fantastic progress. And then the thing I love about it, from the data science side, is that we can go back in a place where the data is the ground truth, and that's the evidence of work, rather than, you know, what winds up being a little bit of cherry picking and, you know, -narrative smoothing on top. -Really visionary statement there, John. -Um. Uh, yeah. -Call me back in five years old. I mean, I don't know, this this is going through my mind right now. So, uh, what we're what we're working towards here is a vision is where, effectively, we've got a bunch of scientists working in front of a chat interface where they're asking questions around, you know, these areas, etc., in order to generate hypotheses which will effectively then could be, could be generating, you know, you go off and test those hypotheses. And let's say we're at a point where automation and robotics are also advanced, and then this stuff is actually just churning itself out right now. The interesting thing for me here is that I haven't seen any evidence of large language models generate new knowledge yet. Right? So what what's effectively happening at the moment is we've got this, we've got this knowledge space where you've got large language models effectively acting like a compression algorithm, where they're filling in the gaps between what we've said already out there in published literature, etc. and, and it comes across as new knowledge, but essentially it's just regurgitating what we already know, right? Um, now the interesting thing is when you combine that with knowledge graphs and we start talking about multi-hop like approaches, etc., we are in fact creating new knowledge because we're finding these hidden, hidden trends and insights between things that we didn't necessarily know about. Right. And if you then put that into a cycle of experimental testing, then you are actually creating new knowledge. So at what point do you need a human in the loop in that process? If if you are just then perpetually able to create new knowledge in this process, presumably there's a limit to that too, because I can't see it fundamentally creating, you know, new understanding of the universe. But we might necessarily start to to do the experiments that we know already work in order to produce that new knowledge that then helps us cure those diseases. Would you say that that's a general like an accurate representation of where we might be heading, or is that a bit, you know, far fetched? -Um. -I think that can be true. But for at least currently, for every accurate prediction or new piece of knowledge that it would churn out, then you probably would churn out a lot more. That would be erroneous. And so we cannot just trust it like that, at least currently. Right? Um. Are how things will evolve. Feature will tell. It would be interesting. I mean that's an interesting vision that you that you got there. Uh, Daniel. But, um, I think we're still a few years away from that being a possibility. I don't know what the other ones think. -Yeah, I think it's. -Um, I mean, to, to be honest, the LM stuff on top feels like that's to me, that's going to take a little bit longer. I think the idea of guided experiments by machine learning within narrow domains, I think we already have, you know, there's some examples of that where that seems to work out in terms of where do you sample the space next to improve your model, where do you or what predictions do you make that you want to validate? So I could see that coming more directly from conventional ML approaches or from, I guess, from knowledge graphs that could make sense. It's going to be super domain constrained because, uh, like this, uh, grand notion of like an autonomous agent creating new knowledge still runs into all these, like, um, frankly, like budgetary constraints. I mean, you could write an LM right now that would send requests to Crows to generate data for you. Right. And, um, you probably run into problems with that really fast. So I think it's going to be in this in this case, more of like, well, I've got 500 data points I can generate. Which ones would be the ones that would be suggested that that's why I think where we'll start to see the first thing, then maybe more of a closed loop around that. And then I can imagine, um, a, an intermediate step for an LM would be almost going back to, um, the John's idea of, uh, sort of an LM based triage. So it would be. Oh, I want to address these questions. What would be experiments that should be designed and where should those go? -Um. -That I could see is like a, you know, not not crazy far off, but the, the whole closed loop thing across domains to generate data. I think we're, -we're quite a ways from that. -I mean, what seems to be really obvious and. Yeah, thanks. I was deliberately, um, provocative there in my vision. Um, yeah. What seems to be the case, though, is that we are we we might not necessarily get to the the place where the, the full end to end cycle of drug discovery can be fully automated. We might do, but that probably would take decades. Um, but what's clear is you can see the steps that you can take to improve that process very clearly even today. Right. In terms of reducing that the error rate, reducing the efficiency, um, pinpointing the, you know, the right things to investigate, the wrong things to investigate. Um, so we should start to see efficiency improvements and therefore, um, reduction in costs and, um, improvements in, um, you know, in the rate of discovering new medicines. Right. Um, that doesn't necessarily match up to what we're seeing in the economic landscape right now. We've tried discovery. So that's the interesting dichotomy, because AI is not that new, right? This has been around for for a long time, whether it's large language models or or just other machine learning approaches. We've been employing all of this stuff for a long time now, and we're seeing more and more statistics coming out about the cost of of developing new molecular entities against, um, the success rates. Right. So I wonder what point we might reverse that trend and we might actually see I start to to make that a cost effective process again. And I wonder if you've got any closing thoughts on that before we we end the session. So one thing to return to on that that I think is, is interesting from the conversation around the value of having an integrated knowledge graph or any of these, like large organizational data initiatives. Is that, um, one thing that I think that some of the AI approaches will let you do experimentally that is underutilized right now is to generate data that is valuable from a knowledge reuse perspective that is not useful immediately for your ongoing projects. So how you enrich your your knowledge graph with things that fill deficiencies and fill like this dark matter kind of concept that people had come up with. Um, and that's something that I think. -It has to. -Be supported by an immense data infrastructure. And it's a long term vision. I don't think you get a payoff, you know, so it's not something that I think, um, is being seriously investigated in the way that it's, you know, may have potential. I would expect that as companies sort of, uh, get more improvements in their data infrastructure and also in the the focus for how they generate their future data. I expect that you would get some dividends for that, but it's it's going to take growth. I mean, all the verification efforts, I mean, many companies made years of investment in that before I think they achieved tangible benefit. And then, um, you know, now that there's tangible benefit, many are getting, um, a return on that investment, I would say. Well, I would posit that the challenge we're fighting against isn't the fact that the AI is useful. I think if you don't do AI, it's even worse, right? It's the problem. Is that what AI, automation, using data, etc.? To me, the problem is we've just eaten all the low hanging fruit and the, the the opportunity space is is, is effectively reducing. Right. So therefore we're going after harder and harder targets and harder and harder to precision medicine strategies which effectively have smaller markets available. And therefore when you're, you know, you put one of those through the drug discovery pipeline. Um, and it fails, then you, you only had a small profit margin to begin with. So just the cost of it's, um, going up and up. Um, so that that forces me the challenge we're fighting against. So it feels like AI has to be the solution. And without doing it, you're you're going to be. Yeah, you definitely are. Um, yeah. Any other final closing comments -before we draw this to an end? -Yeah, I think that's true also for designing new molecules. Right. Um, or coming up with, with the chemical structure of potential new drugs. I think that's very useful. Um, there. But but like you said, um. All of the current treatments that that, that we have. Um, they are the, the easiest ones. Right? Everything that we discover from now on is becoming more and more complex because, you know, we have drug resistance and we have, like this Multi-target, um. Um, um, pathologies. I mean, there are not many cases like breast cancer where Brca1 is like the main target, right? If we get rid of it, then we're we're pretty much almost, you know, we have a good chance of, of survival. So I think the fact that the biology is, is, is more complex now to improve on what we already have is, is somehow, um, unfair on, on, you know, a I in general because it's a harder problem. But like you said, if we don't use it, then it's becoming even harder for humans, right? -Precisely. -Joe Stevens, anything else to add? Uh, yeah. I would just add, I mean, there's a lot that we, you know, didn't touch on, um, you know, in this session around how knowledge crafts could interact with. I mean, number one, there's, uh, the distinction, I think was alluded to before of, you know, traditional I kind of special purpose ML models, uh, then, you know, even within generative AI, it doesn't all have to be language models. Uh, right. Things like generating, uh, you know, compounds, generating protein sequences, things like this, there's generative, um, potential there in these more chemical and biological spaces that that. Right. It's it's not the most obvious, you know, um, thing to try first because it takes a little more customization, a little more work than your standard LLM. Uh, and then to the extent that a lot of that data is stored in graph, then, you know, there's also graph ML kind of looming over the horizon, right? So your LMS are using your traditional sequence based transformer models, but a lot of graph and ML techniques. Um, there's just a whole landscape out there. I think that that will open up a lot of possibilities in drug discovery when it comes to how these things interact, which just because of the what developed first, you know, we're not even really we're just starting this conversation. So, um, I think there's a lot more to come. And on that optimistic statement, I'm going to draw this event to a close. So, Mark, I'll pass back to you. And, um, just before I go. Thanks very much, Miguel. John and John really enjoyed the chat. -Um, yeah. Over to you, Mark. -All right. Well, thank you, Dan, for hosting that sort of really interesting sort of roundtable and question today. I'd like to take this last opportunity to say again, once again, thank you to the speakers, John, Miguel and John. Um, please. You know, we'll be running the next one of these in about three months time. We'll keep everybody informed of those. If you enjoyed that conversation, then, uh, we have, um, John Stevens will be joining Dan and another, um, panelist. Myra, can you just share my screen being shared? I can't see it from my end. I'm sorry. Uh, I'm just bringing up the details. Um, well, they'll be talking more in depth about LMS in the real world. So on that final note, I just like to say thank you to all of our panelists, our speakers, and to everyone who -attended. -Okay. Thank you.