How AstraZeneca is using Netflix like Knowledge Graph to Discover New Drugs

sid dhuri
5 min readJul 29, 2020


Drug discovery is a long, challenging and often futile process for pharmaceutical companies, it’s estimated two-thirds of all clinical trials to find new medicines ultimately fail.

Drug development is a long and expensive process. Source: Spark AI Summit 2020

Existing models for drug discovery can only process a limited number of drugs and have a limited proteome coverage. These approaches have led to high false positive prediction rates, costing significant time, effort and funding.

AI has the potential to transform the way we discover and develop potential new treatments.

Building disease understanding through knowledge graphs

Knowledge graphs are a way of representing information that can capture complex relationships more easily than conventional databases. If you have used Google you have already used a Knowledge graph.

Knowledge graphs work as a library of information which can spot the connections between thousands of different sources to find you the answer you are looking for.

What are knowledge graphs

· Knowledge in graph form!

· Captures entities, attributes, and relationships

· Nodes are entities

· Nodes can have attributes

· Edges between two represent a relationship between entities

A trivial knowledge graph

Some of the most prevalent knowledge graphs that you might have already used:

· Google Knowledge Graph

· Google Knowledge Vault

· Microsoft Satori

· Amazon Product Graph

· Facebook Graph API

· IBM Watson

· LinkedIn Knowledge Graph

Knowledge graphs for drug discovery:

The goal of pre-clinical drug discovery is to deliver one or more clinical candidate molecules, backed with sufficient evidence of biologic activity on a disease target. For this scientists need to research through troves of journals, clinical trials data, genomics and piece together if this candidate molecule is worth progressing.

The sheer amount of scientific information and clinical data available to researchers is growing year. AstraZeneca is organizing this data into knowledge graphs to harness the power of networks of scientific data facts and give their scientists the information they need about genes, proteins, diseases and compounds, and their relationships.

Process of building Knowledge Graph

How AstraZeneca uses internal knowledge graphs for drug discovery

Dr Eliseo Papa, AI engineering lead at AstraZeneca, describes the problem of finding the next drug target is similar to finding the next movie to watch on Netflix.

Speaking at the Spark AI summit 2020 he said “Sometimes, selecting the next best drug target can be compared to choosing the next best movie — just with much more serious implications if you get it wrong. Thanks to Netflix, we now know how to tackle these types of problems.”

By extracting valuable information from multiple sources and connecting these entities in a knowledge graph, enables users to make informed decisions taking into consideration all the available information and discover connections they might not have found otherwise.

Big tech companies like Google, Microsoft and Amazon have used internal knowledge graphs to bring together and connect the vast amounts of data they have accumulated over time.

Connection vast amounts of heterogeneous data enables researchers to construct a query for a specific kind of relationships between different nodes:

The nodes here are chemical compounds, biological entities, and diseases described in the literature, and the system has to use the links as part of the query.

By using AI and machine learning to combine information from multiple sources, researchers at AstraZeneca hope to draw faster and more informed conclusions than if they had analysed all this data manually.

Knowledge graphs allow AstraZeneca’s researchers to ask key questions about genes, diseases, drugs and safety information to help identify and prioritise drug targets. And, as the data and knowledge continues to evolve, graphs grow organically which means every new experiment will benefit from everything learned before.

Ultimately, AstraZeneca wants to develop personalised knowledge graphs that bring the right information to the right scientist, at the right time so that each one can play their part in advancing research to understand diseases and how they work, identify new drug targets and design better clinical trials.

We create a sample knowledge graph of drugs and information from NHS website using R programming

To get some information for our knowledge graph we will scrape data about drugs from the NHS website using the rvest package

#' Get information about drugs from NHS website
nhs_url <- ''
drugs <- c('Atorvastatin', 'Azithromycin', 'Amoxicillin')
#List to save drugs information
datalist = list()
#' for every drug get information from NHS
for (i in 1:length(drugs)) {

drug <- drugs[i]

drug_info <- xml2::read_html(paste0(nhs_url, drug)) %>%
rvest::html_nodes(xpath='//*[@class="nhsuk-grid-column-two-thirds"]') %>%
rvest::html_text() %>%
gsub(pattern="\t|\n|\r", replacement="") %>%
gsub(pattern="\\s+", replacement=" ")

dat <- data.frame(x = drug, y = drug_info)

datalist[[i]] <- dat # add it to your list
#' bind the list into a dataframe
drugs_info =, datalist)
colnames(drugs_info) <- c('Drug', 'Info')

This will give us a dataframe with Drug and its Information from the NHS website as below

Next we create our knowledge graph from this dataframe using the tidytext and igraph packages and then plot the knowledge graph using ggraph package

#' extract bigrams from the freetext
drugs_bigrams <- drugs_info %>%
unnest_tokens(bigram, Info, token = "ngrams", n = 2)
#' create
drugs_bigrams_df <- drugs_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
drugs_bigrams_filtered <- drugs_bigrams_df %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
#' new bigram counts:
drug_bigram_counts <- drugs_bigrams_filtered %>%
count(word1, word2, sort = TRUE)
#' filter for only relatively common combinations
drug_info_graph <- drug_bigram_counts %>%
filter(n > 1) %>%
#' Plot graph
ggraph(drug_info_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)

This will create a graph linking the knowledge from the unstructured text as follows.



sid dhuri

I am data scientist by trade. I love to write about data science, marketing and economics. I founded a marketing ai, analytics and automation platform.