Deep Learning and CRISPR Off-Target Effect Prediction — A Walkthrough

12 min readDec 24, 2019

Unless you’ve been living under a rock the past few decades, or you dislike biology and completely avoid it (which, if you do, why? 🤔 Biology’s awesome!! 😍), you’ve definitely heard of CRISPR. The magical system that can edit our genes and solve all our problems 😎.

A Bit About CRISPR

For those of you that don’t know about CRISPR, it’s basically a method of editing your genes. What are genes? Loosely speaking, they’re stretches of DNA 🧬 in your genome (so all the DNA in your body) that code for specific proteins. These proteins might give you a certain eye 👁 colour or hair colour, or might help your cells grow and divide. Proteins essentially play a role in every single trait you have, and all the processes that are going on in every single one of your trillions of cells every single second.

The central dogma of molecular genetics. Weird two-stranded molecule, to another weird single-stranded molecule, to a weird blob, to, finally, a weird ball of strings

These traits include any and all disorders you might have. No matter where in your body, proteins play a HUGE role. Many of the disorders are caused because the proteins don’t work correctly. Soooo, you might guess that if we’re able to fix those proteins, we might be able to cure those disorders.

Well, think about it as a toy car 🚗 factory🏭. If you know that your cars have broken wheels, you won’t try and fix 🛠️ the defective cars as they’re made. Instead, you’ll change the blueprint or production method so that the wheels are not made broken. Just like that, instead of trying to fix every defective protein as it is made, we can edit the DNA itself to make sure the proteins are not made incorrectly. By doing this, we can effectively cure whatever problem or disorder you might be having.

That’s where CRISPR comes in. The basic structure of CRISPR technology consists of two main parts. The first is a guide RNA sequence. This is a sequence of RNA (basically DNA with a few differences) that guides the CRISPR system to the correct site on the DNA. The second is a Cas enzyme (usually Cas9) that makes a cut ✂️ at that site. Depending on the exact experiment, there might be other parts in the system, but these two components are found across all experiments.

But… despite all the hype 🎉 about it, CRISPR still isn’t perfect and ready for therapeutic purposes yet. It’s better and further than other gene-editing technologies from the past, but it’s not perfect.

First of all, in many cases, it leaves the actual “editing” to chance. It basically cuts the DNA, and leaves, hoping that the cell would repair the DNA properly, and not screw stuff up even more. It’s like you want to edit a certain part of your essay, and you erase the bad part, and leave a monkey 🐒 to hopefully type out ⌨️ just what you need. But… this problem might have been solved by the development of a slightly different form of CRISPR known as prime editing. You can find out all about it in Eliza Aguhar’s article on the mechanism.

The second problem comes with the huge inefficiency of designing the CRISPR molecule. Right now, it’s very slow ⏳ and based highly on guesswork. One reason behind this is the difficulty of designing guide RNA.

Guide RNA Design and Problems

As I said before, the guide RNA molecule guides the CRISPR system to the correct spot on the DNA. It’s like the GPS 🗺️ that guides you to the nearest McDonald’s 🍟. Now, you might think that the CRISPR system will follow these directions really closely.

But that’s not true. Many times, it might also target other sites that have just slight differences. It’s like if the GPS guides you not only to the McDonald’s but also to the nearby BurgerKing and Wendy’s. Sure, they might be similar, but they’re not the exact same. That’s where the problem of off-target effects with CRISPR systems comes in. The systems often attach to not only the main target but also other sites that are similar.

Now, if you visit BurgerKing and Wendy’s instead of McDonald’s, it might not be that big a deal (at least not for me), but when it comes to editing DNA, the code of life, these inaccuracies can be super harmful. It might activate harmful genes, such as the protooncogenes that can be activated into oncogenes to cause cancer, or deactivate important genes, such as the ones that code for hemoglobin, for example. And so, it might not come as a surprise that limiting off-target effects is one of the biggest factors geneticists 👩‍🔬 and biologists 👨‍🔬consider when designing the guide-RNA sequence for a CRISPR experiment.

The problem is, it can be hard to predict off-target sites, since, theoretically, they can occur at any site adjacent to a PAM sequence (3-bp sequence that must be present next to the target sequence). Being able to determine which mismatches are more likely to occur and cause off-target effects seems impossible based on our current knowledge of the molecular interactions in genomics and gene editing. Many statistical and machine learning models exist to attempt to deal with this problem, but most of them are not accurate or effective enough, and different models can give different results for the same exact sequence.

Off-target effects take place because of small mismatches between the gRNA and DNA that the CRISPR system accepts

The Alternative

How can we solve this problem?

Two researchers 🧑‍🔬(Dr. Lin Jiecong and Dr. Wong Ka-Chun) from the City University of Hong Kong suggest using deep learning.

In the study, the researchers use two types of deep learning algorithms ⚙️ known as feedforward-neural networks (FNNs) and convolutional neural networks (CNNs) to improve off-target prediction.

But before we get into the specifics, why is deep learning better?

The Benefit of Deep Learning

It’s faster ⏱️
It’s better at analyzing unclear relationships 📈

Deep learning is much much faster than machine learning models and human-designed statistical models. What might usually take humans hours or days, deep learning can achieve in minutes. When it comes to designing a CRISPR experiment for therapeutic purposes, you want to do it fast, so you can also treat the patients 🤒 as fast as possible.

But even more importantly, deep learning is highly skilled at analyzing data 📈 and finding interesting relationships and connections that humans 🧑 can easily miss. Where human brains 🧠 are not designed to find deep connections in data, deep learning algorithms are designed to perform data analysis. This means that they can find connections between variables that we never could.

The Data

To be able to train the models for off-target prediction, we need to pass the mismatch and off-target effects. In this case, our gRNA sequences consist of 23 base pairs, including the 3-bp PAM sequence. This means that each data point is of length 23. But, neural networks are not always effective at analyzing non-numerical data. So, we must convert the 4 letters that code the gRNA (A, C, T, G) into numerical 🔢 vectors.

A = [1, 0, 0, 0]
G = [0, 1, 0, 0]
C = [0, 0, 1, 0]
T = [0, 0, 0, 1]

Now, we have a 2D matrix of length 23 that might look like this:

The next step is to actually communicate the mismatches that took place between the guide RNA and DNA molecules to cause the off-target effects.

To do this, the DNA code was converted into a matrix the same way. Following this, an OR operator was applied to combine and compare the two matrices and communicate the mismatches and matches.

To do so, the OR operator checks both sequences letter by letter to see which letters match and which ones don’t. For all the letters that match, the code in the matrix remains the same. For example, if both strands had a match between As, the code for that index position would remain as
[1, 0, 0, 0].

However, for indices where the letters did not match, the operator combines the arrays for both sequences at that location and places this combined array into the combined matrix. For example, if there was a mismatch between an A ([1, 0, 0, 0] and a G ([0, 1, 0, 0]), the two arrays would be combined to form the new array [1, 1, 0, 0]. By doing so, we are able to communicate locations and types of mismatches that occurred between the target DNA and guide RNA sequences.

This is how the OR operator combines both sequences to create a collated sequence that can communicate the mismatches between the complementary guide RNA and the target DNA

The Feedforward Neural Network

The first model is a feedforward neural network (FNN), also known as a multi-layer perceptron. This is the basic form of the neural network, with an input layer, hidden dense layers, and an output layer with two neurons.

One special property of FNNs is the fact that they cannot handle 2D data like matrices. To be able to input the data we manipulated above, we have to transform the 23x4 matrix into a 1-dimensional vector of length 92. This means that the input layer of the FNN must have 92 neurons to take in the vector.

Next, the researches incorporated some variability into the structure of the hidden layers, to determine the best model for the problem. They designed three distinct models that all had a different number of hidden layers and neurons in each of those layers. The first model had only two layers with 250 neurons in the first layer, and 40 neurons in the second one. The second model had three layers, with 50 neurons, 20 neurons, and 10 neurons. Finally, the third model had four hidden layers, with 25 neurons, 10 neurons, 10 neurons, and 4 neurons in the layers. Overall, each of these models had a total of 10,000 connections between the neurons in its hidden layers.

The output neurons had the activation function set to softmax, which can convert each neuron output into a probability.

All three models were then compiled using the Adam optimizer, which trained to optimize the cross-entropy loss in the model. In layman’s terms, the model basically trained to minimize the amount of loss, or the number of errors, during training.

The Convolutional Neural Network: How It Works

The second type of model is the convolutional neural network (CNN), which has a similar structure as an FNN, with a few special properties.

CNNs are actually specially designed to analyze images and classify them into one of multiple categories. The reason they are so good at image analysis is because of the presence of a few special layers.

A CNN in action, classifying people, buses, and cars on a busy street (somewhere in Europe I think?). Pretty neat 😎 huh?

The most important of these layers is the convolutional layer. It acts much like normal layers, where it receives an input, transforms the data, and transfers the transformed data into the next layer. The entire purpose of the convolutional layer is to extract features from the input image.

It does this by applying a filter to the original image. This filter multiplies each element highlighted by a certain number, and then adds all the multiplication outputs to get the final number. This number forms a single element in the output matrix, called a feature map. The filter then continuously slides over by one pixel, also known as its stride, and continues to develop the feature map.

The convolutional layer at work, using a preset filter to create a feature map with all the important information in a more concise form

The feature map is then inputted into a batch normalization (BN) layer, which normalizes all the features and reduces the covariate shift, which makes the training process faster⏱ and more efficient.

The next step, in this case, is for the feature map to go through a global max-pooling layer. This layer essentially reduces the dimensionality of the feature map while retaining its most important information. In essence, the layer takes the entire feature map it receives and outputs one element with the largest value from the feature map.

The next two layers are fully connected dense layers, with 100 neurons in the first layer, and 23 neurons in the second. The second dense layer also has a dropout layer applied to it, in order to randomly mask portions of the output to avoid the model from overfitting to the training dataset. The last layer, the output layer, consists of two neurons that correspond to the two classification results: off-target effect, or not.

Now, you might be wondering: “If the CNN is so good at analyzing images, how does that help us with this problem?” The thing is, the CNN isn’t good at analyzing images specifically. Instead, it’s good at analyzing matrices, or 2D data. This means that we could plug in the matrix we made before as it is into the CNN, and the model would be able to train on it, unlike the FNN.

Just like the FNNs, the researchers also carried out multiple experiments with different CNN structures, to see which structure worked the best. There were a total of 6 CNNs: the standard CNN; a CNN without the BN layer (CNN_nbn); a CNN without the dropout layer (CNN_nd); a CNN without the max-pooling layer (CNN_np); a CNN with a max-pooling window size of 3 (CNN_pwin3); and a CNN with a max-pooling window size of 7 (CNN_pwin7).

The Results

So, all of this sounds cool and all, but did it work?

Yes, it did! And it didn’t just do well, it blew all previous methods out of the park!

The accuracy, in this case, was determined using the Receiver Operating Characteristic (ROC) Curve, which compares the false positive rate to the true positive rate for the models, and then finding the Area Under Curve (AUC), which represents the accuracy of the model. As shown in the table below, the best performance was seen with the standard CNN model and the FNN model with 3 hidden layers, with a mean AUC score of around 0.97 for both.

The performance of all the different models in the experiment, with the best performance for the standard CNN (CNN_std) and the FNN with 3 hidden layers (FNN_3layer)

These two models also outperformed current machine learning and statistical algorithms designed to achieve the same purpose. It beat CFD scoring, the most accurate algorithm, and Logistic Regression, the most accurate machine learning model, by 5.8% and 3.9% respectively.

This might not seem like that much, but when it comes to the accuracy of a model designed to predict off-target effects for gene editing experiments, this could mean the difference between a correct prediction and a successful experiment, and an incorrect prediction and an unsuccessful experiment.

Next Steps

Now, this is great news. The first time deep learning was applied to off-target effect and site prediction, it performed extraordinarily and outperformed all current methods.

But more importantly, it shows that deep learning can be used in gene editing and genomic applications, and make a huge difference. So, it’s time we follow suit to virtually every other field across all industries and truly start researching the potential applications and effectiveness of using deep learning models in genome engineering. Once we do that, we might truly be able to make genome editing safe and effective for therapeutic💉 purposes.

If you prefer an audio-video version of this article, check out this video I made.

Hey there 👋🏽! I am Akshaj, a 17-year old innovator set out to change the world 🌎, mainly by applying exponential technologies like artificial intelligence, into biology🧫 and medicine ⚕. I am super passionate about bridging the gap between healthcare 🏥in developed and developing countries, and truly believe that this gap can be overcome with more accessibility to medical technologies, as well as cheaper💲 and more effective technologies. And so, I am working suuuuper hard to find simple solutions to complex medical problems, and affect billions of people worldwide. In short, I am working to be a unicorn🦄 person.

Thanks for reading this article📖! I hope you learnt something new, and really got a sense of how deep learning can be applied in the genome-editing field, and how it will change the future of the field completely.

If you liked what you read, you can find more on my Medium account, so consider subscribing, in order to get regular articles on topics in medicine, biology, and exponential technology.

Check out my Twitter and my LinkedIn, and subscribe to my monthly newsletter to stay fully updated with everything I’m doing, and how I’m working hard to change the world.

Also, check out my personal website to find out more about me, and what I’ve worked on!

Akshaj - Personal Portfolio

I'm Akshaj. I'm a 16-year old Protagonist (ENFJ-T) from Kingston, Ontario, obsessed with solving the world's biggest…

www.akshajdarbar.com

Thank you for reading. See you next time!

Akshaj, out.