Deep Learning and CRISPR Off-Target Effect Prediction — A Walkthrough

Image for post
Image for post

Unless you’ve been living under a rock the past few decades, or you dislike biology and completely avoid it (which, if you do, why? 🤔 Biology’s awesome!! 😍), you’ve definitely heard of CRISPR. The magical system that can edit our genes and solve all our problems 😎.

A Bit About CRISPR

Image for post
Image for post
The central dogma of molecular genetics. Weird two-stranded molecule, to another weird single-stranded molecule, to a weird blob, to, finally, a weird ball of strings

These traits include any and all disorders you might have. No matter where in your body, proteins play a HUGE role. Many of the disorders are caused because the proteins don’t work correctly. Soooo, you might guess that if we’re able to fix those proteins, we might be able to cure those disorders.

Well, think about it as a toy car 🚗 factory🏭. If you know that your cars have broken wheels, you won’t try and fix 🛠️ the defective cars as they’re made. Instead, you’ll change the blueprint or production method so that the wheels are not made broken. Just like that, instead of trying to fix every defective protein as it is made, we can edit the DNA itself to make sure the proteins are not made incorrectly. By doing this, we can effectively cure whatever problem or disorder you might be having.

That’s where CRISPR comes in. The basic structure of CRISPR technology consists of two main parts. The first is a guide RNA sequence. This is a sequence of RNA (basically DNA with a few differences) that guides the CRISPR system to the correct site on the DNA. The second is a Cas enzyme (usually Cas9) that makes a cut ✂️ at that site. Depending on the exact experiment, there might be other parts in the system, but these two components are found across all experiments.

But… despite all the hype 🎉 about it, CRISPR still isn’t perfect and ready for therapeutic purposes yet. It’s better and further than other gene-editing technologies from the past, but it’s not perfect.

First of all, in many cases, it leaves the actual “editing” to chance. It basically cuts the DNA, and leaves, hoping that the cell would repair the DNA properly, and not screw stuff up even more. It’s like you want to edit a certain part of your essay, and you erase the bad part, and leave a monkey 🐒 to hopefully type out ⌨️ just what you need. But… this problem might have been solved by the development of a slightly different form of CRISPR known as prime editing. You can find out all about it in Eliza Aguhar’s article on the mechanism.

The second problem comes with the huge inefficiency of designing the CRISPR molecule. Right now, it’s very slow ⏳ and based highly on guesswork. One reason behind this is the difficulty of designing guide RNA.

Guide RNA Design and Problems

But that’s not true. Many times, it might also target other sites that have just slight differences. It’s like if the GPS guides you not only to the McDonald’s but also to the nearby BurgerKing and Wendy’s. Sure, they might be similar, but they’re not the exact same. That’s where the problem of off-target effects with CRISPR systems comes in. The systems often attach to not only the main target but also other sites that are similar.

Now, if you visit BurgerKing and Wendy’s instead of McDonald’s, it might not be that big a deal (at least not for me), but when it comes to editing DNA, the code of life, these inaccuracies can be super harmful. It might activate harmful genes, such as the protooncogenes that can be activated into oncogenes to cause cancer, or deactivate important genes, such as the ones that code for hemoglobin, for example. And so, it might not come as a surprise that limiting off-target effects is one of the biggest factors geneticists 👩‍🔬 and biologists 👨‍🔬consider when designing the guide-RNA sequence for a CRISPR experiment.

The problem is, it can be hard to predict off-target sites, since, theoretically, they can occur at any site adjacent to a PAM sequence (3-bp sequence that must be present next to the target sequence). Being able to determine which mismatches are more likely to occur and cause off-target effects seems impossible based on our current knowledge of the molecular interactions in genomics and gene editing. Many statistical and machine learning models exist to attempt to deal with this problem, but most of them are not accurate or effective enough, and different models can give different results for the same exact sequence.

Image for post
Image for post
Off-target effects take place because of small mismatches between the gRNA and DNA that the CRISPR system accepts

The Alternative

Two researchers 🧑‍🔬(Dr. Lin Jiecong and Dr. Wong Ka-Chun) from the City University of Hong Kong suggest using deep learning.

In the study, the researchers use two types of deep learning algorithms ⚙️ known as feedforward-neural networks (FNNs) and convolutional neural networks (CNNs) to improve off-target prediction.

But before we get into the specifics, why is deep learning better?

The Benefit of Deep Learning

  1. It’s better at analyzing unclear relationships 📈

Deep learning is much much faster than machine learning models and human-designed statistical models. What might usually take humans hours or days, deep learning can achieve in minutes. When it comes to designing a CRISPR experiment for therapeutic purposes, you want to do it fast, so you can also treat the patients 🤒 as fast as possible.

But even more importantly, deep learning is highly skilled at analyzing data 📈 and finding interesting relationships and connections that humans 🧑 can easily miss. Where human brains 🧠 are not designed to find deep connections in data, deep learning algorithms are designed to perform data analysis. This means that they can find connections between variables that we never could.

The Data

A = [1, 0, 0, 0]
G = [0, 1, 0, 0]
C = [0, 0, 1, 0]
T = [0, 0, 0, 1]

Now, we have a 2D matrix of length 23 that might look like this:

Image for post
Image for post

The next step is to actually communicate the mismatches that took place between the guide RNA and DNA molecules to cause the off-target effects.

To do this, the DNA code was converted into a matrix the same way. Following this, an OR operator was applied to combine and compare the two matrices and communicate the mismatches and matches.

To do so, the OR operator checks both sequences letter by letter to see which letters match and which ones don’t. For all the letters that match, the code in the matrix remains the same. For example, if both strands had a match between As, the code for that index position would remain as
[1, 0, 0, 0].

However, for indices where the letters did not match, the operator combines the arrays for both sequences at that location and places this combined array into the combined matrix. For example, if there was a mismatch between an A ([1, 0, 0, 0] and a G ([0, 1, 0, 0]), the two arrays would be combined to form the new array [1, 1, 0, 0]. By doing so, we are able to communicate locations and types of mismatches that occurred between the target DNA and guide RNA sequences.

Image for post
Image for post
This is how the OR operator combines both sequences to create a collated sequence that can communicate the mismatches between the complementary guide RNA and the target DNA

The Feedforward Neural Network

Image for post
Image for post
The basic structure of an FNN

One special property of FNNs is the fact that they cannot handle 2D data like matrices. To be able to input the data we manipulated above, we have to transform the 23x4 matrix into a 1-dimensional vector of length 92. This means that the input layer of the FNN must have 92 neurons to take in the vector.

Next, the researches incorporated some variability into the structure of the hidden layers, to determine the best model for the problem. They designed three distinct models that all had a different number of hidden layers and neurons in each of those layers. The first model had only two layers with 250 neurons in the first layer, and 40 neurons in the second one. The second model had three layers, with 50 neurons, 20 neurons, and 10 neurons. Finally, the third model had four hidden layers, with 25 neurons, 10 neurons, 10 neurons, and 4 neurons in the layers. Overall, each of these models had a total of 10,000 connections between the neurons in its hidden layers.

The output neurons had the activation function set to softmax, which can convert each neuron output into a probability.

All three models were then compiled using the Adam optimizer, which trained to optimize the cross-entropy loss in the model. In layman’s terms, the model basically trained to minimize the amount of loss, or the number of errors, during training.

The Convolutional Neural Network: How It Works

CNNs are actually specially designed to analyze images and classify them into one of multiple categories. The reason they are so good at image analysis is because of the presence of a few special layers.

Image for post
Image for post
A CNN in action, classifying people, buses, and cars on a busy street (somewhere in Europe I think?). Pretty neat 😎 huh?

The most important of these layers is the convolutional layer. It acts much like normal layers, where it receives an input, transforms the data, and transfers the transformed data into the next layer. The entire purpose of the convolutional layer is to extract features from the input image.

It does this by applying a filter to the original image. This filter multiplies each element highlighted by a certain number, and then adds all the multiplication outputs to get the final number. This number forms a single element in the output matrix, called a feature map. The filter then continuously slides over by one pixel, also known as its stride, and continues to develop the feature map.

Image for post
Image for post
The convolutional layer at work, using a preset filter to create a feature map with all the important information in a more concise form

The feature map is then inputted into a batch normalization (BN) layer, which normalizes all the features and reduces the covariate shift, which makes the training process faster⏱ and more efficient.

The next step, in this case, is for the feature map to go through a global max-pooling layer. This layer essentially reduces the dimensionality of the feature map while retaining its most important information. In essence, the layer takes the entire feature map it receives and outputs one element with the largest value from the feature map.

The next two layers are fully connected dense layers, with 100 neurons in the first layer, and 23 neurons in the second. The second dense layer also has a dropout layer applied to it, in order to randomly mask portions of the output to avoid the model from overfitting to the training dataset. The last layer, the output layer, consists of two neurons that correspond to the two classification results: off-target effect, or not.

Now, you might be wondering: “If the CNN is so good at analyzing images, how does that help us with this problem?” The thing is, the CNN isn’t good at analyzing images specifically. Instead, it’s good at analyzing matrices, or 2D data. This means that we could plug in the matrix we made before as it is into the CNN, and the model would be able to train on it, unlike the FNN.

Just like the FNNs, the researchers also carried out multiple experiments with different CNN structures, to see which structure worked the best. There were a total of 6 CNNs: the standard CNN; a CNN without the BN layer (CNN_nbn); a CNN without the dropout layer (CNN_nd); a CNN without the max-pooling layer (CNN_np); a CNN with a max-pooling window size of 3 (CNN_pwin3); and a CNN with a max-pooling window size of 7 (CNN_pwin7).

The Results

Yes, it did! And it didn’t just do well, it blew all previous methods out of the park!

The accuracy, in this case, was determined using the Receiver Operating Characteristic (ROC) Curve, which compares the false positive rate to the true positive rate for the models, and then finding the Area Under Curve (AUC), which represents the accuracy of the model. As shown in the table below, the best performance was seen with the standard CNN model and the FNN model with 3 hidden layers, with a mean AUC score of around 0.97 for both.

Image for post
Image for post
The performance of all the different models in the experiment, with the best performance for the standard CNN (CNN_std) and the FNN with 3 hidden layers (FNN_3layer)

These two models also outperformed current machine learning and statistical algorithms designed to achieve the same purpose. It beat CFD scoring, the most accurate algorithm, and Logistic Regression, the most accurate machine learning model, by 5.8% and 3.9% respectively.

This might not seem like that much, but when it comes to the accuracy of a model designed to predict off-target effects for gene editing experiments, this could mean the difference between a correct prediction and a successful experiment, and an incorrect prediction and an unsuccessful experiment.

Next Steps

But more importantly, it shows that deep learning can be used in gene editing and genomic applications, and make a huge difference. So, it’s time we follow suit to virtually every other field across all industries and truly start researching the potential applications and effectiveness of using deep learning models in genome engineering. Once we do that, we might truly be able to make genome editing safe and effective for therapeutic💉 purposes.

If you prefer an audio-video version of this article, check out this video I made.

Hey there 👋🏽! I am Akshaj, a 17-year old innovator set out to change the world 🌎, mainly by applying exponential technologies like artificial intelligence, into biology🧫 and medicine ⚕. I am super passionate about bridging the gap between healthcare 🏥in developed and developing countries, and truly believe that this gap can be overcome with more accessibility to medical technologies, as well as cheaper💲 and more effective technologies. And so, I am working suuuuper hard to find simple solutions to complex medical problems, and affect billions of people worldwide. In short, I am working to be a unicorn🦄 person.

Thanks for reading this article📖! I hope you learnt something new, and really got a sense of how deep learning can be applied in the genome-editing field, and how it will change the future of the field completely.

If you liked what you read, you can find more on my Medium account, so consider subscribing, in order to get regular articles on topics in medicine, biology, and exponential technology.

Check out my Twitter and my LinkedIn, and subscribe to my monthly newsletter to stay fully updated with everything I’m doing, and how I’m working hard to change the world.

Also, check out my personal website to find out more about me, and what I’ve worked on!

Thank you for reading. See you next time!

Akshaj, out.

Written by

17 y/o innovator working on reversing ageing and researching cancer.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store