Zacharie MénétrierTechnological Advances for Genomics and Clinics (TAGC), Marseille INSERM U1090 Submitted 18/12/2017AbstractDeep learning forms a machine learning domain where deep non-linear transformation architectures are used to create statistical models with high level of abstraction. They are known to perform well with complex and intricate patterns of data and large datasets. Genomics is a biological field where such datasets are omnipresent mostly due to recent high-throughput techniques. Older shallow models have reached state of the art when predicting biological features, exposing their limitations too. It is thus encouraging to use deep learning structures in order to interpret complex biological features. Deep learning tools are currently described as “black boxes” which is not wanted from a biological point of view. However recent techniques made it possible for one to visualize deep learning models and get interesting insights about the studied genomics data. Here we present a synthesis of different studies that made use of deep learning tools, and present a non-exhaustive panel of what they achieved in their respective fields.Keywords: Machine learning, Genomics, Deep learning, Deep neural networksIntroductionMachine learning is a broad study field, forming large collection of methods. However they share a common purpose, as the goal is for a machine to learn from data without explicit programming. Many challenges such as writing recognition have been successfully overcame by former machine learning methods, and more recent breakthroughs even made computers capable of accurate images recognition or driving cars in an experimented way. In the panel of those recent breakthroughs deep learning stands as the spearhead. It has been largely exploited in a lot of domains including biology and genomic.Deep learning is a set of methods of machine learning trying to modelize data with a high level of abstraction 11 using non-linear transformation architectures. The deep structure stands for a multilayer network where layers show connections. Deep learning methods can be splitted into generative, discriminate and hybrid architectures 7. However, deep learning is mostly implemented using hybrid architectures with deep neural networks (DNNs), and most articles quoted in this review make use of DNNs or of one of its variation (deep convolutional neural networks for the most). We will show some older models limitations, then quickly review some of the deep learning structures implemented in the accounted genomic studies and explore some examples of successful applications of deep learning tools. We will list their methods, the biological concepts they encounter but also try to discuss some statements about deep learning generalities.Limitations of older modelsShallow models such as SVM and random forests have shown their power but their limitations too. Depending on the genomics applications, they are now quasi constantly surpassed by deep learning. Former methods are constrained by the time it would take to train them properly on large datasets. SVM models, for example, are blocked by a computational cost proportional to the size of the data to train on 3. Older machine learning models demand harassing feature engineering by hand. Instead, deep learning models are free from the need to specify a considerable amount of hand-crafted features or rules 4 6 9 (e.g. DNNs usually requires simple preprocessing steps to design hand-crafted features 6.)It is also important to make account of the diversity of biological solutions. It is not rare that same biological entities differ in shapes in different species/cell types 7 8. As a result, shallow computational methods lack of generalization power. Thus encouraging the models engineers to turn to deep learning. DNNs for example with their abstraction abilities make possible to train a more general model that will fit different shapes of the same concept, thus avoiding the need to create a model for each case 8.Former machine learning models makes mostly use of straightforward statistical approaches, they have enabled great analysis performances but stumble on abstract concepts, nested configurations and multi-class problems. Yet most common genomics shared knowledges are at a high level of abstraction where deep learning would find itself useful 11.Still, DNNs are not an all-in-one solutions, even if they surpass former powerful machine learning methods such as SVM in a lot of application, they need to be carefully crafted by an experimented user 2. And more important, they need large amount of data to work properly 1 4 -although it is also the main reason of their success. The recent rise of big data and parallel computing has made grown the interest for deep learning in genomics 2. DNNs computational training lies on large datasets, without what they would perform poorly. Biology and genomics are the scene of recent changes in the way data is produced. The advent of next generation sequencing and high throughput techniques combined with more accessible repositories left us with an ever increasing amount of data demanding to be fully processed 1 2 4. This fact alone has made DNNs an essential tool for genomic studies. Deep learning structures-Deep Neural NetworksA neural network is a combination of nodes called neurons linked together. Neural networks are often implemented in complex combinations of sigmoid functions, a method that is trusted to allow non-linear fitting and approximate 12 any existing functions. Each neuron got its own trigger value and each link its own weight. It’s by tweaking those parameters with addition to a back propagation algorithm 11 during the training period that neural networks become capable of approximating complex functions, that would have been impossible to craft by hand.Figure 1 : A trivial example of neural networkThe figure 1 shows a simple example of a neural network. The signal is feed forwarded by the links in the hidden layer, transforming by the way the nature of this signal. Each link could have a various positive or negative weight. The output neuron that would activate at the end of the signal forwarding will tell us if the nucleotide is a purine or a pyrimidine. Of course such model is completely trivial and would not even need one hidden layer to compute correctly, but it explains how connected neurons work together to transform signal.It is common to see them splitted in feedforward layers. The first layer, the input layer is made of neurons that will hold the initial values filled by the data to learn from. The final layer is composed of neurons that represent the outcome of the model. Between those two layers, the model holds a certain number of hidden layers, that will be responsible of the transformation of the input signal. The number of hidden layer gives knowledge about the depth of our neural network. Of course higher levels of abstraction need larger numbers of hidden layers, that will certainly take more time and computational power to fulfill its model.-Deep Convolutional Neural NetworksDeep convolutional neural networks (CNNs) are variants of DNNs. In addition, they show ability to extract translation-invariant features 3. CNNs are able to capture local and global representation of the data at the same time. Regulatory mechanisms of genomics sequences are known to be partially driven by motifs. Motifs can be viewed as the temporal equivalent of spatial patterns, explaining why CNNs make a great ally when parsing genomic sequences 3.-Recurrent Neural NetworksRecurrent neural networks (RNNs) are a special variant of DNNs where the network can remember previous states. It is a structure created to integrate variable length sequential data. It makes use of cyclic connections in hidden unit to store past informations. A lot of genomic features are sequence-like, and so using RNNs and their variants seems a promising option 3. Known variants used in genomics are the bidirectional RNN, since there is no innate direction in genomic sequences 3. And the long-short term memory (LSTM) RNN. LSTM is mainly used because of its power to learn what to remember and what to forget in order to yield better performance 6.Figure 2 : A basic recurrent neural networkRNNs are great for time sequence processing, like language processing. Genomic sequences areFigure 3 : Examples of model architectures. The goal for each of those architectures would be to predict transcription factor binding site classification. The architectures differ by the middle “module”, which are (a) Convolutional, (b) Recurrent, and (c) Convolutional-Recurrent. Maxpool is a popular downscaler among CNNs. ReLU is a faster alternative to sigmoid activation functions. Softmax is also an activation function usually used at the very end of deep learning structures due to its ability to map outputs to probability distributions.Known to be under time dependencies, explaining the success they encounter in genomic models. It is not rare to see hybrids of different methods where each layer plays a different role capturing the best of each architecture 2 3 11.Genomics predictionsBiological processes show extraordinary complex behaviour and inextricable patterns of molecular interactions. Thus explaining the need of high level of abstraction for machine learning tools. DNNs are known to perform well in the case of multimodal classes 11 and nested configurations which is, as we stated before, often the case concerning biological entities. Therefore it is encouraging to use deep learning for computing such puzzling systems.Here we will present a short list of genomics study fields that made recent use of deep learning to shed light on some of their features. All of the following examples show greater performances than older shallow models in their respective fields.- DNA sequences 1 2Deep convolutional neural networks (CNN) can be used to identify human genomic sites statistically related to phenotypes. The deep learning tools focus on non-coding portions which are known to be strongly associated with human disease. The models are trained on a compendium of accessible genomic sites like ENCODE and Roadmap Epigenomics data releases. CNNs can also be used in a hybrid way, combined with a recurrent neural network (RNN) that made use of a bi directional long short-term memory (LSTM) in order to learn long-term dependencies in the genome. It could be used to predict the phenotypic outcomes of genome editing.- Histone modifications 4Histone modifications and gene regulations are strongly connected. CNNs can classify gene expression coming from public repositories using histone modifications as input. It is also possible to extract insights on the created models in order to visualize the existing pattern between histone modifications and gene regulations.- DNA methylation 5Protocols assaying DNA methylation suffer from incomplete CpG coverage. Thus, predicting missing methylation states is important to enable genome wide analyses. CNNs can extract methylation states in single cells based on different cell types profiled with scBS-seq and scRRBS-seq. The same trained CNN can also be used to retrieve sequence motifs that explain DNA methylation levels or methylation variability and to estimate the effect of single-nucleotide mutations.- microRNA 6Detection of microRNA is hard as they are usually short and hardly distinguishable from other non-coding RNAs. LSTM RNNs can learn structural characteristics of precursor microRNAs in order to identify them. A newly developed visualization method can then be used for further understanding of the biological mechanisms. – Long non-coding RNA 7Long non-coding RNAs play an important role in cellular functions. A set of descriptors combined with a deep stacking network can be used to classify long non-coding RNAs from coding RNAs using Gencode and RefSeq databases. To identify lncRNAs multiple features are fused into one vector. Corresponding features are the open reading frame, k-mer, the secondary structure and the most-like coding domain sequence.-Enhancers 8Transcriptional enhancers are non-coding segments of DNA that play a central role in the spatiotemporal regulation of gene expression programs. DNNs combined with hidden markov models can directly learn to predict enhancer from massively heterogeneous data with great accuracy and consistency among different cell types/tissues.- Protein subcellular localization 9Protein subcellular localization plays an important role on specific functions and biological processes in cells. A stacked auto-encoder can perform predictions for such localization without the need of handcrafted feature descriptors.- Protein structure 10The function of a protein is strongly related to its spatial structure. Given enough training data, and using only the protein sequence a CNN can predict with improved accuracy (breaking the ~80%) its secondary structure. Other protein structure properties such as contact number, disorder regions, and solvent accessibility can be predicted by a properly trained CNN.- RNA-binding proteins 11RNA-binding proteins process of recognizing the target RNAs is still unclear. An hybrid CNN and deep belief network can predict the RNA binding proteins interaction sites and motifs on RNAs. The model is then capable of visualization, discovering the interpretable binding motifs and providing some interesting biological insights.Visualizing deep learningA lot of recent application of deep learning to genomic put emphasis on the need of visualizing what the model has learned in order to discover some interesting patterns governing biological features 1 3 4 5 6 11. DNNs and deep learning methods in general get the reputation to limit at “black boxes” 3 which means they create models that makes accurate predictions but are opaque in their mechanisms. However, in biology, using such approaches that lack of interpretability is not encouraged 6. It is critical to understand how the network extracts features and makes its predictions. Though DNNs and their variants are very accurate predictors, the concepts they abstract are not as easily visualized as they are in simpler linear models 1. However it is still possible to extract informations about the model by understanding the configuration of its parameters, but also by tweaking the signal passing specific layers of the network, and by analyzing its predictions on properly chosen inputs 3.In the following example, three deep learning visualization techniques (Lanchantin et al., 2017) will be described. The DNNs in account are trained to identify TFBS from DNA sequence (Figure 3) and the visualization techniques are meant to uncover why TFs bind to certain locations using the trained model as a motif viewer tool 3.-Saliency mapsThe saliency map is needed to understand which parts of the sequence are most influential for the classification. It gives a rank for each nucleotide that give its influence on the outcome. However, due to the deep combination of nonlinear functions, it is hard to directly see the influence of each nucleotide on the final score. But with one step of backpropagation one can returns the derivative of the influence of a nucleotide. The end result will ranks nucleotides from the one to be changed the least in order to affect the score the most, to the one to be changed the first in order to affect the score the least.-Temporal output scoresIt is also interesting to visualize output scores at each position of a sequence like in a time series, which makes sense as DNA is sequential. Temporal output scores are designed to work only on RNNs models. By checking the RNN’s prediction scores with varying input sequences, it is possible to tell where in the sequence (and so in time) the model changes decisions.-Class optimizationTwo previous visualization methods were sequence specific which means they give insights to a trained model given a specific sequence. The following technique is model specific as it can give crucials informations about the trained model itself. Through stochastic gradient descent, a sequence is found which maximizes the probability of a positive outcome of the DNN. The results can be shown in the form of a motif where each nucleotide size would represent its importance for TFBS classification..ConclusionHere we have presented a non exhaustive review of deep learning applications to genomics which means we did not present other biological applications like medical imagery, despite their importance and regular usage. The goal of this study was to show the broad possibilities that deep learning can achieve when trained with genomics data, sequences for the most. Deep learning applications to genomics are now slowly replacing older models due to their ability to learn and extract abstract and complex patterns out of large datasets. It seems in fact that genomics datasets coming from newer hightroughput techniques are large and full of hidden and abstract patterns. Moreover, new deep learning based visualization methods can bring some deep insights on the so found abstracted patterns and could give great and new way to look at genomics data. However, most of the studies accounted for this review found hard time implementing a correct deep learning architecture. Weights initialization for example continue to be an unsolved problem which has not found an all-in-one solution. Deep learning structures have the disadvantage of a lack of a clear application scheme. Due to this it is still up to everyone willing to create deep learning architectures to imagine their layers connections, the number of hidden layers and to guess if there is a need for convolution or recurrence or any other variants and hybrid configurations.The future of deep learning seems promising. Datasets will continue to expand with more and more details added regularly. Deep learning with its ability to learn from data with high abstraction level will probably play the role of an interface between human and big data. Genomics studies demand a powerful tool to interpret data in a more unified and sophisticated way. But deep learning isn’t the perfect candidate yet, mostly for its lack of standardisation -which seems to be the price for higher levels of abstraction. Innovative methods like neuroevolution of augmenting topologies (NEAT and hyperNEAT) that makes use of genetics algorithms to create deep learning structures or newly created DNNs trained to optimize machine made DNNs may be another step toward the automation of machines and artificial intelligence.References1 Kelley, D.R., Snoek, J., Rinn, J.L., 2016. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome research 26, 990–999.2 Quang, D., Xie, X., 2016. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res 44, e107–e107. 3 Lanchantin, J., Singh, R., Wang, B., Qi, Y., 2017. Deep motif dashboard: Visualizing and understanding genomic sequences using deep neural networks, in: PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017. World Scientific, pp. 254–265.4 Singh, R., Lanchantin, J., Robins, G., Qi, Y., 2016. DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics 32, i639–i648. 5 Angermueller, C., Lee, H.J., Reik, W., Stegle, O., 2017. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biology 18, 67. 6 Park, S., Min, S., Choi, H., Yoon, S., 2016. deepMiRGene: Deep Neural Network based Precursor microRNA Prediction. arXiv:1605.00017 cs, q-bio.7 Fan, X.-N., Zhang, S.-W., 2015. lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning. Molecular BioSystems 11, 892–897.8 Liu, F., Li, H., Ren, C., Bo, X., Shu, W., 2016. PEDLA: predicting enhancers with a deep learning-based algorithmic framework. Scientific Reports 6, 28517. 9 Wei, L., Ding, Y., Su, R., Tang, J., Zou, Q., 2017. Prediction of human protein subcellular localization using deep learning. Journal of Parallel and Distributed Computing. 10 Wang, S., Peng, J., Ma, J., Xu, J., 2016. Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields. Sci Rep 6. 11 Pan, X., Shen, H.-B., 2017. RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach. BMC Bioinformatics 18.12 Agostinelli, F., Ceglia, N., Shahbaba, B., Sassone-Corsi, P., Baldi, P., 2016. What time is it? Deep learning approaches for circadian rhythms. Bioinformatics 32, i8–i17.