Analyzing DNA Sequences
In this section, we examine approaches that involve analyzing DNA sequences. DNA is a class of molecules that consist of a helical pair of polymers. The polymers are complementary and encode identical information. Each polymer is composed of many nucleotides that are joined in sequential fashion along a backbone. The information encoded in DNA can be viewed as a very long sequence of 4-base symbols since there are only four standard nucleic acids in DNA. These long strings of information are then transcribed into shorter segments by a process known as transcription. The shorter strings are composed of a similar molecule called RNA that employs the same type of 4-base representation; and, each such RNA string represents a code for a specific molecule. In many cases the RNA molecules are not themselves end products, but merely an encoding of a different type of molecule called a protein. Proteins are also polymers composed of simpler components joined in sequence, but the building blocks of proteins are amino acids (instead of nucleic acids). As there are 20 different types of standard amino acids, it takes at 3 symbols in the 4-base RNA code to uniquely identify a single symbol in the 20-base protein code. In fact, there is a redundant encoding from the 64 possible, 4-base triples to the 20-base amino acids.
Since DNA is the carrier of heritability, this is a reasonable place to start our discussion. It is relatively easy to build a neural system that processes DNA. Typically,
Input data |
Output |
Method |
DNA sequence |
Promoter regions |
Promoter region identification [10] |
DNA sequence |
RNA gene |
Non-coding RNA gene finder [56] |
DNA sequence |
Functional RNA genes |
Detection of functional RNA genes using feed-forward neural networks [15] |
DNA sequence |
Classifying rare events in human genome |
Detection of rare event in unbalanced data using neural networks [16] |
DNA sequence |
Clustered gene expression patterns |
Analyzing correlated gene expression patterns using unsupervised neural networks [31] |
DNA sequence |
DNA motifs |
Identifying unknown DNA motifs on DNA sequences using unsupervised neural networks [4] |
DNA sequence |
Classification of DNA barcoding genes |
Inferring species membership via DNA barcoding with back-propagation neural networks [68] |
DNA sequence |
mRNA’s donor and acceptor sites |
Predicting donor and acceptor location on human pre-mRNA with feed-forward neural networks [12] |
AA sequence |
Sequence classifications |
Protein Sequence Classification using Bayesian neural networks [62] |
AA sequence |
Clustered sequences |
Unsupervised Kohonen learning technique [26] |
AA sequence |
Coil locations |
Coil prediction [30] |
AA sequence |
/3 -sheet locations |
Predicting protein /3-sheets using alignment, neural networks and graph algorithm [13] |
AA sequence |
/3-tum locations |
Prediction of protein /З-tum structure using evolutionary information and neural networks [36] |
AA sequence |
Protein Structural domains |
Decomposition of protein structures into structural domains using profile and ANN [28] * |
AA sequence |
Protein domain boundaries |
Predicting protein domain using bidirectional recurrent neural networks [60] |
AA sequence |
Disulphide bonds |
Disulphide bond prediction with a 2D- recurrent network [59] |
AA sequence |
Prediction of residue contacts |
2D-recurrent neural networks for Protein contact map prediction [58] |
Input data |
Output |
Method |
AA sequence |
Secondary structure |
Predicting the secondary structure of globular proteins using MLP [52] |
AA sequence |
Secondary structure |
Prediction of protein secondary structure using sequence profiles and neural networks [53] |
AA sequence |
Secondary structure |
Prediction of protein secondary structure using evolutionary information and neural networks [54] |
AA sequence |
Secondary structure |
Prediction of protein secondary structure using Position Specific Scoring Ma- trix(PSSM) and neural networks [34] |
AA sequence |
Secondary structure |
Prediction of protein secondary structure using hidden neural networks [47] |
AA sequence |
Secondary structure |
Prediction of protein secondary structure using bidirectional recurrent neural networks [7] |
AA sequence |
Real values of the solvent accessibility |
Feed-forward neural networks for predicting the real values of solvent accessibility of amino acid [2] |
AA sequence |
Real values of the solvent accessibility |
Approximating the real-value relative solvent accessibility (RSA) of AA residues [1] |
AA sequence |
Protein binding sites |
Binding site prediction with neural network [37] |
AA sequence |
Secondary structure, solvent accessibility, backbone structural motifs, and contact density |
Predicting ID structural properties using stmctural alignment method (SAMD) and recursive neural networks [50] |
AA sequence |
Signal peptides |
Detection of signal peptides in proteins [51] |
AA sequence |
Detection of protein stability |
Prediction of protein stability changes using statistical potentials and multilayer feed-forward neural networks [20] |
AA sequence |
Detection of protein disorders |
Predicting protein disorder for N-, C- and internal regions [46] |
AA sequence |
Detection of motifs |
Predicting proteasome cleavage motifs using artificial neural networks [38] |
AA sequence |
Detection of drug resistant factor |
Predicting HIV drug resistance with neural networks [21] |
AA sequence |
Protein superfamilies |
Classifications of protein sequences based on superfamily classes [66] |
Input data |
Output |
Method |
Mass spectrometry data |
Diagnosis of tumours |
Classifying human tumour and identification of biomarkers [8] |
DNA micro arrays |
Diagnosis of cancers |
Classification and prediction of cancers using gene expression profiling and artificial neural networks [39] |
DNA micro arrays |
Diagnosis of breast cancers |
Detecting breast cancer using artificial neural networks [45] |
DNA micro arrays |
Classification of diseases |
Classification of gene expression data using ensemble neural networks [48] |
a sliding window of fixed length is applied to the sequence, and the nucleic acids that fall within the window are encoded in a one-hot fashion. That is, four input units are used to represent each nucleotide and exactly one of these units (corresponding to one of the four different nucleotides) is activated each time. In this section, we consider four different goals in analyzing DNA: (i) identifying RNA coding regions in the DNA (arbitrary and specific fRNA), (ii) identifying promoter regions in the DNA, (iii) detecting disease carriers, and (iv) DNA barcoding.
While the central dogma of molecular biology encompasses how DNA is transcribed into RNA and then translated into protein sequences, most DNA does not code for proteins. Originally, called “Junk DNA” these parts of the genome are beginning to be better understood. In some cases, DNA is transcribed into functional RNA (fRNA) that is never translated into a protein but rather performs a directly useful biological function. Such RNA can be referred to as “non-coding” and the DNA regions that prescribe it are called “non-coding genes”. Non-coding RNA genes have been explored for their hidden and important roles in cells. A challenging task is the identification of non-coding RNA genes due to the diversity and the lack of consensus patterns for their genes. One avenue is to identify transcription factor binding sites: locations in the DNA where special molecules attach and begin the process of transcribing the DNA into RNA. A novel approach using fuzzy neural networks for non-coding RNA gene prediction was proposed in [56]. The hybrid approach has the advantages that give the nodes and parameters in the neural network physical meanings and provide a means to incorporate the qualitative prior knowledge by fuzzy set theory.
Another research area related to RNA is the detection of the gene encoding functional RNA (fRNA). In brief, fRNAs are the set of RNA genes which generate functional RNA products such as transfer RNA(tRNA) and microRNA(miRNA) without translation to protein. For instances, tRNA is involved in translation of the three-letter code in messenger RNA into the amino acids of proteins. In [15], a feed-forward neural network is employed for fRNA gene detection. Evolutionary computation is used to optimize the architecture of the neural networks. In other words, the neural network is evolved and optimized by deletions and insertions of nodes and connections and also adjusting the weights associated between two nodes.
Another type of pattern that can be found in DNA is the promoter region. These regions provide convenient places for the RNA polymerase proteins to attach to a DNA strand and begin the transcription process. In this fashion, these regions serve a regulatory role. Identifying promoter regions using artificial neural networks has been also studied in [10]. The traditional promoter prediction methods mainly search for motifs. However, recent studies in [35], [42] and [61] indicate that DNA structural features such as curvature, and stress-induced duplex destabilization (SIDD) also provide valuable information. In [10], SIDD profile data obtained from E. coli is used as the training data for the neural network.
One challenge faced by bioinformaticians is an usual sparsity of data. While there are often many long genetic sequences available, the most interesting phenomena are sometimes extremely rare. Therefore, a rare event leads to a variety of needle- in-a-haystack problems which have to be modelled and understood. Rare events are log normally distributed, so methods based on statistics that assume Gaussian distributions (e.g. arithmetic means) fail. However, sample stratification is a useful technique for rare event detection in unbalanced data especially in molecular biology. The technique makes each class in a sample data have equal weight in decision making. Using a neural network for sample stratification and detection of rare events was examined in [16]. The experiment was carried out on human genome DNA, and it showed significant improvement for rare event detection.
A common task with regard to the voluminous data in molecular biology is the detection of unique features from DNA sequences. In [4], an unsupervised learning class of ANNs, known as self-organizing map (SOM) [41], was studied in order to detect new motifs (domains) in DNA sequences. It was used to detect the signal peptide coding region on a dataset of human insulin receptor genes. SOMs are useful in pattern clustering and feature detection since this class of neural networks form internal representations that model the underlaying structures of input data. In the study, no prior knowledge, such as sequence alignment analysis, was embedded in the neural network. Yet, after the neural network training, the existence of minimal similarity patterns (MSPs) among the trained data was found by a statistical measure called “Tanimoto similarity” which is proportional to the difference between the input and weight vectors. The proposed method may potentially facilitate the identification of other DNA domains such as functional DNA patterns by performing further analysis on MSP clusters.
The final problem that we will discuss in this section stems from the field of taxonomy. Traditional taxonometric methods identify species by painstaking observation of morphological features—the physical characteristics of an organism. While this method has served scientists since before the days of Aristotle, it can be problematic. Many organisms are so small that observation of physical differences even using microscopy is difficult. In other cases, organisms have multiple life stages with very different forms that need to be individually identified, or significant differences among sexes. Sometimes the physical form of an organism is affected by its environment (including diet, habitat, etc.). In these cases, relying on the observation of physical traits is problematic. With the advent of genetic sequencing another approach is possible. By directly comparing the DNA of organisms it is possible to make species identifications [29]. Ideally, this is done by focusing on specific genetic traits that vary among species but not within species. A first approach might be to identify a specific gene with this property and then to measure differences among instances of this gene across organisms using a classical genetic distance measure (such as alignment scores). Current distance-based methods for species identifications using DNA barcoding sequences are frequently criticized for treating the nearest neighbour as the closest relative using a number of raw similarity scores. In [68], a feed-forward neural network is employed for the classification of DNA barcoding sequences. The results indicate a better performance compared to the previous methods such as basic local alignment search tool(BLAST) [3] which is a simple genetic distance-based method.