Analyzing DNA Sequences

In this section, we examine approaches that involve analyzing DNA sequences. DNA is a class of molecules that consist of a helical pair of polymers. The polymers are complementary and encode identical information. Each polymer is composed of many nucleotides that are joined in sequential fashion along a backbone. The information encoded in DNA can be viewed as a very long sequence of 4-base symbols since there are only four standard nucleic acids in DNA. These long strings of information are then transcribed into shorter segments by a process known as transcription. The shorter strings are composed of a similar molecule called RNA that employs the same type of 4-base representation; and, each such RNA string represents a code for a specific molecule. In many cases the RNA molecules are not themselves end products, but merely an encoding of a different type of molecule called a protein. Proteins are also polymers composed of simpler components joined in sequence, but the building blocks of proteins are amino acids (instead of nucleic acids). As there are 20 different types of standard amino acids, it takes at 3 symbols in the 4-base RNA code to uniquely identify a single symbol in the 20-base protein code. In fact, there is a redundant encoding from the 64 possible, 4-base triples to the 20-base amino acids.

Since DNA is the carrier of heritability, this is a reasonable place to start our discussion. It is relatively easy to build a neural system that processes DNA. Typically,

Input data

Output

Method

DNA sequence

Promoter regions

Promoter region identification [10]

DNA sequence

RNA gene

Non-coding RNA gene finder [56]

DNA sequence

Functional RNA genes

Detection of functional RNA genes using feed-forward neural networks [15]

DNA sequence

Classifying rare events in human genome

Detection of rare event in unbalanced data using neural networks [16]

DNA sequence

Clustered gene expression patterns

Analyzing correlated gene expression patterns using unsupervised neural networks [31]

DNA sequence

DNA motifs

Identifying unknown DNA motifs on DNA sequences using unsupervised neural networks [4]

DNA sequence

Classification of DNA barcoding genes

Inferring species membership via DNA barcoding with back-propagation neural networks [68]

DNA sequence

mRNA’s donor and acceptor sites

Predicting donor and acceptor location on human pre-mRNA with feed-forward neural networks [12]

AA sequence

Sequence classifications

Protein Sequence Classification using Bayesian neural networks [62]

AA sequence

Clustered sequences

Unsupervised Kohonen learning technique [26]

AA sequence

Coil locations

Coil prediction [30]

AA sequence

/3 -sheet locations

Predicting protein /3-sheets using alignment, neural networks and graph algorithm [13]

AA sequence

/3-tum locations

Prediction of protein /З-tum structure using evolutionary information and neural networks [36]

AA sequence

Protein Structural domains

Decomposition of protein structures into structural domains using profile and ANN [28] *

AA sequence

Protein domain boundaries

Predicting protein domain using bidirectional recurrent neural networks [60]

AA sequence

Disulphide bonds

Disulphide bond prediction with a 2D- recurrent network [59]

AA sequence

Prediction of residue contacts

2D-recurrent neural networks for Protein contact map prediction [58]

Input data

Output

Method

AA sequence

Secondary structure

Predicting the secondary structure of globular proteins using MLP [52]

AA sequence

Secondary structure

Prediction of protein secondary structure using sequence profiles and neural networks [53]

AA sequence

Secondary structure

Prediction of protein secondary structure using evolutionary information and neural networks [54]

AA sequence

Secondary structure

Prediction of protein secondary structure using Position Specific Scoring Ma- trix(PSSM) and neural networks [34]

AA sequence

Secondary structure

Prediction of protein secondary structure using hidden neural networks [47]

AA sequence

Secondary structure

Prediction of protein secondary structure using bidirectional recurrent neural networks [7]

AA sequence

Real values of the solvent accessibility

Feed-forward neural networks for predicting the real values of solvent accessibility of amino acid [2]

AA sequence

Real values of the solvent accessibility

Approximating the real-value relative solvent accessibility (RSA) of AA residues [1]

AA sequence

Protein binding sites

Binding site prediction with neural network [37]

AA sequence

Secondary structure, solvent accessibility, backbone structural motifs, and contact density

Predicting ID structural properties using stmctural alignment method (SAMD) and recursive neural networks [50]

AA sequence

Signal peptides

Detection of signal peptides in proteins [51]

AA sequence

Detection of protein stability

Prediction of protein stability changes using statistical potentials and multilayer feed-forward neural networks [20]

AA sequence

Detection of protein disorders

Predicting protein disorder for N-, C- and internal regions [46]

AA sequence

Detection of motifs

Predicting proteasome cleavage motifs using artificial neural networks [38]

AA sequence

Detection of drug resistant factor

Predicting HIV drug resistance with neural networks [21]

AA sequence

Protein superfamilies

Classifications of protein sequences based on superfamily classes [66]

Input data

Output

Method

Mass spectrometry data

Diagnosis of tumours

Classifying human tumour and identification of biomarkers [8]

DNA micro arrays

Diagnosis of cancers

Classification and prediction of cancers using gene expression profiling and artificial neural networks [39]

DNA micro arrays

Diagnosis of breast cancers

Detecting breast cancer using artificial neural networks [45]

DNA micro arrays

Classification of diseases

Classification of gene expression data using ensemble neural networks [48]

a sliding window of fixed length is applied to the sequence, and the nucleic acids that fall within the window are encoded in a one-hot fashion. That is, four input units are used to represent each nucleotide and exactly one of these units (corresponding to one of the four different nucleotides) is activated each time. In this section, we consider four different goals in analyzing DNA: (i) identifying RNA coding regions in the DNA (arbitrary and specific fRNA), (ii) identifying promoter regions in the DNA, (iii) detecting disease carriers, and (iv) DNA barcoding.

While the central dogma of molecular biology encompasses how DNA is transcribed into RNA and then translated into protein sequences, most DNA does not code for proteins. Originally, called “Junk DNA” these parts of the genome are beginning to be better understood. In some cases, DNA is transcribed into functional RNA (fRNA) that is never translated into a protein but rather performs a directly useful biological function. Such RNA can be referred to as “non-coding” and the DNA regions that prescribe it are called “non-coding genes”. Non-coding RNA genes have been explored for their hidden and important roles in cells. A challenging task is the identification of non-coding RNA genes due to the diversity and the lack of consensus patterns for their genes. One avenue is to identify transcription factor binding sites: locations in the DNA where special molecules attach and begin the process of transcribing the DNA into RNA. A novel approach using fuzzy neural networks for non-coding RNA gene prediction was proposed in [56]. The hybrid approach has the advantages that give the nodes and parameters in the neural network physical meanings and provide a means to incorporate the qualitative prior knowledge by fuzzy set theory.

Another research area related to RNA is the detection of the gene encoding functional RNA (fRNA). In brief, fRNAs are the set of RNA genes which generate functional RNA products such as transfer RNA(tRNA) and microRNA(miRNA) without translation to protein. For instances, tRNA is involved in translation of the three-letter code in messenger RNA into the amino acids of proteins. In [15], a feed-forward neural network is employed for fRNA gene detection. Evolutionary computation is used to optimize the architecture of the neural networks. In other words, the neural network is evolved and optimized by deletions and insertions of nodes and connections and also adjusting the weights associated between two nodes.

Another type of pattern that can be found in DNA is the promoter region. These regions provide convenient places for the RNA polymerase proteins to attach to a DNA strand and begin the transcription process. In this fashion, these regions serve a regulatory role. Identifying promoter regions using artificial neural networks has been also studied in [10]. The traditional promoter prediction methods mainly search for motifs. However, recent studies in [35], [42] and [61] indicate that DNA structural features such as curvature, and stress-induced duplex destabilization (SIDD) also provide valuable information. In [10], SIDD profile data obtained from E. coli is used as the training data for the neural network.

One challenge faced by bioinformaticians is an usual sparsity of data. While there are often many long genetic sequences available, the most interesting phenomena are sometimes extremely rare. Therefore, a rare event leads to a variety of needle- in-a-haystack problems which have to be modelled and understood. Rare events are log normally distributed, so methods based on statistics that assume Gaussian distributions (e.g. arithmetic means) fail. However, sample stratification is a useful technique for rare event detection in unbalanced data especially in molecular biology. The technique makes each class in a sample data have equal weight in decision making. Using a neural network for sample stratification and detection of rare events was examined in [16]. The experiment was carried out on human genome DNA, and it showed significant improvement for rare event detection.

A common task with regard to the voluminous data in molecular biology is the detection of unique features from DNA sequences. In [4], an unsupervised learning class of ANNs, known as self-organizing map (SOM) [41], was studied in order to detect new motifs (domains) in DNA sequences. It was used to detect the signal peptide coding region on a dataset of human insulin receptor genes. SOMs are useful in pattern clustering and feature detection since this class of neural networks form internal representations that model the underlaying structures of input data. In the study, no prior knowledge, such as sequence alignment analysis, was embedded in the neural network. Yet, after the neural network training, the existence of minimal similarity patterns (MSPs) among the trained data was found by a statistical measure called “Tanimoto similarity” which is proportional to the difference between the input and weight vectors. The proposed method may potentially facilitate the identification of other DNA domains such as functional DNA patterns by performing further analysis on MSP clusters.

The final problem that we will discuss in this section stems from the field of taxonomy. Traditional taxonometric methods identify species by painstaking observation of morphological features—the physical characteristics of an organism. While this method has served scientists since before the days of Aristotle, it can be problematic. Many organisms are so small that observation of physical differences even using microscopy is difficult. In other cases, organisms have multiple life stages with very different forms that need to be individually identified, or significant differences among sexes. Sometimes the physical form of an organism is affected by its environment (including diet, habitat, etc.). In these cases, relying on the observation of physical traits is problematic. With the advent of genetic sequencing another approach is possible. By directly comparing the DNA of organisms it is possible to make species identifications [29]. Ideally, this is done by focusing on specific genetic traits that vary among species but not within species. A first approach might be to identify a specific gene with this property and then to measure differences among instances of this gene across organisms using a classical genetic distance measure (such as alignment scores). Current distance-based methods for species identifications using DNA barcoding sequences are frequently criticized for treating the nearest neighbour as the closest relative using a number of raw similarity scores. In [68], a feed-forward neural network is employed for the classification of DNA barcoding sequences. The results indicate a better performance compared to the previous methods such as basic local alignment search tool(BLAST) [3] which is a simple genetic distance-based method.

 
Source
< Prev   CONTENTS   Source   Next >