BIOINFORMATICS AND COMPUTER MODELING
With the rise of big data (e.g., the results of ‘omic studies) comes the necessity to incorporate computation into biotechnology in meaningful and seamless ways. In particular, computational tools are increasingly being used to analyze and make predictions about the structure and function of DNA, proteins, and cellular systems.
DNA and Protein Repositories and Associated Tools
The genomic era brought massive amounts of biological information about DNA and proteins. DNA analysis primarily involves the comparison of sequences. There are a number of bioinformatics tools to assist in such work. The National Center for Biotechnology Information (NCBI) is an excellent source of both genetic information (GenBank) and tools for the comparison and prediction of DNA and protein sequences Basic Local Alignment Search Tool (BLAST).
GenBank® is a publically accessible repository of DNA and protein sequences maintained by the United States’ National Institute of Health (NIH). It is part of a greater consortium of genetic databases, including the DNA DataBank of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL) (NCBI 2011). Newly acquired DNA sequences are deposited into the database by investigators on a rolling basis. Curators of the system ensure some level of consistency and uniformity within the entries. The database can be mined by keyword (such as gene name, function, or organism) queries to acquire the nucleotide or protein sequence of interest. Information such as source references, CDS and other feature coordinates, and translated amino acid sequences are provided in Gen- Bank entries. One could use this information to design primers for gene amplification/cloning or perform comparisons between similar genes/ proteins.
BLAST is a way to search the GenBank database using a sequence- based (rather than keyword) query. The most conventional BLAST searches are based on nucleotide (BLASTn) and protein (BLASTp) sequences.
In BLASTn and BLASTp, submitted nucleotide/protein sequences are cross-referenced against all other nucleotide/protein sequences in the database.39 The results are delivered as a list of “hits,” ranked in order of query similarity. Pairwise sequence alignments are provided, along with E-values and similarity and coverage percentages. E-values are an indication of the expectancy that the similarity between the query and hit occurred by chance—the smaller the E-value; the more significant the match. BLAST searches are such a widely used bioinformatics tool that “BLAST” has become a colloquial verb used among biologists.
In addition to searching the general sequence databases, BLAST can be used to compare one or more sequences to each other. To do this, one must simply select the “Align two or more sequences” box on the submission page and enter the various sequences to be compared. The BLAST tool compares the submitted sequences to each other and generates an alignment with locations of similar and dissimilar nucleotides or amino acids marked. This application of the BLAST tool is particularly useful when comparing cloned to ideal reference sequences or identifying differences (i.e., SNPs) within a subset of similar sequences. Freeware programs of the Clustal-family40 (e.g., Clustal-W2) are specialized online bioinformatics tools for the construction of nucleotide and protein alignments, with the added feature of phylogenetic tree construction. Phylogenetic trees graphically depict the overall genetic similarity between sequences.
A clearinghouse for online bioinformatics tools can be found at the ExPASy Bioinformatics Resources Portal,41 which is maintained by the Swiss Institute of Bioinformatics. Tools are categorized into 11 different genres, ranging from proteomics to systems biology to drug discovery. As an example of what is available, links within the proteomics category lead to a multitude of tools for identifying protein domains, translating nucleotide sequences, mass spectroscopy and 2D gel data, and predicting structure, post-translational modification, and protein topology.