Evidence-Based Diabetes Management

The Role of Bioinformatics in Diabetes Drug Development—and Precision Medicine

Published Online: May 20, 2014
Surabhi Dangi-Garimella, PhD
According to the 2011 National Diabetes Fact Sheet, diabetes affects nearly 26 million Americans, 95% of whom suffer from type 2 diabetes mellitus (T2DM).1 A 2014 report published by the Pharmaceutical Research and Manufacturers of America (PhRMA) documented the development of 180 medications to treat diabetes or diabetes-related conditions, a majority of which are to treat T2DM.2

The drugs being developed are intended to improve on the current therapies to combat the health toll and the healthcare costs associated with the disease. Among the drugs under development is a human peptide, a bioactive part of a gene that regenerates pancreatic islets; additionally, there are novel inhibitors of the protein dipeptidyl peptidase-4 (DPP-4) being developed, as well as a drug that targets sorbitol, a sugar alcohol determined to be responsible for diabetic neuropathy.2 These breakthrough advances are based on the research conducted by scientists to understand disease mechanisms, which include gene sequencing and protein structure elucidation.

GenBank, an all-inclusive, open-source database initiated by the National Center for Biotechnology Information (NCBI), has a very important role to play in this process. GenBank includes nucleotide sequences for more than 280,000 species and the supporting bibliographies, with submissions from individual laboratories as well as large-scale sequencing projects. Additionally, sequences from issued patents are submitted by the US Patent and Trademark Office.3 Despite the open access to this database, researchers all over the world have actively contributed to building up the resource, realizing the vast potential of this knowledge-sharing database. The information either goes to GenBank or is submitted through its European counterpart, the European Bioinformatics Institute (EBI), or its Japanese counterpart, the DNA Data Bank of Japan (DDJB).4

All the leading journals need researchers to submit their sequences to GenBank and cite the corresponding access number in the published article. The new sequences can be directly submitted to EBI, DDJB, or GenBank, and the 3 databases are synchronized daily for easy access to all the information on all 3 databases. The data are virtually in real time, with minimal delay in access to the latest data, free of cost.

Other commonly used nucleotide databases include the European Molecular Biology Laboratory (EMBL; EBI is run by EMBL), SwissProt, PROSITE, and Human Genome Database (GDB).5 Taken together, these databases are essentially a bioinformatics tool that helps integrate biological information with computational software. The information gained can be applied to understand disease etiology (in terms of mutations in genes and proteins) and individual variables, and ultimately aid drug development.

According to the National Institutes of Health Biomedical Information Science and Technology Initiative, bioinformatics is defined as “research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral, or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.”6

Development of GenBank

Initially called the Los Alamos Sequence Database, this resource was conceptualized in 1979 by Walter Goad, a nuclear physicist and a pioneer in bioinformatics at Los Alamos National Laboratory (LANL).7 GenBank followed in 1982 with funding from the National Institutes of Health, the National Science Foundation, and the Departments of Energy and Defense. LANL collaborated with various bioinformatics and technology companies for sequence data management and to promote open access communications. By 1992, GenBank transitioned to being managed by the National Center for Biotechnology information (NCBI).8

Submissions to the database include original mRNA sequences, prokaryotic and eukaryotic genes, rRNA, viral sequences, transposons, microsatellite sequences, pseudogenes, cloning vectors, noncoding RNAs, and microbial genome sequences. Following a submission (using the Web-based BankIt or Sequin programs), the GenBank staff reviews the documents for originality and then assigns an accession number to the sequence, followed by quality assurance checks (vector contamination, adequate translation of coding regions, correct taxonomy, correct bibliographic citation) and release to the public database.3,8

How Are Researchers Utilizing This Database?

BLAST (Basic Local Alignment Search Tool) software, a product of GenBank, allows for querying sequence similarities by directly entering their sequence of interest, without the need for the gene name or its synonyms.4 An orphan (unknown) or de novo nucleotide sequence, which may have been cloned in a laboratory, can gain perspective following a BLAST search and a match with another, better-characterized sequence in the database. Further, by adding restrictions to the BLAST search, only specific regions of the genome (such as gene-coding regions) can be examined instead of the 3 billion bases.4 BLAST can also translate a DNA sequence to a protein, which can then be used to search a protein database.

BLAST, which was developed at NCBI, works only with big chunks of nucleotide sequences, and not with shorter reads, according to Santosh Mishra, PhD, director of bioinformatics and codirector of the Collaborative Genomics Center at the Vaccine and Gene Therapy Institute (VGTI) of Florida. Mishra, who worked as a postdoctoral research associate with Goad at LANL, was actively involved in developing GenBank. His work contributed to the generation of the “flat file” format, and he also worked on improving the query-response time of the search engine.

Additionally, he initiated the “feature table” in GenBank—the documentation within that helps GenBank, EMBL, and DDJB exchange data on a daily basis. According to Mishra, the STAR aligner, developed at Cold Spring Harbor, works better with reference sequences, while Trinity, developed at the Broad Institute in Cambridge, Massachusetts, is useful for de novo sequences. (The Broad Institute made news last month with its work on identifying gene mutations that prevent diabetes in adults who have known risk factors, such as obesity.)

Advantages and Disadvantages of the GenBank Platform

The biggest single advantage of GenBank is the open-access format, which allows for a centralized repository in a uniform format. The tremendous amount of data generated by laboratories (such as from microarrays and microRNA arrays) cannot be published in a research article. However, the data, tagged and uploaded on GenBank, can be linked to the journals’ websites and the links can be provided in the print versions of the articles as well.4

PDF is available on the last page.