Why gene prediction has such importance in the "OMICS" world is based upon their scientific and commercial value. The avalanche of genome data grows daily. The new challenge will be to use this vast reservoir of data to explore how DNA and proteins work with each other and the environment to create complex, dynamic living systems. Systematic studies of function on a grand scale-functional genomics-will be the focus of biological explorations in this century and beyond. These explorations will encompass studies in transcriptomics, proteomics, structural genomics, new experimental methodologies, comparative genomics, and of-course pharmacogenomics.
Transcriptomics-involves large-scale analysis of messenger RNAs (molecules that are transcribed from active genes) to determine when, where, and under what conditions genes are expressed.
Proteomics-the study of protein expression and function-can bring researchers closer than gene-expression studies to what''s actually happening in the cell.
Structural genomics-initiatives are being launched worldwide to generate the 3-D structures of one or more proteins from each protein family, thus offering clues to their function and providing biological targets for drug design.
Knockout studies-are one experimental method for understanding the function of DNA sequences and the proteins they encode. Researchers inactivate genes in living organisms and monitor any changes that could reveal the function of specific genes.
Comparative genomics-analyzing DNA sequence patterns of humans and well-studied model organisms side by side-has become one of the most powerful strategies for identifying human genes and interpreting their function.
Pharmacogenomics- study of how an individual''s genetic inheritance affects the body''s response to drugs holds the promise that drugs might one day be tailor-made for individuals and adapted to each person''s own genetic makeup. It combines traditional pharmaceutical sciences with annotated knowledge of genes, proteins, and single nucleotide polymorphisms.
Although more than 99% of human DNA sequences are the same across the population, variations in DNA sequence can have a major impact on how humans respond to disease; to such environmental bacteria, viruses, toxins, and chemicals; and to drugs and other therapies. Methods are being developed to detect different types of variation, particularly the most common type called single-nucleotide polymorphisms (SNPs), which occur about once every 100 to 300 bases. Scientists believe SNP maps will help them identify the multiple genes associated with such complex diseases as cancer, diabetes, vascular disease, and some forms of mental illness. These associations are difficult to establish with conventional gene-hunting methods because a single altered gene may make only a small contribution to disease risk.
Once a very interesting or important gene is identified, the wet-lab work to be followed up becomes easier and more targeted. Automated sequencing of genomes led to detect the "Possible" genes and the computational technique, though not very sure and reliable every time, but is definitely the best choice over laborious laboratory procedures. Test scenarios of a gene prediction may yield true or false positive, missing, wrong, joined as well as split genes. Schematic representations of these scenarios are depicted below. Existing predictors are for protein coding regions meaning non-coding regions are not detected and non-coding RNA genes are often missed. Several programs and algorithms have been developed to detect the most potential genes from a genome sequence. Programs like Genscan and Genemark have already been widely accepted for gene prediction.
Genes are related to diseases and few questions on analysis of expression data are (1). Which genes are induced / repressed? Confirm using t-test. (2). which genes are co-regulated? This is related to inference of function and the answer is to check using clustering algorithms or support vector machines. (3). which genes regulate others is however related to reconstruction of networks and the best judgment is via transcription factor binding site as well as Bayesian networks.
Consider that a potential gene is predicted and discovered, and then subsequent steps are: design primers and pull out the gene from the organism; clone and express the gene; purify, analyze, and crystallize the ''protein''; search for best ligands/drugs and finally new drug discovery. Identifying novel proteins and their associated biochemical role is another fascinating challenge and the chances of finding new proteins are enormous. Caution has to be taken to ascertain a ''function to the newly discovered protein'' and in most cases the assignment is based on sequence similarity as well as the ''annotation'' of the closet relative is transferred to the novel protein. Careful study of the protein''s physico-chemical, biochemical and pathway information can justify the annotation. Famous Swissprot, Expasy, Pfam (Proteins families'' database of alignments and Hidden Markov Model), ProDom (Protein domain database) and Interpro are few examples of automatic compilation of new proteins.
Once a 3D model of a protein is deduced, the next step is to detect the possible binding sites; several programs are available for this job which uses different force fields. Determining the structure of the protein from a raw sequence has always been a challenge. Over the years, several research groups have designed and developed soft- wares which could be used for Insilco simulation and determining the protein structure. The most promising way to determine the structure of a protein is using comparative or homology approaches based molecular modeling also known as knowledge based modeling, it is based on known crystal structures which has sequence and secondary structure similarity at least 25-30% of the entire protein. Application programs like Modeler and SwissModel are very much in use for comparative modeling.
After the active site is identified, the best ligand or receptor has to be identified; integrating Cheminformatics becomes very useful and handy in identifying ligand. Using automated approach, the most suitable ligand is picked up from a ligand library, subsequently the ligand is ''docked'' to the novel protein under optimum conditions, and finally the stereo-chemistry of the protein complex is calculated. An important program for molecular docking is AUTODOCK (Morris, G. M., Goodsell, D. S., Halliday, R.S., Huey, R., Hart, W. E., Belew, R. K. and Olson, A. J.; Automated Docking Using a Lamarckian Genetic Algorithm and Empirical Binding Free Energy Function; J. Computational Chemistry, 19: 1639-1662, 1998).
Keeping in view pharmaceutical research, efficient interpretation of the functions of human genes and other DNA sequences requires that strategies be developed to enable large-scale investigations across whole genomes. A priority is to generate complete sets of full-length cDNA clones and sequences for human and model-organism genes. Other functional genomics goals include studies into gene expression and the development of experimental/computational methods for understanding gene function. There are several check points which have to be carefully examined in gene prediction particularly False Positives have to be eliminated. Subsequently the properties of the protein and their interactions have to thoroughly understood, once we have enough information about the novel protein it becomes easier to propose a method to model the protein and identify the most suitable ligand.
Mr. Sulip Goswami is a research scholar in Marine Biological Laboratory, Woods Hole, MA 02540. Mr. Amar Singh is a bioinformatics scientist in Mascon-ISC, 444 Wall Street, Princeton, NJ 08540 and Dr. Sandeep Bagga is a Certified Bioinformatics Specialist from National Bioinformatics Institute, USA and associated with Pharmaceutical Research and Clinical Trials, PRACT Advisory Service, Alexandria, VA 22315 (USA).