[Print] [Close]

Global bioinformatics industry makes rapid strides

, Thursday, February 2, 2012, 08:00 Hrs [IST]

The global bioinformatics industry has grown at double-digit growth rates in the past and is expected to follow the same pattern in the next 5 to 10 years. Presently,the US remains the largest market in the world, but India and China have the fastest growth rate. The biggest opportunity will be in the drug discovery sector.

Bioinformatics reduces the overall drug development timeline by 30 per cent and the annual cost by 33 percent due to fast development of tools and software. Given that the development lifecycle for a new drug or biologic comprises 12 to 15 years and costs more than billion dollars, there is significant incentive to reduce the time necessary to develop products. Major US pharmaceutical companies are expected to increase their R&D expenditures in the future; a major portion of this spending is expected to go toward bioinformatics. Global pharmaceutical R&D expenditures in 2010 are predicted to rise to US$ 153 billion. Given recent economic downturns, it is not known if this trend will hold up.

Enormous advances in biological technology over the past four decades have led to a profound change in how information is processed; conceptual and technical developments in experimental and molecular biology disciplines such as genomics, transcriptomics, proteomics, metabolomics, immunomics, and countless other “omics” have resulted in a veritable sea of data with the potential to radically alter biomedicine. Yet, with this wealth of data comes a challenge, namely how to transform the data into information, the information into knowledge, and the knowledge into useful action.

IT engineering is not a core competency of medical researchers, the life sciences industry, healthcare providers, or the various governmental regulatory agencies. Yet these same groups are attempting to utilize very advanced IT systems to perform increasingly complex computational and data-management tasks, all in an effort to explore diseases at a cellular level, to develop therapies, and to utilize these therapies to treat and cure diseases affecting humans worldwide. Sophisticated, trusted, and reliable IT solutions are required to aid and accelerate this work.

Role of IT in drug discovery
The process of drug discovery involves the identification of candidates, synthesis, characterization, screening, and assays for therapeutic efficacy. Once a compound has shown its value in these tests, it will begin the process of drug development prior to clinical trials.

Despite advances in technology and understanding of biological systems, drug discovery is still a lengthy, "expensive, difficult, and inefficient process" with a low rate of new therapeutic discovery. Currently, the research and development cost of each new molecular entity (NME) is approximately US$ 1.8 billion.

Product categories of IT in pharma and biotechnology
The IT industry in pharmaceuticals and biotechnology can be categorized into four product categories: Lab Automations, Content generation, data storage (databases and data warehouses), Analysis software and services, and IT infrastructure.

Laboratory automation is a multi-disciplinary strategy to research, develop, optimize and capitalize on technologies in the laboratory that enable new and improved processes. Laboratory automation professionals are academic, commercial and government researchers, scientists and engineers who conduct research and develop new technologies to increase productivity, elevate experimental data quality, reduce lab process cycle times or enable experimentation that otherwise would be impossible.

The most widely known application of laboratory automation technology is laboratory robotics. More generally, the field of laboratory automation comprises many different automated laboratory instruments, devices, software algorithms and methodologies used to enable, expedite and increase the efficiency and effectiveness of scientific research in laboratories.

The application of technology in today's laboratories is required to achieve timely progress and remain competitive. Laboratories devoted to activities such as high throughput screening, combinatorial chemistry, automated clinical and analytical testing, diagnostics, large scale biorepositories, and many others, would not exist without advancements in laboratory automation.

Content generation
The basic drug discovery pipeline is well known within the pharmaceutical industry. It consists of seven basic steps: disease selection, target hypothesis, lead compound identification, lead optimization, preclinical trial testing, clinical trial testing and pharmacogenomic optimization. In actuality, each step involves a complex set of scientific interactions and each interaction has an information technology component that facilitates its execution. For example, the process of target validation requires access to data from a variety of sources with the goal of gaining a molecular level understanding of the potential role of specific protein molecules in modulating the progress of a disease state. Targets are identified using information gleaned from databases containing information about genomic sequences, protein sequences and structure, and mRNA expression profiles present in specific disease situations. The ultimate goal is to integrate these steps into a seamless process and to provide external interfaces to data sources regardless of differences in structure, design and data definition.

Flow of information through the drug discovery pipeline: Information from physiological databases, medical records and other sources drives disease selection. Target validation involves extensive analysis of protein structure and genomic sequence information. Target and lead compound interactions are the focus of computational chemistry techniques used to select final compounds for clinical trials. Finally, medical informatics techniques use databases containing information correlating clinical outcome with specific genetic make-up of the patient, gene expression data and disease states. Results from clinical trials are fed back to enhance the next round of target selection and lead compound optimization.

An important part of target validation often involves purifying and studying a specific receptor to determine the parameters that molecules are likely to bind most efficiently. This binding of the target and lead compound is the subject of the next graphic. Pharmaceutical companies often use combinatorial chemistry and HTS to predict these interactions.

Combinatorial chemistry involves the rapid synthesis or the computer simulation of a large number of different but structurally related molecules or materials. It is especially common in CADD (Computer aided drug design) and can be done on-line with web based software, such as Molinspiration.

In its modern form, combinatorial chemistry has probably had its biggest impact in the pharmaceutical industry. Researchers attempting to optimize the activity profile of a compound create a 'library' of many different but related compounds. Advances in robotics have led to an industrial approach to combinatorial synthesis, enabling companies to routinely produce over 100,000 new and unique compounds per year.

In order to handle the vast number of structural possibilities, researchers often create a 'virtual library', a computational enumeration of all possible structures of a given pharmacophore with all available reactants. Such a library can consist of thousands to millions of 'virtual' compounds. The researcher will select a subset of the 'virtual library' for actual synthesis, based upon various calculations and criteria.

High-throughput screening (HTS) is a method for scientific experimentation especially used in drug discovery and relevant to the fields of biology and chemistry. Using robotics, data processing and control software, liquid handling devices, and sensitive detectors, High-Throughput Screening allows a researcher to quickly conduct millions of chemical, genetic or pharmacological tests. Through this process one can rapidly identify active compounds, antibodies or genes, which modulate a particular biomolecular pathway. The results of these experiments provide starting points for drug design and understanding the interaction or role of a particular biochemical process in biology.

Early drug discovery involves several phases from target identification to preclinical development. The identification of small molecule modulators of protein function and the process of transforming these into high-content lead series are key activities in modern drug discovery. The Hit-to-Lead phase is usually the follow-up of high-throughput screening (HTS).

HTS has been made lately feasible through modern advances in robotics and high speed computer technology. It still takes a highly specialized and expensive screening lab to run an HTS operation, so in many cases a small-to-moderately sized research institution will use the services of an existing HTS facility rather than set up one.

The advent of combinatorial chemistry in conjunction with high throughput screening has meant that researchers can quickly generate large volumes of data points. The application of techniques such as mass spectrometry and X-ray crystallography for determining the structure of proteins and generation of nucleotide and SNP data from genomics research have also contributed to an explosion in the amount of data generated by researchers in pharmaceuticals and life sciences.

Additionally, during the past several years’ in-silico techniques for predicting these molecular events have advanced to the point where biotech companies are beginning to skip much of the bench work involved in combinatorial chemistry and synthesize only the most promising compounds based on a structural understanding of the receptor and associated ligands.

Proteomics is often considered much more complicated than genomics. This is because while an organism’s genome is constant – with exceptions such as the addition of genetic material caused by a virus, or rapid mutations, transpositions, and expansions that can occur in a tumour – the proteome differs from cell to cell.

Moreover, even a few samples in a Proteomics Core Lab can generate terabytes of downstream data; therefore, data management is a real issue, often more so than sample management.

Data storage and data warehouse
Research laboratories generate large volumes of scientific data to draw insights and inferences. Recent experimental techniques such as the omics methods and computational simulation generate terabytes of raw data in its every run. A typical large Pharma company generates around 20 terabytes of data approximately in a day. Effective analysis and annotations of basic reads or data sets can be laborious without the help of IT tools.

As a lot of genomics and proteomics research involves comparison of experimental data with established genomic and proteomics databases, there are also significant challenges for the computer software industry in enabling quick and accurate searches within these databases.

These challenges have led to the creation of a separate discipline within the life sciences called bioinformatics and most large pharmaceutical and biotechnology companies now have bioinformatics teams. In addition a rash of start-up companies has been formed to develop technologies and sell information from databases.

Examples include Double Twist, Lion Biosciences, Rosetta Inpharmatics and Structural GenomiX. The company that sequenced the human genome – Celera Genomics (now part of PE Corp) is essentially a bioinformatics company.

A variety of languages and software programs have been developed for data mining including Predictive Model Markup Language (PMML), Cross-Industry Standard Process for Data Mining (CRISP-DM), KnowledgeSEEKER, GhostMiner, KEEL, Clementine, R, Viscovery and many other open-source and commercial programs.

These programmes all have in common characteristics that allow them to extract data from existing databases. Before such analysis can happen, however, multiple databases must be subjected to data cleaning (mapping data to consistent conventions, a non-trivial exercise; accurately representing missing data points; and accounting for “noise” in the system), and methods must be provided to create a logical access to the various forms of data, including off-line data and metadata. The process of cleaning and ensuring access is referred to as data warehousing. This latter point is particularly important in the context of bioinformatics due to the variety of data, and particularly as regards the sheer complexity of the data.

Genomic sequences are massive data sets, and the chance for inadvertent error is always present.

Once data warehousing has been accomplished, data mining generally proceeds along six tasks:

Classification: Arrangement of the data into predefined groups. Common algorithms include nearest neighbour, naive Bayes classifier and neural network
Regression: Seeking to identify a function, which models the data with the least error
Clustering: Similar to classification but the groups are not predefined, so the algorithm attempts to group similar items together
Summarization: methods for finding a compact description of a data subset
Dependency modelling: methods for finding significant dependencies between variables in a model
Change and deviation detection: discovering the most significant changes in a data set from a pre-established norm

Analysis software and services
Nearly coincident with the advances in biological science, and in fact rapidly outpacing such advances, has been the advent of the modern computer and associated advances in information storage, retrieval, and processing made practical with microelectronics and informatics. The power of modern information technology is ideal for capturing and storing the huge volume of biological data being generated; however, the respective languages and concepts of biology and computer sciences have, until recently, been disparate enough to prevent the logical next step of combining the two disciplines into a more powerful tool. The discipline of bioinformatics has emerged to capture the information stored in living systems and help turn it into actionable technology.

The NCBI GenBank, RefSeq BLAST, SWISSPORT databases are updated with functional DNA, RNA, Protein sequences. Most of these data banks are growing at exponential rates. Data storage, management, and analysis requirements in life sciences are outpacing current computing capabilities, even though computing power continues to increase. Computing power has expanded at roughly the same rate as public DNA sequences located at the National Institutes of Health (NIH)—both have been doubling every 18 to 24 months for several years. In 1971, only 2,250 transistors were on an integrated circuit; by 2002 there were 42 million.

Similarly, only 606 DNA sequences were housed at NIH’s GenBank in 1982. The number of sequences climbed to 22.3 million in just 20 years.

As computers become more powerful, it is increasingly feasible to simulate various aspects of the drug discovery and development pipeline in-silico rather than undertake experiments or trials in the real world. This could lead to significant savings in both time and cost. As knowledge expands, it is becoming more possible to simulate complex interactions among targets and leads, and among all the proteins involved in complex pathways within the body. The complexity of these bioinformatics applications has attracted information technology providers to life sciences.

IBM, for instance, is in the process of building an advanced petaflop supercomputer to tackle Grand Challenge problems in areas such as protein folding. It is also undertaking other research programs in pattern discovery, protein structure and structural genomics. Computer companies have entered into alliances: Hitachi and Oracle with Myriad Genetics to sequence the human proteome, and IBM with Proteome Systems to identify and analyze proteins.

Many Indian bioinformatics companies such as Ocimum Biosolutions, Strand Life Sciences and Polyskin Technologies are trying to get their different solutions and software required for Biopharma industry.

Major IT players such as Infosys, TCS, Cognizant, and Persistent Systems are also offering a variety of solutions to biopharma companies’ right from database management and data warehouse with customized data mining software, which are compliant with regulatory bodies to ERP solutions for manufacturing, supply chain management, sales force analysis and clinical trials.

Bioinformatics is improving the R&D process in drug discovery and development. IT tools have become important for managing and screening genetics data and for modelling outcomes in drug development. New developments in bioinformatics and genetics, such as pharmacogenetics (that is, the study of the relationships between diseases, genes, proteins, and pharmaceuticals), will enable researchers to identify quickly a patient’s genetic predisposition to contract certain diseases as well as their potential drug response. It is likely that in the near future an individual's protein, genetic or metabolic profile will be used to determine predisposition to disease and to tailor medical care.

Advancing personalized medicine relies on the discovery and validation of protein and peptide biomarkers that signal disease states. Researchers have identified many putative biomarkers, but few have been independently validated. A key challenge is to break this bottleneck and bridge the gap between early-stage discovery and next-stage, routine quantitative application of biomarker assays in the clinical research setting.

Challenges currently facing the bioinformatics industry include, at minimum, the following: lack of interoperability and multiplatform capabilities; lack of standardized formats; difficulties in integrating applications; management of high volume data; and growing competition from in-house development and publicly available tools. To overcome this obstacle, the Interoperable Informatics Infrastructure Consortium (I3C)8 was founded in 2001 to collectively address some standardization problems. I3C is an international consortium that includes life sciences and IT participants from private industry, government institutions, academia and other research organizations. It develops and promotes “global, vendor-neutral informatics solutions that improve data quality and accelerate the development of life science products.” I3C’s accomplishments include standards developed to identify and access biologically significant data and a method that simplifies data retrieval from multiple databases.

Another challenge observed is that data at many companies is not centralized and is fragmented at different locations, making data integration with multiple source systems, data quality and integration with other enterprise applications very crucial for the successful use of analytics tools.

Courtesy :White Paper by Frost & Sullivan and FDASmart Inc.

[Close]