Bioinformatics and Swine Genetic Improvement
Dr. Jeff Veenhuizen
DeKalb Choice Genetics
Recent years have seen an explosive growth in biological data around the genetic makeup of mammals. Such growth is both immensely exciting and challenging. Among the challenges is the need to learn an entire language of new genetic descriptors that some geneticists havenít even heard of yet. Some of the advances in data collection now include bibliographic records, nucleotide and protein sequences and structures (coordinates), protein families, metabolic pathways, sequence motifs or patterns, genetic and physical maps. And terms such as EST, QTL, PDB, PIR, dbSTS, PROSITE, to name a few, are used commonly in some circles. The purpose of this paper is to identify and discuss one of the emerging technologies, bioinformatics, and its application to swine genetic improvement. In doing so, it is inevitable to use several of these acronyms and terms for reference. Reference to many of these items is intended simply to show their existence and detailed explanation is saved for other communications.
The practice of achieving improvements in animal production through breeding selection has been used in the swine industry for many years. The data used in the breeding selection process has been dominated by observable phenotypes. The explosive growth in data about genetic makeup is rapidly making it possible to rely more heavily on genetic information in breeding schemes. At the top of the list of challenges in this genetic area is the constantly increasing quantities of nucleotide sequences that are produced by large sequencing projects. The contents of nucleotide databases are doubling in size approximately every 14 months. The latest release of GenBank (V.114) exceeded 1.4 billion base pairs. Not only is the size of sequence data rapidly increasing, but also the number of characterized genes from many organisms and protein structures doubles about every two years. The staggering volume of molecular data and its cryptic and subtle patterns have far exceeded our human capacity to grasp its significance, much less analyze it. This has led to an absolute requirement for computerized databases and analysis tools. To cope with this great quantity of data, a new scientific discipline has emerged: bioinformatics, (also called biocomputing or computational biology). Bioinformatics combines the tools and techniques of mathematics, computer science and biology in order to understand the biological significance of a variety of data. It involves the development of computational tools, not only to analyze this information but to provide mechanisms to record, store, retrieve, and display it. This field is very new, evolving out of the volume of data literally creating a paralysis without these tools.
What does Bioinformatics accomplish? First, it provides reliable and effective electronic storage mechanisms. Second, it integrates separate datasets to yield useful combinations of data by processing hundreds of millions of data points. Third, it compares and contrasts all existing data, revealing even subtle genetic patterns. In addition, it provides a means of unimpeded data delivery, filtered according to needs so as not to swamp the user with unnecessary and misleading data. The use of bioinformatics reduces the size of the feedback loop in the biological discovery process. It reduces the delay between discovery of a result and dissemination of that result, meaning results can be used faster, ensuring a greater turnover of results.
In response to the explosion in sequence information resulting from a panoply of scientific efforts, valuable databases now have been created as public repositories to collect DNA, protein, structure, map, and other specialized data. The Protein DataBank (PDB) is the primary repository of 3-D atomic coordinate data files for macromolecules. SWISS-PROT and the Protein Information Resource (PIR) are two of the oldest and best protein databases. The National Center for Biotechnology Information (NCBI) also provides the Entrez server through which GenBank protein sequence data, as well as data from European Molecular Biology Laboratory (EMBL), the DNA DataBank of Japan (DDBJ), PIR, SWISS-PROT, and the PDB are easily accessible. This NCBI Entrez server also provide access to a non-redundant nucleotide sequences from GenBank, EMBL, and DDBJ. Examples of specialized databases include the Database of Expressed Sequence Tags (dbEST), the Database of Sequence Tagges Sites (dbSTS) from NCBI. GeneMap99 contains the most comprehensive human physical mapping information.
One of the most common bioinformatics tasks is the search of a sequence database with a query sequence of interest.† dbEST is the fastest growing database with over 3 billion entries (3,340,558 as of 11/26/1999).† If sufficient similarity is observed between the query sequence and a sequence of known function, inference of homology is justified. In other words, these two genes share a common evolutionary history if they display significant similarity. FASTA was the first widely used program for database sequence similarity search. Basic Local Alignment Search Tool (BLAST) is another popular program with improved overall search speed and sensitivity. Sometimes the query sequence may not be particularly similar to a single protein in the database, but might still share considerable similarity with a family of proteins. To derive a representative sequence for a family of proteins, multiple sequence alignment is necessary to produce such a consensus sequence. PileUp (from GCG package), MSA, an† CLUSTALW are often used for multiple sequence alignment.
Large scale EST sequencing has the advantage of rapidly generating large amount of sequence data at relatively low cost, but also presents challenging data analysis problems in dealing with data redundancy, inaccuracy, incompleteness, and sheer data volume. One way to address these problems is to group sequences into a unique set of clusters. The sequences are compared with each other and all sequences that have a statistically significant overlap are placed into a single group. An assembly stage adds two main benefits: first, it produces contiguous and consensus sequences which can completely hide EST redundancy and, second, it should also improve the length and quality of the gene reconstructions beyond that available from any one EST (usually single pass reads). Examples of cluster and assembly programs include those BLAST and FASTA based scripts, and specialized programs such as Phrap, CAP3, and word-based d2 clustering.
What is the practical significance of having these analysis capabilities just described? First, it is impossible to make useful sense out of this evolving genetic information without them. The human brain cannot assimilate the millions of data points available without help from sophisticated and powerful computing capacity. Any genomics program without these capabilities will be severely handicapped if not paralyzed. Second, nature is not known to be simple. As described above, all genome maps being worked on today are essentially being pieced together, bit by bit. Redundancies, partial data, and inaccurate data are all imbedded in the databases. It is reasonable to predict that the most impactful and accurate genetic discoveries will come from a composite view of genotype/phenotype relationships gathered by thousands of data points across many species and by many methods. For example, on November 23, 1999, GenBank celebrated the completion and deposition of one billion base pairs of human genomic sequences. It is estimated that only 10% of human genome specify coding protein sequence. Out of 92497 human Unigene clusters (as of 11/29/1999), only 10589 contain at least one known gene. Undoubtedly, those non-coding regions serve some other purposes, such as regulation of protein synthesis. In order to identify those promoters, motifs, or even coding regions, prediction programs are necessary. For motifs and patterns, there are Pfam, PROSITE, BLOCKS. For gene prediction, GeneID, GeneParser, GRAIL, GENSCAN and GeneMark are some of the commonly used programs. Other specialized tools include SignalP for detection of signal peptides and their cleavage sites. One of the natural extension of EST sequencing is to map those ESTs to existing physical maps. RHMAPPER and rhmap3 are two of programs used most often for building radiation hybrid maps.
Mammals all have a highly conserved genome size and presumably share most of their ~100,000 genes; comparative mapping makes it clear that even gene order is conserved in mammals far more than we would have expected. The human genome project is essentially complete with over 30,000 genes mapped. Comparative analysis will enable us to take advantage of the dense human map-to-map pig genes. However comparative genome analysis also presents interesting challenges to bioinformatics since it involves integration from a diverse range of species of information ranging from sequences, markers, Quantitative Trait Loci (QTL), and maps. One of the best examples to view the increasing volume of information is the Mammalian Homology links established within the Mouse Genome Database (MGD) at the Jackson Laboratory. Searching of homologies between human and pig resulted 87 matching items as of 11/30/1999.
DEKALB CHOICE GENETICS has recognized that genomics and bioinformatics are highly critical functions and resources to genetic selection and has built an infrastructure that includes network and computing capabilities and database servers. In addition to an in-house bioinformatics team with people having skills ranging from programming, database management, statistics, and biology, partnerships have been established with academic and industry leaders in bioinformatics.† The technologies, methodologies, and available information are now being applied directly to learning about the swine genome. Significant gains in understanding genetic control of valuable production traits such as feed utilization, meat quality, and litter size, for example, are being made.† This will lead to faster and more dramatic improvement in performance and health through breeding selection programs.† Future challenges in bioinformatics include finding new and faster approaches to deal with the volume and complexity of data, and providing researchers with better access to analysis and computing tools in order to advance understanding of genetic information and its effect on phenotypes.