Bioinformatics and Swine Genetic Improvement
Dr. Jeff Veenhuizen
DeKalb Choice Genetics
Recent years have seen an explosive growth in biological
data around the genetic makeup of mammals. Such growth is both immensely
exciting and challenging. Among the challenges is the need to learn an entire
language of new genetic descriptors that some geneticists haven’t even heard of
yet. Some of the advances in data collection now include bibliographic records,
nucleotide and protein sequences and structures (coordinates), protein
families, metabolic pathways, sequence motifs or patterns, genetic and physical
maps. And terms such as EST, QTL, PDB, PIR, dbSTS, PROSITE, to name a few, are
used commonly in some circles. The purpose of this paper is to identify and discuss
one of the emerging technologies, bioinformatics, and its application to swine
genetic improvement. In doing so, it is inevitable to use several of these
acronyms and terms for reference. Reference to many of these items is intended
simply to show their existence and detailed explanation is saved for other
communications.
The practice of achieving improvements in animal
production through breeding selection has been used in the swine industry for
many years. The data used in the breeding selection process has been dominated
by observable phenotypes. The explosive growth in data about genetic makeup is
rapidly making it possible to rely more heavily on genetic information in
breeding schemes. At the top of the list of challenges in this genetic area is the
constantly increasing quantities of nucleotide sequences that are produced by
large sequencing projects. The contents of nucleotide databases are doubling in
size approximately every 14 months. The latest release of GenBank (V.114)
exceeded 1.4 billion base pairs. Not only is the size of sequence data rapidly
increasing, but also the number of characterized genes from many organisms and
protein structures doubles about every two years. The staggering volume of
molecular data and its cryptic and subtle patterns have far exceeded our human
capacity to grasp its significance, much less analyze it. This has led to an
absolute requirement for computerized databases and analysis tools. To cope
with this great quantity of data, a new scientific discipline has emerged:
bioinformatics, (also called biocomputing or computational biology).
Bioinformatics combines the tools and techniques of mathematics, computer
science and biology in order to understand the biological significance of a
variety of data. It involves the development of computational tools, not only
to analyze this information but to provide mechanisms to record, store,
retrieve, and display it. This field is very new, evolving out of the volume of
data literally creating a paralysis without these tools.
What does Bioinformatics accomplish? First, it provides reliable and effective
electronic storage mechanisms. Second, it integrates separate datasets to yield
useful combinations of data by processing hundreds of millions of data points.
Third, it compares and contrasts all existing data, revealing even subtle
genetic patterns. In addition, it provides a means of unimpeded data delivery,
filtered according to needs so as not to swamp the user with unnecessary and
misleading data. The use of bioinformatics reduces the size of the feedback
loop in the biological discovery process. It reduces the delay between
discovery of a result and dissemination of that result, meaning results can be
used faster, ensuring a greater turnover of results.
In response to the explosion in sequence information
resulting from a panoply of scientific efforts, valuable databases now have
been created as public repositories to collect DNA, protein, structure, map,
and other specialized data. The Protein DataBank (PDB) is the primary
repository of 3-D atomic coordinate data files for macromolecules. SWISS-PROT
and the Protein Information Resource (PIR) are two of the oldest and best
protein databases. The National Center for Biotechnology Information (NCBI)
also provides the Entrez server through which GenBank protein sequence data, as
well as data from European Molecular Biology Laboratory (EMBL), the DNA
DataBank of Japan (DDBJ), PIR, SWISS-PROT, and the PDB are easily accessible.
This NCBI Entrez server also provide access to a non-redundant nucleotide
sequences from GenBank, EMBL, and DDBJ. Examples of specialized databases
include the Database of Expressed Sequence Tags (dbEST), the Database of
Sequence Tagges Sites (dbSTS) from NCBI. GeneMap99 contains the most
comprehensive human physical mapping information.
One of the most common bioinformatics tasks is the
search of a sequence database with a query sequence of interest. dbEST is the fastest growing database with
over 3 billion entries (3,340,558 as of 11/26/1999). If sufficient similarity is observed between the query sequence
and a sequence of known function, inference of homology is justified. In other
words, these two genes share a common evolutionary history if they display
significant similarity. FASTA was the first widely used program for database
sequence similarity search. Basic Local Alignment Search Tool (BLAST) is
another popular program with improved overall search speed and sensitivity.
Sometimes the query sequence may not be particularly similar to a single protein
in the database, but might still share considerable similarity with a family of
proteins. To derive a representative sequence for a family of proteins,
multiple sequence alignment is necessary to produce such a consensus sequence.
PileUp (from GCG package), MSA, an
CLUSTALW are often used for multiple sequence alignment.
Large scale EST sequencing has the advantage of
rapidly generating large amount of sequence data at relatively low cost, but
also presents challenging data analysis problems in dealing with data
redundancy, inaccuracy, incompleteness, and sheer data volume. One way to
address these problems is to group sequences into a unique set of clusters. The
sequences are compared with each other and all sequences that have a
statistically significant overlap are placed into a single group. An assembly
stage adds two main benefits: first, it produces contiguous and consensus
sequences which can completely hide EST redundancy and, second, it should also
improve the length and quality of the gene reconstructions beyond that
available from any one EST (usually single pass reads). Examples of cluster and
assembly programs include those BLAST and FASTA based scripts, and specialized
programs such as Phrap, CAP3, and word-based d2 clustering.
What is the practical significance of having these
analysis capabilities just described? First, it is impossible to make useful
sense out of this evolving genetic information without them. The human brain
cannot assimilate the millions of data points available without help from
sophisticated and powerful computing capacity. Any genomics program without
these capabilities will be severely handicapped if not paralyzed. Second,
nature is not known to be simple. As described above, all genome maps being
worked on today are essentially being pieced together, bit by bit.
Redundancies, partial data, and inaccurate data are all imbedded in the
databases. It is reasonable to predict that the most impactful and accurate
genetic discoveries will come from a composite view of genotype/phenotype
relationships gathered by thousands of data points across many species and by
many methods. For example, on November 23, 1999, GenBank celebrated the
completion and deposition of one billion base pairs of human genomic sequences.
It is estimated that only 10% of human genome specify coding protein sequence.
Out of 92497 human Unigene clusters (as of 11/29/1999), only 10589 contain at
least one known gene. Undoubtedly, those non-coding regions serve some other
purposes, such as regulation of protein synthesis. In order to identify those
promoters, motifs, or even coding regions, prediction programs are necessary.
For motifs and patterns, there are Pfam, PROSITE, BLOCKS. For gene prediction,
GeneID, GeneParser, GRAIL, GENSCAN and GeneMark are some of the commonly used
programs. Other specialized tools include SignalP for detection of signal
peptides and their cleavage sites. One of the natural extension of EST
sequencing is to map those ESTs to existing physical maps. RHMAPPER and rhmap3
are two of programs used most often for building radiation hybrid maps.
Mammals all have a highly conserved genome size and
presumably share most of their ~100,000 genes; comparative mapping makes it
clear that even gene order is conserved in mammals far more than we would have
expected. The human genome project is essentially complete with over 30,000
genes mapped. Comparative analysis will enable us to take advantage of the
dense human map-to-map pig genes. However comparative genome analysis also
presents interesting challenges to bioinformatics since it involves integration
from a diverse range of species of information ranging from sequences, markers,
Quantitative Trait Loci (QTL), and maps. One of the best examples to view the
increasing volume of information is the Mammalian Homology links established
within the Mouse Genome Database (MGD) at the Jackson Laboratory. Searching of
homologies between human and pig resulted 87 matching items as of 11/30/1999.
DEKALB CHOICE GENETICS has recognized that genomics
and bioinformatics are highly critical functions and resources to genetic
selection and has built an infrastructure that includes network and computing
capabilities and database servers. In addition to an in-house bioinformatics
team with people having skills ranging from programming, database management,
statistics, and biology, partnerships have been established with academic and
industry leaders in bioinformatics. The
technologies, methodologies, and available information are now being applied
directly to learning about the swine genome. Significant gains in understanding
genetic control of valuable production traits such as feed utilization, meat
quality, and litter size, for example, are being made. This will lead to faster and more dramatic
improvement in performance and health through breeding selection programs. Future challenges in bioinformatics include
finding new and faster approaches to deal with the volume and complexity of
data, and providing researchers with better access to analysis and computing
tools in order to advance understanding of genetic information and its effect
on phenotypes.