|
Gene Discovery and Functional Genomics in the Pig
Tuggle, C. K1., R. S. Prather2, M. B. Soares3, T. Casavant4, D. Pomp5, M. F. Rothschild1, W. Beavis6
1 Department of Animal Science, Iowa State University, 2255 Kildee Hall Ames, IA 50011 2 Department of Animal Science, 162 ASRC University of Missouri-Columbia, Columbia, MO 65211 3 Department of Pediatrics, University of Iowa, 451 Eckstein Med. Research Bldg, Iowa City, IA 52242 4 Department of Electrical and Computer Engineering, University of Iowa, Iowa City, IA 52242 5 Department of Animal Science, University of Nebraska, Lincoln, NE 68583 6 National Center for Genomic Resources, Sante Fe, NM
SummaryAdvances in gene mapping and genomics in farm animals have been considerable over the past decade. Medium resolution linkage and physical maps have been reported, and specific chromosomal regions and genes associated with traits of biological and economic interest have been identified. We have reached an exciting stage in gene identification, mapping and quantitative trait locus discovery in pigs, as new molecular information is accumulating rapidly. Significant progress has been made by identifying candidate gene associations and low-resolution regions containing quantitative trail loci (QTL). However, we are still disadvantaged by the lack of tools available to efficiently use much of this new information. For example, current pig maps are neither of high enough resolution nor sufficiently informative at the comparative level for positional candidate gene cloning within QTL regions. As well, studying biological mechanisms underlying economically important traits such as reproduction is limited by the lack of molecular resources. This is especially important, as reproduction is very difficult to genetically improve by classical breeding methods due to the relatively low heritability and high expense in data collection. Thus, an improved understanding of porcine reproductive biology is of crucial economic importance, yet reproductive processes are poorly characterized at the molecular level. Recently, new methodologies have been brought to bear on a better understanding of pig molecular biology for accelerating genetic improvement in pigs. Several groups are developing molecular information in the pig, and the total Genbank sequence entries for porcine expressed genes have recently topped 100,000. Our Midwest EST Consortium has produced cDNA libraries containing the majority of genes expressed in major female reproductive tissues, and we have deposited nearly 15,000 gene sequences into public databases. These sequences represent over 8,900 different genes, based on sequence comparison among these data. Furthermore, we have developed computer software to automatically extract sequence similarity of these pig genes with their human counterparts, as well as the mapping information of these human homologues. Within our data set, we have identified nearly 1,500 pig genes with strong similarity to mapped human genes, and we are in the process of mapping 700 of these genes to improve the human-pig comparative map. This work and the complementary work of others can now be used to more rapidly understand and identify the genes controlling reproduction, so that genetic improvement of reproduction phenotypes can accelerate.
IntroductionThe long-term goal of genetic mapping in the pig is the discovery of quantitative or economic trait loci (QTL, ETL). Clearly the detailed mapping information available from the Human Genome Project will become extremely useful for ETL positional cloning success. Identification of the genomic region in well-defined human chromosomes homologous to a pig ETL containing region will identify numerous positional candidate genes for the trait. Development of comparative links between the pig and human/mouse maps is therefore critical to this work. Unfortunately, comparative maps are limited in the pig. Results of comparative mapping have come from two approaches. Bi-directional ZOO-FISH painting results suggest that large regions are conserved between the pig and human (Goureau et al., 1996). However, mapping of individual genes suggests that the gene order between species may be either conserved (Sun et al., 1999a) or quite divergent (Larsen et al., 1999; Sun et al., 1999b). A highly developed comparative map is therefore a necessity, as human gene order is not a good predictor of pig gene order.
Pig Reproductive BiologyThe reproductive process is central to pig production efficiency. Unfortunately there is loss of potential conceptuses during the first month of gestation (Perry, 1954). Several areas exist where there are opportunities for increasing the number and quality of offspring from a given mating. These areas (tissues involved) include: 1) increasing the ovulation rate (ovary, hypothalamus and pituitary); 2) improving fertilization and improving the quality of the conceptus (ovary, conceptus); 3) improving the responsiveness of the dam to the conceptus (uterus, placenta); and 4) understanding the onset of gene expression specific for organogenesis (embryo/fetus). Furthermore, with the likely development of cell culture technologies to clone manipulated pig embryos, it will be possible to use homologous recombination to make specific alternations. While alteration of known genes such as myostatin or MC4R might have significant effects on agricultural productivity, genes that are currently unknown that play a central role in the reproductive process cannot be altered unless first identified. Thus, basic information must be gathered to identify these genes. The evaluation of gene expression differences offers powerful new opportunities for gene discovery. Reports of gene expression changes associated with reproductive phenotypes are mostly gene specific (e.g. Kaminski et al., 1997; Li et al., 1998; Yelich et al., 1997a, b), primarily due to the paucity of resources available for large-scale and whole-genome analysis. While the advent of differential display PCR has enabled preliminary investigations of gene expression changes on a whole-tissue level (e.g. Li et al., 1996; Clouscard-Martinato, 1998), the methods are laborious and not facilitative of large-scale analysis of various time points, tissue sources and treatments. Status of Pig Genome MapsAt present, over 2,000 markers and genes have been mapped in pigs (Ellegren et al., 1994; Archibald et al., 1995; Marklund et al., 1996; Rohrer et al., 1996), though not all have been published. The most extensive linkage map has over 1,100 loci (Rohrer et al., 1996), but consists almost entirely of anonymous markers, which cannot be linked to human genome information. The new PiGMaP genetic linkage map (Archibald et al., in preparation) will contain over 650 loci, including nearly 250 genes. The Iowa State University (ISU) and University of Nebraska-Lincoln (UNL) laboratories have supplied over 100 genes to this linkage map. The physical map has improved due to the use of somatic cell hybrids and Fluorescent In Situ Hybridization (Yerle et al., 1996a; 1997). Both genes and markers have now been cytogenetically mapped (522 genes and 374 markers as of 14 Nov 2000, see: http://www.toulouse.inra.fr/lgc/pig/cyto/cyto.htm). Integration of the pig physical and genetic linkage maps is now easier with the development of a Radiation Hybrid (RH) panel (Yerle et al., 1996b; Hawkins et al 1999). The main advantage of RH mapping over established technologies is the high resolution possible without the need for polymorphisms. This last point is important for mapping sequences with the low rates of polymorphism seen in the coding portion of genes critical for conserved comparative loci (O’Brien et al., 1993). Through the use of common markers, RH mapping of comparative loci will allow significant integration of pig linkage and cytogenetic maps, as well as access to the gene order information available for the human genome. The work of Hawkins and colleagues has shown that a recently developed pig RH panel is highly useful (Hawkins et al., 1999). However, the RH map is currently made up of approximately 965 microsatellites and 245 genes, limiting the use of human genome information with this RH map. Further, there are many small linkage groups not connected by the RH panel work, thus additional loci need to be mapped on the RH panel. These loci can and should be primarily composed of comparative markers, as significant sequence for such conserved loci is now available (see below).
Genetic Improvement of Pig Reproductive TraitsSelection for litter size is economically important but experimental results are disappointing. Realized heritability is low (Ollivier and Bolet (1982); Bolet et al. (1989); Lamberson et al. (1991) and response to selection for different components of litter size is moderate (Neal et al. 1989; Johnson et al., 1999). Detection of QTL for reproduction traits in pigs has been fairly limited to date. Several groups have identified reproduction QTL on pig chromosome 8 (SSC8), although at different chromosomal locations (Wilkie et al. 1999; Milan et al., 1998; Rohrer et al., 1999). Reproduction QTL on SSC4, 6 and 7 have been suggested (Wilkie et al., 1999; Milanet al., 1998), while the Nebraska group has reported several QTL for a variety of reproductive traits (Cassady et al., 2001). Candidate gene analysis for reproduction has also shown merit in genetic improvement. Markers in the estrogen receptor (ESR) are significantly associated with litter size (Rothschild et al., 1996; Short et al., 1997). In addition, a marker in the prolactin receptor (PRLR) locus was significantly associated with litter size (Vincent et al., 1998). On the other hand, other studies of trait association with these candidate genes have not found similar results (Linville et al., 2001), suggesting either marker linkage disequilibrium with true QTL or background gene effects play a role. Detection of QTL and/or significant candidate genes is required to implement marker assisted selection to enhance genetic improvement. While some QTL are being detected, the relative lack of loci with effects on litter size and ovulation rate may be a function of the low heritability for these traits and hence the low statistical power afforded to QTL detection. Other experimental means are now necessary for identification of individual genes influencing reproduction. One implicit goal in the effort to apply molecular genomic information to genetic improvement is the development of tools for marker-assisted selection (MAS) and marker-assisted introgression (MAI). MAS and MAI will be most effective if the causal mutation controlling improved reproductive traits is identified and used as the marker. Thus for the most efficient application of MAS, identification of the individual genes controlling traits is a prerequisite.
New Systematic Approaches to Exploring Biology and Identifying Important Genes The Human Genome Project (HGP) has been extremely successful in both advancing information regarding the human genome and in supporting development of technologies for the study of genomes at the molecular level. These latter methods can be used to advance our knowledge in “map-poor” species such as the pig.
a) Gene discovery using large-scale expressed sequence tag (ESTs) analysis
The generation of Expressed Sequence Tags (ESTs) from cDNA clones randomly picked from libraries constitutes an efficient and widely recognized strategy to identify genes (Adams et al., 1991; Schuler et al., 1996). Advantages of cDNA characterization are: 1) most cDNAs are single copy and make good molecular sequence probes; 2) cDNAs span large regions and therefore can help in forming contigs; 3) cDNA sequences are often evolutionarily conserved and thus allow cross-species comparative studies, and 4) they are candidate loci for disease or quantitative trait genes.
b) Analysis of gene expression in parallel - New technologies produce a paradigm shift in biology
New technology to look, in parallel, at the expression of many genes in complex biological samples has emerged on the heels of the first complete sequencing of a eukaryote, yeast (reviewed in Lander, 1999). For the first time, using microarray technology (below) it became possible to look at the entire pattern of changes in expression during normal cellular processes (i.e., the cell cycle), or the response of an organism to environmental changes (Schena et al., 1995, 1996; Brown and Botstein, 1999). Beyond yeast, such parallel analysis of gene expression has allowed new insights in human gene expression, such as differences between cancerous and normal cells (DeRisi et al., 1996), or the response to serum starvation by fibroblasts (Iyer et al., 1999). Thus this exploratory paradigm can be used in the absence of complete genome information, yet is extremely powerful due to the high-throughput nature of the technology, the relative low cost, and the rapidity of the results obtained. In microarray technology, large numbers of partially sequenced cDNA sequences are individually placed (printed) in defined patterns onto a solid support (usually a glass slide). Two mRNA populations to be compared are labeled, each with a different fluorescent dye, mixed, and hybridized to the microarray. After stringent washings, detection of the remaining fluorescent material at each spot is performed by a microarray reader, which can accurately estimate the level of each fluorescent dye at each spot. Thus it is possible to determine the level of each mRNA in the two samples at the same time. The data are often represented as the ratio of expression level detected between the two mRNA populations used. The DNA microarray technology will be highly valuable in studying agricultural problems. Parallel analysis of genesis designed to be very high throughput; thus costs are low relative to current methods to analyze gene expression. Furthermore, microarray analysis is likely to be more biologically relevant than the old paradigm of reductionism as it uncovers new biological connections between genes and biochemical pathways. This may be especially important for biological events that are specific to the pig and that have not been studied in model species.
Large-Scale EST Sequencing Projects World-Wide in PigsThe first report of an EST project in pigs (Tuggle and Schmitz, 1994), including the use of clone arrays to identify tissue-specific expression, was published soon after the seminal EST paper in human genomics (Adams et al., 1991). The results from the first large-scale pig EST project was reported in 1996 (Winterø et al., 1996). This project in Denmark has generated 1,389 sequences to date, while a group in Italy has published initial studies (Davoli et al., 2000) in a project which has currently generated over 400 sequences from longissimus dorsi tissue cDNA libraries. This latter effort is part of GENETPIG, a European consortium whose goal is to map 700 genes on the physical map (Hatey et al., 2000). In the late 1990’s and early 2000’s, several groups initiated projects to generate large-scale EST data through single-pass sequencing of porcine cDNAs. The largest project is underway at USDA-MARC in Clay Center, Nebraska. The goals of this project are to sequence randomly selected clones from normalized libraries produced from a) an equal mixture of mRNA from embryos at day 11,13,15, 20 and 30 of gestation; and b) an equal mixture of testis, ovary, pituitary, hypothalamus, placenta and endometrium. Including 1,182 additional sequences from a separate elongating embryo library project, the group at MARC (S. Fahrenkrug, T. Smith and co-workers) has deposited over 67,000 ESTs as of November 2001. The next largest deposit of EST sequence has been made by the Midwest EST Consortium, which will be discussed in detail below. In other unpublished projects, A. Rink and co-workers have deposited over 6,000 ESTs from a variety of immune tissues and cell types; A. Caetano, D. Pomp and co-workers have deposited over 5,500 sequences from an ovarian follicle cDNA library; and N. Hamasima and co-workers have deposited nearly 2,300 sequences obtained from pig backfat libraries (see Table below). Overall, if one looks at Genbank mRNA entries that list Sus scrofa as the source organism, there are 100,606 entries as of November 13, 2001. If one excludes EST entries, the remaining 1,512 are made up of entries for named genes. The laboratories listed in Table 1 account for all but about 1,200 EST and gene mRNA sequences. These latter 1,200 sequences are the result of a large number of small-scale efforts (< 100-200 entries) around the world.
Table 1. Porcine Expressed Sequence Tag (EST) and mRNA Entries in Genbank tops 100,000
ResultsProgress in Midwest Consortium EST ProjectOver the past two years, we have developed 21 cDNA libraries derived from porcine anterior pituitary, conceptus, fetus, hypothalamus, ovary, and placenta collected at various stages of gestation or estrus (Table 2). These libraries are for the most part, highly complex and useful for sequencing. Furthermore, each of these individual libraries is specifically tagged through the use of specific primer sequences during library creation. Thus clones derived from a particular tissue or stage of development can be recognized as such even when libraries are mixed during further manipulations such as normalization and/or subtraction (Bonaldo et al., 1996). This “tissue-tagging” ability may be of importance, as the largest current EST data set, produced at USDA-ARS-MARC, cannot provide this type of tissue-of-origin data. As of November 1, 2001, we have generated a total of 14,486 sequences from the 3’ end of randomly selected cDNA clones from the libraries described in (Table 2); the project goal is 20,000. All sequences (96%) have been submitted to Genbank dbEST. These sequences represent 8,859 different genes (clusters) based on clustering analysis of 14,105 sequences for a gene discovery rate of 63%. Further, the average size of the clusters is very small. Over 89% of clusters have 1 or 2 members, indicating that the complexity of the available clone population in these libraries is very high. Table 2. Library Information and Current Sequence totals
To check for the utility of these sequence data with respect to available sequences, we performed a BLAST analysis of our sequences against both the entire Genbank data set for pig and for human entries. Of the current EST data set, 50% are novel relative to the pig EST database, as 4,405 out of the 8,859 clusters had a BLAST score less than 50; and 36% are novel relative to the human EST database. Thus many of these new sequences likely represent novel genes. A caveat should be addressed at this point, however. Pig EST sequences submitted by others are primarily 5’ end sequences, while our sequences are produced from the 3’ end. Thus, two sequences may actually represent the same gene, even though no sequence similarity exists between the two EST entries. We are also interested in using these sequences for efficient RH mapping of loci relevant to reproductive traits. RH mapping requires the design of PCR primers for placement of these genes on the RH map. To determine if we have sufficient sequence similarity to the human genome so that we can select pig sequences homologous to genes already on the human RH maps, we have determined that clear matches to human genes do exist within our current data set. Over 14% of these ESTs have matches (BLAST score >200) to human genes/ESTs. Thus this EST data set will be useful for extensive comparative mapping between pig and human. To select the genes most appropriate for mapping, we have developed a set of computer scripts that automatically perform batch BLAST analysis to human sequences for all our sequences and collects information on the human matches obtained. This information includes the quality of the match, the name of the human match locus, the cytogenetic and RH map location for this gene, a quality measure for human localization, as well as any porcine mapping information if available. This information is then used to develop lists of porcine ESTs that have a strong match to a human locus for which there is unambiguous, consistent mapping information. Using this software, we have identified 1,486 ESTs with significant sequence similarity to human genes with RH and cytogenetic locations that agree (Table 3). Further, using available genome information for each human chromosome (Hsap), we have estimated the coverage of these 1,486 hits across the human genome, in an attempt to determine the coverage of the existing pig EST database hits to the human genome. On average, we have an EST BLAST hit every 9.4 centiRay (cR), which is a measure of distance between loci roughly analogous to centiMorgan units used in linkage mapping. There are several chromosomes with relatively dense coverage (on Hsap 1, 14 and 19, there are about 4-5 cR between hits), and several chromosomes with relatively light overage (Hsap 18 and 21 have about 26 cR between hits on average). When coverage is estimated by looking at the number of BLAST hits per gene on the RH map, the coverage density seems more even. On average, there is a hit every 21 genes mapped, and the range is 15 to 30 (Table 3). As our goal is to map about 50% of these hits or 700 ESTs. We expect that our final coverage should be close to one comparative link (mapped gene in the pig) every 40-50 human genes. Sequence data for these selected loci are now being used to design primers for PCR-based RH mapping. Currently, we have mapped approximately 70 genes to the RH map; our current pipeline from gene selection -> PCR primer selection -> gene mapping has a throughput of approximately 50 genes per month. Thus we expect to complete our goal of mapping 700 genes by the end of 2002. Finally, an important part of the genome project is dispersal of information to interested scientists and the lay public. Summary information tables and a searchable EST database have been established at the ISU website: http://pigest.genome.iastate.edu/.
ConclusionsThis work and the complementary work of others have now produced a worldwide total of 100,000 EST sequences from a large number of cell types and tissues. While many tissues remain unexplored, this data set can now be used to more rapidly understand and identify the genes controlling recalcitrant traits such as reproduction, so that genetic improvement can accelerate. Paramount in this ongoing effort is comparative mapping to use all available genome information from related species such as human and mouse, and the development of tools for high-throughput transcriptional profiling so that gene functions and genetic networks can be defined and intelligently explored. Additional sequencing will also be useful; a French collaboration recently announced a new effort to sequence 100,000 ESTs from a normalized, multi-tissue library (F. Hatey, personal communication).
Table 3. Coverage of Human Genome by BLAST hits to Pig ESTs
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||