Generating High Quality Phenotypic Data


Sam Buttram, Ph.D.

DEKALB Choice Genetics



Genomics is no doubt the future of the genetic business.  Marker-assisted genetic evaluation will soon replace conventional BLUP as the preferred method of genetic evaluation and the quest for QTL is foremost in the minds of geneticists that would participate in this revolution.  There is much talk about technologies like genotyping, transcript profiling, DNA sequencing, gene mapping, and bioinformatics.  These technologies allow laboratories to generate data in factory-like proportions.  Unfortunately the value of this plethora of information is dependent not just upon the volume of data, but also upon its quality.  Basically, genomics is about discovering genetic markers and eventually genes that control traits of economic importance.  This requires both DNA and phenotypic data from live animals in order to make the necessary associations.  Therefore success with genomics depends largely on the ability to access accurate and complete phenotypic data and tissue from sufficiently sized populations of animals with the proper genetic structure. 

Quantitative geneticists are driven by data.  They spend much time modeling, analyzing, and interpreting data.  But seldom do they spend time concerning themselves with the quality of the data or the way in which it is collected.  Assumptions are made that the data are accurate, the pedigrees are correct, and measurement processes were consistent across all animals.  Besides, the data were collected months ago and there is no way to know any different.  In reality, pedigree errors run around 5-10% for a typical genetic nucleus herd that generates data by hand and manually enters it into a database.  Some are willing to accept this level of accuracy because the data are used exclusively for quantitative genetic evaluation; and BLUP is fairly robust when it comes to compensating for pedigree errors.  Times are changing and marker-assisted genetic evaluation may not be as forgiving.

Obviously good data quality is important regardless of its intended use.  But how good is good enough?  This is an individual question and may depend upon how the data are to be used as well as the cost.  Certainly there is additional cost associated with higher levels of data quality and the resources required to move to the next level of accuracy can be prohibitive.  On the other hand, when data points become more valuable, as in the case of genomics research, higher levels of accuracy may be necessary and justified.  The desired level of data quality is an important question that must be answered by each individual organization.  DEKALB Choice Genetics (DCG) has made a decision to move to a new level of data quality to accommodate current genomics research and to prepare for selection using marker-assisted genetic evaluation.  The purpose of this paper is to share some of the experiences and learnings regarding the practical aspects of phenotypic data collection that have resulted from this endeavor.


Data Quality

Data quality is affected by a variety of things including the genetic structure of the population from which the data are generated, the number of animals sampled, the relevance of the traits measured, and the completeness or accuracy of the information.  This paper focuses on the latter of these, with primary attention given to how to produce phenotypic data that are complete and accurate.

Completeness is relatively straightforward and easy to monitor, but not always given the attention it deserves.  Basically, each phenotypic data point should be accompanied with when, where, how, and by whom the data was collected.  These characteristics are essential for analyzing and interpreting the data as well as following up on problems.  There is a tendency to take shortcuts and skip some of this information by making the assumption that things will not change.  The fact is that things will change, and for the data to have maximum value now and in the future, it is important to have complete information.

Accuracy refers to how closely the data collected matches the true values being generated by the animals.  It is not as easily measured as completeness, because we don’t always know the true value.  Measures of repeatability can be useful in evaluating data accuracy in cases where the true value is not known.  For instance, at DCG, periodically the same group of animals is run through the testing process twice on the same day and standard errors and correlations between the two measurements are used to evaluate the process for taking weights and ultrasound measurements.  Where true values are known, error rate can be used to monitor accuracy.


Factors Affecting Data Quality



If building a data collection system can be compared to constructing a building, then people must be considered the foundation.  No matter how foolproof or how sophisticated is a data collection system; it is doomed to fail without good people.  DEKALB Choice Genetics has been successful at adding several good people to the team in Kansas over the past several months to keep up with the increased data collection and quality assurance activities.  Some of the key things to look for are people who are detail-oriented, who a have a healthy respect for processes and doing things consistently, but who also are willing to continually evaluate the processes and suggest changes for improvement.  They also need to be motivated by quality and the feeling of a job well done.  People who are looking only for the quickest and easiest way to complete the task do not make good data collectors.  Data collection is not rocket science and it does not gain the attention of the media.  It can become a mundane and difficult task, so it is important that people see how their work contributes to the overall success of the project or organization.  If people can be identified that meet these qualifications, are compensated fairly, and given sufficient training and resources to do their job, the foundation of a successful data collection program is created.



A process is a series of tasks directed toward the completion of a particular result.  All data collection activities can be broken down into one or more processes, depending upon the length and complexity of the activity.  Processes are the basic building blocks of a good data collection system.  They provide the structure necessary to make sure that data are collected in a manner that is meaningful and consistent with project objectives and will result in data that are useful.  It is essential that processes be documented in order to ensure consistency and eliminate ambiguity.  DEKALB Choice Genetics uses a Standard Operating Procedure (SOP) to document each of the processes it uses.  The SOP is a detailed step-by-step written description of a process.  It includes a section on the purpose of the process, gives background information, and safety precautions in addition to the details of the process.  An important feature of the SOP is that the people who actually perform the procedure develop it.  Of course those who have responsibility for oversight of the project must approve it, but it is not something that is handed down to the data collectors from their supervisor.  The SOP is made accessible to everyone involved with the project and people are encouraged to bring forth suggestions for changes if they feel there is a better procedure for doing something.  If indeed it is agreed that the procedure should be changed, a revised SOP is drawn up and approved to make the change official.



A training program is the mortar that holds everything together in a good data collection system.  Training is crucial for either new employees or for people taking on a new role to ensure that data are consistent across time and people.  The most important training for the new employee is training on the procedures themselves.   A Standard Operating Procedure (SOP) offer the perfect tool for training.  SOPs are used extensively in the training program at DCG.  In fact we have an SOP that describes how to train employees on an SOP.  In addition, suppliers and/or consultants are good sources for training.  Continuous learning is also an important part of training.  This may involve workshops, seminars, or other topical training geared at increasing the level of understanding as a basis for future process improvement.



Technology for improving phenotypic data collection processes comes in many forms.  It may be as simple as a better technique or piece of equipment, or as complex as an information management system.  Whatever the form, technologies are necessary tools for building a good data collection system.  The following are some of the tools that DCG employs its phenotypic data collection program.

Data verification.  Data verification refers to checking original data against the final data set, usually from an electronic file.  An example would be checking original hand-written forms against a printout of the final data set.  Another example is the PIGID verification process that DCG uses.  This involves checking the tag numbers (PIGID) of all pigs against an electronic file that was generated when the pigs were tagged at birth.  The PIGID is the most critical piece of data in the identification process.  Verification is the best way to ensure that the original data are entered completely and accurately.  It is also an expensive and time-consuming process and should be used only in limited cases.  One case where data verification may be warranted is at critical control points in the collection process, as described in the case of PIGID verification.

Another case is when large volumes of data are hand written and manually entered into a computer.  It is impossible to record and enter data by hand without making mistakes.  However, data verification can help identify and eliminate most of the mistakes, making manual data recording and entry a viable method of phenotypic data collection.

Data screening.  Data screening refers to checking the data against pre-defined criteria that indicate whether or not the data are valid.  Examples would be checking that dates or locations are valid, or that phenotypic measurements fall with the normal range for that particular trait.  We have dozens of such screening checks that are performed each time data are added to the database.  For instance, we use a series of data screening checks to ensure that pedigrees are kept intact.  This works by checking to make sure that any animal to be added to the population database at birth has a litter record with known sire and dam.  For a sire and dam to create a litter record, they must each have been added to the population database at birth under the same restrictions.  And so a cyclical screening process is created.  Data screening is commonly done electronically as data are added to the final data set, so there is essentially no cost in screening the data.  Instead the time-consuming part is resolving the errors that are found.

Electronic data capture.  Electronic data capture is a valuable tool for collecting data that can be read directly into a computer or into a device that can be downloaded into a computer.  This virtually eliminates errors caused by manual recording or entering of data.  It can also improve efficiency of data collection processes that require reading large volumes of information.  A good example of electronic data capture is the animal testing process that DCG uses to weigh and ultrasound pigs.  A laptop computer with an AUSKEY spreadsheet is used to store the information as it is collected.  First, bar coded ear tags are scanned into the spreadsheet to identify each pig.  Then weight is captured electronically as it becomes stabilized.  Finally real-time ultrasound images are captured and backfat thickness and loin muscle measurements are calculated and added to the spreadsheet.  All this takes place without any manual reading, recording, or entering of data.  This not only improves accuracy, but makes the process more efficient as well.

Another benefit is that electronic data capture can be used to record some types of data that would not otherwise be possible.  Feed intake is a good example of this.  Without the use of electronic data capture, feed intake is either measured on a pen basis or on pigs in individual stalls.  Electronic feed intake recording equipment (FIRE) allows feed intake to be measured on individuals in a normal pen setting.  In addition to FIRE feeders, DCG uses several other types of equipment for electronic data capture.  These include both hand-held and laptop computers, bar code tags and scanners, radio frequency tags and readers, electronic scales, real-time ultrasound machines, pH meters, and colorimeters.

Data management systems.  Data management systems include the hardware, software and networking capabilities necessary to move, check and store information after it has been captured.  They provide for the flow of data from the point of collection to the final database.  Most data management systems allow for some screening before the data enter the final database and ensure that it is well organized and that scientists and managers have ready access to the information.  This is one of the most complex tools we use for data collection, but is no less essential.  We are in the process of creating a large “data warehouse” in St. Louis for storing our phenotypic data and are establishing data flows that will move information from the farm to the data warehouse accurately and efficiently.  One of the first examples of this is the data from the FIRE feeders.  These feeders generate approximately 7,000 data records per day that are sent to the data warehouse on a daily basis.  There each record is screened according to a series of 18 different algorithms and a report is fed back to the scientists and the FIRE feeder manager daily so that necessary adjustments can be made to the feeders.  We recently installed a wireless communication system for the GN herds in Kansas to overcome the limitations of the phone lines.  This allows data to flow between the farms and St. Louis as rapidly as the local office.

Similar data management systems are in place at the genotyping laboratories.  There, a Laboratory Information Management System (LIMS) is used to manage the identification, tracking, and storage of incoming tissue samples.  This same system also ensures that the resulting genotypic information is associated with the correct sample when it is moved to the genotypic database.  On the front end, LIMS is integrated into the phenotypic data collection process to make the original association between the sample and the animal from which it came.  As a part of the processing of pigs at birth, both the PIGID and the LIMS tube are scanned with a bar code scanner, the association is made, and the data uploaded to the LIMS database.  The tissue samples are shipped overnight to St. Louis and when the lab receives them, the LIMS tubes are scanned again to verify that they have been received.  We are currently gearing up to process about 2,000 samples per week using this system.


Continuous Improvement

Once the phenotypic data collection system is built, it is already time to begin remodeling.  To remain viable, the system must evolve and continue to improve by adapting new technologies and better processes.  This is accomplished by encouraging people to identify problems and suggest solutions; by learning from our mistakes; and by generating feedback.  For instance, we have a weekly meeting of scientists and FIRE feeder technicians designed to discuss problems with FIRE data collection and possible causes and solutions.

This works with suppliers as well.  We work with our suppliers to help them improve their product so that it will in turn improve ours.  An excellent example of this is our relationship with our tag supplier.  We have been using bar coded tags for a number of years, but the reliability and consistency of the bar codes has been questionable at times.  This caused us much difficulty in trying to read them with a scanner after they had been in the pig for six months or longer.  On some days, most or all of them had to be read visually and entered manually into the computer.  Then one of our IT people experienced with bar codes suggested some changes, which they made for us, and now we are reading 98.1% with the scanner at six months of age.

Finally, the only way to know if data quality is improving is to monitor the results.  This can be done in a variety of ways.  We keep track of error rates at certain key points in the data collection processes.  We use repeatability to quantify measurement errors for some phenotypic traits, and we monitor pedigree errors through DNA testing - all in an effort to maintain data quality at the highest possible level.


High quality phenotypic data is a key driver of both genomics and quantitative genetics research that must not be ignored or taken for granted.  The level of data quality desired is an individual decision that must take into account how the data will be used and how much one is willing to invest.  Genomics and the move to marker-assisted genetic evaluation, however, are rapidly making high quality data a must.  Some improvements can be realized simply through people, training, and documented processes.  Other improvements may require investment in advancing data collection technology and an ongoing continuous improvement program.  DEKALB Choice Genetics has chosen to put these pieces in place and to build a data collection system that will generate high quality phenotypic data in order to take full advantage of whatever technology lies ahead.

2001 NSIF Proceedings