Generating High Quality Phenotypic Data
Sam
Buttram, Ph.D.
DEKALB Choice Genetics
Introduction
Genomics is no doubt the
future of the genetic business.
Marker-assisted genetic evaluation will soon replace conventional BLUP
as the preferred method of genetic evaluation and the quest for QTL is foremost
in the minds of geneticists that would participate in this revolution. There is much talk about technologies like
genotyping, transcript profiling, DNA sequencing, gene mapping, and
bioinformatics. These technologies
allow laboratories to generate data in factory-like proportions. Unfortunately the value of this plethora of
information is dependent not just upon the volume of data, but also upon its
quality. Basically, genomics is about
discovering genetic markers and eventually genes that control traits of
economic importance. This requires both
DNA and phenotypic data from live animals in order to make the necessary
associations. Therefore success with
genomics depends largely on the ability to access accurate and complete
phenotypic data and tissue from sufficiently sized populations of animals with
the proper genetic structure.
Quantitative geneticists are
driven by data. They spend much time
modeling, analyzing, and interpreting data.
But seldom do they spend time concerning themselves with the quality of
the data or the way in which it is collected.
Assumptions are made that the data are accurate, the pedigrees are
correct, and measurement processes were consistent across all animals. Besides, the data were collected months ago
and there is no way to know any different.
In reality, pedigree errors run around 5-10% for a typical genetic
nucleus herd that generates data by hand and manually enters it into a
database. Some are willing to accept this
level of accuracy because the data are used exclusively for quantitative
genetic evaluation; and BLUP is fairly robust when it comes to compensating for
pedigree errors. Times are changing and
marker-assisted genetic evaluation may not be as forgiving.
Obviously good data quality
is important regardless of its intended use.
But how good is good enough?
This is an individual question and may depend upon how the data are to be
used as well as the cost. Certainly
there is additional cost associated with higher levels of data quality and the
resources required to move to the next level of accuracy can be
prohibitive. On the other hand, when
data points become more valuable, as in the case of genomics research, higher
levels of accuracy may be necessary and justified. The desired level of data quality is an important question that
must be answered by each individual organization. DEKALB Choice Genetics (DCG) has made a decision to move to a new
level of data quality to accommodate current genomics research and to prepare
for selection using marker-assisted genetic evaluation. The purpose of this paper is to share some
of the experiences and learnings regarding the practical aspects of phenotypic
data collection that have resulted from this endeavor.
Data Quality
Data quality is affected by a variety of things
including the genetic structure of the population from which the data are
generated, the number of animals sampled, the relevance of the traits measured,
and the completeness or accuracy of the information. This paper focuses on the latter of these, with primary attention
given to how to produce phenotypic data that are complete and accurate.
Completeness is relatively
straightforward and easy to monitor, but not always given the attention it deserves. Basically, each phenotypic data point should
be accompanied with when, where, how, and by whom the data was collected. These characteristics are essential for
analyzing and interpreting the data as well as following up on problems. There is a tendency to take shortcuts and
skip some of this information by making the assumption that things will not
change. The fact is that things will
change, and for the data to have maximum value now and in the future, it is
important to have complete information.
Accuracy refers to how closely the data collected
matches the true values being generated by the animals. It is not as easily measured as
completeness, because we don’t always know the true value. Measures of repeatability can be useful in
evaluating data accuracy in cases where the true value is not known. For instance, at DCG, periodically the same
group of animals is run through the testing process twice on the same day and
standard errors and correlations between the two measurements are used to evaluate
the process for taking weights and ultrasound measurements. Where true values are known, error rate can
be used to monitor accuracy.
Factors
Affecting Data Quality
People
If building a data collection system can be compared
to constructing a building, then people must be considered the foundation. No matter how foolproof or how sophisticated
is a data collection system; it is doomed to fail without good people. DEKALB Choice Genetics has been successful
at adding several good people to the team in Kansas over the past several
months to keep up with the increased data collection and quality assurance
activities. Some of the key things to
look for are people who are detail-oriented, who a have a healthy respect for
processes and doing things consistently, but who also are willing to
continually evaluate the processes and suggest changes for improvement. They also need to be motivated by quality
and the feeling of a job well done.
People who are looking only for the quickest and easiest way to complete
the task do not make good data collectors.
Data collection is not rocket science and it does not gain the attention
of the media. It can become a mundane
and difficult task, so it is important that people see how their work
contributes to the overall success of the project or organization. If people can be identified that meet these
qualifications, are compensated fairly, and given sufficient training and
resources to do their job, the foundation of a successful data collection
program is created.
Processes
A process is a series of tasks directed toward the
completion of a particular result. All
data collection activities can be broken down into one or more processes,
depending upon the length and complexity of the activity. Processes are the basic building blocks of a
good data collection system. They
provide the structure necessary to make sure that data are collected in a
manner that is meaningful and consistent with project objectives and will
result in data that are useful. It is
essential that processes be documented in order to ensure consistency and
eliminate ambiguity. DEKALB Choice
Genetics uses a Standard Operating Procedure (SOP) to document each of the
processes it uses. The SOP is a
detailed step-by-step written description of a process. It includes a section on the purpose of the
process, gives background information, and safety precautions in addition to
the details of the process. An
important feature of the SOP is that the people who actually perform the
procedure develop it. Of course those
who have responsibility for oversight of the project must approve it, but it is
not something that is handed down to the data collectors from their
supervisor. The SOP is made accessible
to everyone involved with the project and people are encouraged to bring forth
suggestions for changes if they feel there is a better procedure for doing
something. If indeed it is agreed that
the procedure should be changed, a revised SOP is drawn up and approved to make
the change official.
Training
A training program is the
mortar that holds everything together in a good data collection system. Training is crucial for either new employees
or for people taking on a new role to ensure that data are consistent across
time and people. The most important
training for the new employee is training on the procedures themselves. A Standard Operating Procedure (SOP) offer
the perfect tool for training. SOPs are
used extensively in the training program at DCG. In fact we have an SOP that describes how to train employees on
an SOP. In addition, suppliers and/or
consultants are good sources for training.
Continuous learning is also an important part of training. This may involve workshops, seminars, or
other topical training geared at increasing the level of understanding as a
basis for future process improvement.
Technology
Technology for improving
phenotypic data collection processes comes in many forms. It may be as simple as a better technique or
piece of equipment, or as complex as an information management system. Whatever the form, technologies are
necessary tools for building a good data collection system. The following are some of the tools that DCG
employs its phenotypic data collection program.
Data
verification. Data verification refers to checking
original data against the final data set, usually from an electronic file. An example would be checking original
hand-written forms against a printout of the final data set. Another example is the PIGID verification
process that DCG uses. This involves
checking the tag numbers (PIGID) of all pigs against an electronic file that
was generated when the pigs were tagged at birth. The PIGID is the most critical piece of data in the
identification process. Verification is
the best way to ensure that the original data are entered completely and
accurately. It is also an expensive and
time-consuming process and should be used only in limited cases. One case where data verification may be
warranted is at critical control points in the collection process, as described
in the case of PIGID verification.
Another case is when large
volumes of data are hand written and manually entered into a computer. It is impossible to record and enter data by
hand without making mistakes. However,
data verification can help identify and eliminate most of the mistakes, making
manual data recording and entry a viable method of phenotypic data collection.
Data
screening. Data screening refers to checking the data
against pre-defined criteria that indicate whether or not the data are
valid. Examples would be checking that
dates or locations are valid, or that phenotypic measurements fall with the
normal range for that particular trait.
We have dozens of such screening checks that are performed each time
data are added to the database. For
instance, we use a series of data screening checks to ensure that pedigrees are
kept intact. This works by checking to
make sure that any animal to be added to the population database at birth has a
litter record with known sire and dam.
For a sire and dam to create a litter record, they must each have been
added to the population database at birth under the same restrictions. And so a cyclical screening process is
created. Data screening is commonly
done electronically as data are added to the final data set, so there is
essentially no cost in screening the data.
Instead the time-consuming part is resolving the errors that are found.
Electronic
data capture. Electronic data capture is a valuable tool
for collecting data that can be read directly into a computer or into a device
that can be downloaded into a computer.
This virtually eliminates errors caused by manual recording or entering
of data. It can also improve efficiency
of data collection processes that require reading large volumes of
information. A good example of
electronic data capture is the animal testing process that DCG uses to weigh
and ultrasound pigs. A laptop computer
with an AUSKEY spreadsheet is used to store the information as it is
collected. First, bar coded ear tags
are scanned into the spreadsheet to identify each pig. Then weight is captured electronically as it
becomes stabilized. Finally real-time
ultrasound images are captured and backfat thickness and loin muscle
measurements are calculated and added to the spreadsheet. All this takes place without any manual
reading, recording, or entering of data.
This not only improves accuracy, but makes the process more efficient as
well.
Another benefit is that electronic data capture can
be used to record some types of data that would not otherwise be possible. Feed intake is a good example of this. Without the use of electronic data capture,
feed intake is either measured on a pen basis or on pigs in individual
stalls. Electronic feed intake
recording equipment (FIRE) allows feed intake to be measured on individuals in
a normal pen setting. In addition to
FIRE feeders, DCG uses several other types of equipment for electronic data
capture. These include both hand-held
and laptop computers, bar code tags and scanners, radio frequency tags and
readers, electronic scales, real-time ultrasound machines, pH meters, and
colorimeters.
Data
management systems. Data management systems
include the hardware, software and networking capabilities necessary to move, check
and store information after it has been captured. They provide for the flow of data from the point of collection to
the final database. Most data
management systems allow for some screening before the data enter the final
database and ensure that it is well organized and that scientists and managers
have ready access to the information.
This is one of the most complex tools we use for data collection, but is
no less essential. We are in the
process of creating a large “data warehouse” in St. Louis for storing our
phenotypic data and are establishing data flows that will move information from
the farm to the data warehouse accurately and efficiently. One of the first examples of this is the
data from the FIRE feeders. These
feeders generate approximately 7,000 data records per day that are sent to the
data warehouse on a daily basis. There
each record is screened according to a series of 18 different algorithms and a
report is fed back to the scientists and the FIRE feeder manager daily so that
necessary adjustments can be made to the feeders. We recently installed a wireless communication system for the GN
herds in Kansas to overcome the limitations of the phone lines. This allows data to flow between the farms
and St. Louis as rapidly as the local office.
Similar data management systems are in place at the
genotyping laboratories. There, a
Laboratory Information Management System (LIMS) is used to manage the
identification, tracking, and storage of incoming tissue samples. This same system also ensures that the
resulting genotypic information is associated with the correct sample when it
is moved to the genotypic database. On
the front end, LIMS is integrated into the phenotypic data collection process to
make the original association between the sample and the animal from which it
came. As a part of the processing of
pigs at birth, both the PIGID and the LIMS tube are scanned with a bar code
scanner, the association is made, and the data uploaded to the LIMS
database. The tissue samples are shipped
overnight to St. Louis and when the lab receives them, the LIMS tubes are
scanned again to verify that they have been received. We are currently gearing up to process about 2,000 samples per
week using this system.
Continuous Improvement
Once the phenotypic
data collection system is built, it is already time to begin remodeling. To remain viable, the system must evolve and
continue to improve by adapting new technologies and better processes. This is accomplished by encouraging people
to identify problems and suggest solutions; by learning from our mistakes; and
by generating feedback. For instance,
we have a weekly meeting of scientists and FIRE feeder technicians designed to
discuss problems with FIRE data collection and possible causes and solutions.
This works
with suppliers as well. We work with
our suppliers to help them improve their product so that it will in turn
improve ours. An excellent example of
this is our relationship with our tag supplier. We have been using bar coded tags for a number of years, but the
reliability and consistency of the bar codes has been questionable at
times. This caused us much difficulty
in trying to read them with a scanner after they had been in the pig for six months
or longer. On some days, most or all of
them had to be read visually and entered manually into the computer. Then one of our IT people experienced with
bar codes suggested some changes, which they made for us, and now we are
reading 98.1% with the scanner at six months of age.
Finally, the
only way to know if data quality is improving is to monitor the results. This can be done in a variety of ways. We keep track of error rates at certain key
points in the data collection processes.
We use repeatability to quantify measurement errors for some phenotypic
traits, and we monitor pedigree errors through DNA testing - all in an effort
to maintain data quality at the highest possible level.
Summary
High quality
phenotypic data is a key driver of both genomics and quantitative genetics
research that must not be ignored or taken for granted. The level of data quality desired is an
individual decision that must take into account how the data will be used and
how much one is willing to invest.
Genomics and the move to marker-assisted genetic evaluation, however,
are rapidly making high quality data a must.
Some improvements can be realized simply through people, training, and
documented processes. Other
improvements may require investment in advancing data collection technology and
an ongoing continuous improvement program.
DEKALB Choice Genetics has chosen to put these pieces in place and to
build a data collection system that will generate high quality phenotypic data
in order to take full advantage of whatever technology lies ahead.