Hostname: page-component-77c78cf97d-rv6c5 Total loading time: 0 Render date: 2026-04-24T01:41:58.347Z Has data issue: false hasContentIssue false

Estimating population haplotype frequencies from pooled DNA samples using PHASE algorithm

Published online by Cambridge University Press:  06 January 2009

MATTI PIRINEN*
Affiliation:
Department of Mathematics and Statistics, University of Helsinki, P.O. Box 68, FIN-00014 University of Helsinki, Finland
SANGITA KULATHINAL
Affiliation:
Department of Mathematics and Statistics, University of Helsinki, P.O. Box 68, FIN-00014 University of Helsinki, Finland Indic Society for Education and Development, Nashik, India
DARIO GASBARRA
Affiliation:
Department of Mathematics and Statistics, University of Helsinki, P.O. Box 68, FIN-00014 University of Helsinki, Finland
MIKKO J. SILLANPÄÄ
Affiliation:
Department of Mathematics and Statistics, University of Helsinki, P.O. Box 68, FIN-00014 University of Helsinki, Finland
*
*Corresponding author. Tel: (358) 9-191-51419. Fax: (358) 9-191-51400. e-mail: matti.pirinen@helsinki.fi
Rights & Permissions [Opens in a new window]

Summary

Recent studies show that the PHASE algorithm is a state-of-the-art method for population-based haplotyping from individually genotyped data. We present a modified version of PHASE for estimating population haplotype frequencies from pooled DNA data. The algorithm is compared with (i) a maximum likelihood estimation under the multinomial model and (ii) a deterministic greedy algorithm, on both simulated and real data sets (HapMap data). Our results suggest that the PHASE algorithm is a method of choice also on pooled DNA data. The main reason for improvement over the other approaches is assumed to be the same as with individually genotyped data: the biologically motivated model of PHASE takes into account correlated genealogical histories of the haplotypes by modelling mutations and recombinations. The important questions of efficiency of DNA pooling as well as influence of the pool size on the accuracy of the estimates are also considered. Our results are in line with the earlier findings in that the pool size should be relatively small, only 2–5 individuals in our examples, in order to provide reliable estimates of population haplotype frequencies.

Information

Type
Paper
Copyright
Copyright © 2009 Cambridge University Press
Figure 0

Fig. 1. Relative errors of frequency estimates of nine most common haplotypes in the simulated example.

Figure 1

Table 1. Frequency estimates on the simulated data set. The best estimate for each haplotype is shown in boldface

Figure 2

Fig. 2. Relation between multinomial likelihood and the total variation distance to the true haplotype distribution in the simulated example.

Figure 3

Table 2. Distances to the true distribution on the simulated data set

Figure 4

Table 3. Marker spacing (in base pairs) in HapMap data sets

Figure 5

Fig. 3. Total variation distances from the estimated haplotype distributions on E100 to the true ones. The horizontal axis contains ten different genomic regions and for each the analyses are carried out for five different pool sizes (2, 3, 4, 5 and 10). For each region and each pool size, the upper panel shows the results of 100 separate runs. For the lower panel, a single run with the highest PAC-B likelihood has been chosen to represent the final estimate. The horizontal lines in the lower panel depict the results given by PHASE when run on single-individual pools. Pooling schemes are given in the form ‘Number of pools×Number of individuals per pool’.

Figure 6

Fig. 4. Total variation distances from the estimated haplotype distributions on E5k to the true ones. The horizontal axis contains ten different genomic regions and for each the analyses are carried out for five different pool sizes (2, 3, 4, 5 and 10). For each region and each pool size, the upper panel shows the results of 100 separate runs. For the lower panel, a single run with the highest PAC-B likelihood has been chosen to represent the final estimate. The horizontal lines in the lower panel depict the results given by PHASE when run on single-individual pools. Pooling schemes are given in the form ‘Number of pools×Number of individuals per pool’.

Figure 7

Fig. 5. Relation between PAC-B likelihood and total variation distance between true and estimated distributions. The results are shown for ten regions of E100 data sets for two pooling schemes: five individuals per pool (two upper lines) and ten individuals per pool (two lower lines). Each picture describes the results of 100 separate runs.

Figure 8

Fig. 6. Total variation distances between the estimated haplotype distributions and the true ones. On the horizontal axis, there are ten different genomic regions and for each there are four different pooling schemes. For each region and each pooling scheme, 20 different pool contents were analysed and their results lie within the vertical intervals. Points represent medians of 20 analyses. The upper panel concerns the data set E100 and the lower panel the data set E5k. Pooling schemes are given in the form ‘Number of pools×Number of individuals per pool’.

Figure 9

Fig. 7. Total variation distances between the estimated haplotype distributions and the true ones on the data set E25k. On the horizontal axis, there are ten different genomic regions and for each combination of region, method and pooling scheme, 20 different pool contents were analysed. The results lie between the vertical intervals and points represent medians of 20 analyses. Pooling schemes are given in the form ‘Number of pools×Number of individuals per pool’.