Sampling from a population

In studying the properties of a biological system we usually are unable to measure every part of the system. We can, however, collect one or more representative samples of information from the total system (or population). We usually are not interested in our sample itself. Instead, we are interested in what it tells us about the total system. Figure B1 of Principles of Life, Second Edition (page 944) illustrates the relationship between a sample and a population.

To see how we use properties of samples to draw inferences about populations, consider a population of a fish, the freshwater bream (Abramis brama), living in Lake Laengelmavesi in Finland. Suppose we want to know how big the fish in the lake are. We cannot capture and characterize every individual of this species, but we can make inferences about the entire lake population from samples of fish. For instance, we can capture n fish, measure the weight of each fish, and calculate the mean $\bar{x}$ and standard deviation $s$ of the weights of this sample of fish, using the following formulas (also see Figure B6):

These descriptive sample statistics estimate the true mean and true standard deviation in the entire lake population. But how accurate are these estimates?

The standard error of a sample statistic gives us a measure of how close it is likely to be to the true population value. For example, the standard error of the sample mean, $SE_\bar{x}$ is an estimate of how far we can expect a sample mean to be from the true population mean. It is a function of the spread of values within the sample (the standard deviation) and the sample size, n (also see Figure B6):

The standard error can be used to calculate the confidence interval—an interval around the sample statistic that has a specified probability of including the true population value. For example, for many types of continuous data, the 95% confidence interval of the sample mean (95% C.I.) is calculated by adding and subtracting 1.96 times the standard error of the mean from the sample mean:

95% C.I. = 1.96 x $SE_\bar{x}$.

The 95% C.I. has the following interpretation: if you were to take replicate samples of fish, each of size n, and if you then calculated the 95% C.I. for each of these samples, you would expect that this interval would include the true mean of the population in 95% of all the samples. Thus the 95% C.I. of even a single sample has a good probability (0.95) of including the population mean.

Exploring the properties of samples

To explore the properties of samples, imagine that a class of 100 Finnish students wants to know the average weight of Abramis brama fish in Lake Laengelmavesi. Let us assume that they are sampling from an infinitely large population of fish with a mean weight of exactly 600 grams (g) and a standard deviation of exactly 200 g. Each student collects and weighs a sample of n fish, then throws the fish back. Each student then calculates the mean, standard deviation, standard error, and 95% C.I. of his or her sample.

Go to the Simulation tab. Begin by setting the number of fish each student samples to 20 and click on the "Sample" button. The program selects 20 fish at random from the population and calculates the mean, standard deviation, standard error of the mean, and 95% C. I. of the mean for this first student's sample. The program then repeats the sampling and calculations for 99 more students. The program graphs these results with a dot for each sample mean and a vertical bar to indicate the 95% C.I. The vertical bar is black for samples whose 95% C.I. includes the true population mean (indicated by a blue horizontal line) and red for samples whose 95% C.I. does not include the true population mean.

To see the students' calculations in table format, click on the Table tab. Once again, the table uses red to indicate samples whose 95% C.I. does not include the true population mean.

In the Table tab, examine the 100 sample means and standard deviations. Are they all exactly 600 g and 200 g? Why or why not? Pick an example and verify that the students correctly calculated the standard error of the mean, and 95% confidence interval values.

In either the Table or Simulation tab, examine the sample means. How many do you expect to fall below versus above the true mean, and what is the actual pattern? Now examine the number of red samples. How many samples do you expect to be shown in red? How many are actually shown in red? If these aren't the same, why not?

Now repeat this exercise to obtain another set of 100 samples of 20 fish each, by clicking "Reset" and "Sample" in the Simulation tab. In what ways is this set of samples different from, or similar to, the first set? Are the same number of samples shown in red? How many of them fall above versus below the true mean?

What do you think would happen if you accumulated many more sets of 100 samples of 20 fish each? What percentage of samples in this cumulative set would eventually turn out to be shown in red, and in what portion of these and of all samples would the sample mean fall above versus below the true mean? If you aren't certain about the answer, add up the number of red samples in 1, 2, …10 successive sets of 100 samples, and divide by the total sample size in the cumulative sample. Does the proportion of red samples converge on 5%?

Exploring the effect of sample size

In general, bigger samples yield more accurate estimates of the entire population. To see this property of sample size, let's see what happens if we increase the number of fish in each student's sample (other than making the students complain about overwork!).

Go to the Simulation tab. Click the "Reset" button to clear the graph. Increase the number of fish per sample to, say, 100. What has happened to the number of sample means that are above or below the true population mean? Has the standard deviation of each sample changed noticeably? How about the standard error of the mean, or the 95% C.I.? How about the number of samples in red?

To see sample size effects even more clearly, repeat with even larger sample sizes (n = 500, for example).

 Number of fish per sample =
 Population Mean = 600Population Standard Deviation = 200

Sample #MeanStandard
Deviation
Standard
Error
95% Confidence
Interval

Note: Samples in red have a sample mean that differs from the true population mean by more than the 95% confidence interval.

Properties of samples—such as the sample means—are estimates of the properties of the population from which the samples were drawn. In some cases, just by chance, the sample statistics provide very accurate estimates of the true population values, whereas in other cases they do not. In real examples we will not know the true properties of the underlying population, and there is always some chance that the estimates from a sample we take will lead us to draw an incorrect biological conclusion—for example, that the fish in one lake are larger than those in another when actually they are not larger (a Type I statistical error) or conversely that they are the same size when actually they are not (a Type II error).

We can reduce the chance that we make one of these errors by taking larger samples.

Textbook Reference: Appendix B Step 5: Inferential Statistics