# Revisiting Sample Bias

## Checking Your Data—Confidence Intervals and Confidence Levels

When we take a sample, how do we know whether it is a good representation of the population? There are ways that we can check this. I do want to caution you that even the best researchers will always find at least a little difference between the sample and the population, as it is inevitable. That difference is called sampling error, and it is simply part of research life. As I have discussed throughout this chapter, there are ways to increase one’s chances of selecting a representative sample, thereby reducing sampling error. One way includes the use of random selection. The other is selecting a large enough sample size. I will get to appropriate sample size in a little bit. Before I do that, I want to discuss how to compare your sample to the population from which it was selected.

The first way to compare your sample to the population is the easiest, and that is to just look at the sample statistics and population parameters (Maxfield & Babbie, 2012). A sample statistic provides a summary description of a variable in the sample, whereas the population parameter gives the same information about a variable in the population. Whenever population parameters are known, we can get an idea of how representative our sample is by comparing the numbers to our sample statistics. If I took a random sample of residents in a particular state, I could then look at my sample statistics for resident demographics, such as race and sex, and compare those numbers to the state’s published census results.

When we do not have the population statistics available, we can use some very basic statistical techniques to determine confidence intervals and confidence levels. Confidence intervals are a range of values in which the true population parameter is likely to be. For example, let’s take a sample of jails across the country and ask them to indicate the percent capacity at which they were operating on June 30 that year. We get a range of responses on our surveys and calculate the mean facility capacity of our sample to be at 84 percent. We can figure out the size of the confidence interval in which the population capacity mean lies by determining our confidence level, or the probability that our population value falls within a particular confidence interval. So we have our sample mean of 84 percent capacity for the jails. We now need a few other statistics, including the standard deviation, which is a measure of the average deviation from the mean for each of our sample statistics. We also need to know the size of the sample we selected. Once we have those figures, we can calculate a confidence interval. We can then adjust the size of confidence interval, depending on how confident we wanted to be that we had captured the true population value. The drawback here is that the higher our confidence level is, the less precise our confidence interval will be.

Let me share an example I often use with my students to illustrate the trade-off between the confidence level and the size of the confidence interval. I’ll ask someone what time of day they were born. Let’s say the student answers 2:15 p.m. Then, I ask if they are really sure, and the student says, “Well, sometime between 1:30 and 2:45.” Next, I tell the student that they will get an automatic F in the class if I find out that they are wrong, and they respond, “Sometime between 1 p.m. and 3p.m.” I do this with my students all the time, and the higher I make the stakes for being wrong, the less precise they get because they want to be confident that their time interval captures the actual time that they are born. So we could use our available sample mean, standard deviation, and sample size to calculate the amount of sampling error. That would give us an idea of how closely our sample resembles the population. The smaller the calculated sampling error, the better our sample matches the population.

Going back to my hypothetical jails example, my mean occupancy rate was 84 percent. I could then use this information and the other statistics that I already mentioned and calculate a confidence interval. If I wanted to be just 50 percent sure that I knew what the actual population mean was, 1 might use the sample mean, sample size, and standard deviation to calculate a confidence interval that would probably be pretty small given that we are sacrificing confidence for precision here. In this hypothetical example, we come up with a mean of 84 percent and an interval of + or -3 percent, meaning that 1 am 50 percent sure that the nationwide jail occupancy rate is somewhere between 81 percent and 87 percent. People rarely only want to be 50 percent sure though. We are much more likely to want to be 95 percent or 99 percent confident in our results. If I were to calculate the confidence interval with a 95 percent confidence level, that would produce a larger confidence interval, perhaps 84 percent + or -7 percent, or anywhere from 77 percent to 91 percent occupancy rate.

I began this chapter discussing political polls. Have you ever noticed that poll results are always accompanied by a margin of error, usually 1 to 5 percent? That is because the poll is based on a sample, not the whole voting population, and the margin of error is the confidence interval. If there is an upcoming election and Politician A has 51 percent support while Politician В has 48 percent, why is it that the reporters say that they are in a “statistical dead heat” when one is 3 percent ahead in the polls? That is because there is likely at least a 3 percent margin of error associated with this sample, so the true population level of support for Politician A is actually anywhere from 48 percent to 54 percent, and for Politician B, the confidence interval is 45 percent to 51 percent. So, based on this sample, we cannot completely rule out the possibility that Politician В is actually ahead in the polls.