Chapter 3 Samples and populations
3.1 Introduction
So far we have discussed statistical description, which is concerned with summarizing features of a sample of observed data. From now on, most of the attention will be on statistical inference. As noted in Section 1.2.3, the purpose of inference is to draw conclusions about the characteristics of some larger population based on what is observed in a sample. In this chapter we will first give more careful definitions of the concepts of populations and samples, and of the connections between them. In Section 3.5 we then consider the idea of a population distribution, which is the target of statistical inference. The discussion of statistical inference will continue in Chapters 4–7 where we gradually introduce the basic elements of inference in the contexts of different types of analyses.
3.2 Finite populations
In many cases the population of interest is a particular group of real people or other units. Consider, for example, the European Social Survey (ESS) which we used in Chapter 2 (see early in Section 2.2).10 The ESS is a cross-national survey carried out biennially in around 30 European countries. It is an academically-driven social survey which is designed to measure a wide range attitudes, beliefs and behaviour patterns among the European population, especially for purposes for cross-national comparisons.
The target population of ESS is explicitly stated as being “all persons aged 15 and over resident within private households, regardless of their nationality, citizenship, language or legal status” in each of the participating countries. This is, once “private household” has been defined carefully, and notwithstanding the inevitable ambiguity in that the precise number and composition of households are constantly changing, a well-defined, existing group. It is also a large group: in the UK, for example, there are around 50 million such people. Nevertheless, we have no conceptual difficulty with imagining this collection of individuals. We will call any such population a finite population.
The main problem with studying a large finite population is that it is usually not feasible to collect data on all of its members. A census is a study where some variables are in fact measured for the entire population. The best-known example is the Census of Population, which at least aims to be a complete evaluation of all persons living in a country on a particular date with respect to basic demographic data. Similarly, we have the Census of Production, Census of Distribution etc. For most research, however, a census is not feasible. Even when one is attempted, it is rarely truly comprehensive. For example, all population censuses which involve collecting the data from the people themselves end up missing a substantial (and non-random) proportion of the population. For most purposes a well-executed sample of the kind described below is actually preferable to an unsuccessful census.
3.3 Samples from finite populations
When a census is not possible, information on the population is obtained by observing only a subset of units from it, i.e. a sample. This is meant to be representative of the population, so that we can generalise findings from the sample to the population. To be representative in a sense appropriate for statistical inference, a sample from a finite population must be a probability sample, obtained using
- probability sampling: a sampling method where every unit in the population has a known, non-zero probability of being selected to the sample.
Probability sampling requires first a sampling frame, essentially one or more lists of units or collections of units which make it possible to select and contact members of the sample. For example, the first stage of sampling for many UK surveys uses the Postcode Address File, a list of postal addresses in the country. A sampling design is then created in such a way that it assigns a sampling probability for each unit, and the sample is drawn so that each unit’s probability of being selected into the sample is given by their sampling probability. The selection of the specific set of units actually included in the sample thus involves randomness, usually implemented with the help of random number generators on computers.
The simplest form of probability sampling is
- simple random sampling, where every unit in the population has the same probability of selection.
This requirement of equal selection probabilities is by no means essential. Other probability sampling methods which relax it include
stratified sampling, where the selection probabilities are set separately for different groups (strata) in the population, for example separately for men and women, different ethnic groups or people living in different regions.
cluster sampling, where the units of interest are not sampled individually but in groups (clusters). For example, a school survey might involve sampling entire classes and then interviewing every pupil in each selected class.
multistage sampling, which employs a sequence of steps, often with a combination of stratification, clustering and simple random sampling. For example, many social surveys use a multistage area sampling design which begins with one or more stages of sampling areas, then households (addresses) within selected small areas, and finally individuals within selected households.
These more complex sampling methods are in fact used for most large-scale social surveys to improve their accuracy and/or cost-efficiency compared to simple random sampling. For example, the UK component of the European Social Survey uses a design of three stages: (1) a stratified sample of postcode sectors, stratified by region, level of deprivation, percentage of privately rented households, and percentage of pensioners; (2) simple random sample of addresses within the selected sectors; and (3) simple random sample of one adult from each selected address.
Some analyses of such data require the use of survey weights to adjust for the fact that some units were more likely than others to end up in the sample. The questions of how and when the weights should be used are, however, beyond the scope of this course. Here we will omit the weights even in examples where they might normally be used.11
Not all sampling methods satisfy the requirements of probability sampling. Such techniques of non-probability sampling include
purposive sampling, where the investigator uses his or her own “expert” judgement to select units considered to be representative of the population. It is very difficult to do this well, and very easy to introduce conscious or unconscious biases into the selection. In general, it is better to leave the task to the random processes of probability sampling.
haphazard or convenience sampling, as when a researcher simply uses the first \(n\) passers-by who happen to be available and willing to answer questions. One version of this is volunteer sampling, familiar from call-in “polls” carried out by morning television shows and newspapers on various topics of current interest. All we learn from such exercises are the opinions of those readers or viewers who felt strongly enough about the issue to send in their response, but these tell us essentially nothing about the average attitudes of the general population.
quota sampling, where interviewers are required to select a certain number (quota) of respondents in each of a set of categories (defined, for example, by sex, age group and income group). The selection of specific respondents within each group is left to the interviewer, and is usually done using some (unstated) form of purposive or convenience sampling. Quota sampling is quite common, especially in market research, and can sometimes give reasonable results. However, it is easy to introduce biases in the selection stage, and almost impossible to know whether the resulting sample is a representative one.
A famous example of the dangers of non-probability sampling is the survey by the Literary Digest magazine to predict the results of the 1936 U.S. presidential election. The magazine sent out about 10 million questionnaires on post cards to potential respondents, and based its conclusions on those that were returned. This introduced biases in at least two ways. First, the list of those who were sent the questionnaire was based on registers such as the subscribers to the magazine, and of people with telephones, cars and various club memberships. In 1936 these were mainly wealthier people who were more likely to be Republican voters, and the typically poorer people not on the source lists had no chance of being included. Second, only about 25% of the questionnaires were actually returned, effectively rendering the sample into a volunteer sample. The magazine predicted that the Republican candidate Alf Landon would receive 57% of the vote, when in fact his Democratic opponent F. D. Roosevelt gained an overwhelming victory with 62% of the vote. The outcome of the election was predicted correctly by a much smaller probability sample collected by George Gallup.
A more recent example is the “GM Nation” public consultation exercise on attitudes to genetically modified (GM) agricultural products, carried out in the U.K. in 2002–3.12 This involved various activities, including national, regional and local events where interested members of the public were invited to take part in discussions on GM foods. At all such events the participants also completed a questionnaire, which was also available on the GM Nation website. In all, around 37000 people completed the questionnaire, and around 90% of those expressed opposition to GM foods. While the authors of the final report of the consultation drew some attention to the unrepresentative nature of this sample, this fact had certainly been lost by the time the results were reported in the national newspapers as “5 to 1 against GM crops in biggest ever public survey”. At the same time, probability samples suggested that the British public is actually about evenly split between supporters and opponents of GM foods.
3.4 Conceptual and infinite populations
Even a cursory inspection of academic journals in the social sciences will reveal that a finite population of the kind discussed above is not always clearly defined, nor is there often any reference to probability sampling. Instead, the study designs may for example resemble the following two examples:
Example: A psychological experiment
Fifty-nine undegraduate students from a large U.S. university took part
in a psychological experiment, either as part of a class project or
for extra credit on a psychology course.13 The participants were randomly
assigned to listen to one of two songs, one with clearly violent lyrics
and one with no violent content. One of the variables of interest was a
measure (from a 35-item attitude scale) of state hostility
(i.e. temporary hostile feelings), obtained after the participants had
listened to a song, and the researchers were interested in comparing
levels of hostility between the two groups.
Example: Voting in a congressional election
A political-science article considered the U.S. congressional
election which took place between June 1862 and November 1863,
i.e. during a crucial period in the American Civil War.14 The units of
analysis were the districts in the House of Representatives. One part of
the analysis examined whether the likelihood of the candidate of the
Republican Party (the party of the sitting president Abraham Lincoln)
being elected from a district was associated with such explanatory
variables as whether the Republican was the incumbent, a measure of the
quality of the other main candidate, number of military casualties for
the district, and the timing of the election in the district (especially
in relation to the Union armies’ changing fortunes over the period).
There is no reference here to the kinds of finite populations and probability samples discussed Sections 3.2 and 3.3. In the experiment, the participants were a convenience sample of respondents easily available to the researcher, while in the election study the units represent (nearly) all the districts in a single (and historically unique) election. Yet both articles contain plenty of statistical inference, so the language and concepts of samples and populations are clearly being used. How is this to be justified?
In the example of the psychological experiment the subjects will clearly not be representative of a general (non-student) population in many respects, e.g. in age and education level. However, it is not really such characteristics that the study is concerned with, nor is the population of interest really a population of people. Instead, the implicit “population” being considered is that of possible values of level of hostility after a person has listened to one of the songs in the experiment. In this extended framework, these possible values include not just the levels of hostitility possibly obtained for different people, but also those that a single person might have after listening to the song at different times or in different moods etc. The generalisation from the observed data in the experiment is to this hypothetical population of possible reactions.
In the political science example the population is also a hypothetical one, namely those election results that could have been obtained if something had happened differently, i.e. if different people turned up to vote, if some voters had made different decisions, and so on (or if we considered a different election in the same conditions, although that is less realistic in this example, since other elections have not taken place in the middle of a civil war). In other words, votes that actually took place are treated as a sample from the population of votes that could conceivably have taken place.
In both cases the “population” is in some sense a hypothetical or conceptual one, a population of possible realisations of events, and the data actually observed are a sample from that population. Sometimes it is useful to apply similar thinking even to samples from ostensibly quite finite populations. Any such population, say the residents of a country, is exactly fixed at one moment only, and was and will be slightly different at any other time, or would be even now if any one of a myriad of small events had happened slightly differently in the past. We could thus view the finite population itself at a single moment as a sample from a conceptual population of possible realisations. This is known in survey literature as a superpopulation. The data actually observed are then also a sample from the superpopulation. With this extension, it is possible to regard almost any set of data as a sample from some conceptual superpopulation.
The highly hypothetical notion of a conceptual population of possible events is clearly going to be less easy both to justify and to understand than the concept of a large but finite population of real subjects defined in Section 3.2. If you find the whole idea distracting, you can focus in your mind on the more understandable latter case, at least if you are willing to believe that the idea of a conceptual population is also meaningful. Its main justification is that much of the time it works, in the sense that useful decision rules and methods of analysis are obtained based on the idea. Most of the motivation and ideas of statistical inference are essentially the same for both kinds of populations.
Even when the idea of a conceptual population is invoked, questions of representativeness of and generalisability to real, finite populations will still need to be kept in mind in most applications. For example, the assumption behind the psychological experiment described above is that the findings about how hearing a violent song affects levels of hostility are generalisable to some larger population, beyond the 59 participants in the experiment and beyond the body of students in a particular university. This may well be the case at least to some extent, but it is still open to questioning. For this reason findings from studies like this only become really convincing when they are replicated in comparable experiments among different kinds of participants.
Because the kinds of populations discussed in this section are hypothetical, there is no sense of them having a particular fixed number of members. Instead, they are considered to be infinite in size. This also implies (although it may not be obvious why) that we can essentially always treat samples from such populations as if they were obtained using simple random sampling.
3.5 Population distributions
We will introduce the idea of a population distribution first for finite populations, before extending it to infinite ones. The discussion in this section focuses on categorical variables, because the concepts are easiest to explain in that context; generalisations to continuous variables are discussed in Chapter 7.
Suppose that we have drawn a sample of \(n\) units from a finite population and determined the values of some variables for them. The units that are not in the sample also possess values of the variables, even though these are not observed. We can thus easily imagine how any of the methods which were in Chapter 2 used to describe a sample could also be applied in the same way to the whole population, if only we knew all the values in it. In particular, we can, paralleling the sample distribution of a variable, define the population distribution as the set of values of the variable which appear in the population, together with the frequencies of each value.
For illustration, consider again the example introduced early in Section 2.2. The two variables there are a person’s sex and his or her attitude toward income redistribution. We have observed them for a sample \(n=2344\) people drawn from the population of all UK residents aged 15 or over. The sample distributions are summarised by Table 2.3.
Sex |
Agree strongly | Agree |
Neither agree nor disagree | Disagree |
Disagree strongly | Total |
---|---|---|---|---|---|---|
Male | 3.84 | 10.08 | 4.56 | 4.32 | 1.20 | 24.00 |
(16.00) | (42.00) | (19.00) | (18.00) | (5.00) | (100) | |
[7.68] | [20.16] | [9.12] | [8.64] | [2.40] | [48.00] | |
Female | 4.16 | 13.00 | 4.68 | 3.38 | 0.78 | 26.00 |
(16.00) | (50.00) | (18.00) | (13.00) | (3.00) | (100) | |
[8.32] | [26.00] | [9.36] | [6.76] | [1.56] | [52.00] | |
Total | 8.00 | 23.08 | 9.24 | 7.70 | 1.98 | 50 |
(16.00) | (46.16) | (18.48) | (15.40) | (3.96) | (100) |
Imagine now that the full population consisted of 50 million people, and that the values of the two variables for them were as shown in Table 3.1. The frequencies in this table desribe the population distribution of the variables in this hypothetical population, with the joint distribution of sex and attitude shown by the internal cells of the table and the marginal distributions by its margins. So there are for example 3.84 million men and 4.16 million women in the population who strongly agree with the attitude statement, and 1.98 million people overall who strongly disagree with it.
Rather than the frequencies, it is more helpful to discuss population distributions in terms of proportions. Table 3.1 shows two sets of them, the overall proportions in square brackets out of the total population size, and the two rows of conditional proportions of attitude given sex (in parentheses). Either of these can be used to introduce the ideas of population distributions, but we focus on the conditional proportions because they will be more convenient for the discussion in later chapters. In this population we observe, for example, that the conditional proportion of “Strongly disagree” given that a person is a woman is 0.03, i.e. 3% of women strongly disagree with the statement, while among men the corresponding conditional proportion is 0.05.
Instead of “proportions”, when we discuss population distributions we will usually talk of “probabilities”. The two terms are equivalent when the population is finite and the variables are categorical, as in Table 3.1, but the language of probabilities is more appropriate in other cases. We can then say that Table 3.1 shows two sets of conditional probabilities in the population, which define two conditional probability distributions for attitude given sex.
The notion of a probability distribution creates a conceptual connection between population distributions and sampling from them. This is that the probabilities of the population distribution can also be thought of as sampling probabilities in (simple random) sampling from the population. For example, here the conditional probability of “Strongly disagree” among men is 0.05, while the probability of “Strongly agree” is 0.16. The sampling interpretation of this is that if we sample a man at random from the population, the probability is 0.05 that he strongly disagrees and 0.16 that he strongly agrees with the attitude statement.
The view of population distributions as probability distributions works also in other cases than the kind that is illustrated by Table 3.1. First, it applies also for continuous variables, where proportions of individual values are less useful (this is discussed further in Chapter 7). Second, it is also appropriate when the population is regarded as an infinite superpopulation, in which case the idea of population frequencies is not meaningful. With this device we have thus reached a formulation of a population distribution which is flexible enough to cover all the situations where we will need it.
3.6 Need for statistical inference
We have now introduced the first key concepts that are involved in statistical inference:
The population, which may regarded as finite or infinite. Distributions of variables in the population are the population distributions, which are formulated as probability distributions of the possible values of the variables.
Random samples from the population, and sample distributions of variables in the sample.
Substantive research questions are most often questions about population distributions. This raises the fundamental challenge of inference: what we are interested in — the population — is not fully observed, while what we do observe — the sample — is not of main interest for itself. The sample is, however, what information we do have to draw on for conclusions about the population. Here a second challenge arises: because of random variation in the sampling, sample distributions will not be identical to population distributions, so inference will not be as simple as concluding that whatever is true of the sample is also true of the population. Something cleverer is needed to weigh the evidence in the sample, and that something is statistical inference.
The next three chapters are mostly about statistical inference. Each of them discusses a particular type of analysis and inferential and decriptive statistical methods for it. These methods are some of the most commonly used in basic statistical analyses of empirical data. In addition, we will also use them as contexts in which to introduce the general concepts of statistical inference. This will be done gradually, with each chapter both building on previous concepts and introducing new ones, as follows:
Chapter 4: Associations in two-way contingency tables (significance testing, sampling distributions of statistics).
Chapter 5: Single proportions and comparisons of proportions (probability distributions, parameters, point estimation, confidence intervals).
Chapter 7: Means of continuous variables (probability distributions of continuous variables, and inference for such variables).
European Social Survey (2012). ESS Round 5 (2010/2011) Technical Report. London: Centre for Comparative Social Surveys, City University London. See http://www.europeansocialsurvey.org for more on the ESS.↩
For more on survey weights and the design and analysis of surveys in general, please see MY456 (Survey Methodology) in the Lent Term.↩
For more information, see Gaskell, G. (2004). “Science policy and society: the British debate over GM agriculture”, Current Opinion in Biotechnology 15, 241–245.↩
Experiment 1 in Anderson, C. A., Carnagey, N. L., and Eubanks, J. (2003). “Exposure to violent media: the effects of songs with violent lyrics on aggressive thoughts and feelings”. Journal of Personality and Social Psychology 84, 960–971.↩
Carson, J. L. et al. (2001). “The impact of national tides and district-level effects on electoral outcomes: the U.S. congressional elections of 1862–63”. American J. of Political Science 45, 887–898.↩