Appendix

Computer classes

Some general instructions on computing and the SPSS package are given first below. It makes most sense to read these together with the instructions for individual computer classes.

General instructions

Using the networked computers at LSE

To access IT facilities at LSE you need an IT account with its Username and Password. Please see https://info.lse.ac.uk/staff/divisions/imt/help/guides-faqs/accounts for instructions on how to activate your account. In case of any problems, please ask for assistance at the IT help desk (Library 1st floor).
Various introductory documents can be accessed through the IMT services web pages at https://info.lse.ac.uk/current-students/imt.
Logging in to use Windows: When you arrive at a networked computer, wait for Windows to start up (if the machine is not already on). Type in CTRL + ALT + Delete and the Enter Network Password screen will appear. Type in your username and your password and press Enter or click on the OK button. This will log you on to the computer.

Data downloading

The instructions for each class will give the name of a file or files which will be used for that exercise. In order to do the class, you will need to download the file to your H: space (i.e. your personal file storage space on the LSE network, shown as disk drive H: on a networked computer once you have logged on). You can download all the data files for the course, as well as other course-related material, from the web-based Moodle system. See instructions in the beginning of this book for how to register for MY451 on Moodle.

Introduction to SPSS

General information and documentation

SPSS (formerly Statistical Package for the Social Sciences) is a widely used general-purpose statistical software package. It will be used for all the computer classes on this course. The current version on the LSE network is SPSS 21. This section gives some general information on the structure and use of SPSS. The discussion is brief and not meant to be comprehensive. The instructions given here and in the descriptions of individual computer classes below will be sufficient for the purposes of this course. If, however, you wish to find out more about SPSS, more information and examples can be found in the SPSS help files and tutorials found under the Help menu of the program, and in introductory guide books such as

Field, A. (2013). Discovering Statistics using IBM SPSS Statistics (4th ed). Sage. Kinnear, P. R. and Gray, C. D. (2012). SPSS 19 Made Simple. Psychology Press. Pallant, J. (2013). SPSS Survival Manual (5th ed). Open University Press.

These are given here purely as examples (there are many others) and not as recommendations. We have not reviewed any of these books in detail and so cannot make any comparisons between them.

Starting SPSS

To start SPSS, double-click on the SPSS icon on the Windows desktop. Alternatively, click on the Start button at the bottom left corner, and select All Programs, then Specialist and teaching software, Statistics, SPSS, and finally SPSS 21 (or some obvious variant of these, in case the exact wording on your desktop is slightly different).

An initial screen for opening data files appears. Click on Cancel to get rid of this and to enter the data editor (which will be discussed further below).

Exiting from SPSS

Select Exit from the File menu or click on the X at the upper right corner of the SPSS data editor window. You may then be prompted to save the information in the open windows; in particular, you should save the contents of the data editor in a file (see below) if you have made any changes to it.

SPSS windows

There are several different types of windows in SPSS. The two most important are

Data editor: A data set is displayed in the Data Editor window. Several of these can be open at a time. The data editor which you have selected (clicked on) most recently defines the active data set, and the procedures you request from the menus are applied to this data set until you select a different active data set. The data editor window has two parts, accessed by clicking on the two tabs at the bottom of the window:
- Data view, which shows the data matrix in the spreadsheet-like form discussed in Section 1.2.1, with units in the rows and variables in the columns.
- Variable view, which shows information about the variables.
Working with the data editor will be practised in the first computer class. The contents of the data editor, i.e.the data matrix and associated information, can be saved in an SPSS data file. Such files have names with the extension .sav.
Output viewer: Output from statistical analyses carried out on the data will appear here. The output can be printed directly from the viewer or copied and pasted to other programs. The contents of the viewer can also be saved in a file, with a name with the extension .spv (since version 17; in previous versions of SPSS the extension was .spo).

There are also other windows, for example for editing SPSS graphs. They will be discussed in the instructions to individual computer classes where necessary.

Menus

SPSS has a menu-based interface, which can be used to access most of its features for statistical analysis, manipulation of data, loading, saving and printing files, and so on.

The procedures for statistical analysis are found under the Analyze menu, which provides further drop-down menus for choosing different methods.
- Similarly, procedures for various statistical graphics are found under Graphs. We will be using procedures found under Graphs / Legacy Dialogs. Here “legacy” means that these are the graphics menus which were included also in previous versions of SPSS. The current version also contains a second, new set of menus for the same graphs, under Graphs / Chart Builder. We do not regard these as an improvement in usability, so we will continue to use the old menus. You are welcome to explore the cababilities of the “Chart Builder” on your own.
Eventually the menu choices lead to a dialog box with various boxes and buttons for specifying details of the required analysis. Most dialog boxes contain buttons which open new dialog boxes for further options. The details of these choices for the methods covered on this course are described in the instructions to individual computer classes.
Almost all of the dialog boxes have options which are not needed for our classes and never mentioned in the instructions. Some of these simply modify the output, others request variants of the statistical methods which will not be used in these classes. All such options have default values which can be left untouched here. You are, however, welcome to experiment with these additional choices to see what they do. Further information on them can be accessed through the Help button in each dialog box.

Notational conventions for the instructions

Because analyses in SPSS are carried out by making choices from the menus, the instructions for the computer classes need to describe these choices somehow. To reduce the length and tedium of the instructions, we will throughout present them in a particular format explained below. Because this information is rather abstract if read in isolation, it is best to go through it while carrying out specific instructions for the first few computer classes.

The appropriate menu choices for obtaining the dialog box for the required analysis are first given in bold, for example as follows:

Analyze/Descriptive statistics/Frequencies

This is short for “Click on the menu item Analyze at the top of the window; from the drop-down menu, select Descriptive statistics and then click on Frequencies.” This particular choice opens a dialog box for constructing various descriptive statistics and graphs (as discussed in Chapter 2).

Unless otherwise mentioned, subsequent instructions then refer to choices in the most recently opened dialog box, without repeating the full path to it.
For all of the statistical analyses, we need first to specify which variables the analyses should be applied to. This is done by entering the names of those variables in appropriate boxes in the dialog boxes. For example, the dialog box opened above has a box labelled Variable(s) for this purpose. The dialog box also includes a separate box containing a list of all the variables in the data set. The required variables are selected from this list and moved to the choice boxes (and back again, when choices are changed) by clicking on an arrow button between the boxes. For example, suppose that a data set contains a grouped age variable called AGEGROUP, for which we want to construct a frequency table. The class instructions may then state in words “Place AGEGROUP in the Variable(s) box”, or sometimes just

Variable(s)/AGEGROUP

both of which are short for “In the dialog box opened above, click on the name AGEGROUP in the list of variables, and then click on the arrow button to move the name into the Variable(s) box”. Sometimes we may also use a generic instruction of the form

Variable(s)/$<$Variables$>$

where $<$Variables$>$ indicates that this is where we would put the name of any variables for which we want to obtain a frequency table. Note that here and in many other procedures, it is possible to select several variables at once. For the Frequencies procedure used as an example here, this simply means that a separate frequency table is constructed for each selected variable.
Other choices in a dialog box determine details of the analysis and its output. In most cases the selection is made from a fixed list of possibilities provided by SPSS, by clicking on the appropriate box or button. In the instructions, the choice is indicated by listing a path to it, for example as

Charts/Chart Type/Bar charts

in the above example (this requests the so-called bar chart). The items on such a list are labels for various items in the dialog boxes. For example, here Charts is a button which opens a new subsidiary dialog box, Chart Type is the title of a list of options in this new dialog box, and Bar charts is the choice we want to select. In other words, the above instruction is short for “In the dialog box opened above, click on the button Charts to open a new dialog box. Under Chart type, select Bar charts by clicking on a button next to it.”
Some choices need to be made by typing in some information rather than selecting from a list of options. Specific instructions for this will be given when needed.
After choices are made in subsidiary dialog boxes, we return to the main dialog box by clicking on Continue. Once all the required choices have been made, the analysis is executed by clicking on OK in the main dialog box. This should be reasonably obvious, so we will omit explicit instructions to do so.

A useful feature of SPSS is the dialog recall button, which is typically sixth from the left in the top row of buttons in the Output viewer window; the button shows a rectangle with a green arrow pointing down from it. Clicking on this gives a menu of recently used procedures, and choosing one of these brings up the relevant dialog box, with the previously used choices selected. This is useful when you want to rerun a procedure, e.g. to try different choices for its options. It is usually quicker to reopen a dialog box using the dialog recall button than through the menus.

SPSS session options

Various options for controlling the format of SPSS output and other features can be found under Edit/Options. For example, an often useful choice is General/Variable Lists/Display names. This instructs SPSS to display the names of variables in the variable lists of all procedures, instead of the (typically much longer) descriptive labels of the variables. In large data sets this may make it easier to find the right variables from the list. This may be further helped by selecting General/Variable Lists/Alphabetical, which causes the names to be listed in an alphabetical order rather than the order in which the variables are included in the data set.

Printing from SPSS

All the computers in the public rooms are connected to one of the laser printers. When you print a document or a part of it, you need to have credit on your printing account. See https://info.lse.ac.uk/staff/divisions/imt/help/guides-faqs/campus-facilities for further information.

You can print your results from the Output Viewer either by selecting File/Print or by clicking on Print on the toolbar (the button with a little picture of a printer). Please note that SPSS output is often quite long, so this may result in much more printout than you really want.
Alternatively, in the Output Viewer, select the objects to be printed, select Edit / Copy, open a Word or Excel document and Paste. You can make any changes or corrections in this document before printing it. This method gives you more control over what gets printed than printing directly from SPSS.
At the printer terminal, type in your username and password. The files sent for printing are then listed. Select the appropriate file number and follow the instructions given by the computer.

SPSS control language

Early versions of SPSS had no menu-based interface. Instead, commands were executed by specifying them in SPSS command language. This language is still there, underlying the menus, and each choice of commands and options from the menus can also be specified in the control language. We will not use this approach on this course, so you can ignore this section if you wish. However, there are some very good reasons why you might want to learn about the control language if you need to work with SPSS for, say, analyses for your thesis or dissertation:

Because the control language commands can be saved in a file, they preserve a record of how an analysis was done. This may be important for checking that there were no errors, and for rerunning the analyses later if needed.
For repetitive analyses, modifying and rerunning commands in the control language is quicker and less tedious than using the menus repeatedly.
Some advanced SPSS procedures are not included in the menus, and can only be accessed through the control language.

The main cost of using the control language is learning its syntax. This is initially much harder than using the menus, but becomes easier with experience. The easiest way to begin learning the syntax is to request SPSS to print out the commands corresponding to choices made from the menus. Two easy ways of doing so are

Selecting the session option (i.e. under Edit/Options) Viewer/Display commands in the log. This causes the commands corresponding to the menu choices to be displayed in the output window.
Clicking on the Paste button in a dialog box (instead of OK) after selecting an analysis. This opens a Syntax window where the corresponding commands are now displayed. The commands in a syntax window can be edited and executed, and also saved in a file (with the extension .sps) for future use.

WEEK 2 class: Descriptive statistics for categorical data, and entering data

Data set

The data file ESS5_sample.sav will be used today. It contains a simplified sample of data from UK respondents in the 2010 European Social Survey (Round 5). The questions in the survey that you see here were designed By Dr Jonathan Jackson and his team as part of a module investigating public trust in the criminal justice system. Further information about the study can be found at
https://www.europeansocialsurvey.org/docs/findings/ESS5_toplines_issue_1_trust_in_justice.pdf ⁵⁹

The main purpose of today’s class is to introduce you to the layout of SPSS and to show you how to produce some basic tables and graphs for categorical variables. Additionally, we provide instructions on how to enter data into a new SPSS data file, using the Data Editor. This exercise is not strictly needed for the course, but we include it for two purposes. Firstly, students often find this a helpful way of learning how the software works. Secondly, this exercise may be a useful introduction for students who go on to collect or collate data for their own empirical research.

Classwork

Part 1: The layout of an SPSS data file

Opening an SPSS data file: this is done from File/Open/Data, selecting the required file from whichever folder it is saved in in the usual Windows way. Do this to open ESS5_sample.sav.
Information in the Variable View window. The data file is now displayed in the Data Editor. Its Data View window shows the data as a spreadsheet (i.e. a data matrix). We will first consider the information in the Variable View window, accessed by clicking on the Variable View tab at the bottom left corner of the window. The columns of this window show various pieces of information about the variables. Take a little while familiarising yourself with them. The most important of the columns in Variable View are
- Name of the variable in the SPSS data file. The names in this column (also shown as the column headings in Data View) will be used to refer to specific variables in all of the instructions for these computer classes.
- Type of the variable. Here most of the variables are Numeric, i.e. numbers, and a few are String, which means text. Clicking on the entry for a variable in this column and then on the button (with three dots on it) revealed by this shows a list of other possibilities.
- Width and Decimals control the total number of digits and the number of decimal places displayed in Data View. Clicking on an entry in these columns reveals buttons which can be used to increase or decrease these values. Here all but two of the numeric variables are coded as whole numbers, so Decimals has been set to 0 for them.
- Label is used to enter a longer description of the variable. Double-clicking on an entry allows you to edit the text.
- Values shows labels for individual values of a variable. This is mostly relevant for categorical variables, such as most of the ones in these data. Such variables are coded in the data set as numbers, and the Values entry maintains a record of the meanings of the categories the numbers correspond to. You can see examples of this by clicking on some of the entries in the Values column and then on the resulting button. The value labels can also be displayed for each observation in Data View by selecting View/Value Labels in that window.
- Missing specifies missing data codes, i.e. values which are not actual measurements but indicators that an observation should be treated as missing. There may be several such codes. For example, variables in these data often have separate missing data codes for cases where a respondent was never asked a question (“Not applicable”, often abbreviated NAP), replied “Don’t know” (DK) or otherwise failed to provide an answer (“Refusal” or “No answer”; NA); the explanations of these values are found in the Values column. An alternative to using missing data codes (so-called User missing values) is to enter no value (a System missing value) for an observation in the data matrix. This is displayed as a full stop (.) in Data View. There are no such values in these data.
- Measure indicates the measurement level of a variable, as Nominal, Ordinal or Scale (meaning interval). This is mostly for the user’s information, as SPSS makes little use of this specification.
Any changes made to the data file are preserved by saving it again from File/Save (or by clicking on the Save File button of the toolbar, which the one with the picture of a diskette). You will also be prompted to do so when exiting SPSS or when trying to open a new data file. Today you should not save any changes you may have made to ESS5_sample.sav, so click No if prompted to do so below.

Part 2: Descriptive statistics for categorical variables

Most of the statistics required for this class are found in SPSS under Analyze/Descriptive Statistics/Frequencies as follows:

Names of the variables for which the statistics are requested are placed in the Variable(s) box. To make it easy to find variables in the list box on the left, you may find it convenient to change the way the variables are displayed in the list; see under “SPSS Session Options” for instructions.
Tables of frequencies: select Display frequency tables
Bar charts: Charts/Chart Type/Bar charts. Note that under Chart Values you can choose between frequencies or percentage labels on the vertical axis.
Pie charts: Charts/Chart Type/Pie charts

In addition, we will construct some two-way tables or cross-tabulations, by selecting Analyze/Descriptive Statistics/Crosstabs. In the dialog box that opens, request a contingency table between two variables by entering

The name of the row variable into the Row(s) box, and
The name of the column variable into the Column(s) box.
Cells/Percentages for percentages within the table: Row gives percentages within each row (i.e. frequencies divided by row totals), Column percentages within columns, and Total percentages out of the total sample size.

The labels in the SPSS output should be self-explanatory. Note that in this and all subsequent classes, the output may also include some entries corresponding to methods and statistics not discussed on this course. They can be ignored here.

The first variable in the data set, GOODJOB, asks respondents whether they generally feel that the police are doing a good job in their country. There are three response categories for this item: “a good job”, “neither a good job nor a bad job”, or “a bad job”. Obtain a frequency table and bar chart to investigate the distribution of responses to this question.

Check that you understand how to interpret the output you obtain. In particular, make sure that you understand the information displayed in each of the columns in the main table, and that you can see the connection between the information in the table and the information represented in the bar chart.
The last variable in the set, AGE_GRP, records in which of the following age groups each respondent falls: up to 29 years of age, 30-49, or 50+ years. Let us consider the association between age group and opinions of the police. Obtain a two-way contingency table of GOODJOB by AGE_GRP. To make interpretation easier, request percentages within each of the age groups. If you use AGE_GRP as the row variable, then include row percentages in the output.

Interpret the resulting table. Are opinions of the police distributed differently among the three different age groups? Does there appear to be an association between age group and attitude?
If you have time after completing the data entry exercise (below), you may wish to return to this data set and explore frequencies and contingency tables for some of the other variables in the set.

Part 3: Entering data directly into Data Editor

This procedure may be useful to know if the data you are analysing are not in any electronic form at the beginning of the analysis, for example if you start with a pile of filled-in questionnaires from a survey. For practice, we will enter the following small, artificial data set:

Sex: Man; Age: 45; Weight: 11 st 3 lbs
Sex: Woman; Age: 68; Weight: 8 st 2 lbs
Sex: Woman; Age: 17; Weight: 9 st 6 lbs
Sex: Man; Age: 28; Weight: 13 st 8 lbs
Sex: Woman; Age: 16; Weight: 7 st 8lbs

Select File/New/Data to clear the Data Editor. Go to Variable View and enter into the first four rows of the Name column names for the variables, for example sex, age, wstones and wpounds.
Switch to Data View and type the data above into the appropriate columns, one unit (respondent) per row. Note that the person’s weight is entered into two columns, one for stones and one for pounds. Enter sex using numerical codes, e.g. 1 for women and 2 for men.
Save the file as a new SPSS data file (File/Save as), giving it a name of your choice. You should also resave the file (from File/Save or by clicking the File Save button) after each of the changes made below.
Practise modifying the information in Variable View by adding the following information for the sex variable:
- Enter the label Sex of the respondent into the Label column.
- Click on the Values cell and then on the resulting button to open a dialog box for entering value labels. Enter Value: 1; Value Label: Woman; Add. Repeat for men, and click OK when finished.
Transforming variables: It is often necessary to derive new variables from existing ones. We will practise the two most common examples of this:
1. Creating a grouped variable: Suppose, for example, that we want to define a grouped age variable with three categories: less than 25 years, 25–54 and 55 or over. This is done as follows:
  - Select Transform/Recode into Different Variables. This opens a dialog box which is used to define the rule for how values of the existing variable are to be grouped into categories of the new one.
  - Move the name of the age variable to the Input Variable –$>$ Output Variable box.
  - Under Output Variable, enter the Name of the new variable, for example agegroup, and click Change.
  - Click on Old and New Values. Enter Old Value/Range: Lowest through 24 and New Value/Value: 1, and click Add.
  - Repeat for the other two categories, selecting Range: 25 through 54 and Range: 55 through highest for Old value, and 2 and 3 respectively for New value.
  - You should now see the correct grouping instructions in the Old –$\mathbf{>}$ New box. Click Continue and OK to create the new variable.
  - Check the new variable in Data View. At this stage you should normally enter in Variable View the value labels of the age groups.
2. Calculations on variables: Some new variables are obtained through mathematical calculations on existing ones. For example, suppose we want to include weight in kilograms as well as stones and pounds. Using the information that one stone is equal to 6.35 kg and one pound is about 0.45 kg, the transformation is carried out as follows:
  - Select Transform/Compute Variable. This opens a dialog box which is used to define the rule for calculating the values of the new variable.
  - Enter Target variable: weightkg (for example; this is the name of the new variable) and Numeric Expression: 6.35 * wstones + 0.45 * wpounds; for the latter, you can either type in the formula or use the variable list and calculator buttons in a fairly obvious way.

WEEK 2 HOMEWORK

The homework exercise for this week is to complete the multiple choice quiz which you can find in the Moodle resource for MY451. Answers to the questions are also included there, including feedback on why the incorrect answer are incorrect. The first part of the quiz asks for answers to the class exercise, and the second part asks you to identify the level of measurement of some different variables.

WEEK 3 class: Descriptive statistics for continuous variables

Data set: The data file used today is london-borough-profiles.sav. It contains a selection of data on the 33 London boroughs obtained from the London Datastore, which publishes a range of statistical data about the city, collated by the Greater London Authority’s GLA Intelligence Unit.⁶⁰

Descriptive statistics in SPSS

This week you will produce and examine descriptive statistics for a number of individual variables. As for last week, almost all of the statistics required for this class can be obtained in SPSS under Analyze/Descriptive Statistics/Frequencies. Note that you will probably not find the tables of frequencies very useful, because continuous variables can take so many different values. So for this class, uncheck the Display frequency tables option in the dialog box.

Measures of central tendency: Mean, Median and Mode under Statistics / Central Tendency
Measures of variation: Range, Std. deviation and Variance under Statistics/Dispersion. For the Interquartile range, select Statistics/ Percentile values/Quartiles and calculate by hand the difference between the third and first quartiles given (as Percentiles 75 and 25 respectively) in the output.
Histograms: Charts/Chart Type/Histograms

Two charts needed today are not found under the Frequencies procedure:

Stem and leaf plots, which are obtained from Analyze/Descriptive Statistics/Explore by entering variable(s) under Dependent list and selecting Display/Plots and Plots/Descriptive/Stem-and-leaf. You can place more than one variable under the Dependent list in order to compare variables.
Box plots are also automatically generated through this dialog box, regardless of whether you want to see them! So this is the simplest way to produce them.

Most of these statistics and charts can be obtained in other ways as well, for example from Analyze/ Descriptive Statistics/Descriptives or Graphs/Legacy Dialogs/Histogram, or Graphs/Legacy Dialogs/Boxplot, but we will not use these alternatives today. Feel free to investigate them in your own time if you wish.

Classwork

The variable YOUTH_DEPRIVATION records for each borough the percentage of children who live in out-of-work families. This is an indicator of deprivation, with higher values indicating a worse situation for each borough. Investigate the distribution of this variable across London boroughs by obtaining its mean, median, minimum and maximum, quartiles and standard deviation, and a histogram. Obtain also a stem and leaf plot and a box plot. Note that double-clicking on a histogram (or any other SPSS graph) opens it in a new window, where the graph can be further edited by changing titles, colours etc. The graph can also be exported from SPSS into other software. Check that you understand how to find the measures of central tendency and dispersion from the output. Does the distribution of YOUTH_DEPRIVATION appear to be symmetrically distributed or skewed?
Consider now the variable CRIME, which records the numbers of reported crimes for every 1000 inhabitants, over the years 2011-12. Obtain some summary descriptive statistics, a histogram and a box plot for this variable. Is the distribution of the variable symmetric or skewed to the left or right? CRIME is one of many variables in this data set which have outliers, i.e. boroughs with unusually large or small values of the variable. Normally statistical analysis focuses on the whole data rather than individual observations, but the identities of individual outliers are often also of interest. The outliers can be seen most easily in the box plots, where SPSS labels them with their case numbers, so that you can identify them easily in the data set. For example, 1 would indicate the 1st case in the data set. If you click on to the Data View you can see that this 1st case is the City of London. Which borough is the outlier for CRIME?

HOMEWORK

For the questions below, select the relevant SPSS output to include in your homework and write brief answers to the specific questions. Remember SPSS produces some outputs that you do not need. Feel free to transcribe tables or modify charts if you wish to improve their presentation.

The variable VOTING records voter turnout in a borough, specifically the percentage of eligible voters who voted in the local elections in 2010. Obtain descriptive statistics, a histogram and a box plot for this variable. What is the range of the variable, and what is its inter-quartile range? Are there any outliers? Is the distribution of voter turnout symmetrical or skewed? How you can you tell?
In the data set employment rates are given overall, but also separately for males and females. The employment rate is the percentage of working age population who are in employment. Compare and contrast male and female employment rates across the boroughs, using the variables MALE_EMPLOYMENT and FEMALE_EMPLOYMENT. Comment on the differences and/or similarities in their descriptive statistics: minimum and maximum, mean, median and standard deviation. Obtain histograms for these two variables. Are the distributions of male employment and female employment symmetrical or skewed?

WEEK 4 class: Two-way contingency tables

Data set: The data file used today is GSS2010.sav. It contains a selection of variables on attitudes and demographic characteristics for 2044 respondents in the 2010 U.S. General Social Survey (GSS).⁶¹ The full data set contains 790 variables. For convenience the version you are analysing today contains just a selection of those items.

Analysing two-way contingency tables in SPSS

All of the analyses needed for this week’s class are found under Analyze/Descriptive Statistics/Crosstabs. We will be obtaining contingency tables between two variables, as in Week 2 class, with the following commands:

The name of the row variable into the Row(s) box, and
The name of the column variable into the Column(s) box.
Cells/Percentages for percentages within the table: Row gives percentages within each row (i.e. frequencies divided by row totals), Column percentages within columns, and Total percentages out of the total sample size.

The only additional output we will need today is obtained by selecting

Statistics/Chi-square for the $\chi^{2}$ test of independence
(If you are interested in the $\gamma$ measure of association for ordinal variables, outlined in the coursepack, you may obtain it using Statistics/Ordinal/Gamma. In the output the $\gamma$ statistic is shown in the “Symmetric measures” table in the “Value” column for “Gamma”. We will not use this measure today, but feel free to ask if you are interested in it.)

Classwork

Suppose we want to use the GSS data to investigate whether in the U.S. population sex and age are associated with attitudes towards women’s roles. The respondent’s sex is included in the data as the variable SEX, and age as AGEGROUP in three groups: 18-34, 35-54, and 55 or over. The three attitude variables we consider are

FEFAM: Level of agreement with the following statement: “It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family”. Available response options are Strongly agree, Agree, Disagree, and Strongly disagree.
FEPOL: Level of agreement with the following statement: “Most men are better suited emotionally for politics than are most women”. Available response options are: Agree and Disagree.
FEPRES: Response to the following statement: “If your party nominated a woman for President, would you vote for her if she were qualified for the job?” Available response options are Yes and No.

Consider first the association between sex and attitude towards male and female work roles, by constructing a contingency table between SEX and FEFAM. To make interpretation of the results easier, include also appropriate percentages. Here it makes most sense to treat sex as an explanatory variable for attitude, so we want to examine percentages of attitudes within categories of male and female. If you use SEX as the row variable, this means including the Row percentages in the output. Request also the $\chi^{2}$-test statistic. In SPSS output, results for the $\chi^{2}$ test are given below the two-way table itself in a table labelled “Chi-Square Tests”, in the row “Pearson Chi-Square”. The test statistic itself is given under “Value” and its $P$-value under “Asymp. Sig. (2-sided)”. By considering the $\chi^{2}$ test statistic and its $P$-value, do you think there is enough evidence to conclude that males and females differ in their views on male and female work roles? If there is, how would you describe the association?
Consider now the association between age and attitude towards male and female work roles, by constructing a table between AGEGROUP and FEFAM. Interpret the results, and compare them to your findings in Exercise 1.
Examine differences between men and women in their views about women’s suitability for politics, using a table between SEX and FEPOL. Interpret the results. (Note: ignore the last two columns of the $\chi^{2}$ test output, labelled ‘Exact Sig. (2-sided)’’ and ‘Exact Sig. (1-sided)’’, and use the result under “Asymp. Sig. (2-sided)” as in the other tables.)

HOMEWORK

What is the null hypothesis for the $\chi^{2}$ test that you carried out in analysis 2 in the class, for the table of AGEGROUP by FEFAM?
State the $\chi^{2}$ test statistic, degrees of freedom and $P$-value for this table, and interpret these results.
Interpret the table of percentages to describe the nature of the association between AGEGROUP and FEFAM.
Consider now the association between age and attitude towards voting for a female President, by constructing a table between AGEGROUP and FEPRES. In the population, do people in different age groups differ in their willingness to vote for a female President? Interpret the results of the $\chi^{2}$ test and illustrate your answer with one or two percentages from the two-way table.

WEEK 5 class: Inference for two population means

Data set: The data file used today is ESS5_GBFR.sav. It contains data for a selection of variables from the 2010 European Social Survey for respondents in Great Britain and France.⁶² Only a few of the variables are used in the exercises; the rest are included in the data set as examples of the kinds of information obtained from this survey.

Two-sample inference for means in SPSS

$t$-tests and confidence intervals for two independent samples for inference on the difference of the population means: Analyze/Compare Means/Independent-Samples T Test. The variable of interest $Y$ is placed under Test Variable(s) and the explanatory variable $X$ under Grouping Variable. The values of $X$ identifying the two groups being compared are defined under Define Groups.
Box plots for descriptive purposes are obtained from Analyze/Descriptive Statistics/Explore. Here we want to draw side-by-side box plots for values of a response variable $Y$, one plot for each distinct value of an explanatory variable $X$. The name of $Y$ is placed under Dependent List and that of $X$ under Factor List. Box plots are obtained by selecting Plots/Boxplots/Factor levels together.
Tests and confidence intervals for single means (c.f. Section 7.4) are not considered today. These are obtained from Analyze/Compare Means/One-Sample T Test. They can also be used to carry out inference for comparisons of means between two dependent samples (c.f. Section 7.5).

Classwork Consider the survey data in the file ESS5_GBFR.sav. We will examine two variables, and carry out statistical inference to compare their means among the survey populations of adults in Great Britain and France.⁶³

The variable WKHTOT shows the number of hours per week the respondent normally works in his or her main job. Obtain box plots and descriptive statistics for this variable separately for each country (identified by the variable CNTRY). Compare measures of central tendency and variation for WKHTOT between the two countries. What do you observe?
Obtain a $t$-test and confidence interval for the difference of weekly working hours between Britain and France (specify the values of the country variable as Define Groups/Group 1: GB and Group 2: FR as coded in the data). Details of SPSS output for this are explained in Chapter 7; you can use the results under the assumption of equal population variances. What do you conclude? Is there a statistically significant difference in the average values of WKHTOT between the two countries? What does the confidence interval suggest about the size of the difference?
The variable STFJBOT asks those in paid work, “How satisfied are you with the balance between the time you spend on your paid work and the time you spend on other aspects of your life?”. Respondents are asked to rate their level of satisfaction on a scale from 0-10, where 0 means “Extremely dissatisfied” and 10 means “Extremely satisfied”. Repeat exercises 1 and 2 for this variable, and compare also histograms of STFJBOT for each country. What do you observe?

HOMEWORK

Write up your answers to the second class exercise, answering these specific questions:
1. What are the observed sample means for WKHTOT for French and British respondents?
2. Is there a statistically significant difference in the average values of WKHTOT between the two countries? State the value of the test statistic and its corresponding $P$-value. You may assume equal population variances for this test.
3. Interpret the 95% confidence interval for the difference.
The variable WKHSCH asks respondents, “How many hours a week, if any, would you choose to work, bearing in mind that your earnings would go up or down according to how many hours you work?”. Is there a statistically significant difference between ideal (rather than actual) work hours for French and British respondents? Carry out a t-test and report and interpret the results.
The variable STFMJOB asks respondents, “How satisfied are you in your main job?”. Respondents are asked to rate their level of satisfaction on a scale from 0-10, where 0 means “Extremely dissatisfied” and 10 means “Extremely satisfied”. Is there a statistically significant difference, at the 5% level of significance, between mean levels of job satisfaction for French and British respondents? Answer this question by using the 95% confidence interval for the difference in means (you need the full t-test output to obtain the confidence interval, but you need not report the results of the t-test itself for this question).

WEEK 7 class: Inference for population proportions

Data sets: Files BES2010post_lastdebate.sav and BES2010pre_lastdebate.sav.

Inference on proportions in SPSS

SPSS menus do not provide procedures for calculating the tests and confidence intervals for proportions discussed in Chapter 5. This is not a serious limitation, as the calculations are quite simple.
It is probably easiest to use a pocket calculator for the calculations, and this is the approach we recommend for this class. The only part of the analysis it cannot do is calculating the precise $P$-value for the tests, but even this can be avoided by using critical values from a statistical table such as the one at the end of this Coursepack to determine approximate $P$-values (or by using an online $P$-value calculator — see “About Week 4 class” on the Moodle page for suggested links).

Classwork

The survey data set BES2010post_lastdebate.sav contains part of the information collected by the British Election Study, an ongoing research programme designed to understand voter choices in the UK.⁶⁴

In the run-up to the UK General Election on 6 May 2010, opinion polls reported quite dramatic changes in popularity of the Liberal Democrat party. Key to their increasing popularity was the performance of their party leader, Nick Clegg, in a series of three televised debates between the leaders of the three main political parties (the other participants were Gordon Brown for Labour and David Cameron for the Conservative party). The debates were broadcast between 15 and 29 April 2010.

The data in BES2010post_lastdebate.sav contain information on respondents’ voting intentions, obtained after the debates had ended (i.e. between 30 April and 6 May).

VOTE_LIBDEM is a dichotomous variable indicating whether a respondent intended to vote for the Liberal Democrats (value 1) or some other party (0) in the 2010 General Election. The value of this variable is by definition missing for those who had not decided which way they would vote or who did not intend to vote at all, so they are automatically excluded from the analysis. The parameter of interest $\pi$ is now the population proportion of those who say they would vote Liberal Democrat. We will compare it to 0.23, the proportion of the vote the party actually received in 2010. The analysis is thus one-sample inference on a population proportion, and the relevant formulas are (11) for the test statistic and (15) for the confidence interval that can be found in Section 5.5.2 and 5.6.2 respectively.
- Begin by creating a frequency table of VOTE_LIBDEM. This should show that the sample estimate of $\pi$ is 0.260, out of $3226$ non-missing responses. Thus $n=3226$ and $\hat{\pi}=0.260$ in the notation of Chapter 5.
- For the one-sample significance test, the value of $\pi$ under the null hypothesis is $\pi_{0}=0.230$. Using the specific formula of the test statistic in Section 5.5.2, the value of the test statistic $z$ is thus given by the calculation \[z = \frac{0.260-0.230}{\sqrt{0.230\times (1-0.230)/3226}}\] Calculate this using a calculator. The result should be $z=4.049$.
- The (two-sided) $P$-value for this is the probability that a value from the standard normal distribution is at most $-4.049$ or at least 4.049. Evaluate this approximately by comparing the value of $z$ to critical values from the standard normal distribution (c.f. Table 5.2) as explained in Section 5.5.3. Here, for example, $z$ is larger than 1.96, so the two-sided $P$-value must be smaller than 0.05. Convince yourself that you understand this statement.
- Calculate a 95% confidence interval for the population proportion of prospective Liberal Democrat voters, using equation (15) at the end of Section 5.6.2.
What do you conclude about the proportions of prospective and actual Liberal Democrat voters? Why might the two differ from each other?
The variable TVDEBATE indicates whether the respondent reports having watched any of the three televised debates (1 for Yes, at least one watched, 0 otherwise - this includes “no” and “don’t know” responses). We will compare the proportion of people intending to vote Liberal Democrat amongst those who watched some or all of the debates with those who did not, using the two-sample methods of analysis discussed in Section 5.7. The formula of the $z$-test statistic for testing the hypothesis of equal population proportions is thus the two-sample $z$-test statistic for proportions (see middle of Section 5.7), and a confidence interval for the difference of the porportions is (25) in Section 5.7.
- Begin by calculating the relevant sample proportions. The easiest way to do this is by creating a two-way contingency table between TVDEBATE and VOTE_LIBDEM as you did in the Week 2 and 4 classes. The results required for the analysis considered here are all shown in the resulting table. Convince yourself that these show that, in the notation of Section 5.7,
  - $n_{1}=930$ and $\hat{\pi}_{1}=0.218\; (=203/930)$,
  - $n_{2}=2296$ and $\hat{\pi}_{2}=0.277\; (=636/2296)$,
  where 1 denotes respondents who did not watch any of the debates and 2 those who watched at least some. The pooled estimated proportion $\hat{\pi}$ (formula 21 in Section 5.7) used in the test statistic (23) is here $\hat{\pi}=0.260$, shown on the “Total” row.
- Calculate the test statistic, its $P$-value and a 95% confidence for the difference in population proportions, using the relevant formulas. For example, the test statistic is here given by \[z= \frac{0.277-0.218}{\sqrt{0.260\times (1-0.260)\times (1/2296+1/930)}}.\]
What do you conclude? Is there evidence that those who watched at least some of the leaders’ debates were more likely to declare an intention to vote Liberal Democrat? If there is, how big is the difference in proportions of prospective Liberal Democrat voters between the debate-watchers and debate-non-watchers?

HOMEWORK

Write up your answers to the second class exercise. In particular, answer the following specific questions:

What proportion of respondents say that they did watch at least some of the leaders’ debates? And what proportion did not? Of those who watched at least some of the leaders’ debates, what proportion said they intended to vote Liberal Democrat? And what proportion of those who did not watch any of the leaders’ debates said they intended to vote Liberal Democrat?
Calculate the test statistic and find its corresponding approximate $P$-value for the difference in population proportions of prospective Liberal Democrat voters among those who did and did not watch the leaders’ debates. Show your working. State the conclusion from the test.
Calculate a 95% confidence interval around this difference. State its lower and upper limits.
Write a brief substantive interpretation of your results.

The data set BES2010pre_lastdebate.sav contains responses to the same question - whether respondents intended to vote Liberal Democrat or not - but asked before the last of the party leaders’ debates. Repeat the analysis you carried out for the first class exercise, but using this data set. In other words carry out a one-sample analysis, of the kind done in exercise 1 above, to compare the proportion of respondents who said they intended to vote Liberal Democrat with the proportion who actually did. Answer the following questions:

State the null hypothesis for the test.
Calculate the test statistic and find its corresponding approximate $P$-value. Show your workings.
Give a brief interpretation of the results. Do they differ from the other data set? Can you think of any reasons for this? (This last question invites some speculation - do not worry if you don’t have any ideas! But see the sample answer if you are interested in our speculation.)

WEEK 8 class: Correlation and simple linear regression 1

Data set: Files decathlon2012.sav.

Scatterplots, correlation and simple linear regression in SPSS

A scatterplot is obtained from Graphs/Legacy Dialogs/“Scatter/Dot”/ Simple Scatter/Define. The variables for the $X$-axis and $Y$-axis are placed in the X Axis and Y Axis boxes respectively. Double-clicking on the plot in the Output window opens it in a Chart Editor, where various additions to the graph can be made. A fitted straight line is added from Elements/Fit Line at Total. A least squares fitted line is the default under this option, so it is drawn immediately and you can just click Close. Closing the Chart Editor commits the changes to the Output window.
A correlation matrix is obtained from Analyze/Correlate/Bivariate, when Correlation Coefficients/Pearson is selected (which is the default, so you should not need to change it). The variables included in the correlation matrix are placed into the Variables box. The output also includes a test for the hypothesis that the population correlation is 0, but we will ignore it.
Linear regression models are obtained from Analyze/Regression/Linear. The response variable is placed under Dependent and the explanatory variable under Independent(s). The dialog box has many options for various additional choices. Today you can leave all of them at their default values, except that you should select Statistics/Regression Coefficients/Confidence intervals to include also 95% confidence intervals for the regression coefficients in the output.

Classwork

Decathlon is a sport where the participants complete ten different athletics events over two days. Their results in each are then translated into points, and the winner is the competitor with the highest points total for the ten events. The file decathlon2012.sav contains the results of the decathlon competition at the 2012 Olympics in London for the 26 athletes who finished the competition.⁶⁵ The results for each event are given both in their original units (variables with names beginning with “mark_”) and in decathlon points (names beginning with “points_”). The ten events are identified by the variable labels in Variable View. The variable points_total gives the final points total for each competitor.

Create a scatterplot between the result ($X$-axis) and points ($Y$-axis) for one event, the 100-metre sprint (variables MARK_100M and POINTS_100M), and add a fitted line. This simply provides information on the calculation used to transform the result into points. Clearly a linear calculation is used for this, at least over the range of results in these data. Notice the downward slope of the line: the faster the result, the higher the number of points. From now on, for simplicity we will consider only the points variables for each event.
Obtain the correlation matrix for all pairs of variables among the ten individual points scores and the total score. Consider first correlations between the individual events only. Which correlations tend to be high (say over 0.5), which ones close to zero and which ones even negative? Can you think of any reasons for this? Draw scatterplots and fitted lines for a few pairs of variables with different sizes of correlations (here the variables are treated symmetrically, so it does not matter which one is placed on the $X$-axis). Can these associations be reasonably described as linear?
Consider now the correlations between the ten event scores and the final score POINTS_TOTAL. Which of them is highest, and which one lowest? Examine the scatterplot and fitted line between points for 100 metres (POINTS_100M) and the total score (POINTS_TOTAL). Fit a line to this scatterplot variables, with POINTS_100M as the explanatory variable. Interpret the results. Does there appear to be an association between the points for 100 metres and the total score? What is the nature of the association?

Suppose you were told that a competitor received 800 points (a time of about 11.3 seconds) for 100 metres, the first event of the decathlon. Based on the fitted model, what final points score would you predict for him? You can calculate this fitted value with a pocket calculator. What would be the predicted value if the 100-metre score was 950 points (about 10.6 s) instead?

HOMEWORK

Briefly discuss the correlation matrix produced in the class. Pick out a few examples for illustration - which correlations are highest, and which ones lowest, and which ones negative? You may comment on correlations between individual events, as well as on correlations between the final score and individual events.
Obtain the scatterplot and linear regression model for the total score given points for the long jump, one of the field events (POINTS_LONGJUMP). Is the score for long jump strongly or weakly associated with the final score? Interpret the slope coefficient. Suppose you were told that a competitor received 900 points (a jump of about 7.4 metres) for the long jump. Based on the fitted model, what final points score would you predict for him?
Obtain the scatterplot and linear regression model for the total score given points for throwing the discus, another of the field events (POINTS_DISCUS). Interpret the slope coefficient. Is the score for discus strongly or weakly associated with the final score?

WEEK 9 class: Simple linear regression and 3-way tables

Data set: File GSS2010.SAV. This contains a selection of variables on attitudes and demographic characteristics for 2044 respondents in the 2010 U.S. General Social Survey (GSS).⁶⁶ Only a few of the variables are used in the exercises.

Classwork - linear regression

Here we will focus on the variables EDUC, PAEDUC, MAEDUC and SPEDUC. These show the number of years of education completed by, respectively, the survey respondent him/herself, and the respondent’s father, mother and spouse.

Obtain basic descriptive statistics for the variables. Here they can be compared directly, because the meaning of the variable is similar in each case. We can even draw side-by-side box plots for the variables (rather than for values of a single variable at different levels of another, as before). These can be obtained from Analyze/Descriptive Statistics/Explore by placing all the variables under Dependent List and selecting Plots/Boxplots/Dependents together. You should then also select Options/Missing Values/Exclude cases pairwise to include all non-missing values for each variable (here SPEDUC has for obvious reasons more missing values than the others).
Obtain the correlation matrix of the four variables. Which correlations are highest, and which ones lowest?
Draw a scatterplot with fitted line for EDUC given PAEDUC. Fit a linear regression model between these variables, regressiong EDUC (response variable) on PAEDUC (explanatory variable). Interpret the results. Is there a statistically significant linear association between a person’s years of schooling and those of his/her father? Interpret the estimated regression coefficient, $t$-statistic and $P$-value, and 95 per cent confidence interval.
Based on the fitted model, what is the predicted number of years of education for a respondent whose father completed 12 years of education?

HOMEWORK: Simple linear regression and three-way tables

The homework exercise uses the same data set for two different types of analysis.

Linear regression

Draw a scatterplot with fitted line for EDUC given MAEDUC. Fit a linear regression model between these variables, regressiong EDUC (response variable) on MAEDUC (explanatory variable).

Interpret the results: Is there a statistically significant linear association between a person’s years of schooling and those of his/her mother? Interpret the estimated regression coefficient, $t$-statistic and $P$-value, and 95 per cent confidence interval.
Based on the fitted model, what is the predicted number of years of education for a respondent whose mother completed 10 years of education?
Interpret the R-squared statistic for the model.

Analysing multiway contingency tables in SPSS

Three-way contingency tables are again obtained from Analyze/Descriptive Statistics/Crosstabs. The only change from Week 4 class is that the conditioning variable is now placed in the Layer 1 of 1 box. This produces a series of partial two-way tables between the row and column variables specified in the Row(s) and Column(s) boxes, one for each category of the Layer variable. Percentages and $\chi^{2}$ test are similarly calculated separately for each partial table. For this example we elaborate on the first two exercises from Week 4 class. To remind you, the categorical variables we are analysing are these:

The respondent’s sex, recorded as the variable SEX.
age as AGEGROUP in three groups: 18-34, 35-54 and 55 or over.
FEFAM: Level of agreement with the following statement: “It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family”, with response options Strongly agree, Agree, Disagree, and Strongly disagree.

First remind yourself of the associations between SEX and FEFAM and between AGEGROUP and FEFAM. Obtain the two-way contingency table between FEFAM and SEX, including appropriate percentages and $\chi^{2}$ test of independence. Repeat the procedure for FEFAM by AGEGROUP. What do you learn about the associations between attitude and sex, and between attitude and age?
Sociologists would suggest that the relationship between sex and attitude towards male and female work roles might be different for different age groups. In other words, age might modify the association between sex and attitude. Investigate this possible interaction between the three variables. Create a three-way table where FEFAM is the column variable, SEX the row variable and AGEGROUP the layer (conditioning) variable. Study the SPSS output, and make sure you understand how this shows three partial tables of FEFAM vs.SEX, one for each possible value of AGEGROUP. Examine and interpret the associations in the three partial tables. State the results of the $\chi^{2}$ test for each partial table, and illustrate your interpretations with some appropriate percentages. Finally, summarise your findings: are there differences in the nature, strength or significance of the association between sex and attitude, depending on the age group? Comment on how this interpretation differs from the initial two-way table of FEFAM and SEX.

WEEK 10 class: Multiple linear regression

Data set: File humandevelopment2011.sav.

Multiple linear regression in SPSS

Multiple linear regression is obtained from Analyze/Regression/Linear, by placing all of the required explanatory variables in the Independent(s) box. No other changes from last week are required.
To include categorical explanatory variables, the necessary dummy variables have to be created first. The ones for today’s class are already included in the data set. If you need to create dummy variables for your own analyses in the future, it is usually easiest to do so from Transform/Compute Variable. Some of the buttons on the keypad shown in that dialog box are logical operators for defining conditions for which the outcome is either 1 (True) or 0 (False), as required by a dummy variable. For example, the categorical variable INCOME_GROUP in today’s data set has the value 3 if the country is in the high income group. The dummy variable HIGH_INCOME was created from this by entering Target Variable: HIGH_INCOME and Numeric Expression: INCOME_GROUP=3. This means that the new variable HIGH_INCOME will have the value 1 for countries for which INCOME_GROUP is equal to 3, and will be 0 otherwise. Other logical operators may also be used: for example, urban_pop$<$50 would produce 1 if the variable URBAN_POP was less than 50 and 0 otherwise.

Classwork

The file humandevelopment2011.sav contains data on a number of indicators of what might broadly be called development, for 194 countries in 2011. These were collated from two international data agency sources.⁶⁷ The response variable considered today is SCHOOL_YEARS, which records for each country the mean number of years of schooling taken by the adult population. We treat it here as a general indicator of the educational situation in a country, which is an important aspect of development. We will consider the following explanatory variables for it:

URBAN_POP: the degree of urbanisation of the country, specifically the percentage of the country’s population living in urban areas variable
GOVERNANCE, a continuous variable contructed from expert opinion surveys to reflect the perceived effectiveness of government in delivering services.
INFANT_MORTALITY, number of infants dying before 1 year old, per 1,000 live births — a “proxy” indicator representing the health of the population
INCOME_GROUP, classified as low, middle or high income economies. This is also provided in the form of three dummy variables: LOW_INCOME, MIDDLE_INCOME and HIGH_INCOME.

Obtain some descriptive statistics for the continuous variables, to gain and impression of their ranges. A quick way of doing this is via Analyze/Descriptive Statistics/Frequencies, unchecking the “Display frequency tables” and requesting minimum and maximum values.
Investigate the idea that increased urbanisation is linked to greater availability of schooling for people. Obtain a scatterplot and a simple linear regression model for SCHOOL_YEARS given URBAN_POP. What do you observe in the scatterplot? Interpret the regression output.
Now consider the possibility that schooling may also be explained by the effectiveness of governments in providing public services (such as education). Fit a multiple linear regression model for SCHOOL_YEARS given both URBAN_POP and GOVERNANCE. Compare the the estimated coefficient of URBAN_POP for this model with the coefficient of the same variable in the model in Question 2. What do you conclude? Does the association between schooling and urbanisation change when we control for government effectiveness? If so, in what way? Interpret the estimate coefficient of GOVERNANCE in the fitted model, the results of its $t$-test and its 95% confidence interval.
Next consider the possible explanatory value of the income wealth of a country for understanding variation in schooling years. Include income by entering two of the three dummy variables for income group. For the most convenient interpretation, we suggest that you leave “low income” as the reference group, and enter the dummies for MIDDLE_INCOME and HIGH_INCOME in the model. Interpret the values of the estimated regression coefficients for the two income dummy variables. In addition, for each one state the null hypothesis for its $t$-test, and interpret the result of the test and 95% confidence intervals.
Using this model, what level of schooling would you predict for a country with 70% urban population, a score of 1.5 on governance, and a high income economy?
Using this model, what level of schooling would you predict for a country with 30% urban population, a score of -0.2 on governance, and a low income economy?

HOMEWORK

Write up your answers to the last three questions in the class exercise.
Finally, consider one more possible explanatory variable: INFANT_MORTALITY. Add this variable to the multiple linear regression model fitted above. Is it statistically significant, at the 1% level of significance? Interpret the value of its estimated coefficient, and its 95% confidence interval. Take care to make sense of the sign (positive or negative) of the coefficient.
Has the inclusion of INFANT_MORTALITY modified the interpretation of any of the other explanatory variables in the model? Are they all statistically significant, at the 5% level of significance? Briefly outline the similarities and differences between the results for this final model and the model fitted in the class exercise.

WEEK 11 class: Review and Multiple linear regression

Data set: File ESS5GB_trust.sav.

This class is for you to revisit any topics of your choosing. Make the most of the opportunity to ask your class teachers any questions you have about any of the course material, and to practise any of the analyses you have learned during the course.

As an optional exercise, the data file ESS5GB_trust.sav is provided. This contains a selection of variables from the survey of British respondents that forms the 2010 wave of the European Social Survey.⁶⁸

We suggest that you use the data to practise multiple linear regression modelling on one or more of the variables capturing people’s levels of trust in institutions. For these questions, respondents were asked the following: “Using this card, please tell me on a score of 0-10 how much you personally trust each of the institutions I read out. 0 means you do not trust an institution at all, and 10 means you have complete trust.” The institutions (and their variable names) are:

trstprl: Trust in country’s parliament
trstlgl: Trust in the legal system
trstplc: Trust in the police
trstplt: Trust in politicians
trstprt: Trust in political parties
trstep: Trust in the European Parliament
trstun: Trust in the United Nations

After you choose a response variable that interests you, you will need to select some potential explanatory variables to test. The data set contains a number of variables. Some are socio-demographic, such as age and gender. Some are attitudinal or behavioural, such as amount of time spent reading newspapers. You will need to make a judgement about the levels of measurement of the variables, and how to enter them into the model. Use the “Values” column in the SPSS Variable View to check how each variable is coded. Note: we suggest that it is not too much of a compromise to treat the variables on television, radio and newspaper consumption as continuous, interval level variables. Note also: we have provided dummy variables for the categorical variables in the data set.

HOMEWORK

As this is the last week of the course, there is no homework. You can find further information on this and the other class exercises and homeworks in the model answers, which will be posted in the Moodle site.

Statistical tables

Explanation of the “Table of standard normal tail probabilities” in Section @ref(s_disttables_Z):

The table shows, for values of $Z$ between 0 and 3.5, the probability that a value from the standard normal distribution is larger than $Z$ (i.e. the “right-hand” tail probabilities).
- For example, the probability of values larger than 0.50 is 0.3085.
For negative values of $Z$, the probability of values smaller than $Z$ (the “left-hand” tail probability) is equal to the right-hand tail probability for the corresponding positive value of $Z$.
- For example, the probability of values smaller than $-0.50$ is also 0.3085.

Table of standard normal tail probabilities

$z$	Prob.	$z$	Prob.	$z$	Prob.	$z$	Prob.	$z$	Prob.	$z$	Prob.
0.00	0.5000	0.50	0.3085	1.00	0.1587	1.50	0.0668	2.00	0.0228	2.50	0.0062
0.01	0.4960	0.51	0.3050	1.01	0.1562	1.51	0.0655	2.01	0.0222	2.52	0.0059
0.02	0.4920	0.52	0.3015	1.02	0.1539	1.52	0.0643	2.02	0.0217	2.54	0.0055
0.03	0.4880	0.53	0.2981	1.03	0.1515	1.53	0.0630	2.03	0.0212	2.56	0.0052
0.04	0.4840	0.54	0.2946	1.04	0.1492	1.54	0.0618	2.04	0.0207	2.58	0.0049
0.05	0.4801	0.55	0.2912	1.05	0.1469	1.55	0.0606	2.05	0.0202	2.60	0.0047
0.06	0.4761	0.56	0.2877	1.06	0.1446	1.56	0.0594	2.06	0.0197	2.62	0.0044
0.07	0.4721	0.57	0.2843	1.07	0.1423	1.57	0.0582	2.07	0.0192	2.64	0.0041
0.08	0.4681	0.58	0.2810	1.08	0.1401	1.58	0.0571	2.08	0.0188	2.66	0.0039
0.09	0.4641	0.59	0.2776	1.09	0.1379	1.59	0.0559	2.09	0.0183	2.68	0.0037
0.10	0.4602	0.60	0.2743	1.10	0.1357	1.60	0.0548	2.10	0.0179	2.70	0.0035
0.11	0.4562	0.61	0.2709	1.11	0.1335	1.61	0.0537	2.11	0.0174	2.72	0.0033
0.12	0.4522	0.62	0.2676	1.12	0.1314	1.62	0.0526	2.12	0.0170	2.74	0.0031
0.13	0.4483	0.63	0.2643	1.13	0.1292	1.63	0.0516	2.13	0.0166	2.76	0.0029
0.14	0.4443	0.64	0.2611	1.14	0.1271	1.64	0.0505	2.14	0.0162	2.78	0.0027
0.15	0.4404	0.65	0.2578	1.15	0.1251	1.65	0.0495	2.15	0.0158	2.80	0.0026
0.16	0.4364	0.66	0.2546	1.16	0.1230	1.66	0.0485	2.16	0.0154	2.82	0.0024
0.17	0.4325	0.67	0.2514	1.17	0.1210	1.67	0.0475	2.17	0.0150	2.84	0.0023
0.18	0.4286	0.68	0.2483	1.18	0.1190	1.68	0.0465	2.18	0.0146	2.86	0.0021
0.19	0.4247	0.69	0.2451	1.19	0.1170	1.69	0.0455	2.19	0.0143	2.88	0.0020
0.20	0.4207	0.70	0.2420	1.20	0.1151	1.70	0.0446	2.20	0.0139	2.90	0.0019
0.21	0.4168	0.71	0.2389	1.21	0.1131	1.71	0.0436	2.21	0.0136	2.92	0.0018
0.22	0.4129	0.72	0.2358	1.22	0.1112	1.72	0.0427	2.22	0.0132	2.94	0.0016
0.23	0.4090	0.73	0.2327	1.23	0.1093	1.73	0.0418	2.23	0.0129	2.96	0.0015
0.24	0.4052	0.74	0.2296	1.24	0.1075	1.74	0.0409	2.24	0.0125	2.98	0.0014
0.25	0.4013	0.75	0.2266	1.25	0.1056	1.75	0.0401	2.25	0.0122	3.00	0.0013
0.26	0.3974	0.76	0.2236	1.26	0.1038	1.76	0.0392	2.26	0.0119	3.02	0.0013
0.27	0.3936	0.77	0.2206	1.27	0.1020	1.77	0.0384	2.27	0.0116	3.04	0.0012
0.28	0.3897	0.78	0.2177	1.28	0.1003	1.78	0.0375	2.28	0.0113	3.06	0.0011
0.29	0.3859	0.79	0.2148	1.29	0.0985	1.79	0.0367	2.29	0.0110	3.08	0.0010
0.30	0.3821	0.80	0.2119	1.30	0.0968	1.80	0.0359	2.30	0.0107	3.10	0.0010
0.31	0.3783	0.81	0.2090	1.31	0.0951	1.81	0.0351	2.31	0.0104	3.12	0.0009
0.32	0.3745	0.82	0.2061	1.32	0.0934	1.82	0.0344	2.32	0.0102	3.14	0.0008
0.33	0.3707	0.83	0.2033	1.33	0.0918	1.83	0.0336	2.33	0.0099	3.16	0.0008
0.34	0.3669	0.84	0.2005	1.34	0.0901	1.84	0.0329	2.34	0.0096	3.18	0.0007
0.35	0.3632	0.85	0.1977	1.35	0.0885	1.85	0.0322	2.35	0.0094	3.20	0.0007
0.36	0.3594	0.86	0.1949	1.36	0.0869	1.86	0.0314	2.36	0.0091	3.22	0.0006
0.37	0.3557	0.87	0.1922	1.37	0.0853	1.87	0.0307	2.37	0.0089	3.24	0.0006
0.38	0.3520	0.88	0.1894	1.38	0.0838	1.88	0.0301	2.38	0.0087	3.26	0.0006
0.39	0.3483	0.89	0.1867	1.39	0.0823	1.89	0.0294	2.39	0.0084	3.28	0.0005
0.40	0.3446	0.90	0.1841	1.40	0.0808	1.90	0.0287	2.40	0.0082	3.30	0.0005
0.41	0.3409	0.91	0.1814	1.41	0.0793	1.91	0.0281	2.41	0.0080	3.32	0.0005
0.42	0.3372	0.92	0.1788	1.42	0.0778	1.92	0.0274	2.42	0.0078	3.34	0.0004
0.43	0.3336	0.93	0.1762	1.43	0.0764	1.93	0.0268	2.43	0.0075	3.36	0.0004
0.44	0.3300	0.94	0.1736	1.44	0.0749	1.94	0.0262	2.44	0.0073	3.38	0.0004
0.45	0.3264	0.95	0.1711	1.45	0.0735	1.95	0.0256	2.45	0.0071	3.40	0.0003
0.46	0.3228	0.96	0.1685	1.46	0.0721	1.96	0.0250	2.46	0.0069	3.42	0.0003
0.47	0.3192	0.97	0.1660	1.47	0.0708	1.97	0.0244	2.47	0.0068	3.44	0.0003
0.48	0.3156	0.98	0.1635	1.48	0.0694	1.98	0.0239	2.48	0.0066	3.46	0.0003
0.49	0.3121	0.99	0.1611	1.49	0.0681	1.99	0.0233	2.49	0.0064	3.48	0.0003

Table of critical values for t-distributions

df	0.100	0.050	0.025	0.010	0.005	0.001	0.0005
1	3.078	6.314	12.706	31.821	63.657	318.309	636.619
2	1.886	2.920	4.303	6.965	9.925	22.327	31.599
3	1.638	2.353	3.182	4.541	5.841	10.215	12.924
4	1.533	2.132	2.776	3.747	4.604	7.173	8.610
5	1.476	2.015	2.571	3.365	4.032	5.893	6.869
6	1.440	1.943	2.447	3.143	3.707	5.208	5.959
7	1.415	1.895	2.365	2.998	3.499	4.785	5.408
8	1.397	1.860	2.306	2.896	3.355	4.501	5.041
9	1.383	1.833	2.262	2.821	3.250	4.297	4.781
10	1.372	1.812	2.228	2.764	3.169	4.144	4.587
11	1.363	1.796	2.201	2.718	3.106	4.025	4.437
12	1.356	1.782	2.179	2.681	3.055	3.930	4.318
13	1.350	1.771	2.160	2.650	3.012	3.852	4.221
14	1.345	1.761	2.145	2.624	2.977	3.787	4.140
15	1.341	1.753	2.131	2.602	2.947	3.733	4.073
16	1.337	1.746	2.120	2.583	2.921	3.686	4.015
17	1.333	1.740	2.110	2.567	2.898	3.646	3.965
18	1.330	1.734	2.101	2.552	2.878	3.610	3.922
19	1.328	1.729	2.093	2.539	2.861	3.579	3.883
20	1.325	1.725	2.086	2.528	2.845	3.552	3.850
21	1.323	1.721	2.080	2.518	2.831	3.527	3.819
22	1.321	1.717	2.074	2.508	2.819	3.505	3.792
23	1.319	1.714	2.069	2.500	2.807	3.485	3.768
24	1.318	1.711	2.064	2.492	2.797	3.467	3.745
25	1.316	1.708	2.060	2.485	2.787	3.450	3.725
26	1.315	1.706	2.056	2.479	2.779	3.435	3.707
27	1.314	1.703	2.052	2.473	2.771	3.421	3.690
28	1.313	1.701	2.048	2.467	2.763	3.408	3.674
29	1.311	1.699	2.045	2.462	2.756	3.396	3.659
30	1.310	1.697	2.042	2.457	2.750	3.385	3.646
40	1.303	1.684	2.021	2.423	2.704	3.307	3.551
60	1.296	1.671	2.000	2.390	2.660	3.232	3.460
120	1.289	1.658	1.980	2.358	2.617	3.160	3.373
$\infty$	1.282	1.645	1.960	2.326	2.576	3.090	3.291

Explanation: For example, the value 3.078 in the top left corner indicates that for a $t$-distribution with 1 degree of freedom the probability of values greater than 3.078 is 0.100. The last row shows critical values for the standard normal distribution.

Table of critical values for chi-square distributions

df	0.100	0.050	0.010	0.001
1	2.71	3.84	6.63	10.828
2	4.61	5.99	9.21	13.816
3	6.25	7.81	11.34	16.266
4	7.78	9.49	13.28	18.467
5	9.24	11.07	15.09	20.515
6	10.64	12.59	16.81	22.458
7	12.02	14.07	18.48	24.322
8	13.36	15.51	20.09	26.124
9	14.68	16.92	21.67	27.877
10	15.99	18.31	23.21	29.588
11	17.28	19.68	24.72	31.264
12	18.55	21.03	26.22	32.909
13	19.81	22.36	27.69	34.528
14	21.06	23.68	29.14	36.123
15	22.31	25.00	30.58	37.697
16	23.54	26.30	32.00	39.252
17	24.77	27.59	33.41	40.790
18	25.99	28.87	34.81	42.312
19	27.20	30.14	36.19	43.820
20	28.41	31.41	37.57	45.315
25	34.38	37.65	44.31	52.620
30	40.26	43.77	50.89	59.703
40	51.81	55.76	63.69	73.402
50	63.17	67.50	76.15	86.661
60	74.40	79.08	88.38	99.607
70	85.53	90.53	100.43	112.317
80	96.58	101.88	112.33	124.839
90	107.57	113.15	124.12	137.208
100	118.50	124.34	135.81	149.449

Explanation: For example, the value 2.71 in the top left corner indicates that for a $\chi^{2}$ distribution with 1 degree of freedom the probability of values greater than 2.71 is 0.100.

ESS Round 5: European Social Survey Round 5 Data (2010). Data file edition 2.0. Norwegian Social Science Data Services, Norway - Data Archive and distributor of ESS data. The full data can be obtained from http://ess.nsd.uib.no/ess/round5/.↩
The data were obtained from https://data.london.gov.uk/dataset/london-borough-profiles. If you download the “Profiles in Excel” workbook, you will find that one of the pages contains a map of the boroughs, and a tool for visualising the data on that map. A regular map of the boroughs can be found at for example at https://en.wikipedia.org/wiki/London_boroughs.↩
The data can be obtained from http://www3.norc.org/gss+website/, which gives further information on the survey, including the full text of the questionnaires.↩
ESS Round 5: European Social Survey Round 5 Data (2010). Data file edition 2.0. Norwegian Social Science Data Services, Norway - Data Archive and distributor of ESS data. The full data can be obtained from http://ess.nsd.uib.no/ess/round5/.↩
Strictly speaking, the analysis should incorporate sampling weights (variable DWEIGHT) to adjust for different sampling probabilities for different types of respondents. Here the weights are ignored. Using them would not change the main conclusions for these variables.↩
The data can be obtained from http://bes2009-10.org/, which gives further information on the survey, including the full text of the questionnaires. The data analysed in this class and homework are from the BES Campaign Internet Panel Survey, which has been divided into two data sets corresponding to two time periods leading up to the General Election.↩
Official results obtained from http://www.olympic.org/london-2012-summer-olympics.↩
The data can be obtained from http://www..norc.org/GSS+Website/, which gives further information on the survey, including the full text of the questionnaires.↩
United Nations Development Programme International Human Development Indicators, http://hdr.undp.org/en/data/; World Bank Worldwide Governance Indicators, http://info.worldbank.org/governance/wgi/pdf/wgidataset.xlsx; World Bank World Development Indicators, http://data.worldbank.org/indicator/SP.DYN.IMRT.IN.↩
ESS Round 5: European Social Survey Round 5 Data (2010). Data file edition 2.0. Norwegian Social Science Data Services, Norway - Data Archive and distributor of ESS data. The full data can be obtained from http://ess.nsd.uib.no/ess/round5/.↩