# Analysing quantitative data

## Introduction

This chapter focuses on quantitative data analysis. It gives you an overview of the ways in which you can analyse the data that you have spent so much time and energy collecting. Using real-life data and examples, the chapter provides a guide to essential statistical techniques commonly used in undergraduate dissertation. By providing the essential steps and procedures in analysing your data, the chapter seeks to develop your skills in handling and making sense of your data, whether collected through questionnaires, surveys, structured interviews, observations, existing secondary data or any other methods that you may have used to collect the data. All examples and case studies in this chapter are based on real-life data. All statistical analyses are done using IBM-SPSS programme Version 24. The emphasis here is to develop your knowledge and understanding of when to use different analytical techniques and the practical skills and procedures required to analyse your data and how to interpret or draw inferences from your results.

By the end of the chapter, you will have a better understanding of how to:

• • Prepare for quantitative data analysis by organising and coding your data;
• • Define different types of variables and entering data into IBM-SPSS statistical programme;
• • Explore the distribution of your data and using descriptive statistics to summarise your data;
• • Use IBM-SPSS to construct simple graphs to visually present your data;
• • Explore any statistical relationships between variables; through univariate and bivariate analyses;
• • Draw inferences and conclusions from quantitative statistical analysis of your data.

## Variables

Not all numbers are the same. An understanding of the differences between variables is important when thinking of analysing your data and choosing which calculations to do with your data. There are three main types of variable;

• 1. Nominal. This is when numbers are used like names. In a questionnaire, for example, certain questions might be coded with numbers to represent different categories. In a question on country of birth, Afghanistan might be coded as 1, Albania as 2, Algeria as 3, etc. The numbers 1, 2 and 3 have no numeric value and have been chosen arbitrarily. We could easily have chosen to code Albania as 9 and Algeria as 11. These categories cannot be rank ordered, and it would be meaningless to carry out certain statistical tests (such as calculating the mean) on nominal data. Other examples of nominal data include ethnicity, eye colour and housing tenure.
• 2. Ordinal. For these variables, the numbers represent categories again, but this time they can be rank ordered through the use of Likert-type scales. For example, levels of satisfaction can be numbered from 1 to 5, where 1 = extremely satisfied, 2 = satisfied, 3 = neither satisfied nor not satisfied, 4 = not satisfied, 5 = not satisfied at all. With these kinds of data it is possible to describe people’s level of satisfaction, e.g. '67 per cent of respondents were very satisfied with the service.’ It is important to remember, however, that the distances across the categories might not be equal. The researcher cannot judge whether someone who gives a 5 for satisfaction is five times less satisfied than someone who gives a 1. This means that, as with nominal data, certain calculations, such as mean and standard deviation, cannot be carried out on ordinal data. Other examples of ordinal data include: age categories (21-30, 31-40, etc.) or frequency of doing something (never, rarely, often, frequently).
• 3. IntervallRatio. Here, the differences between the numbers are equal across the range. If someone is 21 and someone else is 18, the difference is three years. These three years are equal to the three-year difference between someone who is 35 and someone else who is 32. The distinction between interval and ratio data is that the zero in interval data is arbitrary. For example, on a thermometer, the zero for Fahrenheit and Celsius scales is different. In social science research, most variables will have a fixed zero — so they are ratio variables. It is possible to carry out more complex calculations and statistical tests on interval and ratio data. Examples of ratio data include age, income, height or weight.

Think about your data and decide which types of variables you are working with before starting your data analysis. The volume of numbers from which you need to create order and meaning can be intimidating at the start of data analysis. You need to find ways to summarise the data so that you can more easily see what the data is telling you. As you describe and summarise your data, you will be making it more readable, comprehensible and clear. Here we will look at how you can describe one variable and then compare two variables. Univariate analysis is looking at one variable and describing tendencies, patterns and trends whereas bivariate analysis looks at relationships between two variables. Multivariate analysis looks at more than two relationships simultaneously.

## Data exploration and descriptive statistics

Quantitative data analysis is generally divided into two categories — descriptive and inferential statistics. Descriptive statistics enable you to explore and understand your data before carrying out any further or detailed statistical analysis that may be required. For some dissertations, it could well be that all is required are descriptive statistics without any further complicated statistical analysis. There are several ways you can explore and make a judgement on the nature of your data in terms of its distribution, underlying patterns and structures. One of these may involve manual calculations of simple statistics that measure averages such as the mean value, weighted mean, the median, the mode, percentile, etc. Other manual calculations may involve the use of statistics that measure variability such as the range, standard deviation and the variance. Each of these descriptive statistics are explained below, using examples of real-life data.

## Averages and measures of central tendency

You can explore the distribution of quantitative data using simple statistics that measure characteristics such as:

• • Average value
• • Variability
• • Skewness
• • Kurtosis.

Average

An average is the one value that best represents an entire group of scores or values in your dataset. Statistics based on averages tend to measure central tendency of the data. There are three forms of averages:

• • The mean
• • The median
• • The mode.

Each of these averages will produce a different type of information about the nature of your data and its distribution.

The mean

The mean (also known as arithmetic mean) is the most common type of average computed in social science undergraduate dissertations. It is the sum of all the values in your data set, divided by the number of values or cases in that group. This is mathematically expressed as:

Where X bar is the mean value of the group of scores or simply, the mean;

y is the summation sign denoted by the Greek letter sigma;

The X is each individual score in the group or dataset;

The n is the size of the sample or number of cases relating to your dataset. In some publications, the mean is sometimes represented or denoted by the letter M. Technically, the arithmetic mean is defined as the point at which the sum of the deviations from the mean is equal to zero.

For example, a researcher interviewed a group of ten people whose ages (in years) are recorded as 18, 18, 37, 40, 47, 54, 62, 70, 74, and 80. The mean age in this sample is 50, and the sum of the deviation of each score from the mean is zero (i.e. adding up: -.32, -32, -1.3, -10, -3, 4, 12, 20, 24, .30).

Weighted mean

Weighted mean is used in situations where you have occurrence of more than one value. For example, a Sociology lecturer interested in calculating the weighted mean score in her Sociology and Change module over a period of three years recorded the data in Table 9-1.

The weighted mean was obtained by multiplying each score (module grade) by the frequency of its occurrence (frequency), adding the total of all the occurrences, and then divided by the total number of occurrences. The weighted mean grade, in this case, is 61.24% (6124 divided by 100).

The median

The median is defined as the midpoint in a set of values or scores. The median divides your data set into two equal halves such that one-half, or 50%, of the scores or values in your data set fall above the median point and the other half or 50% fall below the median point. Although there is no standard formula for computing the median, it can be determined by:

Table 9.1 Weighted mean of students’ grades enrolled on Society and Change module 2017-2019

 Module grade (%) Frequency (no. of students with corresponding grade score) Grade x frequency 47 5 235 50 12 600 54 11 594 60 20 1200 63 29 1827 70 11 770 71 8 568 80 2 160 84 1 84 86 1 86 Total 100 6124

The weighted mean grade is 61.24%.

• • Listing all the values in order, either from highest to lowest or lowest to highest
• • Finding the middle-most score
• • Averaging between the two middle values, if the number of values is even.

While the mean measures the middle point of a set of values, the median is the middle point of a set of cases. In relation to the previous example and the ages of the of ten people recorded as 18, 18, 37, 40, 47, 54, 62, 70, 74, and 80, the median will be between 47 and 54, i.e. 51.5.

Percentile

A percentile is a statistical measure of distribution that shows the value below which a given percentage of observations in a set of data or group of observation fall. It is used to define the percentage of cases equal to and below a certain point in a distribution or set of scores. If a score is at the 75th percentile, it means that the score is at or above 75% of the other scores in your data set or sample. The median is the 50th percentile, i.e. the point below which 50% of your data sample fall. The 25th percentile is known as the First Quartile (Qi) which is the middle number between the smallest number and the median of your data set. The 50th percentile is the Second Quartile (Qj) which is also the median of the data. The 75th percentile is referred to as the Third Quartile (Q3) which is the middle value between the median and the highest value of your data sample.

The mode

The mode is the value that occurs most frequently. To compute the mode, you need to:

• • List all the values in your data set or distribution (list each value only once)
• • Tally the number of times that each value occurs
• • Note the value that occurs most often.

If every value in a distribution contains the same number of occurrences, then there is no modal value. If more than one value appears with equal frequency, the distribution is multi-modal. If two modes exist in a set of scores, then the distribution is bimodal. The mode in the interviewed group of ten people recorded as 18, 18, 37, 40, 47, 54, 62, 70, 74 and 80 would be 18.

## How do I know which measure of central tendency to use?

The type of averages or central tendency measures you need for your dissertation will depend on the type of data you've collected. For categorical or nominal data such as hair colour, income bracket, voting preference, racial group, etc., the mode is a more practical measure of central tendency to use. For interval/ratio data such as income levels, age, test score, height, weight, etc., the median and mean are best used. Generally, the mean is a more precise measure than the median, and the median is a more precise measure than the mode.

## Measures of variability (spread or dispersion)

Measures of variability reflect how scores differ from one another. So, for example, if a researcher was interested in the variations in life expectancy at birch across different countries around the world, they could randomly select six countries each from sub-Saharan Africa, Asia and Europe as recorded in Table 9-2.

For purpose of illustration, the life expectancy figures in Table 9-2 show different variability within each of the three groups of countries in sub-Saharan Africa, Asia and Europe. While the data shows a relatively less degree of variation in life expectancy for the selected countries in Europe, countries in sub-Saharan Africa have greater variations. The selected Asian countries have no variability in their life expectancy data at all. Technically, variability is a measure of how much each score in a group of scores differs

Table 9.2 Life expectancy at birth, total (years), for selected countries in sub-Saharan Africa, Asia and Europe (2017)

 Sub-Saharan Africa Asia Europe Cameroon 59 Mongolia 69 Belgium 81 Eritrea 66 India 69 Bulgaria 75 Liberia 63 Timor-Leste 69 France 83 Mali 58 Cambodia 69 Ireland 82 Nigeria 54 Indonesia 69 Italy 83 Sierra Leone 52 Philippines 69 United Kingdom 81

Source: The World Bank, 2018. Accessible at: https://data.worldbank.org/indicator/SP.

DYN.LEOO.IN

from the mean. Therefore, both average and variability are used to describe the characteristics of a distribution or data set.

The three most commonly used measures of variability are:

• • The range
• • The standard deviation
• • The variance

The range

The range gives an idea of how far apart scores are from one another.

It is computed by subtracting the lowest score in a distribution from the highest score

where r is the range, h is the highest score and 1 is the lowest score in the data set.

There are two kinds of ranges:

• • exclusive range
• • inclusive range.

Inclusive range is calculated using the formula:

Exclusive range is highest score minus the lowest score plus 1. This is computed using the formula:

The range gives only a general estimate on how wide or different scores are from one another. They should not be used to reach any conclusions regarding how individual scores differ from each other.

The standard deviation (SD)

The standard deviation represents the average amount of variability in a set of scores. In technical terms, the SD is the average distance from the mean. The larger the standard deviation, the larger the average distance each data point is from the mean of the distribution. The SD is computed using the formula:

Where s is the standard deviation, S is sigma — the summation sign, X is each individual score,

X is the mean of all the scores and n is the sample size of your data.

In order to manually compute the SD, for your data set, you will need to:

• • Find the difference between each individual score and the mean (X-X)
• • Square each difference and sums them all together
• • Divide the sum by the size of the sample (minus 1)
• • Then take the square root of the results.

The mean deviation

The mean deviation (also called the mean absolute deviation) is the sum of the absolute value of the deviations from the mean divided by the number of data points. The sum of the deviations from the mean is always equal to 0.

The variance

The variance is the standard deviation squared. This can be computed using the formula:

Distribution curves and skewness

Skewness and kurtosis are statistical terms used to describe the shape of a distribution. Most statistical analysis assumes a normal distribution of data with a symmetric bell-shaped pattern as shown in Figure 9-1. With normal distribution the data tends to be around the mean with no bias

Figure 9. / Normal distribution curve

left or right — so, for example, if you were to take a class of 50 people and measure their heights, if there was a normal distribution, then most people would have heights clustered around the mean.

However, many data may not conform to the normal distribution assumption and it may be necessary to establish the degree to which your data deviates from normal distribution. Skewness is a measure of the lack of symmetry, or the lopsidedness, of a distribution. This occurs when one 'tail’ of the distribution is longer that another.

Figure 9-2 shows two forms of distribution with varying degrees of skewness. While the normal distribution curve in Figure 9-1 has equal lengths of tails and no skewness, curve A in Figure 9-2 has a longer right tail than left. This suggests a smaller number of occurrences at the high end of the distribution. This kind of distribution is referred to as positively skewed. Conversely, the distribution B in Figure 9-2 has a shorter right tail than left. This means a larger number of occurrences at the high end of the distribution. Therefore, curve B denotes a negatively skewed distribution.

The location of the mean value in relation to the median value will indicate the direction of skewness. Generally, if the mean is greater than the median, the distribution is positively skewed. Conversely, if the median is greater than the mean, the distribution is positively skewed.

Figure 9.2 Distribution curves with positive and negative skew

In mathematical terms, skewness is computed by subtracting the value of the median from the mean. For example, if the mean value of a distribution is 95 and the median is 86, the skewness value is 9, i.e. 95—86. This means the distribution is positively skewed. Similarly, if the mean of a distribution is 67 and the median is 74, the skewness value is -7, i.e. 67—74. That will suggest that the distribution is negatively skewed.

To compare the skewness of one distribution to another, in absolute terms, the following formula is often used:

Where SK is Pearson's measure of skewness (correlation); X is the mean value, M is the median and S is the standard deviation.

For example, if the mean value of a distribution is 100, the median 105, and the standard deviation is 10, its skewness will be -5. This means the distribution is negatively skewed. In the same vein, if a distribution has a mean value of 120, the median of 116 and the standard deviation of 10, its skewness will be 4. This means the distribution is positively skewed.

Kurtosis

Kurtosis relates to how flat or peaked a distribution appears. Figure 9-3 shows three different distribution curves with different kurtosis. The term platykurtic is used to refer to a distribution curve that is relatively flat, compared to a normal or bell-shaped distribution. A normal bellshaped distribution is described as mesokurtic, while the term leptokurtic refers to a distribution that is relatively peaked compared to a flatshaped or bell-shaped distribution.

Generally, data sets that are platykurtic are relatively more dispersed than those that are not. Distributions that are leptokurtic are less variable or dispersed relative to others. While skewness and kurtosis are used mostly as descriptive terms, there are mathematical indicators or measures that can be computed to indicate how skewed or kurtotic your data distribution is.

## Using software to analyse your data

Do not despair if you have read through this chapter so far and wondered how on earth you would be able to do all of the mathematical calculations and produce the complicated curves, tables and graphs you need for your dissertation — there is software to help you!

Figure 9.3 Distribution curves and kurtosis

The most commonly used statistical computer program designed originally for social scientists is IBM-SPSS. It is relatively easy to use and there are many good books that will introduce you to the program and provide step-by-step guides on how you can use it to analyse your data. While it is out of the scope of this book to provide you with the training and skills to use IBM-SPSS, we have, where necessary, offered some tips to get you started. IBM-SPSS is a powerful piece of software which has functionality way beyond what you will need for your research. Your institution might well have a licence for IBM-SPSS or other similar general-purpose statistical software such as Stata. There are many online guides, textbooks and chapters in data analysis books that may also help you with learning IBM-SPSS. It is worth taking a look at these texts and working through some of the examples before you start to analyse your own data.

There are other statistical packages available to you, some of which are free access — for example. Openstat and Excel in the Microsoft Office package. The statistics functionality of these programs is good, and the freely available add-ons can give more advanced features. Using IBM-SPSS or any of the other statistical programs will enable you to do different calculations and you can also produce graphs, plots and tables to visually present your data. For the rest of this chapter, we have provided a number of case studies to illustrate how to use IBM-SPSS to analyse your data.

The computer package only works with what you input. So, it is important you understand the principles and techniques for using IBM-SPSS to achieve your data analysis objective or indeed any other statistical package. That is why we have concentrated more on what the different techniques show rather than demonstrating how you carry them out in different packages. It is really important, therefore, that you understand the tests you are asking IBM-SPSS software to do and how to interpret the results and present your own findings. Read, carefully, each of the following case studies and examples of how IBM-SPSS was used to answer specific research questions.

## Preparing data for computer analysis

The first stage in using computers to analyse your data is getting the data into a format that a computer can read. This usually involves creating a spreadsheet to input all your data. This initial data organisation could be done using Microsoft Excel or any specialised software that allows data entry and storage. If your data collection instrument is a questionnaire, you may need to design a coding scheme to enter all your questionnaire data in a format that the computer will be able to read and process the information. This initial questionnaire data management and organisation is referred to as coding.

Coding and coding schemes

Coding is the transformation of the information contained in your questionnaires into a numeric or alpha-numeric format that a computer can understand and use in statistical analysis. It relates to the method of assigning numerical values/symbols to various answer categories in a questionnaire. For each response option to a question, a letter code (a, b, c) or preferably numerical code (1,2, 3) is usually assigned. Coding is an important stage in data processing; hence care should be taken while assigning codes to your questionnaire information to make your analysis meaningful.

For example, in a questionnaire survey of undergraduate students’ educational experience and course choice in a UK University, the coding scheme in Table 9-3 was used to record and enter some of the respondents’ answers to the survey questions.

The essence of a coding scheme is to facilitate computer data entry and to ensure data is correctly entered. The coding scheme allows questionnaire information to be entered in a consistent format that can be read and analysed by computer programmes such as IBM-SPSS in the form of a spreadsheet.

## Introducing IBM-SPSS, defining variables and inputting data

To make the most of IBM-SPSS, you need to first know how the program works and the various interface to use it to analyse your data. It is worth consulting books that introduce you to the program in full and that helps you learn how to use IBM-SPSS software to define your variables, input your data and save your data file. To get you started, here are some useful tips:

• • Log on to IBM-SPSS.
• • A dialogue box will appear to either open an existing SPSS data file or create a new data file.
• • To create a new SPSS data file, you need to define each of your variables in the variable window.

Table 9.3 Sample coding scheme used in a survey of students' educational experience and course choice in a UK university

 Survey question Variable name used to define question Possible response options to question Codes used to define options Gender Gender Male 1 Female 2 Your age category Age 18-21 21-25 2 25-35 3 35-45 4 45-55 5 55-60 6 60-65 7 65 plus 8 Studentship status Status Undergraduate Postgraduate 2 Other 3 Mode of study Study mode Full time 1 Part time 2 Distance learning 3 E-learning 4 Other 5 Current stage/year of study Study stage Year 1 1 Year 2 2 Year 3 3 Year 4 4 Other Year 5 How important is the course design a factor in choosing your programme? Course design Very important 1 Important 2 Not important 3 How important is the University reputation and standing in the League table in choosing your course? Uni reputation Very important 1 Important 2 Not important 3

(Continued)

Table 9.3 (Cont.)

 Survey question Variable name used to define question Possible response options to question Codes used to define options How important to you is job prospect in choosing your course? Job prospect Very important 1 Important 2 Not important 3

Source: extract from a student experience survey (Jegede, 2018)

• • At the bottom left hand corner, you can swap between variable view window and data view window.
• • In the first row of the variable view window, define your first variable by specifying the variable name (variable name cannot be more than eight characters).
• • Choose the variable type (numeric for numbers or strings for texts or letters).
• • Define the width of your variable — variable width must be equal to or greater than the largest number of digits in the data set for the variable including the decimal point. For example, to enter 267.84 will require a variable width of 6 while 7.9 is 3 in width.
• • Define the number of decimal places in your data set. The programme default is two decimal places, but this can be changed as required. If there is no decimal place, the value of zero should be entered.
• • Label your variable if required. (You have the option to label your variable with a longer name containing more information.)
• • To define other variables, repeat the above procedure on row two, row three, etc.
• • Once you have defined and entered all the essential information for each of your variables, you can click on the data view tab in the bottom left hand corner, to start entering your data.
• • Remember to save your data.
• • If a variable name is not fully displayed, you can increase the width of the field by holding down the left button on the variable name cell and dragging it to the right.

• The main menu options are located at the top of the screen where you will select all the SPSS commands needed for your analysis.

All the tips provided in this book are based on IBM-SPSS Statistics Version 24. For a practical guide to computing descriptive statistics using real-life data, read case study 91. It illustrates how IBM-SPSS can be used to generate descriptive statistics to summarise your data.

Case Study 9.1 Computing descriptive statistics using crime data derived from the Crime Survey for England and Wales and Police Recorded Crime Data

Problem definition

A criminologist interested in analysing the volume of violent crime dealt with by the police from year ending March 2003 to year ending March 2015 in England and Wales extracted the data in Table 9-2 from the Crime Survey data for England and Wales.

The objective is to use IBM-SPSS descriptive statistics to analyse and summarise the data.

See Table 9.4

Tips for IBM-SPSS procedure for computing descriptive statistics:

• • Log on and enter your data into IBM-SPSS.
• • Define your variables, e.g. crime figures (numeric); crime record period (string); crime category (string), e.g. violence with injury = 1; violence without injury = 2; stalking and harassment = 3-
• • Data — split file, compare groups, groups based on crime category.
• • Analyse.
• • Descriptive statistics; crime figures.
• • Options — check mean, sum, std deviation, minimum, maximum, range, kurtosis and skewness boxes.
• • Extract results from IBM-SPSS output window.

The result of the SPSS descriptive statistics computed using this data is summarised in Table 9-3. The researcher also used the same data to construct a histogram and boxplots for each of the crime category as shown in Figure 9-4 and Figure 9-5.

Tips for IBM-SPSS procedure for boxplots:

• • Log on and enter your data into IBM-SPSS.
• • Define your variables, e.g. crime figures (numeric); crime record period (string); crime category (string), e.g. violence with injury = 1; violence without injury = 2; stalking and harassment = 3-
• • Graph > legacy dialogues > boxplot > simple.
• • Select summaries for groups of cases.
• • Define.
• • A new dialogue box opens up.
• • Move crime figures variable into variable box.
• • Move crime category variable into category axis.
• • Move crime record period variable into label cases by box.
• • Click OK.
• • Copy your boxplots and note the size of each of the boxplots, the location of the median line, the length of the whiskers and any outliers.

Table 9.4 Volume of violent crime dealt with by the police from year ending March 2003 to year ending March 2015

 Period Violence with injury Violence without injury Stalking and harassment Total violence April02- March03 371,774 302,450 33,002 708,742 April03- March04 457,223 300,090 40,522 799,247 April04- MarchOS 514,638 277,569 52,1 17 845,673 AprilOS- March06 543,044 237,218 57,192 838,674 April06- March07 505,848 249,632 58,150 814,865 April07- March08 451,806 241,226 54,531 748,779 April08- March09 420,184 236,943 50,758 709,008 April09- MarchlO 400,703 241,818 55,329 699,01 1

(Continued)

Table 9.4 (Cont.)

 Period Violence with injury Violence without injury Stalking and harassment Total violence April 10- March 11 367,847 243,426 53,144 665,486 April 1 1-Marchl2 337,709 238,276 49,766 626,720 AprilO 12-Marchl3 31 1,740 232,466 56,032 601,141 April 13- Marchl4 322,362 248,616 62,656 634,625 April 14-Marchl5 373,509 317,166 86,368 778,172

Data extracted from the Crime Survey for England and Wales and Police Recorded Crime Data.

Office of National Statistics. Accessed 4 June 2019 from: www.ons.gov.uk/peoplepopula tionandcommunity/crimeandjustice/datasets/crimeinenglandandwalesbulletintables. (Data used for this exercise is based on Police recorded crime, Home Office, licensed under the Open Government Licence and available from Office of National Statistics.

Accessed 4 June 2019 from: www.ons.gov.uk/peoplepopulationandcommunity/crimeand justice/datasets/crimeinenglandandwalesbulletintables)

The result of the SPSS descriptive statistics used in Case study 9-1 is summarised in Table 9-5. Using the same data, construct a histogram for each of the crime categories as shown in Figure 9-4.

## Graphs and graphical display of data

Visual aids are an important part of data exploration and can help you make sense of your data. Graphs generated through IBM-SPSS or Excel can help you summarise your data and highlight key areas that you may focus your attention. Different types of graphs can be used to make your data more visually appealing and accessible.

The most common types of graphs used in dissertations are:

• • Histogram
• • Bar graph
• • Line graph
• • Pie chart
• • Box plots.

Table 9.5 Summary of descriptive statistics for violent crime, England and Wales, from year ending March 2003 to year ending March 2015

 Violence with injury Violence without injury Stalking and harassment Total Sample size n 13 13 13 13 Minimum 31 1,740 23,2466 33,002 33,002 Maximum 543,044 317,166 86,368 543,044 Mean (arithmetic) 413,722.08 258,992.00 54,582.08 242,432.05 Standard deviation 75,804.35 29,507.28 12,271.53 156,014.40 Sum 5,378,387 3,366,896 709,567 9,454,850 Range 231,304 84,700 53,366 510,042 Skewness 0.36 III 1.05 117 Kurtosis -1.08 -0.40 3.87 -1.081

Source: Extracts from IBM-SPSS Descriptive Statistics Output.

Frequency distributions

One way of presenting your data is through frequency distributions. This will show the number of people and the percentage for each category in your variable. Frequency distributions can be used for all types of variable (nominal, ordinal, interval/ratio) mentioned above. The way you present your frequency distribution will depend on the variables you are describing. You can present your data in tables or graphs (pie charts, bar charts, histograms). A good rule of thumb is to use a table unless a graph can put across the message more clearly.

Boxplot

The box plot shows all of the following:

• • The smallest observation (the bottom horizontal line)
• • The bottom 25% (the section between the lowest observation and the grey box)
• • The interquartile range (the grey box)
• • The mean (thick black line inside the box)
• • The top 25% (section above the grey box)
• • The highest observation (upper horizontal line).

Figure 9.4 Histogram of violent crime in England and Wales 2003-2015

Figure 9.4 (Continued)

The box plot shows whether your data is a symmetrical or skewed distribution. In this example, the boxplot shows that violent crime data in England and Wales is skewed, suggesting there is more spread for all categories of crime in the upper 25%. The box plot also indicates where there might be outliers. Outliers are cases which are very different from the rest of the cases. They are shown here by circles and stars. In our example, the stalking and harassment boxplot shows an outlier.

Exploring your data through descriptive statistics and graphical presentation as shown in this chapter can enable you to make a judgement on the nature of your data in terms of its distribution, underlying patterns and structures. Therefore, you may consider descriptive statistics as the first step in quantitative analysis. As an exploratory tool, descriptive statistics can uncover hidden patterns in your data and help you decide on further analysis that may be needed.

Figure 9.5 Boxplots of violent crime in England and Wales 2003-2015

Univariate, bivariate and multivariate analysis

As mentioned earlier it is important to make a distinction between different forms of analyses in relation to the number of variables involved. The example in the first case study is a case of univariate analysis where we deal with one variable — volume of crime in England and Wales. Although we are interested in different types of crime in the study, we are not looking for or attempting to compare crime variable with any other variable. In a bivariate analysis, the focus is to examine the relationships between two variables: the explanatory and the outcome variable. The explanatory variable is the variable which is thought to be the variable of influence (it is also known as the independent, input or predictor variable). The outcome variable (also known as the dependent variable) is the one that we believe will be affected by the explanatory variable.

In a multivariate analysis, you can analyse more than two variables simultaneously. The techniques used to do this kind of analysis are quite advanced. While we have provided a general guide to analysing multivariate analysis in this chapter, you may need to consult books that deal specifically with quantitative analysis (see recommended reading list).

## Beyond descriptive analysis to inferential statistics

Using descriptive analysis, you will have described the data that you have collected and identified relationships between those variables. The next stage in analysis is to test to what extent the results of the data in your sample are generalisable to your sample’s population. These tests are called hypothesis tests or tests for statistical significance. The results from these tests tell you how confident you can be that the relationships observed in the sample are representative of that population.

You would use different hypothesis tests depending on the types of variables you are analysing. For example, if you were dealing with categorical explanatory and outcome variables and you require a chi-square test of association, then Phi or Cramer's V test will be appropriate. See Appendix 2 for a list of statistical tests and a brief notes on what they are designed to measure.

All of these tests will carry with them certain assumptions about your data. For example, they might require that your data be normally distributed, independent, continuous. However, in terms of hypothesis testing there is one assumption that is extremely important. The data needs to have been drawn from a random sample. Hypothesis tests are carried out when you want to know whether something found is a quirk of the data set or something that is a feature of the population.

A test carried out on a non-random sample cannot speak with confidence about generalising to the population. If your sample is not random, you would be advised to spend your time carrying out a thorough descriptive analysis and trying to interpret what is happening in the sample that you have collected. If you collect your own data for your undergraduate dissertation, you are unlikely to have a truly random sample large enough to analyse with inferential statistics. For this reason, we will not go too deep into inferential statistics in this book. If you are working with a random sample (if you are analysing data that has been collected by someone else as part of a much bigger survey, for example), you may look at some of the books in the list at the end of this chapter that will introduce you to some of the more sophisticated statistical techniques. You may also need to discuss analysis options with your dissertation supervisor.

## Analysing relationships – inferential statistics and hypothesis testing

Inferential statistics involves testing hypotheses in relation to the type of relationship that exist between variables. Part of this may involve measuring the degree of correlation.

Measuring correlation

Correlation is a statistical method for uncovering the nature and strength of relationships, if any, that exist between two or more variables. Not only will this technique tell you the kind of relationship that exists, but also enables you to evaluate, through hypothesis testing, the statistical validity of your result based on your sample data. You can carry out calculations to assess the degree or extent your two variables are related. The degree or association or correlation is determined by calculating the ‘correlation coefficients', and they are usually a value between zero and one.

Relationships between two events, or variables X and Y, could be described in two ways:

• • We could have an association between two variables where there is some kind of influence of one variable on the other, i.e. how X influences Y and vice versa;
• • We could have a case of a causal relationship where one variable X causes change to occur in the other variable Y.

A causal link exists if changes in event X triggers an action or reaction in event Y. This is often referred to as a ‘cause and effect’ relationship. The cause is often referred to as the independent variable; the variable that is affected is known as the dependent variable.

The correlation between two events or variables X and Y can be described as:

• • None (no correlation) — where changes in X has no effect on Y and vice versa;
• • Positive (positive correlation) — where an increase in one variable results in an increase in the other variable, or a decrease in one variable results in a decrease in the other variable);
• • Negative (negative correlation) — where an increase in one variable generates a decrease in the other.

Here are two common correlation coefficients:

• Pearson's r. This measures the relationship between two continuous variables. The value ranges from -1 (a perfect negative relationship) through 0 (no relationship) to +1 (perfect positive relationship). In order to conduct a Pearson’s r test, your data needs to meet certain assumptions:
• (a) The two variables need to have a normal distribution (i.e. the histogram would look like an upside-down bell); and
• (h) When the variables are plotted against each other in a scatterplot, there needs to be a linear relationship between them.
• Spearman’s rho. This test is similar to Pearson’s r but your data do not need to meet the same assumptions. In this test, variables are ranked. A ranking of +1 shows a perfect relationship. It is possible to use Spearman’s rho with both continuous and categorical data.

Case study 9.2 Analysing relationships using socioeconomic deprivation and crime data for English towns and cities 2015

Problem definition

A researcher interested in housing and socio-economic deprivation in English towns and cities obtained the data shown in Appendix 2. The objective of the study is to establish any connection between the degree of deprivation and crime in selected towns/ cities using appropriate statistical analysis.

(Data Source: Office for National Statistics licensed under the Open Government Licence.)

In order to test for statistical validity of any connection between deprivation and crime, the following hypotheses were posed:

The Null Hypothesis H„:

There is no statistically significant relationship between level of deprivation and crime rates in English towns and cities.

The Alternative Hypothesis Hp

There is a statistically relationship/connection between level of deprivation and crime rates in English towns and cities.

Given that the data set is ranked, it is appropriate to use Spearman's rho correlation technique to test whether or not there is a relationship between deprivation and crime and if any such relationship is statistically significant at 95% level of confidence.

Tips for IBM-SPSS procedure for Spearman's rho correlation analysis:

• • Log on and enter your data into IBM-SPSS.
• • Define your variables, e.g. index of multiple deprivation rank (IMD) and crime rank figures.
• • Analyse menu.
• • Correlation statistics.
• • Correlate > bivariate.
• • Move the two variables into the variable list box, e.g. IMD Rank and Crime Rank.
• • Select the appropriate correlation method e.g. Spearman.
• • Select the required test of significance, e.g. 2-tailed.
• • Check flag significant correlations box.
• • Click OK.

Tips for IBM-SPSS procedure for scatter plots and fitting regression line

• • From graphs menu, select scatter.
• • Choose simple scatterplot and click on define button.
• • Move the Crime Rank variable into the Y-axis box.
• • Move the IMD variable into the X-axis box.
• • Click on OK.
• • Double click the graph to open the chart editor.
• • Under chart menu, select options.
• • Check fit line box.
• • Check Display R-Square in legend.
• • Check include constant in equation.
• • Click continue.
• • Click OK.

Results:

The result shows that there is a strong connection between level of deprivation and level of crime in English towns and cities (See Table 9 6 and Figure 9-5).

Table 9.6 Correlation matrix of index of multiple deprivation (IMD) and level of crime in English towns and cities

 Correlations Index of multiple deprivation Crime Spearman’s rho Index of multiple deprivation Correlation Coefficient 1.000 .687" Sig. (2-tailed) .000 N 109 109 Crime Correlation Coefficient .687" 1.000 Sig. (2-tailed) .000 N 109 109

** . Correlation is significant at the 0.01 level (2-tailed).

Source: IBM-SPSS Spearman Correlation, derived from Socio-economic deprivation in English towns and cities - 2015, Office of National Statistics.

Hypothesis testing offers the opportunity to establish whether there is strong enough evidence in the sample of data that you collected to infer or make a judgement on whether certain condition is true or false for the entire population to which your data relate. This is based on the nature and strength of connection between two variables (bivariate analysis). Inferential statistics require an understanding of key statistical concepts such as hypothesis formulation, correlation, cross-tabulation, bivariate analysis, confidence interval and statistical significance.

Figure 9-6 shows the scatterplots of index of multiple deprivation and crime based on the data in our case study.

In this example the relationship between the two variables is positive since an increase in the value of one variable show an increase in the other. The figure suggests a positive connection between crime and deprivation rank in English towns and cities.

Crime and deprivation - regression line

The line that runs through the middle of the scatter points in Figure 9-6 is known as the regression line. A relationship between two

Figure 9.6 Scatterplots of index of multiple deprivation and crime in English towns and cities

quantitative variables that can be represented by a straight line is called a linear relationship. One of the objectives of simple linear regression analysis is to help us determine the best line through the data points in a scatterplot. One common way of determining this line is through the method of least squares; therefore, the regression line is also known as the least square line. The underlying principle of the least square method is that the line of best fit through the scatter is such that the sum of the squares of the deviations from the points to the line is minimum. While a correlation statistic shows how closely the data points are distributed around a straight line, regression analysis involves calculating and fitting a line of best fit across the middle of the data points.

While the scatterplot provides a visual picture as to the nature of relationships between the two variables, Table 9-6 gives a much more detailed statistic with a correlation coefficient of 0.687 and significance level, p of 0.000. This means there is a strong positive correlation between crime and level of deprivation in our case study and the correlation is statistically significant.

In interpreting the result in relation to the stated hypotheses, we need to consider not only the correlation coefficient but also the level of significance, p. Since p < 0. 01, the Null hypothesis H„ in our case study can be rejected which means the Alternative hypothesis H! is true. Therefore, we can conclude that there is a statistically significant relationship between deprivation and crime in English towns and cities and we can make that conclusion at more than 99% level of confidence.

Crime and deprivation - linear regression model

In Figure 9 6, we can see the mathematical model or equation that defines the relationship between crime and deprivation. A linear relationship is usually represented by a linear equation.

The simple linear regression model is stated as:

Where Y is the dependent (or predicted) variable; (in our example — crime)

a is the intercept (the point at which the regression line touches the Y-axis)

x is the independent (or predictor) variable (IMD)

b is the slope or gradient of the regression line

e is the error term (that is the stochastic disturbance or the residuals).

Therefore, the mathematical model or equation that defines the relationship between crime and deprivation in English towns and cities can be stated as:

The R-Square value of 0.471 suggests that 47.1% of the variations in crime in English towns and cities can be explained by the level of multiple deprivation in those cities and towns.

Multivariate analysis and advanced analytical techniques

In our case study 9-2, we used only one explanatory variable, index of multiple deprivation (IMD) to explain the dependent variable (crime). You can use more than one independent variable to explain variations incrime. For example, the technique of multiple linear regression will enable you to use two or more explanatory variables in your analysis. You can learn more about multiple linear regression and other advanced statistical from relevant textbooks.

To help you along, we've included a number of variables in Appendix 2 that could help you with this. For example, you may wish to explore the relationship between crime and income deprivation, employment deprivation, health deprivation, education, skills and training deprivation, barriers to housing services and living environment deprivation. Using advanced statistical technique, you can determine the extent to which each of the factors listed above contribute to crime level in English towns and cities and the joint contribution of all the factors to the problem of crime in English towns and cities.

Besides multiple linear regression, there are a number of other advanced statistical and analytical techniques that are used in social sciences, such as modelling, hierarchical structure analysis, multinomial logistic techniques, factor analysis, etc., that are beyond the level expected of an undergraduate study. If you are thinking of doing postgraduate studies, it may well be that you are interested in developing your analytical skills further in these areas.

## Key messages

• • You need to understand the data that you are working with so that any calculations that you perform are valid.
• • Descriptive analysis allows you to examine the variables you have in your data set and to establish relationships between them.
• • There is no point conducting tests to establish the generalisability of your findings if you have a small sample, collected through convenience sampling.
• • If you are carrying out analysis of an existing data set, techniques of inferential statistics might well be appropriate.
• • Investigate computer packages that will help you with your analysis — investment of time in learning the software will pay dividends in terms of time spent analysing the data.

## Key questions

• • Does your university support a data analysis package and have you identified a package that can help you?
• • Do you know which variables you are working with?
• • Have you described your data using the appropriate techniques?
• • Have you used the right checks to establish relationships between your variables?
• • Does your sample comply with the assumptions necessary to carry out tests of statistical significance?
• • Have you allowed enough time to interpret your statistics?