Variables and Distributions Flashcards by Aukse Kanopiene

What do we call the different aspects of the individual in the study that are measured?

These are called variables.

How well did you know this?

Not at all

Perfectly

What are the two main types of variables?

Categorical and continuous variables.

How well did you know this?

Not at all

Perfectly

What are categorical variables?

With categorical variables the response can be categorised into a number of distinct groups or categories. For example a variable on smoking status may have the categories ‘current smoker’, ‘ex-smoker’ and ‘never-smoker’.

When there are only two possible responses, eg yes or no, diseased or not diseased, the variable is called binary.

When there are three or more categories and there is some logical ‘order’ to the categories (e.g. Never drinker, occasional drinker, regulate drinker), the variable is called an ordinal categorical variable. If there is no ordering (e.g. Ethnic group, sex), this is called a nominal categorical variable.

How well did you know this?

Not at all

Perfectly

What is a binary categorical variable?

A binary categorical variable is a categorical variable for which there are only two possible responses. For example yes or no.

How well did you know this?

Not at all

Perfectly

What is a ordinal categorical variable?

An ordinal categorical variable is a categorical variable for which there are three or more categories and there is some logical order to the categories (e.g. Never drinker, occasional drinker, regular drinker).

How well did you know this?

Not at all

Perfectly

What is a nominal categorical variable?

A nominal categorical variable is a categorical variable for which there are three or more categories but no ordering (e.g. Ethnic group, sex).

How well did you know this?

Not at all

Perfectly

What is a continuous variable?

For continuous variables the responses are numerical and may take any value on a well-defined continuous scale e.g. Height, weight, blood pressure, pack-years of smoking. Note it is possible to recode continuous data into a categorical variable - for example weight may be recorded as an ordered categorical data as underweight, optimal weight, pre-obese or obese, or to a binary value of underweight or not. However, this will result in a loss of information and may restrict the statistical tests which can be carried out on the new variable (so make sure this is beneficial before you do it).

How well did you know this?

Not at all

Perfectly

How should one go about summarising categorical variables?

To summarise categorical data (including binary data), we simply count up the number of observations in each category; these counts are called frequencies. We usually express these as proportions or percentages of the total number of individuals.

For example we may study 200 people and see that 100 (0.5 or 50%) are never smokers, 60 (0.3 or 30%) are ex smokers and 40 (0.2 or 20%) are current smokers.

We can then present these numbers either in table format or graphically. However you choose to present the data, remember it should be clear and immediately evident to the reader what they are looking at. Use clear headings for columns and rows in tables, and informative legends for charts or figures. Categorical data are plotted either using a bar chart or a pie chart. A bar chart consists of a bar for each category where the lengths of the bars are proportional to the frequencies. Note the bars do not touch as the data are not continuous but fall into distinct categories. In a pie chart the area of each segment is proportional to the frequency in that category.

How well did you know this?

Not at all

Perfectly

How should one go about summarising continuous variables?

For continuous variables, we can summarise the data graphically using a histogram or box-plots, and we can also compute summary measures of data location (mean, median etc) and spread (range, standard deviation etc.)

To produce a histogram, we need to first group the data into ranges, and then count the number of observations in each group. These counts are called the frequency distribution. Identifying the lowest and highest values helps you decide on how the data should be grouped. Too few groups will mean detail is lost but too many groups will result in hardly any observations in each group. Having formed the frequency distribution, we can plot the number in each range to get a histogram. In a histogram the bars touch each other (unlike a bar chart) to indicate that the data are continuous. Note the are in each bar is proportional to the number of people in that range. Do not use 3d graphs as they are mathematically wrong.

A wide range of human characteristics follow a Normal distribution. Characteristics of the Normal distribution are:

it is symmetrical and bell shaped
the two extremes of the distribution are plus and minus infinity; they get very close to the x-axis but never quite reach it
95% of the data lie within 2 (or more precisely 1.96) standard deviations of the mean
when plotted with fraction on the Y-axis, the area under the curve is equal to 1

Not all continuous variables follow a Normal distribution. Often data are skewed, either to the right (positively skewed, longer tail on the right), or to the left (negatively skewed, longer tail on the left).

Another way of summarising a continuous variable is to compute summary measures, namely a measure of where the data are located, and a measure of the spread of the data. In other words for our height data, in general how tall were the children and how diverse were the range of measurements.

The three measures of data location are:

the mean: the sum of all the values divided by the total number of people
the mode: the most common value
the median: the Middle value (50th centile) when the variables are placed in ascending numerical order, i.e. The median = 1/2 (n+1)th value. If there are an even number of readings the mean of the middle two values is taken as the median.

How well did you know this?

Not at all

Perfectly

Describe how you would go about making a histogram for continuous data.

How well did you know this?

Not at all

Perfectly

What are the characteristics of a normal distribution?

A wide range of human characteristics follow a Normal distribution. Characteristics of the Normal distribution are:

it is symmetrical and bell shaped
the two extremes of the distribution are plus and minus infinity; they get very close to the x-axis but never quite reach it
95% of the data lie within 2 (or more precisely 1.96) standard deviations of the mean
when plotted with fraction on the Y-axis, the area under the curve is equal to 1

How well did you know this?

Not at all

Perfectly

How would we describe continuous data if the plotted histogram of said data has a longer tail on:

A). The right
B). The left

A). Positively skewed data

B). Negatively skewed data

How well did you know this?

Not at all

Perfectly

What are the three measures of data location?

The three measures of data location are:

the mean: the sum of all the values divided by the total number of people
the mode: the most common value
the median: the Middle value (50th centile) when the variables are placed in ascending numerical order, i.e. The median = 1/2 (n+1)th value. If there are an even number of readings the mean of the middle two values is taken as the median.

How well did you know this?

Not at all

Perfectly

What can be use to measure the spread or variation observations in continuous data?

The spread or variation in the observations can be measured using the following:

Range - the highest to lowest values
Interquartile range - the 25th to 75th centile. The 25th centile is the value for which a quarter or 25% of observations fall below when put in order, and the 75th centile is the value for which three quarters of observations fall below. In the same way as the median is calculated, the quartile can be identified as follows:

Lower quartile (QL) = 1/4(n+1)th value
Upper quartile (QU) = 3/4(n+1)th value

Interquartile range = QU - QL

Standard deviation - the measure of spread used in conjunction with the mean. The standard deviation is derived from the difference between each individual reading and the mean of all the readings.

To work out the standard deviation:

1) . First work out the difference between each measurement and the overall mean - these are called the deviations.
2) . Square the deviations. This has the impact of making all the results positive and thereby removes negative values.
3) . Add up all the squared deviations and divide by n-1 where n is the number of measurements. This produces what is known as the variance.
4) . Take the square root of the variance to give the standard deviation.

When the data are normally distributed the mean and standard deviation are the most appropriate measures of location and spread. The standard deviation tells us about the spread of data because if the data are normally distributed:

1) . Approximately 70% of the readings fall within 1 standard deviation either side of the mean.
2) . Approximately 95% of the readings lie within 2 standard deviations of the mean.

If the data are skewed, we usually present the median and IQR as they are less influenced by very high or very low values.

How well did you know this?

Not at all

Perfectly

What are the most appropriate measures of location and spread of continuous data if the data are skewed?

If the data are skewed, we usually present the median and IQR as they are less influenced by very high or very low values.

How well did you know this?

Not at all

Perfectly

What are the most appropriate measure of location and spread if the data are normally distributed?

1) . Approximately 70% of the readings fall within 1 standard deviation either side of the mean.
2) . Approximately 95% of the readings lie within 2 standard deviations of the mean.

How well did you know this?

Not at all

Perfectly

What is the interquartile range and how is it calculated?

Interquartile range - the 25th to 75th centile. The 25th centile is the value for which a quarter or 25% of observations fall below when put in order, and the 75th centile is the value for which three quarters of observations fall below. In the same way as the median is calculated, the quartile can be identified as follows:

Lower quartile (QL) = 1/4(n+1)th value
Upper quartile (QU) = 3/4(n+1)th value

Interquartile range = QU - QL

How well did you know this?

Not at all

Perfectly

What is standard deviation?

Standard deviation - the measure of spread used in conjunction with the mean. The standard deviation is derived from the difference between each individual reading and the mean of all the readings.

How well did you know this?

Not at all

Perfectly

How do you calculate the standard deviation?

To work out the standard deviation:

How well did you know this?

Not at all

Perfectly

What characteristic of the median and interquartile range makes them much more appropriate for describing skewed data than the mean and standard deviation?

The median and interquartile range are often described as being resistant to outliers. This property makes them much more appropriate for describing skewed data, as they better reflect where the majority of the observations lie within the distribution. This is not the case for the mean and standard deviation.

How well did you know this?

Not at all

Perfectly

For symmetrical distributions, the ______ and _______ of the data will be the same.

For symmetrical distributions, the mean and median of th data will be the same.

How well did you know this?

Not at all

Perfectly

For positively skewed distributions the ______ will be greater than the _______.

For positively skewed data the mean will be greater than the median.

How well did you know this?

Not at all

Perfectly

For negatively skewed data the _______ will be greater than the _______.

For negatively skewed data the median will be greater than the mean.

How well did you know this?

Not at all

Perfectly

What is one advantage of a stem and leaf plot over a histogram?

On advantage of the stem and leaf plot over the histogram is that the stem and leaf plot displays not only the frequency for each interval, but also displays all of the individual values within that interval.

How well did you know this?

Not at all

Perfectly

What is a box and whisker plot especially useful for?

A box and whisker plot does not show a distribution in as much detail as a stem and leaf or histogram plot does, but is especially useful for indicating whether a distribution is skewed and whether there are potential unusual observations or outliers in the data set.

What represents the median value on a box and whisker?

The median value is identified by the black line.

What des the length of the box represent on a box and whisker plot?

The length of the box on a box and whisker plot represents the interquartile range (IQR, Central 50% of observations).

What do the whiskers represent on a box and whisker plot?

The whiskers extend to the minimum and maximum values (up to 1.5 times the width of the box; any value outside of this range is considered to be an outlier - the number next to the circle indicates the current row number of that case.

Describe a stem and leaf plot.

In a stem and leaf plot, each observed value is divided into two components - leading digits (stem) and trailing digits (leaf). The "leaf" is usually the last digit of the number and the other digits to the left of the "leaf" form the "stem". A stem and leaf plot resembles a histogram turned sideways - the stem value could represent the intervals of a histogram, and the leaf values could represent the frequency for each interval. One advantage to the stem and leaf plot over the histogram is that the stem and leaf plot displays not only the frequency for each interval, but also displays all of the individual values within the interval. To get back to the original values, multiply the value in the plot by the stem width.

If you have a mean of 60 for a sample and a standard deviation of 11 what two values should 95% of values fall between if the data is normally distributed?

95% of values in this sample should fall between 38 and 82 (mean minus 2SD, mean plus 2SD).

What is meant by the term 95% confidence interval?

The 95% confidence interval is the range within which we are 95% sure the actual mean lies.

What would be appropriate summary statistics to use for data following a normal distribution?

Mean and standard deviation.

What would be appropriate summary statistics for data not following a normal distribution?

Median and interquartile range.

What do the values for skewness indicate?

The value for skewness should be zero for a normal distribution. Positive values indicate a pile up of scores on the left of the distribution (i.e. with a long right tail); negative values indicate a pile up of scores on the right of the distribution (i.e. with a long left tail). Positive skewness = long right tail Negative skewness = long left tail

What do kurtosis values indicate?

The kurtosis value should be zero for a normal distribution. Positive scores indicate a pointy distribution (with low numbers of observations in the tails of the distribution), and negative scores indicate a flat distribution (with many observations in the tails).

If a distribution has a low number of observations in the tails of the distribution (a pointy distribution), what does that indicate about the kurtosis score?

It will be positive. If we have a pointy distribution with low numbers of observations in the tails we will have a positive kurtosis score.

What does a negative kurtosis score indicate about a distribution?

A negative kurtosis score is indicative of a flat distribution with many observations in the tails.

What does a kurtosis score of 0 indicate about a distribution?

A kurtosis score of 0 is indicative of a normal distribution.

What do we call the number of individuals in a study?

The sample size.

For a stem and leaf plot how do we get back to the original values?

We multiply the value in the plot by the stem width

What is one advantage of a stem leaf plot over a histogram?

The stem and leaf plot displays not only the frequency for each interval, but also all of the individual values within that interval.

What is a box and whisker plot especially useful for?

A box an whisker plot is especially useful for indicating whether a distribution is skewed and whether there are potential unusual observations or outliers in the data set.

What is one advantage of a stem leaf plot over a histogram?

The stem and leaf plot displays not only the frequency for each interval, but also all of the individual values within that interval.

For a stem and leaf plot how do we get back to the original values?

We multiply the value in the plot by the stem width

What is one advantage of a stem leaf plot over a histogram?

The stem and leaf plot displays not only the frequency for each interval, but also all of the individual values within that interval.

What is a box and whisker plot especially useful for?

A box an whisker plot is especially useful for indicating whether a distribution is skewed and whether there are potential unusual observations or outliers in the data set.

What does the black line on a box an whisker plot represent?

The median

What does the length of the box on a box and whisker plot indicate?

The interquartile range

What would a skewness value less than -1 or greater than 1 tell us about the distribution?

If skewness is less than -1 or greater than 1 then the distribution is highly skewed.

What does the black line on a box an whisker plot represent?

The median

What does the length of the box on a box and whisker plot indicate?

The interquartile range

What is a box and whisker plot especially useful for?

A box an whisker plot is especially useful for indicating whether a distribution is skewed and whether there are potential unusual observations or outliers in the data set.

What does the black line on a box an whisker plot represent?

The median

What does the length of the box on a box and whisker plot indicate?

The interquartile range

What would a skewness value less than -1 or greater than 1 tell us about the distribution?

If skewness is less than -1 or greater than 1 then the distribution is highly skewed.

If skewness is between -1 and -0.5 or between 0.5 and 1, what does this tell us about the distribution?

It is moderately skewed

If skewness is between -0.5 and 0.5 what does this tell us about the distribution?

It is approximately symmetric

What would a skewness value less than -1 or greater than 1 tell us about the distribution?

If skewness is less than -1 or greater than 1 then the distribution is highly skewed.

If skewness is between -1 and -0.5 or between 0.5 and 1, what does this tell us about the distribution?

It is moderately skewed

If skewness is between -0.5 and 0.5 what does this tell us about the distribution?

It is approximately symmetric

What would a skewness value less than -1 or greater than 1 tell us about the distribution?

If skewness is less than -1 or greater than 1 then the distribution is highly skewed.

If skewness is between -1 and -0.5 or between 0.5 and 1, what does this tell us about the distribution?

It is moderately skewed

If skewness is between -0.5 and 0.5 what does this tell us about the distribution?

It is approximately symmetric