LECTURE WEEK 1 Flashcards
What is statistics?
“statistics is a body of principles and methods concerned with extracting useful information from a set of data to help people make informed policy and business decisions”
What are the three big v’s of data?
The three v’s of big data are volume (quantity), velocity (speed) and variety (types).
What are the different types of statistics?
Descriptive statistics deal with methods of organising, summarising and presenting data in a convenient and informative way, one form of which uses graphical techniques.
Inferential statistics is also a set of methods, but it is used to draw conclusions or inferences about the largest set where the data came from i.e. characteristics of populations based on sample statistics calculated from sample data
Basic terminology
Population is a group of all items (data) of interest, and is often very large or infinite.
Sample is a set of items (data) drawn from the full population of interest, samples can be large but smaller than populations.
Parameter is a descriptive measure of a population (mean, highest value, lowest value etc).
Statistic is a descriptive measure of a sample (mean, highest and lowest values etc).
What are advantages and disadvantages of a census?
Advantages of a Census
• Provides a true measure of the population (no sampling error)
• Benchmark data may be used by other studies as a frame
• Detailed information about small sub groups within the population is more likely to be available
Disadvantages of a census
• Difficult to identify, locate and survey all units
• Higher cost of running compared to sample surveys
• The population of interest may change, shorter timed surveys can get more accurate results
Terminology II
A variable is some characteristic of a population or sample and is usually noted with capital x, y or z (eg student marks).
The values of the variable are the range of possible values for a variable eg student marks (0-100).
Data are the observed values of a variable (eg student marks 47, 56,74,96).
What are different types of data?
Nominal values are the arbitrary numbers that represent categories. Only calculations such as proportions, based on the frequencies of occurrence are valid. Data may not be treated as ordinal or numerical.
• The values of nominal data are categories ie single = 1, married = 2, divorced = 3
• These data are categorical in nature, arithmetic operations don’t make any sense (ie does widowed/2 = married?)
• Nominal data is also called qualitative or categorical
Ordinal values must represent the ranked order of the data and calculations based on an ordering process are not valid. Data may be treated as nominal but not numerical.
• Ordinal data appear to be categorical in nature, but their values are ordered ie poor = 1, fair = 2, good = 3
• It isn’t meaningful to use arithmetic’s on this data but ie poor < good can be used
• The order matters but the number associated to each category is meaningless
• Is also called ranked
Numerical values are real numbers and all calculations are valid, data may be treated as ordinal or nominal.
• Values of numerical data are real numbers eg height and weight
• Arithmetic operations can be performed on numerical data eg 2 x height or 6 x price
• Also called quantitative or interval
Observational and experimental data
When no data is available, a study is needed to generate data.
• Observational study is when measurements representing a variable of interest are recorded and observed, without controlling factors that might influence their values.
• Experimental study is when measurements representing a variable of interest are observed and recorded while controlling factors that might influence their values.
Data series types
Cross sectional data which is variables measured at one point in time, of different subjects.
Time series is a variable measured at regular intervals over time.
Longitudinal (panel) data is variables measured on the same subjects, at multiple points in time.
What is a survey?
A survey solicits existing information from survey participants whose response rate (proportion of selected participants that responded to the survey) is a key survey parameter.
What are the different sampling plans and their characteristics?
Simple random sampling
• A simple random sampling is a sample selected in such a way that every possible sample of the same size is equally likely to be chosen ie choosing a name from a hat of 200 names
Stratified random sampling
• A stratified random sample is obtained by dividing the population into mutually exclusive sets (or strata), and then drawing simple random samples from each stratum eg population being divided by occupation or by age categories
Cluster sampling
• Cluster sampling is a simple random sample of groups or clusters of elements (vs a simple random sample consists of individual objects, eg to get a cluster sample of residents of Adelaide, select a number of streets using a random sampling method and include all residents in those streets to form the cluster sample
• This procedure is useful when it is difficult and costly to develop a complete list of the population members (making it difficult to develop a random sampling procedure)
• Cluster sampling may increase sampling error, because of probable similarities among cluster members
What is sample size and error?
There are two types of errors.
Sampling errors refer to differences between the sample and the population, increasing sample sizes reduces the sampling error.
Non sampling errors arise when there are errors in data acquisition, non-response errors and selection bias.
Increasing the sample size will not reduce the non-sampling error.