MODULE 1 - DESCRIPTIVE STATISTICS Flashcards
The study of statistics is often broken into what two main categories?
- descriptive statistics
- inferential statistics
inferential statistics (3)
- Frequently, it is impossible to contact every person in large populations, so a smaller group is used, called a sample.
- A researcher can draw conclusions about the larger population using the sample data.
- Focuses on using information from the sample to make conclusions about the population from which the sample was drawn.
descriptive statistics (4)
- focuses on summarizing survey data about a sample drawn from a population.
- Summary statistics include measures of central tendency such as mean, median, and mode; and dispersion such as range and standard deviation.
- Descriptive statistics cannot make conclusions based on the data. 4. Rather, descriptive statistics is a way to present data in a meaningful way.
What is data?
is information, especially facts or numbers, usually collected or computed for purposes of analysis.
Common sources of data (3)
- Social networks
- Traditional Business Systems
- Internet of Things
Data analytics
is the field of analyzing data to gain insight, draw conclusions, or make decisions.
Big data
refers to very large data sets that cannot be processed by traditional methods, and is characterized by high volume, rapid velocity of collection, and variety in type and quality.
3 Types of data analytics
- Descriptive
- Predictive
- Prescriptive
Descriptive data analytics
analytics seeks to describe data, providing insight and knowledge.
Predictive data analytics
seeks to make predictions from data
Prescriptive data analytics
seeks to make decisions (prescriptions) based on data
Data is typically represented using what?
variables
variable
is an item that can have different (“varying”) values
Variables are often considered as being of two possible types:
- quantitative variable
- categorical variable
quantitative variable
can take on a numeric value (quantitative data) that can be measured and ordered
categorical variable (qualitative variable)
can take on the value (usually a label) of one of several categories
reason for distinguishing variable types (3)
- Each type is handled differently in data analytics
- A categorical variable typically involves counting the instances of each category, often then depicted with a bar chart or pie chart.
- But a quantitative variable is commonly plotted versus another quantitative variable, often depicted with a scatter plot or line chart
Two types of categorical variables are often distinguished
- Nominal
- Ordinal
Nominal variable
have no ordering, existing in name only, like apples, oranges, and grapes. (“Nominal” means “in name only”).
Ordinal Variable
have an ordering, like disagree, neutral, and agree.
Two types of quantitative variables are often distinguished
- continuous variable
- discrete variable
continuous variable
are infinite along a continuum of values within a range, typically real numbers. Continuous variables usually represent measurements, like height ( meters) or temperature ( degrees).
discrete variable (3)
- are finite within a range, typically integers.
- Discrete variables usually represent countable items, like people in a family () or cars in a city ().
- Generally, if “number of” can be added to the beginning, the variable is discrete, like “number of people in a family”, but not “number of height”.
Data visualization
is the display of data in a format, such as a table or chart, that seeks to achieve a goal of conveying particular information to a viewer
Considerations for data visualization
- Cardinality
- depends on the kind of data being presented, and the information to be conveyed.
Cardinality (2)
- is the number of unique elements in a dataset.
- scatter graphs, line charts, and histograms, work very well for high-cardinality data
Pie charts
are a good choice for low-cardinality data, and for showing the relative frequency in which unrelated categories occur.
scatter plot
can be used to identify trends.
A bar chart
is a good choice for displaying frequency or counts in low-cardinality data.
spreadsheet application
is a common computer application for organizing data like text or numbers, for using formulas to calculate a mathematical quantity using existing data as inputs, and for creating charts to visualize data.
A spreadsheet consists of? (2)
- A spreadsheet consists of cells organized into columns and rows. The column headings are letters and the row headings are numbers, but headings are not counted as cells.
- A user can enter data, like words or numbers, into each cell. The spreadsheet is a convenient way to create a table of data.
spreadsheet function
is a predefined formula that supports common tasks such as computing the average, minimum, or maximum of a group of cells.
function syntax
defines how the function is used, and specifies the function’s name and accepted arguments
Function’s arguments (3)
- are surrounded by parentheses and specify the data that the function operates on.
- Arguments may be numbers, cells, a range of cells, or a combination thereof.
- The [ ] arguments are optional.
To call a function in a spreadsheet
= is followed by the function’s name and then arguments separated by commas.
range operator (:)
- defines a reference to a group of cells.
- Ex: =SUM(A1:A4, B10) calculates the sum of cells A1, A2, A3, A4, and B10.
The two primary methods of inferential statistics
confidence intervals, and hypothesis testing
Confidence Intervals
specify the range within which a parameter falls with a given probability
hypothesis testing
allows differences between population parameters to be compared.
Surveys
Are conducted to allow statisticians to make generalizations about a population.
population
is any collection of objects, people, or things about which statistical inference are made
parameter of a population
is a numerical characteristic of a population, such as mean, median, or standard deviation.