Data Literacy Flashcards
Statistics helps us test…
…the likelihood of an event happening by random chance versus systematically.
Part of practicing good data literacy means asking…
Who participated in the data?
Who is left out?
Who made the data?
Part of an analyst’s job is to…
…provide context and clarifications to make sure that audiences are not only reading the correct numbers, but understanding what they mean.
A “causal link” means…
…proving that one event causes another.
Ethical issues regarding data collection may be divided into the following categories:
Consent: Individuals must be informed and give their consent for information to be collected.
Ownership: Anyone collecting data must be aware that individuals have ownership over their information.
Intention: Individuals must be informed about what information will be taken, how it will be stored, and how it will be used.
Privacy: Information about individuals must be kept secure. This is especially important for any and all personally identifiable information.
The difference between measuring and categorizing is so important that the data itself is termed differently:
Variables that are measured are Numerical variables
Variables that are categorized are Categorical variables.
Numerical variables
Numerical variables are a combination of the measurement and the unit. Without the unit, a numerical variable is just a number.
Categorical variables
Categorical variables describe characteristics with words or relative values.
Nominal variables
A purely nominal variable is one that simply allows you to assign categories but you cannot clearly order the categories. If the variable has a clear ordering, then that variable would be an ordinal variable, as described below.
Dichotomous variables
Dichotomous variables are nominal variables which have only two categories or levels. For example, if we were looking at gender, we would most probably categorize somebody as either “male” or “female”. This is an example of a dichotomous variable (and also a nominal variable).
Ordinal variables
An ordinal variable is similar to a categorical variable. The difference between the two is that there is a clear ordering of the categories.
Likert scale
A Likert scale is a rating scale used to measure opinions, attitudes, or behaviors. It consists of a statement or a question, followed by a series of five or seven answer statements. Respondents choose the option that best corresponds with how they feel about the statement or question.
Missing Completely at Random
This refers to when the data is missing and is completely unrelated to any features or attributes. The data is simply missing by random chance. There’s no feature, whether observed / measured or not observed / not measured, that affects the data being missing.
Missing at Random
This refers to when the data is missing at random due to the other variables in the dataset. In MAR, the data being missing is related to other features we collected.
Structurally Missing
This refers to missing data that can be explained and is not due to randomness. With this type of missing data, there is an inherent reason or structure that justifies the missing data.
There are a lot of ways a dataset can have low accuracy, but it all comes down to the question of: “are these measurements (or categorizations) correct?” It requires a critical evaluation of your specific dataset to identify what the issues are, but there are a few ways to think about it.
First, thinking about the data against expectations and common sense is crucial for spotting issues with accuracy. You can do this by inspecting the distribution and outliers to get clues about what the data looks like.
Second, critically considering how error could have crept in during the data collection process will help you group and evaluate the data to uncover systematic inconsistencies.
Finally, identifying ways that duplicate values could have been created goes a long way towards ensuring that reality is only represented once in your data. A useful technique is to distinguish between what was human collected versus programmatically generated and using that distinction to segment the data.
validity of our dataset
It’s not just typos, mistakes, missing data, poor measurement, and duplicated observations that make a dataset low quality. We also have to make sure that our data actually measures what we think it is measuring. This is the validity of our dataset.
Validity is a special kind of quality measure because it’s not just about the dataset, it’s about the relationship between the dataset and its purpose. A dataset can be valid for one question and invalid for another.
convenience sample
Convenience sampling is a non-probability sampling method where units are selected for inclusion in the sample because they are the easiest for the researcher to access. This can be due to geographical proximity, availability at a given time, or willingness to participate in the research.
bias
Statistical bias is a term used to describe statistics that don’t provide an accurate representation of the population. Some data is flawed because the sample of people it surveys doesn’t accurately represent the population.
population
In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects or a hypothetical and potentially infinite group of objects conceived as a generalization from experience.
sample
Statistical sampling is drawing a set of observations randomly from a population distribution. Often, we do not know the nature of the population distribution, so we cannot use standard formulas to generate estimates of one statistic or another.
normal distribution
What Is a Normal Distribution? Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graphical form, the normal distribution appears as a “bell curve”.
mean
In mathematics and statistics, the arithmetic mean, arithmetic average, or just the mean or average is the sum of a collection of numbers divided by the count of numbers in the collection. The collection is often a set of results from an experiment, an observational study, or a survey.
standard deviation
A standard deviation (or σ) is a measure of how dispersed the data is in relation to the mean. Low, or small, standard deviation indicates data are clustered tightly around the mean, and high, or large, standard deviation indicates data are more spread out.