Data Literacy Flashcards

1
Q

Statistics helps us test…

A

…the likelihood of an event happening by random chance versus systematically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

Part of practicing good data literacy means asking…

A

Who participated in the data?
Who is left out?
Who made the data?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Part of an analyst’s job is to…

A

…provide context and clarifications to make sure that audiences are not only reading the correct numbers, but understanding what they mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A “causal link” means…

A

…proving that one event causes another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Ethical issues regarding data collection may be divided into the following categories:

A

Consent: Individuals must be informed and give their consent for information to be collected.

Ownership: Anyone collecting data must be aware that individuals have ownership over their information.

Intention: Individuals must be informed about what information will be taken, how it will be stored, and how it will be used.

Privacy: Information about individuals must be kept secure. This is especially important for any and all personally identifiable information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The difference between measuring and categorizing is so important that the data itself is termed differently:

A

Variables that are measured are Numerical variables
Variables that are categorized are Categorical variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Numerical variables

A

Numerical variables are a combination of the measurement and the unit. Without the unit, a numerical variable is just a number.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Categorical variables

A

Categorical variables describe characteristics with words or relative values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Nominal variables

A

A purely nominal variable is one that simply allows you to assign categories but you cannot clearly order the categories. If the variable has a clear ordering, then that variable would be an ordinal variable, as described below.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Dichotomous variables

A

Dichotomous variables are nominal variables which have only two categories or levels. For example, if we were looking at gender, we would most probably categorize somebody as either “male” or “female”. This is an example of a dichotomous variable (and also a nominal variable).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Ordinal variables

A

An ordinal variable is similar to a categorical variable. The difference between the two is that there is a clear ordering of the categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Likert scale

A

A Likert scale is a rating scale used to measure opinions, attitudes, or behaviors. It consists of a statement or a question, followed by a series of five or seven answer statements. Respondents choose the option that best corresponds with how they feel about the statement or question.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Missing Completely at Random

A

This refers to when the data is missing and is completely unrelated to any features or attributes. The data is simply missing by random chance. There’s no feature, whether observed / measured or not observed / not measured, that affects the data being missing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Missing at Random

A

This refers to when the data is missing at random due to the other variables in the dataset. In MAR, the data being missing is related to other features we collected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Structurally Missing

A

This refers to missing data that can be explained and is not due to randomness. With this type of missing data, there is an inherent reason or structure that justifies the missing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

There are a lot of ways a dataset can have low accuracy, but it all comes down to the question of: “are these measurements (or categorizations) correct?” It requires a critical evaluation of your specific dataset to identify what the issues are, but there are a few ways to think about it.

A

First, thinking about the data against expectations and common sense is crucial for spotting issues with accuracy. You can do this by inspecting the distribution and outliers to get clues about what the data looks like.

Second, critically considering how error could have crept in during the data collection process will help you group and evaluate the data to uncover systematic inconsistencies.

Finally, identifying ways that duplicate values could have been created goes a long way towards ensuring that reality is only represented once in your data. A useful technique is to distinguish between what was human collected versus programmatically generated and using that distinction to segment the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

validity of our dataset

A

It’s not just typos, mistakes, missing data, poor measurement, and duplicated observations that make a dataset low quality. We also have to make sure that our data actually measures what we think it is measuring. This is the validity of our dataset.

Validity is a special kind of quality measure because it’s not just about the dataset, it’s about the relationship between the dataset and its purpose. A dataset can be valid for one question and invalid for another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

convenience sample

A

Convenience sampling is a non-probability sampling method where units are selected for inclusion in the sample because they are the easiest for the researcher to access. This can be due to geographical proximity, availability at a given time, or willingness to participate in the research.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

bias

A

Statistical bias is a term used to describe statistics that don’t provide an accurate representation of the population. Some data is flawed because the sample of people it surveys doesn’t accurately represent the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

population

A

In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects or a hypothetical and potentially infinite group of objects conceived as a generalization from experience.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

sample

A

Statistical sampling is drawing a set of observations randomly from a population distribution. Often, we do not know the nature of the population distribution, so we cannot use standard formulas to generate estimates of one statistic or another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

normal distribution

A

What Is a Normal Distribution? Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graphical form, the normal distribution appears as a “bell curve”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

mean

A

In mathematics and statistics, the arithmetic mean, arithmetic average, or just the mean or average is the sum of a collection of numbers divided by the count of numbers in the collection. The collection is often a set of results from an experiment, an observational study, or a survey.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

standard deviation

A

A standard deviation (or σ) is a measure of how dispersed the data is in relation to the mean. Low, or small, standard deviation indicates data are clustered tightly around the mean, and high, or large, standard deviation indicates data are more spread out.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

skewed distribution

A

A skewed distribution is neither symmetric nor normal because the data values trail off more sharply on one side than on the other. In business, you often find skewness in data sets that represent sizes using positive numbers (eg, sales or assets).

24
Q

median

A

denoting or relating to a value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it.

25
Q

interquartile range (IQR)

A

In descriptive statistics, the interquartile range is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the difference between the 75th and 25th percentiles of the data.

26
Q

outliers

A

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal.

27
Q

aggregated data

A

Aggregate data is high-level data which is acquired by combining individual-level data. For instance, the output of an industry is an aggregate of the firms’ individual outputs within that industry. Aggregate data are applied in statistics, data warehouses, and in economics.

28
Q

scatter plot

A

a graph in which the values of two variables are plotted along two axes, the pattern of the resulting points revealing any correlation present.

29
Q

correlation coefficient

A

a number between −1 and +1 calculated so as to represent the linear dependence of two variables or sets of data.

30
Q

correlation coefficient tells us two things:

A

Direction: A positive coefficient means that higher values in one variable are associated with higher values in the other. A negative coefficient means higher values in one variable are associated with lower values of the other.

Strength: The farther the coefficient is from 0, the stronger the relationship and the more the points in a scatter plot look like a line.

31
Q

univariate charts

A

Univariate graphs plot the distribution of data from a single variable. The variable can be categorical (e.g., race, sex, political affiliation) or quantitative (e.g., age, weight, income).

32
Q

bivariate charts

A

A bivariate plot graphs the relationship between two variables that have been measured on a single sample of subjects. Such a plot permits you to see at a glance the degree and pattern of relation between the two variables.

33
Q

bivariate/multivariate map

A

A bivariate map or multivariate map is a type of thematic map that displays two or more variables on a single map by combining different sets of symbols. Each of the variables is represented using a standard thematic map technique, such as choropleth, cartogram, or proportional symbols

34
Q

information redundancy

A

In engineering and systems theory, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form

34
Q

checklist for baseline accessibility

A

Colorblind-friendly color palettes
Large enough font size
Readable, web-accessible font type
Alt text on data visualization images online

35
Q

accessibility factors (3)

A

Readability: keep the reading level to a high school level whenever possible

Prior knowledge: define unfamiliar terms and avoid unnecessary jargon

Information overload: introduce new information with intentional pacing and organization

36
Q

break

A

A scale break is an area across an axis which is shown in place of a section of the axis’ range. It appears across the original axis as a ragged (or wavy or straight, depending on the desired appearance) line.

37
Q

linear scale

A

What Is a Linear Scale? A linear scale is much like the number line described above. They key to this type of scale is that the value between two consecutive points on the line does not change no matter how high or low you are on it. For instance, on the number line, the distance between the numbers 0 and 1 is 1 unit.

38
Q

logarithmic scale

A

A logarithmic scale shows the base value of 10 raised to the power of a value. For example, 10 has a logarithm of 1 because 10 raised to the power of 1 is 10. 100 has a logarithm of 2 because 10 raised to the power of 2 is 100, and so on.

39
Q

Sequential scales

A

Sequential scales are colors in a sequence – often, this is the same hue with more and more white added to or taken away from the color. Sequential scales are used to show a variable increasing or decreasing in intensity or amount, like income, depth, or percent of population that owns a chinchilla.

40
Q

Divergent scales

A

Divergent scales are anchored by colors from opposite sides of the color wheel, a.k.a. complementary colors. A divergent scale is used to visualize data where the middle is a baseline, and either side represents a contrasting change. For example, divergent scales do a good job of showing a positive/negative swing in voting or polling, temperatures above and below freezing, or gains and losses over time.

41
Q

Categorical scales

A

Categorical scales use a variety of colors to differentiate categories without assigning a rank or order to them. In other words, “purple” doesn’t necessarily mean more than “green” – the two are just different colors. Categorical scales are for categorical data, like types of vegetables in a supermarket, or different treatments tested in a controlled study, or organizational blocks on a calendar.

42
Q

five types of data analysis

A

Descriptive analysis
Exploratory analysis
Inferential analysis
Causal analysis
Predictive analysis

43
Q

Descriptive analysis

A

Descriptive analysis lets us describe, summarize, and visualize data so that patterns can emerge. Sometimes we’ll only do a descriptive analysis, but most of the time a descriptive analysis is the first step in our analysis process.

Descriptive analyses include measures of central tendency (e.g., mean, median, mode) and spread (e.g., range, quartiles, variance, standard deviation, distribution), which are referred to as descriptives or summary statistics.

44
Q

Exploratory analysis

A

Exploratory analysis is the next step after descriptive analysis. With exploratory analysis, we look for relationships between variables in our dataset.

While our exploratory analyses might uncover some fascinating patterns, we should keep in mind that exploratory analyses cannot tell us why something happened: correlation is not the same as causation.

45
Q

Inferential Analysis

A

A/B tests are a popular business tool that data scientists use to optimize websites and other online platforms. A/B tests are a type of inferential analysis. Inferential analysis lets us test a hypothesis on a sample of a population and then extend our conclusions to the whole population.

46
Q

Inferential Analysis rules

A

This is a powerful thing to be able to do! But since it’s so powerful, there are some rules about how to do it:

Sample size must be big enough compared to the total population size (10% is a good rule-of-thumb).
Our sample must be randomly selected and representative of the total population.
We can only test one hypothesis at a time.

47
Q

Causal Analysis

A

We know that correlation does not mean causation. This is an important limitation in data analysis. We should be cautious to believe any studies or headlines claiming that one thing caused another without knowing their research methods. However, we often really want to know why something happened. In these cases, we turn to causal analysis. Causal analysis generally relies on carefully designed experiments, but we can sometimes also do causal analysis with observational data.

48
Q

Experiments that support causal analysis (3):

A

Only change one variable at a time

Carefully control all other variables

Are repeated multiple times with the same results

49
Q

Causal inference with observational data requires (3):

A

Advanced techniques to identify a causal effect
Meeting very strict conditions
Appropriate statistical tests

50
Q

Predictive Analysis

A

Predictive analysis uses data and supervised machine learning techniques to identify the likelihood of future outcomes.

Some popular supervised machine learning techniques include regression models, support vector machines, and deep learning convolutional neural networks. The actual algorithm used with each of these techniques is different, but each requires training data. That is, we have to provide a set of already-classified data that the algorithm can “learn” from. Once the algorithm has learned from the features of the training data, it can make predictions about new data.

51
Q

linear regression

A

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression.

52
Q

support vector machines

A

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis.

53
Q

deep learning convolutional neural networks

A

In deep learning, a convolutional neural network (CNN/ConvNet) is a class of deep neural networks, most commonly applied to analyze visual imagery. Now when we think of a neural network we think about matrix multiplications but that is not the case with ConvNet. It uses a special technique called Convolution. Now in mathematics convolution is a mathematical operation on two functions that produces a third function that expresses how the shape of one is modified by the other.

54
Q

Bias in interpreting results and drawing conclusions (3)

A

Confirmation bias

Overgeneralization bias

Reporting bias

55
Q

Confirmation bias

A

Confirmation bias is our tendency to seek out information that supports our views. Confirmation bias influences data analysis when we consciously or unconsciously interpret results in a way that supports our original hypothesis. To limit confirmation bias, clearly state hypotheses and goals before starting an analysis, and then honestly evaluate how they influenced our interpretation and reporting of results.

56
Q

Overgeneralization bias

A

Overgeneralization bias is inappropriately extending observations made with one dataset to other datasets, leading to overinterpreting results and unjustified extrapolation. To limit overgeneralization bias, be thoughtful when interpreting data, only extend results beyond the dataset used to generate them when it is justified, and only extend results to the proper population.

57
Q

Reporting bias

A

Reporting bias is the human tendency to only report or share results that affirm our beliefs or hypotheses, also known as “positive” results. Editors, publishers, and readers are also subject to reporting bias as positive results are published, read, and cited more often. To limit reporting bias, report negative results and cite others who do, too.