Intro Flashcards

1
Q

Definition / equation of churn rate

A

Cancellations / total subscribers (current + new subscribers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two types of organized observations

A

Methodology and shape

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the most common shape for data

A

Table or spreadsheet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Variables

A

The things we measure (columns of a table)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Observations / entity / instance

A

Rows - Individual instances of the things we are measuring

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Numerical variables

A

Both the measurement and unit of measurement (without unit a numerical variable is just a number)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the two ways of getting a number

A

Counting (discrete) or measuring (continuous)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Whole numbers are what type of variable

A

Discrete variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Partial values are what type of variable

A

Continuous variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Categorical variables

A

Characteristics with words or relative values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Nominal variable

A

A categorical variable that is specifically A named value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Dichotomous variable

A

A categorical variable that is binary (yes / no, true / false, on / off, 1 / 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Ordinal variable

A

A categorical variable that is a subjective value (a ranking from 1 to 5)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

3 common messy data problems

A
  1. Typos
  2. Missing data
  3. Inconsistent coding (three instead of 3 or N/A instead of 0)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Missing completely at random

Vs

Missing at random

Vs

Structurally missing

A

Data was simply not entered or entered properly

We can predict if one value is missing based on the value in another variable

We don’t expect there to be a value to begin with

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Accuracy

A

A measure of how well records reflect reality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Validity

A

The data actually measures what we think it is measuring

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Various ways that a dataset can be low quality

A

Typos
Mistakes
Missing data
Poor measurement
Duplicate observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the two types of categorical variables

A

Ordinal (ordered)

Nominal (unordered)

20
Q

A distribution is

A

A function that shows all possible values of a variable and how frequently each value occurs

21
Q

Interquartile range

A

The range between the first and third quartile of the dataset

First quartile marks the point at 25% into the range of data

Third quartile marks the point at 75% into the range of data

A range of data is all values arranged from smallest to largest

22
Q

Bimodal distribution

A

A distribution with two peaks (modes)

23
Q

The act of aggregating data

A

Summarizing a numeric variable across each value of a categorical variable

24
Q

Correlation coefficient

A

Direction: - or +

Strength: 0 to 1

25
Q

Chart for representing change over time

A

Line graph

26
Q

Chart for comparing a part to the whole

A

Pie chart

27
Q

Chart for showing the spread of data points in one variable

A

Histogram

28
Q

A chart for comparison of two variables to understand a trend

A

Scatterplot w/ or w/o trendline

29
Q

Univariate charts

A

Help us visualize a change in only one variable - often that means measuring “how much” - a common type for counts is bar charts

30
Q

Univariate chart type examples

A

Bar chart
Histogram
Density curve
Box plot
Univariate map

31
Q

Bi / Multivariate charts definition

A

Charts that show the relationship between two or more variables

32
Q

Multivariate chart examples

A

Scatterplot
Line chart
Bivariate map

33
Q

Is information redundancy bad?

A

Not necessarily, redundancy can help add clarity or emphasis

34
Q

Linear scale

A

The numbers of the axis count up by a consistent interval

35
Q

Logarithmic scale

A

Common for showing exponential growth

36
Q

Three common color scales

A

Sequential
Diverging
Categorical

37
Q

Descriptive analysis

A

We describe, summarize, and visualize data so that patterns can emerge

Most of the time this is the first step in the analysis process

38
Q

Descriptives / summary statistics

A

Central tendency: mean median mode

Spread: range, quartiles, variance, standard deviation, distribution

39
Q

Exploratory analysis

A

Typically the next step after descriptive analysis

We look for relationships between variables in our dataset

40
Q

Clustering analysis

A

Uses Principal Component Analysis which compresses the variables into principle components that can be plotted against each other

The plotting is checked with a k-means clustering value to confirm the correlation

41
Q

Inferential analysis

A

We test a hypothesis on a sample of a population and then extend our conclusions to the whole population

Typically use an A/B test

Sample size should be at least 10% of the population and must be random selection

42
Q

Casual analysis

A

Carefully designed experiments, usually with the following:

  • only change one variable at a time
  • carefully control all other variables
  • repeated multiple times with same results
43
Q

Good experimental design

A

Replication

Randomization

Control

44
Q

Casual inference with observational data

A

Requires:

  • advanced techniques to identify a casual effect
  • meeting very strict conditions
  • appropriate statistical tests
45
Q

Predictive analysis

A

Uses supervised machine learning techniques

It will only be as good as the training data used to start the algorithm