Statistical Thinking for Data Science and Analytics Flashcards

Question

See Image for Question

Answer 1

See image for answer

Answer 2

Center of Variation. The center of variation is where the different observed values distribute around

Answer 3

The first one is mean, which is the numerical average of observed values. The second is median, which is the midpoint.

Answer 4

* Summary statistics alone don't necessarily provide an insight into the distribution i.e. normal distribution, skewed etc. * A visualisation such as a box plot is not only easy to read and understand, but can also show outliers in the data. * data plot is an image. Image make quicker sense to human brain than pure numbers. * Sometimes plotting the data can give additional insight into data itself. * Visuals are also more compelling to people and help communicate what the data is saying

Answer 5

Association is defined as when you observe certain values of one variable are observed more frequently, more often, with certain values of another variable.

Answer 6

A correlation of '0' means there is no linear association but this does not mean there is no association. To get the whole picture look at the scatter plot. There could be a 'U' plot and hence some association.

Answer 7

* Randomized Experiments * A/B Testing * Control Groups * Double Blinded Studies * Causal Inference from Observational Data

Answer 8

To derive knowledge from sample to population, we need to have a representative sample.

Answer 9

* Misleading Outcomes * Biased Results * Difficult to analyze results * Wastage of time and money

Answer 10

1. Unpredictability 2. Trends

Answer 11

Probability is the proportion of a certain occurrence in the long run. It is only when you have a large number of occurrences in the long run you can use probability to accurately describe the proportion of any certain random outcome.

Answer 12

**Specific Addition Rule** Only valid when the events are mutually exclusive. P(A or B) = P(A) + P(B) **Non-Mutually Exclusive Events** General Addition Rule P(A or B) = P(A) + P(B) - P(A and B) **Specific Multiplication Rule:Independent Events** P(A and B) = P(A) \* P(B) **Conditional Probability : General Multiplication Rule** P(A and B) = P(A) \* P(B|A) OR P(B|A) = P(A and B) / P(A)

Answer 13

* by simulation * by experiment * by mathematical models

Answer 14

* Sampling size * Data generation process * Population distribution for the variable of interest

Answer 15

A confidence interval measures the probability that a population parameter will fall between two set values. The confidence interval can take any number of probabilities, with the most common being 95% or 99%.

Answer 16

In statistics, the coverage probability of a confidence interval is the proportion of the time that the interval contains the true value of interest.

Answer 17

Then hypothesis is a statement regarding a value of interest. It may not be true. It may be true or may be incorrect. But it is a statement that you're trying to collect evidence-- collect information-- to prove or disprove.

Answer 18

Because randomized controlled trials may not always be possible

Answer 19

Systematic errors such as bias and confounding.

Answer 20

A real drug-outcome pair for which there is no causal relationship.

Answer 21

1. We do not have full control or knowledge of the sampling process 2. We do not know whether there are unmeasured confounding factors 3. We do not know whether there are any systematic measurement errors in the observed data.

Answer 22

1. Biased estimates will lead to misrepresented statistical significance 2. Unmeasured confounding will lead to spurious association findings 3. Systematic measurement errors will contribute to poor reproducibility of findings.

Answer 23

It denotes conditional probability and means the Probability of ' B' given that 'A' has occured.

Answer 24

0.155 ## Footnote EXPLANATION Given this biased coin, P(HH) = 0.49, P(HT)=0.21, P(TH)=0.21, P(TT)=0.09. P(both tosses are the same) = P(HH or TT) = 0.49+0.09 = 0.58. P(TT | both tosses are the same) = P(TT)/P(both tosses are the same) = 0.09/0.58.

Answer 25

0.265 ## Footnote EXPLANATION P(TT | fair coin) = 0.25 P(TT | biased coin) = 0.09 P(fair coin) = 0.5 P(biased coin) = 0.5 P(biased coin | TT) = 0.5\*0.09 /(0.5\*0.09+0.5\*0.25) = 0.265.

Answer 26

The Chi square test applies to categorical data. This nonparametric test determines whether the observed counts for the categories differ from the expected counts. Look up the p-value for the Chi square statistic obtained in a statistical table in order to determine if the test reaches significance. Before using the table, calculate the degrees of freedom for the problem. For two independent variables, the degrees of freedom are the number of levels of the first variable minus one, times the number of levels of the second variable minus one. Hence, df = (r - 1) (s - 1), where r is the number of levels in the first variable and s is the number of levels in the second variable.

Answer 27

False. ## Footnote EXPLANATION Only categorical variables.

Answer 28

False. ## Footnote EXPLANATION ...is divided by the marginal distribution.

Answer 29

TRUE. ## Footnote EXPLANATION It only means that the association pattern between X and Y shown in the data is unlikely due to chance.

Answer 30

FALSE. ## Footnote EXPLANATION The distribution of Y does not change with the value of X when Y is independent with X.

Answer 31

It's important to make of visualization of your data so that you can spot some misinformation in your data very easily.

Answer 32

FALSE ## Footnote EXPLANATION It depends on the scale of X variable.

Answer 33

FALSE. ## Footnote EXPLANATION Due to sampling variability and randomness in Y that is not related to X.

Answer 34

1. Descriptive 2. Predictive 3. Prescriptive

Answer 35

So a document term matrix describes the counts of words in each document. So each row will be a document.

Answer 36

Predictive Analytics

Answer 37

Prescriptive Analytics

Answer 38

Descriptive Analytics

Answer 39

Descriptive Analytics and Predictive Analytics

Answer 40

* stopping * stemming **stop word**s are simple words, usually conjunctions-- and, but, or-- prepositions-- in,on, to-- and any other words, such as articles--A, B, and, et cetera, et cetera-- that tend not to carry much information. Secondly, words are **stemmed**, or trimmed to their roots. This is so that you can gather similar terms without having to worry about verb conjugations or noun declensions.

Answer 41

Exploratory data analysis refers to display of data-- or more generally, display of any numerical information-- in a way that can allow us to discover patterns that we did not expect to see.

Answer 42

Visualization, more generally, refers to the techniques that we use to see data or to see the numerical information.

Answer 43

1. Displaying the data clearly and consicely 2. Interpreting the data 3. Modeling the data.

Answer 44

COMPARISIONS

Answer 45

control group

Answer 46

need to choose the chart thatcommunicates the data most effectivelyand communicates the story that you're trying to show to your reader most easily and intuitively so that it doesn't take too much time before they're able to see the point we're trying to make with our data.

Answer 47

" A good visualization summarizes information and organizes in a way that enables the reader to focus on the points that are relevant to the key message being conveyed.

Answer 48

Dashboards are a way of organizing data,such that we can see multiple chartsand how they all are linked together.Typically the dashboard is constructed of several charts in a panel. And those charts come from one data set but that data set can change over time. The key thing about these dashboards are that the multiple visualizations are all organized together, such that they build up a story that one chart by itself is not sufficient to tell us.

Answer 49

* Communicating results * Exploratory data analysis

Answer 50

The most important information on one screen observable at a glance

Answer 51

A probability is called calibrated if it's empirically correct on average.

Answer 52

To understand customers' behavior and to use that information to predict future outcomes.

Answer 53

The combination of information from different sources.