statistics notes 2020 march 30 Flashcards

1
Q

Data types

A

Categorical and numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

types of Categorical data

A

Nominal, Ordinal

Nominal:

Named data which can be separated into discrete categories which do not overlap.

Ordinal:

the variables have natural, ordered categories and the distances between the categories is not known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

types of numerical data

A

Discrete, continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Ordinal data

A

a categorical, statistical data type

the variables have natural, ordered categories and the distances between the categories is not known.

data which is placed into order or scale (no standardised value for the difference)

(easy to remember because ordinal sounds like order).

e.g.: rating happiness on a scale of 1-10. (no standardised value for the difference from one score to the next)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Nominal Data mytutor.co.uk

A

Named data which can be

separated into discrete categories which do not overlap.

(e.g. gender; male and female) (eye colour and hair colour)

An easy way to remember this type of data is that nominal sounds like named,

nominal = named.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Ordinal Data

mytutor.co.uk

A

Ordinal data:

placed into some kind of order or scale. (ordinal sounds like order).

e.g.:

rating happiness on a scale of 1-10. (In scale data there is no standardised value for the difference from one score to the next)

positions in a race (1st, 2nd, 3rd etc). (the runners are placed in order of who completed the race in the fastest time to the slowest time, but no standardised difference in time between the scores).

Intervaldata:

comes in the form of a numerical value where the differencebetween points is standardisedand meaningful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Interval Data

mytutor.co.uk

A

Interval data:

comes in the form of a numerical value where the difference between points is standardised and meaningful.

e.g.: temperature, the difference in temperature between 10-20 degrees is the same as the difference in temperature between 20-30 degrees.

can be negative

(ratio data can NOT)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Ratio Data

mytutor.co.uk

A

Ratio data:

much like interval data – numerical values where the difference between points is standardised and meaningful.

it must have a true zero >> not possible to have negative values in ratio data.

e.g.: height be that centimetres, metres, inches or feet. It is not possible to have a negative height.

(comparing this to temperature (possible for the temperature to be -10 degrees, but nothing can be – 10 inches tall)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

inferential statistics

A

Population: an entire group of items, such as people, animals, transactions, or purchases >> Descriptive statistics applied if all values in the dataset are known.

>> not possible or feasible to analyse >>

Sample: a selected subset, called a sample, is extracted from the population.

The selection of the sample data from the population is random >> Inferential statistics applied >> develop models to extrapolate from the sample data to draw inferences about the entire population (while accounting for the influence of randomness)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Quantitative analysis can be split into two major branches of statistics:

A

Descriptive statistics (if all values in the dataset are known)

Inferential statistics (extrapolates from the sample data to draw inferences about the entire population)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

inferential

A

következtetési, deductive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Descriptive statistical analysis

A

As a critical distinction from inferential statistics, descriptive statistical analysis applies to scenarios where all values in the dataset are known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Confidence, confidence level

A

Confidence is a measure to express how closely the sample results match the true value of the population.

Confidence level: 0% - 100%

95%: if we repeat the experiment numerous times (under the same conditions), the results will match that of the full population in 95% of all possible cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Hypothesis Testing

A

Hypothesis test:

evaluate two mutually exclusive statements to determine which statement is correct given the data presented.

incomplete dataset >> hypothesis testing is applied in inferential statistics to determine if there’s reasonable evidence from the sample data to infer that a particular condition holds true of the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

null hypothesis

A

A hypothesis that the researcher attempts or wishes to “nullify.”

most of the world believed swans were white, and black swans didn’t exist inside the confines of mother nature. The null hypothesis that swans are white

The term “nulldoes not meaninvalid” or associated with the value zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

In hypothesis testing, the null hypothesis (H0)

A

In hypothesis testing, the null hypothesis (H0) is assumed to be the commonly accepted fact but that is simultaneously open to contrary arguments.

If substantial evidence to the contrary >> the null hypothesis is disproved or rejected >> the alternative hypothesis is accepted to explain a given phenomenon.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

The alternative hypothesis

A

The alternative hypothesis is expressed as Ha or H1.

Covers all possible outcomes excluding the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the relationship between the null hypothesis and alternative hypothesis?

A

null hypothesis and alternative hypothesis are mutually exclusive,

which means no result should satisfy both hypotheses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

a hypothesis statement must be

A

a hypothesis statement must be clear and simple. Hypotheses are also most effective when based on existing knowledge, intuition, or prior research.

Hypothesis statements are seldom chosen at random. a good hypothesis statement should be testable through an experiment, controlled test or observation.

(Designing an effective hypothesis test that reliably assesses your assumptions is complicated and even when implemented correctly can lead to unintended consequences.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

A clear hypothesis

A

A clear hypothesis tests only one relationship and avoids conjunctions such as “and,” “nor” and “or.”

A good hypothesis should include an “if” and “then” statement

(such as: If [I study statistics] then [my employment opportunities increase])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

The good hypothesis sentence structure

A

The first half of this sentence structure generally contains an independent variable (this is the hypothesys) (i.e., if study statistics) in the

second half: a dependent variable (whatyou’re attempting to predict) (i.e., employment opportunities).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

A dependent variable represents

A

A dependent variable represents what you’re attempting to predict,

2nd half of the hypothesys sentence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

The independent variable is

A

The independent variable (in the first half of the sentence) is the variable, that supposedly impacts the outcome of the dependent variable (which is the 2nd half of the hypothesys senetence)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

double-blind

A

where both the participants and the experimental team aren’t aware of who is allocated to the experimental group and the control group respectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
probability
probability expresses the **likelihood** of something **happening** expressed in **percentage** or **decimal** **form**; typically expressed as a number with a decimal value called a **floating**-**point** **number**.
26
odds
odds define the **likelihood** of an **event** **occurring** **with** **respect** **to** the **number** of **occasions** it does **not** **occur**. For instance, the odds of selecting an **ace** of **spades** from a standard deck of **52** cards is **1 against 51**. On 51 occasions a card other than the ace of spades will be selected from the deck.
27
correlation
Correlation is often computed during the **exploratory** **stage** of **analysis** to understand **general** **relationships** **between** **variables**. Correlation **describes** the **tendency** of **change** in **one** **variable** to **reflect** a change **in** **another** **variable**.
28
confounding variable
the observed correlation could be caused by a third and **previously** **unconsidered** variable, aka **lurking** variable or **confounding** variable. It’s important to consider variables that fall outside your hypothesis test as you prepare your research and before publishing your results.
29
zavarba hoz
confound
30
the curse of dimensionality
**confusing** **correlation** and **causation** arises when you analyze **too** **many** **variables** while looking for a match. (In statistics, dimensions can also be referred to as variables). If we are analyzing **three** **variables**, the **results** **fall** into a **three**-**dimensional** **space**.) You can find instances of the “curse” or phenomenon using **Google** **Correlate** (www.google.com/trends/correlate) the curse of dimensionality tends to **affect** **machine** **learning** and **data** **mining** **analysis** more than traditional hypothesis testing due to the **high** **number** of **variables** **under** **consideration**. e.g: ...It turns out that the **bang** **energy** **drink**, for example, came onto the market at a similar time as **Alibaba** **Cloud’s** international product offering and then grew at a similar pace in terms of Google search volume.. átok
31
Data
A **term** for **any** **value** that describes the **characteristics** and **attributes** of **an** **item** that can be **moved**, **processed**, and **analyzed**. The item could be a transaction, a person, an event, a result, a change in the weather, and infinite other possibilities. **Data** can **contain** **various** **sorts** of **information**, and through statistical analysis, these **recorded** **values** can be better **understood** and **used** to **support** or **debunk** a research **hypothesis**.
32
 Population
The **parent** **group** **from** **which** the experiment’s **data** is **collected**, e.g., all registered users of an online shopping platform or all investors of cryptocurrency.
33
Sample
A **subset** of a **population** **collected** **for** the **purpose** of an **experiment**, e.g., 10% of all registered users of an online shopping platform or 5% of all investors of cryptocurrency.  A sample is often used in statistical experiments for **practical** **reasons**, as it might be **impossible** or prohibitively **expensive** to directly **analyze** the **full** **population**.
34
Variable
A **characteristic** of an **item** **from** the **population** that **varies** **in** **quantity** or **quality** **from** **another** **item**, e.g., the Category of a product sold on Amazon. A **variable** that varies in regards to quantity and takes on **numeric** **values** is known as a **quantitative** **variable**, e.g., the Price of a product. A **variable** that varies in **quality**/**class** is called a **qualitative** **variable**, e.g., the Product Name of an item sold on Amazon. This process is often referred to as **classification**, as it involves **assigning** a **class** **to** a **variable**.
35
Variable types (what is the term for the process to establish types?)
**quantitative** **variable** (varies in regards to quantity and **takes** on **numeric** **values**), **qualitative** **variable** (varies in quality/class), **classification**
36
Discrete Variable
A **variable** that can only **accept** a **finite** **number** of **values**, e.g., **customers** purchasing a product on **Amazon.com** can **rate** the product as 1, 2, 3, 4, or 5 stars. In other words, the product has five distinct rating possibilities, and the reviewer cannot submit their own rating value of 2.5 or 0.0009. Helpful tip: **qualitative** **variables** are **discrete**, e.g. **name** or **category** of a product.
37
Continuous Variable
A **variable** that can assume an **infinite** **number** of **values**, e.g., depending on supply and demand, gold can be converted into unlimited possible values expressed in U.S. dollars. A **continuous** **variable** **can** also **assume** **values** **arbitrarily** **close** together. e.g.: **price** and reviews (**number** **of** **reviews** on a product) are continuous variables
38
Categorical Variables
A **variable** whose **possible** **values** consist of a **discrete** **set** of **categories**, **rather** **than** **numbers** quantifying values on a continuous scale) (such as **gender** or political allegiance,
39
Ordinal Variables
(a subcategory of **categorical** **variables**), ordinal variables **categorize** **values** in a **logical** and **meaningful** **sequence**. ordinal variables contain an **intrinsic** **ordering** or **sequence** such as {small; medium; large} or {dissatisfied; neutral; satisfied; very satisfied}. The **distance** of separation between ordinal variables does **not** **need** to be **consistent** or **quantified**. (For example, the measurable gap in performance **between** a **gold** and **silver** **medalist** in athletics need not mirror the difference in performance between a silver and bronze medalist.) **standard** **categorical** **variables**, i.e. **gender** or film genre,
40
Independent and Dependent Variables
An **independent** **variable** (expressed as **X**) is the variable that supposedly **impacts** the **dependent** **variable** (expressed as **y**). For example, the **supply** of **oil** (independent variable) impacts the **cost** of **fuel** (dependent variable). As the **dependent** **variable** is “**dependent**” **on** the **independent** **variable**, it is generally the **independent** **variable** that is **tested** in experiments. **As** the **value** of the **independent** variable **changes**, the **effect** **on** the **dependent** variable is **observed** and **recorded**.  In analyzing Amazon products, we could examine Category, Reviews and 2-Day Delivery as the independent variables and observe how changes in those variables affect the dependent variable of Price. Equally, we could select the Reviews variable as the dependent variable and examine Price, 2-Day Delivery and Category as the independent variables and observe how these variables influence the number of customer reviews.
41
What determines wether a variable is “independent” or “dependent” ?
The labels of “independent” and “dependent” are hence **determined** **by** **experiment** **design** **rather** than **inherent** **composition** (one variable could be a dependent variable in one study and an independent variable in another)
42
two events are considered independent if ...
In **probability**, **two** **events** are considered **independent** if the **occurrence** of **one** **event** does **not** **influence** the **outcome** of **another** **event** (the outcome of one event, such as flipping a coin, doesn’t predict the outcome of another. If you flip a coin twice, the outcome of the first flip has no bearing on the outcome of the second flip)
43
P(E|F)
the **probability** of **E** **given** **F** The **probability** of **one** **event** (**E**) **given** the **occurrence** of **another** **conditional** event (**F**) is **expressed** as **P(E|F)**,
44
two events are said to be independent if ..
Conversely, **two** **events** are said to be **independent** if **P(E|F)** = **P(E)**. This equation holds that the **probability** of **E** is the **same** **irrespective** of **F** **being** **present**. This expression can also be tweaked to **compare** **two** **sets** of **results** where the **conditional** event (**F**) is **absent** **from** the **second** **trial**.
45
Bayes' theorem in nutshell
The **premise** of this **theory** is **to** **find** the **probability** of an **event**, **based** **on** **prior** **knowledge** of **conditions** potentially **related** **to** the **event**. Bayes' theorem "is to the theory of probability what the **Pythagorean** **theorem** is **to** **geometry**.”  For instance, if **reading** **books** is **related** to a person’s **income** **level**, then, using Bayes’ theory, we can assess the **probability** **that** a **person** **enjoys** reading **books** **based** **on** prior knowledge of their **income** **level**. In the case of the **2012** **U.S. election**, **Nate** **Silver** drew from **voter** **polls** as prior knowledge to refine his **predictions** of **which** **candidate** would **win** in each state. Using this method, he **was** **able** to successfully **predict** the outcome of the presidential **election vote** in **all** **50** states.
46
Triboluminescence
**Triboluminescence** is the **light** **emitted** when **crystals** are **crushed**…” ‘When you take a **lump** of **sugar** and crush it with a pair of **pliers** in the dark, you can see a **bluish** **flash**. Some other crystals do that too. lump - csomó pliers - fogó
47
Bayes' theorem formula
**P(A/B)= P(A) \* P(B/A) / P(B) ** P(A|B) is the probability of A given that B happens (conditional probability) **P(A)** is the **probability of A** (**without** any **regard** to whether event **B** has **occurred** (**marginal** probability) **P(B|A**) is the **probability** of **B** **given** that **A happens** (**conditional** probability) **P(B)** is the **probability** of **B** **without** any **regard** to **whether** event **A** has **occurred** (marginal probability)  Bayes’ theorem can be written in multiple formats including the use of **∩** (**intersection**) instead of P(B/A). [https://www.dropbox.com/s/p8io6kx4d4d0vne/Bayes%20theorem%20formula.png?dl=0](https://www.dropbox.com/s/p8io6kx4d4d0vne/Bayes%20theorem%20formula.png?dl=0)
48
conditional probability (and what is opposite?)
Both **P(A|B)** and **P(B|A)** are the **conditional** **probability** of observing **one** **event** **given** the **occurrence** **of** **the** **other**. Both **P(A)** and **P(B)** are **marginal** **probabilities**, which is the **probability** of a **variable** **without** **reference** to the values of **other** variables.
49
Let’s imagine a particular **drug test** is **99%** **accurate** at detecting a subject as a drug user. Suppose now that **5%** of the **population** has **consumed** a banned drug. How can Bayes’ theorem be applied to **determine** the **probability** that an **individual**, who has been **selected** at **random** from the population is a **drug user** if they **test** **positive**?
we need to designate A and B events: **P(A)**: real drug user probability and **P(B)**: probability of identifying someone as positive (even if in reality is not \>\> all real positives from users and the false positives from non-users) **P(A/B)**: this is the question; probability of a **real****drug****user**identified **positive**ly**in****the****test** (different from 0.99 because there is a probability, that the test shows false positive result from non-users (the test does not catch all positives either, but not important now) **P(A)**: probability of a **real**“**drug user**” \>\> 0.05 (implies probability of non-user: 1-0.05 = 0.95) **P(B/A)**: probability of a **positive****test**\>\> 0.99 (result given that the individual is a drug user) **P(B)**: the probability of a **positive****test****result**(two elements: actually identified real users + false positively identified non-users): 0.059 1. actually identified real users: 0.05 \* 0.99 = 0.0495 2. false positively identified non users; (1-0.05) \* 0.01 = 0.95 \* 0.01= 0.9505 \* 0.01=0.0095 **0.059**= 0.0495 + 0.0095 (from 1. + 2.) **P(A/B) = P(A) \* P(B/A) / P(B)** \>\> 0.05 \* 0.99 / 0.059 = 0.8389 P(user|positive test) = P(user) \* P(positive test|user)/P(positive test) [Bayes theorem example 1](https://www.dropbox.com/s/yqmlvu2tca7ljo1/Bayes%20theorem%20example%201%20.png?dl=0)
50
What is the **implication** of the **false** **positive** **test** results? How to deal with it?
Using Bayes’ theorem, we’re able to determine that (in the current example) there’s an 83.9% probability that an individual with a positive test result is an actual drug user. The reason this **prediction** is **lower** for the general population **than** the **successful** **detection** **rate** **of** actual **drug** **users** or P (positive test | user), **which** was **99%**, is due to the **occurrence** of **false**-**positive** **results**.
51
Bayes’ theorem weakness
important to acknowledge that Bayes’ theorem can be a **weak** **predictor** **in** **the** **case** **of** **poor** **data** regarding prior knowledge and this **should** be **taken** **into** **consideration**.
52
Binomial Probability
used for **interpreting** **scenarios** **with** **two** possible **outcomes**. (**Pregnancy** and **drug** **tests** both produce binomial outcomes in the form of **negative** and **positive** **results**, and so too **flipping** a two-sided **coin**.) The **probability** of **success** in a binomial experiment is expressed as **p**, and the **number** of **trials** is referred to as **n**.
53
drawing aggregated conclusions from multiple binomial experiments such as flipping consecutive heads using a fair coin?
you would need to **calculate** the **likelihood** of **multiple** **independent** **events** **happening**, which is the product (**multiplication**) of **their** **individual** **probabilities**
54
Permutations
tool to **assess** the **likelihood** of an **outcome**. **not** a **direct** **metric** of **probability**, permutations can be **calculated** to **understand** the **total** number of **possible** **outcomes**, which can be **used** for **defining** **odds**. calculate the **full** **number** of **permutations**, which refers to the **maximum** **number** of **possible** **outcomes** **from** **arranging** **multiple** **items**
55
find the full number of seating combinations for a table of three
we can apply the function **three**-**factorial**, which entails **multiplying** the **total** **number** of **items** by **each** discrete **value** **below** that number, i.e., 3 x 2 x 1 = 6.
56
Four-factorial is
Four-factorial is 4 x 3 x 2 x 1 = 24
57
you want to know the **full** **number** of **combinations** for **randomly** **picking** a **box** **trifecta**, which is a scenario where you **select** **three** **horses** to **fill** the **first** **three** **finishers** in **any order**.
using **permutations** is for horse betting; we’re **calculating** the **total** **number** of **permutations** and also a **subset** of **desired** **possibilities** (recording a **1st** place, recording a **2nd** place, and recording a **3rd** **place** **finish**). The **total** number of **combinations** on where each horse can finish is calculated as **Twenty**-**factorial** We next **need** to **divide** **twenty**-**factorial** **by** **seventeen**-**factorial** to ascertain **all** **possible** **combinations** of a **top** **three** placing. **Twenty**-**factorial** / **Seventeen**-**factorial** = **6,840** Thus, there are 6,840 possible combinations among a 20-horse field that will offer you a box trifecta.
58
CENTRAL TENDENCY
the **central** **point** of a **given** **dataset**, aka central tendency measures. the three primary measures of central tendency are the **mean**, **mode**, and **median**.
59
The Mean
**Arithmetic** **mean** (**sum** **divided** by the **sample** **number**) the **midpoint** **of** a **dataset**, is the **average** **of** a **set** of **values** and the **easiest** **central** **tendency** **measure** to understand. sum of all numeric values / by the number of observations
60
trimmed mean
the **mean** can be **highly** **sensitive** **to** **outliers**. (statisticians sometimes use the trimmed mean, which is the mean obtained after **removing** **extreme** **values** at **both** the **high** and **low** **band** of the dataset, such as **removing** the **bottom** and **top** **2%** of **salary** **earners** in a national income survey).
61
The Median
the **median** **pinpoints** the **data** **point(s)** **located** in the **middle** **of** the **dataset** to suggest a **viable** **midpoint**. The median, therefore, occurs at the position in which exactly **half** **of** **the data values** are **above** and **half** **are** **below when arranged in ascending** or **descending** **order**. The solution for an **even** **number** **of data points** is to **calculate** the **average** of the **two** **middle** **points**
62
The Median or mean is better?
The **mean** and **median** **sometimes** produce **similar** **results**, but, in general, the **median** is a **better** measure of **central** **tendency** than the mean **for** **data** that is **asymmetrical** as it is **less** **susceptible** to **outliers** and **anomalies**. The **median** is a **more** **reliable** **metric** for **skewed** (**asymmetric**) **data**
63
The Mode
statistical technique to **measure** **central** **tendency** The mode is the **data** **point** in the dataset that **occurs** **most** **frequently**.
64
discrete categorical values
a **variable** that can **only** **accept** a **finite number** of **values**
65
ordinal values
the **categorization** of **values** in a **clear** **sequence** (such as a 1 to 5-**star** **rating** system on **Amazon**)
66
Why The Mode is advantageous?
**easy** to **locate** in **datasets** with a low number of discrete **categorical** **values** (a variable that can only accept a finite number of values) or **ordinal** **values** (the categorization of values in a clear sequence)
67
Why can be The Mode is disadvantageous?
The **effectiveness** of the mode can be **arbitrary** and **depends** heavily **on** the **composition** of **the** **data**. The mode, **for** **instance**, can be a **poor** **predictor** for **datasets** that do not have a **single** **high** **number** of **common** **discrete** **outcomes** (**all** star **values** have about the **same** **%**)
68
Weighted Mean
statistical measure of central tendency factors the **weight** of **each** **data** point **to** **analyze** the **mean**. used when you want to **emphasize** **a** **particular** **segment** of **data** **without** **disregarding** the **rest** of the dataset. e.g.: students’ grades, the **final** **exam** accounting for **70%** **of** the **total** **grade**.
69
What is the a suitable measure of central tendency?
**depends** on the **composition** of the **data**. The **mode**: **easy** **to** **locate** in datasets **with** a **low** **number** of **discrete** **values** or **ordinal** **values**, The **mean** and **median**: suitable for datasets that contain **continuous** **variables**. The **weighted** **mean**: used when you want to **emphasize** a **particular** **segment** of **data** **without** **disregarding the rest** of the dataset.
70
MEASURES OF SPREAD
describes **how** **data** **varies** The **composition** of **two** **datasets** **can be** very **different** **despite** the fact they each dataset has the **same** **mean**. The critical point of difference is the **range** of the **datasets**, which is a simple **measurement** **of** **data** **variance**.
71
range of the datasets
As the **difference** **between** the **highest value** (**maximum**) and the **lowest** value (**minimum**), the range is **calculated** by **subtracting** the **minimum** from the **maximum**. **knowing** the **range** **for** the **dataset** can be **useful** **for** data screening and **identifying errors**. An **extreme** minimum or maximum **value**, for example, might indicate a **data** **entry** **error**, such as the inclusion of a measurement in **meters** in the same column as other measurements expressed in **kilometers**.
72
Standard Deviation
**describes** the **extent** to which **individual** **observations** **differ** **from** the **mean**. the standard deviation is a **measure** **of the spread** or **dispersion** **among** **data points** just **as** **important** **as** **central** **tendency** measures for **understanding** the **underlying shape of the data**.
73
How Standard deviation measures variability ?
Standard deviation **measures** **variability** by **calculating** the **average** **squared** **distance** of **all** **data observation**s **from** the **mean** of the dataset.
74
Standard Deviation what low/high SD values mean?
the **lower** the **standard** **deviation**, the **less** **variation** **in** the **data** When **SD** is a **lower** **number** (**relative** **to** the **mean** of the dataset) \>\> it indicates that most of the **data** **values** are **clustered** closely **together**, whereas a **higher** **value** **indicates** a **higher** **level** of **variation** and **spread**. a low or high standard deviation value **depends** **on** the **dataset** (depends on the mean, on the range and even on the variability of the values in the dataset ) [SD -1.png](https://www.dropbox.com/s/98k3bh0xafh7pl7/SD%20-1.png?dl=0)
75
How to Calculate Standard Deviation ?
[SD-2](https://www.dropbox.com/s/aigdxxosvbs5xlz/SD-2.png?dl=0)
76
histogram
visual technique for **interpreting data variance** is to **plot** the **dataset’s distribution values**
77
what is standard normal distribution?
A **normal** **distribution** with a **mean of 0** and a **standard deviation of 1**
78
What histogram shape a normal distribution produces?
data is distributed evenly \>\> a bell curve A symmetrical bell curve of a standard normal model [bell curve -1.png](https://www.dropbox.com/s/dr7c4sak2u0fv8g/bell%20curve%20-1.png?dl=0)
79
Normal distribution can be transformed to a standard normal distribution by ..
converting the original values to **standardized** **scores**
80
normal distribution features:
- the **highest** **point** of the dataset occurs at the **mean** (**x̄**). - the **curve** is **symmetrical** around an imaginary **line** that lies **at** **the** **mean**. - **at** its **outermost ends,** the **curves** **approach** but **never** quite **touch** or **cross** the **horizontal** **axis**. - the **location** at which the curves transition **from** **upward** **to** **downward** cupping (known as **inflection** **points**) occur **one standard deviation above** and **below** the **mean**. [bell curve -1.png](https://www.dropbox.com/s/dr7c4sak2u0fv8g/bell%20curve%20-1.png?dl=0)
81
how variables diverge in the real world?
The **symmetrical** shape of **normal** **distribution** is a **often** **reasonable** description. (body **height**, **IQ** tests, **variable** **values** **generally** **gravitate** **towards** a **symmetrical** **shape** **around** the **mean** as **more** **cases** are **added**)
82
Empirical Rule
variables often diverge in the real world like a The symmetrical shape of normal distribution
83
How the **Empirical Rule** describes normal distribution ?
Approximately **68% of values** fall **within** **one standard** **deviation** of the **mean**. Approximately **95% of values** fall **within two standard deviations** of the **mean**. Approximately **99.7%** **of values** fall within **three standard deviations** of the mean. Aka the **68 95 99.7 Rule** or the **Three Sigma Rule**
84
What the French mathematician Abraham de Moivre discovered?
Following an **empirical experiment** flipping a two-sided coin, de Moivre discovered that **an increase in events** (coin **flips**) gradually **leads** **to** a **symmetrical curve** of **binomial distribution**.
85
What is Binomial distribution?
It **describes** a **statistical** **scenario** when only **one** of **two** **mutually exclusive outcome**s of a trial is possible, i.e., a head or a tail, true or false.)
86
Total possible outcomes of flipping a head with four standard coins Flipping exp. with 4 coins..
the **histogram** has **five possible outcomes** ## Footnote the probability of most outcomes is now lower. the **more data \>**\> the **histogram** contorts into a **symmetrical** **bell**-**shape**. As **more data** is **collected** \>\> **more observations** settle **in** **the middle** of the **bell curve**, a **smaller** **proportion** of observations land **on the left and right tails** of the curve. The histogram eventually produces approximately **68%** of values **within one standard deviation of the mean**. Using the histogram, we can pinpoint the probability of a given outcome such as **two heads (37.5%)** and whether that **outcome** is **common** or **uncommon** **compared** **to other results**—a potentially **useful** piece of **information** **for** gamblers and other **prediction** scenarios. It's also interesting to note that the **mean**, **median**, and **mode** all occur at the **same** **point** **on** **the curve** **as** this location is both the **symmetrical center** and the **most common point**. However, **not all frequency curves produce a normal distribution**. [symm bell shape in binom distrib.png](https://www.dropbox.com/s/i0zinc64xo14x4x/symm%20bell%20shape%20in%20binom%20distrib.png?dl=0)
87
MEASURES OF POSITION
**on** a **normal curve** there’s a **decreasing** **likelihood** of **replicating a result** the **further** that observed data point is **from** the **mean**. We can also assess whether that data point is approximately **one** (**68**%), **two** (**95**%) or **three** **stand**ard **dev**iations (**99.7**%) **from** the **mean**. This, however, **doesn’t** **tell** us the **probability** of **replicating** the **result**. **we** **want** to **identify** the **probability** of **replicating** a result.
88
How to identify the probability of replicating a result?
Depending on the size of the dataset: **Z-Score**
89
Z-Score
**finds** the **distance** **from** the sample’s **mean** **to** an individual **data** **point** expressed **in units** of **stand**ard **deviation**. [z-score.png](https://www.dropbox.com/s/lzhc7ppn87lyb6p/z-score.png?dl=0)
90
Z-Score is 2.96, means ..
the **data point** is **located** **2.96 stand**ard **dev**iations **from** the **mean** in the **pos**itive **direction**. This data point could also be considered an **anomaly** as it is **close to three deviations** from the mean and **different** from other data points.
91
Z-Score is -0.42, means ..
the **data point** is positioned **0.42 stand**ard **dev**iations from the **mean** in the **negative** **direct**ion, (this data point is **lower** **than** the **mean**)
92
anomaly
**if** the **Z-Score** **falls three** positive or negative **deviations** **from** the **mean** (in case of normal distribution) \>\> anomaly \>\> data points that lie an **abnormal distance** from other data points. \>\> **a rare event** that is **abnormal** and perhaps **should not have occurred**. in the case of a normal distribution if the Z-Score falls three positive or negative deviations from the mean of the dataset, it **falls beyond 99.7%** of the **other** **data points** on a normal distribution curve. sometimes viewed as a **negative exception**, such as **fraudulent behavior** or an **environmental crisis**. **help** to **identify** **data** **entry** **errors** and are commonly used in **fraud** **detection** to **identify** **illegal** **activities**.
93
Outliers no unified agreement on how to define outliers, but:
**data points** that **diverge** from **primary data patterns** as outliers because they record **unusual scores** on at least one variable and are **more plentiful than anomalies**.
94
Z-Score applies to..
to a **normally distributed sample** with a **known stand**ard **dev**iation of the population.
95
When to use T-Score?
sometimes the **mean** is**n’t** **norm**ally **distributed** or the **stand**ard **dev**iation of the population is **unknown** or **not** **reliable**, \<\< which could be **due** to **insufficient** **sampling** (**small** **sample** **size**)
96
What is the problem with small datasets?
The standard deviation of small datasets is susceptible to change as more observations are included
97
T-Score who, when discovered, how else called?
**Irish** statistician W. S. **Gosset**. **early** **20th** Century published **under** the pen **name**"**Student**" \>\> sometimes called "**Student's T-distribution**."
98
What Z-Score/ T-Score using?
Z-distribution / T-distribution (Student's T-distribution)
99
What is Z-Score and T-Score primary function?
same primary function (measure distribution) they’re used with different sizes of sample data.
100
What is Z-distribution?
standard normal distribution
101
What Z-Score measures?
the **deviation** of an individual **data** **point** **from** the **mean** for **datasets** with **30** or more **observations** based on **Z-distribut**ion (**stand**ard **norm**al **distr**ibution). [Z and T distribution graph.png](https://www.dropbox.com/s/jloya1tooes6w3q/Z%20and%20T%20distribution%20graph.png?dl=0)
102
T-distribution features
the T-distribution is **not** **one** fixed bell **curve** rather its distribution curve **changes** (**multiple shapes**) **in** **accordance** with the **size** of the **sample**. - if the **sample size is small,** (e.g. 10): \>\> the **curve** is relatively **flat** with a **high proportion** of data points in the curve’s **tails**. - as the **sample size increas**es \>\> the **distrib**ution **curve** **approaches** the **stand**ard **norm**al **curve** (**Z-distribution**) with **more** data **points** **closer** to the **mean** at the **center** of the curve. [Z and T distribution graph.png](https://www.dropbox.com/s/jloya1tooes6w3q/Z%20and%20T%20distribution%20graph.png?dl=0)
103
A standard normal curve is defined by...
by the **68 95 99.7 rule**, which **sets** approximate **confidence levels for one, two**, and **three stand**ard **dev**iations **from** a **mean** of **0**. Based on this rule, **95%** of **data points** will **fall** **1.96 stand**ard **dev**iations **from** the **mean**
104
if the sample’s mean = 100 and we randomly select an observation from the sample (in case of standard normal curve)..
the **probability** of that **data point falli**ng **within 1.96** **stand**ard **dev**iations of 100 is 0.95 or **95%**. **To** **find** the **exact variation** of that data point **from** the **mean** we can use the **Z-Score**
105
In the case of smaller datasets we need to.. what is the problem?
they don’t follow a normal curve—we instead need to use the **T-Score**.
106
T-Score
The formula is **similar** to that of the **Z-Score**, **except** the **stand**ard **dev**iation is **divided** by the **sample size**. Also, the **stand**ard **dev**iation is that **of the sample** in question, which **may** or **may not reflect** that of the **population** (when more observations are added to the dataset). [T-score.png](https://www.dropbox.com/s/lv0m2p4tlp9qsqz/T-score.png?dl=0)
107
You’ll want to use the t score formula when ..
when you don’t know the population standard deviation and you have a small sample (under 30).
108
T-score formula
[T-score formula.png](https://www.dropbox.com/s/ngqgby38h2vewsj/T-score%20formula.png?dl=0)
109
When to use T-score formula ?
You’ll want to use the t score formula when you don’t know the population standard deviation and you have a small sample (under 30).
110
What is the T Score in essence?
A t score is **one form** of a **standardized** **test** statistic (the other you’ll come across in elementary statistics is the z-score). The **t score formula** **enables** you to **take an individual score** and **transform** it into a **standardized** **form** \> one which **helps** you **to** **compare** scores.
111
Z-score tells you:
z score tells you how many standard deviations from the mean your score is
112
very good website \>\> work out here
https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/z-score/
113
Z score = 0: what is the meaning?
Your observation is right in the middle of the distribution (in the mean)
114
Z score = 1: what is the meaning?
Your observation is 1 SD away from the mean (above if +1, bellow if -1)
115
Z-score summary
[Z-score summary.png](https://www.dropbox.com/s/ozx6gzy94t9oato/Z-score%20summary.png?dl=0)
116
The Law of Large Numbers
if we take a sample (**n**) **observations** of our **random** **variable** & **avg** the observation (**mean**)-- it will **approach** the **expected** **value** **E**(**x**) of the random variable.
117
What is a typical sample size that would allow for usage of the central limit theorem?
In **practice**, "**n = 30**" is usually what distinguishes a "large" sample from a "small" one. In other words, if your sample has a size of at least 30 you can say it is **approximately** **Normal** (and, hence, **use** the **Normal** **distribution**). If, on the other hand, your sample has a size **less** **than** **30**, it's best to use the **t-distribution instead**.
118
Do we average large number of samples when applying Central limit theorem?
We are **not** **averaging** a large number of samples, **rather**, we are **obtaining** the **averages** **from** **many** **repeated** **samples**. The **distribution** of the **sample** **averages** is the **Normal** **distribution** we obtained. It **does** **not** **represent** the **original** **distrib**ution **well**. But it's **not** **supposed** to do so! This Normal distribution is the **distribution** **of** **the** **sample** **mean**. Its use it to let us talk about the **probability** **of** the **sample** **mean** **being** **in** a **given** **interval**, **better** **understand**ing the **population** **mean**, and so forth.
119
How can we use the Central Limit Theorem?
We can **get** **info** **about** **a** **population** **not** **taking** **large** **number** of **samples**, but getting the **averages** **from many repeated** smaller **samples** \>\> their **distribution** will be **normal** (**around** the **mean**) \>\> this **normal distrib**ution **is** the **distribution** **of** **the** **sample** **mean**. \>\> **population** **mean** can be determined \>\> can **determine** the **probability** of the **sample** **mean** **being** in a **given interval** (and maybe more what I still dont get)
120
Central Limit Theorem
**if** we **take** the **mean** of the **samples** (**n**) and **plot** the **frequencies** **of** **their** **mean**, \>\> we get a **normal** **distribution**! as the **sample** **size** (**n**) **increases** --\> approaches **infinity** --\> we find a **normal** **distribution** (**calculate** the **mean** of a **few random samples** (e.g: **n=4**) from the whole population \> gives a value (**sample mean**) \> **repeat** **several times** with the **same sample size** (4-4-4 samples) \> **plot** **their** **means** on a **freq**uency **distrib**ution \> if you do it many times \> the **distrib**ution of the **sample** **means** will **follow** **norm**al **distrib**ution if the **sample** **size** is **low** (e g.: **n=4**) \>\> the **curve** will be **wide** and **flat** as **sample size** **incr**eases (e g.: n \>\>\> 4) \> the **curve** will be **higher** and **tighter** **around** the **mean** [Central Limit Theorem .png](https://www.dropbox.com/s/9mdioirrjozjubv/Central%20Limit%20Theorem%20.png?dl=0)
121
what's the difference between an average and mean?
The word '**average**' is a bit more **ambiguous**. **Average** **can** legitimately **mean** almost **any** **measure** of **central tendency**: **mean**, **median**, **mode**, **typical value**, etc. However, even "**mean**" admits some **ambiguity**, as there are **different** **types** of means. The one you are probably **most** **familiar** with it the **arithmetic** **mean**, although there is also a **geometric** **mean** and a **harmonic** **mean**.
122
Skew and Kurtosis of the Normal Distribution
[Skew and Kurtosis of the Normal Distribution .png](https://www.dropbox.com/s/as1spl78k8p0b5g/Skew%20and%20Kurtosis%20of%20the%20%20Normal%20Distribution%20.png?dl=0)
123
opposite of fraction number
integer
124
The Standard Error of the Mean
the Standard Error of the Mean the Stand Dev of the Mean the 'stand deviation' of the 'sample distribution' of the 'sample mean' --\> all the same [the Standard Error of the Mean.png](https://www.dropbox.com/s/59jhiv2r53zsofe/the%20Standard%20Error%20of%20the%20Mean.png?dl=0)
125
what is 'mu' and 'X upper lined'
the **whole** **population** can be characterized by a **mean** **μ** (mu), but it is impossible to measure (everybody) so we take several samples from the whole population and calculate the **sample mean**s **x̄** (x upper lined) according to the **Central Limit Theorem** the **means** of the **taken** **samples** will follow **Normal** **distrib**ution **even** **if** the **distrib**ution is **not** **normal** **in** the **population**
126
what is sigma squared?
population variance
127
what is sigma ?
population SD
128
what is 's' squared?
sample variance
129
what is 's' ?
sample SD (square rooted sample variance) but square rooting is non -linear \>\> **square** **root**ing (**n-1**) \>\> introduces **slight** **errors**, **still** the **best** **we** **have** [sample standard deviation.png](https://www.dropbox.com/s/3ttv5e460dpzf71/sample%20standard%20deviation.png?dl=0)
130
sample standard deviation
sample SD (**square** **rooted** sample **variance**) but **square** **rooting** is **non** -**linear** \>\> square rooting (**n-1**) \>\> introduces **slight** **errors**, still the best we have sample [standard deviation.png](https://www.dropbox.com/s/3ttv5e460dpzf71/sample%20standard%20deviation.png?dl=0)
131
Variance
**squared** **stand**ard **dev**iation **square root** of **variance** gives --\> **stand**ard **devi**ation population variance / population variance: the **differences** of sample **values** and **means** **squared** --\> **summed** **up** --\> **divided** by sample number (**n**; in case of population variance) or (**n-1**; sample variance) **pop**ulation **variance**: **sigma** **samp**le **variance**: '**s**' [Variance.png](https://www.dropbox.com/s/fcakp656hqcolpg/Variance.png?dl=0)
132
difference between one-tailed test and 2 tailed test
**one-tailed test** considers **one** **direction** of results (**left** or **right**) **from** the **null** **hypoth**esis, whereas a **two-tailed test** considers **both** **directions** (**left** and **right**). the **objective** of the **hypothesis** **test** is not to **challenge** the null hypothesis in one particular direction but to **consider** **both** **directions** **as** **evidence** **of** an **altern**ative **hypoth**esis. there are **two rejection zones**, known as the **critical** **areas**. **Results** that **fall** **within** either of the two **critical** **areas** **trigger** **rejection** **of** the **null hypoth**esis and thereby **validate** the **alternati**ve **hypoth**esis. [1 tailed test-1.png](https://www.dropbox.com/s/hrcrkeb2ndsg4d4/1%20tailed%20test-1.png?dl=0) [2 tailed test-1.png](https://www.dropbox.com/s/hlhp9cfd6gevx8e/Variance%20summary.png?dl=0)
133
Type I Error in hypothesis testing
the **rejection** of a null hypothesis (**H0**) that was **true** and **should** **not** **have** **been** **reject**ed. This means that although the **data** appears to **support** that **a relationship** is responsible, the **covariance** of the **variables** is **occurring** entirely **by** **chance**. (this does **not** **prove** that a **relation**ship does**n’t** **exist**, merely that it’s **not** the most **likely** **cause**) **covariance**: a measurement of **how related** the **variance** is **between** **two** **variables** This is commonly referred to as a **false-positive**.
134
Type II Error in hypothesis testing
**accepting** a **null** hypothesis (**H0**) **that** **should’ve** **been** **rejected** because the **covariance** of **variables** was probably **not** **due** to **chance**. This is also known as a **false-negative**. **covariance**: a measurement of how related the variance is between two variables
135
pregnancy test example for type I type II errors
we **need** to **establish** a **H0** what can be **challenged** **experimentally** we can do **test** **for** **pregnancy** -\> if the test shows pregnancy -\> we **can** **reject** **H0** stating that the **woman** is **not** **pregnant --\>\>** the **null** hypothesis (**H0**): the **woman** is **not** **pregnant**. **H0** **rejected** **if** the woman is **pregnant** --\> H0 is false and **H0** **accepted** **if** the woman is **not** **pregnant** (**H0** is **true**). the **test** may **not** be **100%** accurate \>\> mistakes may occur. If **H0** **rejected** (**false +** test) and the woman is not actually pregnant (H0 is true), this leads to a **Type I Error**. If **H0** is **accepted** (the **test** **fails** to **show** **pregn**ancy, **false** **negative**) and the woman is **pregnant** (**H0 is false**) --\> this leads to a **Type II Erro**r (**we** do **not** **reject** **H0** \> **accept** **H1**)
136
example for hypothesis testing my take (not sure)
we change sg --\> causing effect or not? let's detect events to see H0: no affect H1: does have affect --\> if we can detect events, wich would be highly unlikely by chance (e.g. three SD away from the random distribution mean) ez az otletem, de majmeglattyuk
137
What is Covariance?
a **measure** **of** the **variance** **between** **two** **variables**. covariance is a **measure** **of** the **relationship** **between** two **random** **variables**. a **measurement** of **how** related the **variance** is **between** two variables The metric evaluates how much – to **what** **extent** – the **variables** **change** **together**. However, the metric does **not** **assess** the **dependency** between variables. [Covariance summed](https://www.dropbox.com/s/rum9ewu7sc6wrff/Covariance%20summed.png?dl=0)
138
covariance is measured..
covariance is **measured** **in units**. The units are computed by **multiplying** the **units** **of** the two **variables**. The variance can take any **positive** or **negative** **values**. The values are interpreted as follows: **Positive** **covariance**: Indicates that **two** **variables** tend to **move** in the **same** **direction**. **Negative** **covariance**: Reveals that two **variables** tend to **move** in **inverse** **directions**. [Covariance summed](https://www.dropbox.com/s/rum9ewu7sc6wrff/Covariance%20summed.png?dl=0)
139
covariance concept is used..
**In finance**, the concept is primarily used in **portfolio** **theory**. One of its most common applications in portfolio theory is the **diversification** **method**, using the **covariance** **between** **assets** **in** a **portfolio**. By **choosing** **assets** that do **not** **exhibit** a high **positive** **covariance** with each other, the **unsystematic** **risk** can be **partially** **eliminated** [Covariance summed](https://www.dropbox.com/s/rum9ewu7sc6wrff/Covariance%20summed.png?dl=0)
140
the covariance between two random variables X and Y can be calculated using the following formula (for population):
[Covariance summed](https://www.dropbox.com/s/rum9ewu7sc6wrff/Covariance%20summed.png?dl=0)
141
Covariance measures what? what are the limitations of covariance?
Covariance measures the **total** **variation** of **two** **random** **variables** **from** their **expected** **values**. Using covariance, we can **only** **gauge** the **direction** of the **relationship** (whether the variables tend to move in tandem or show an inverse relationship) it does **not** **indicate** the **strength** of the relationship, **nor** the **dependency** between the variables. [Covariance summed](https://www.dropbox.com/s/rum9ewu7sc6wrff/Covariance%20summed.png?dl=0)
142
Correlation measures
**Correlation** measures the **strength** of the **relationship** **between** **variables**. Correlation is the **scaled** **measure** of **covariance**. It is **dimensionless**. In other words, the **correlation** **coefficient** is always a **pure** **value** and **not** measured in **any** **units**. **correlation**: **covariance** **divided** by **stand**ard **dev**iation of **both** X and Y variables [Covariance summed](https://www.dropbox.com/s/rum9ewu7sc6wrff/Covariance%20summed.png?dl=0)
143
investing Example of Covariance
John is an **investor**. **His** **portfolio** primarily **tracks** the **performance** of the **S&P 500** and John **wants** to **add** the **stock** of ABC Corp. Before adding the stock to his portfolio, he wants to **assess** the **directional** **relationship** between **the** **stock** and the **S&P 500**. John **does** **not want to increase the unsystematic risk** of his **portfolio**. Thus, he is **not** **interested** in **owning** **securities** in the portfolio that tend to **move** in the **same** **direction**. John can **calculate** **the covariance between** the **stock** of ABC Corp. **and** **S&P 500** by following the steps below: [https://corporatefinanceinstitute.com/resources/knowledge/finance/covariance/](https://corporatefinanceinstitute.com/resources/knowledge/finance/covariance/)
144
Why Statistical Significance important?
Given that the **sample data** **cannot** **be** truly **reliable** and **representative** **of** the **full population**, there is the possibility of a **sampling error** or **random chance affecting** the **experiment’s** **results**. **not all samples random**ly extracted from the population are preordained to **reproduce** the **same** result. It’s natural for **some samples** to contain a **higher number of outliers** and **anomalies** than other samples, and **naturally**, **results** can **vary**. If we continued to extract random samples, we would likely see a **range of results** and the **mean** of **each random sample** is **unlikely** to be **equal** to the true mean of the **full population.**
145
statistical significance : what is the role?
**outlines** a **threshold** for **rejecting** the **null** **hyp**othesis. Statistical significance is often referred to as the **p-value** (**probability value**) and is expressed **between** **0** and **1**.
146
what is the meaning of p-value of 0.05?
A p-value of 0.05, expresses a **5% possibility** of **replicating** a **result** if we take **another** **sample**.
147
how we use the p-value in hypothesis testing?
the **p-value** is **compared** to a **pre-fixed value** (the **alpha**). If the **p-value returns** as equal or **less** than **alpha**, then the **result** is **stat**istically **significant** and **we** **can** **reject** the **null** **hyp**othesis. If the **p-value** is **greater** than **alpha**, the result is **not** **stat**istically **significant** and we **cannot** **reject** the **null** hypothesis. **Alpha** sets a **fixed threshold** for **how** **extreme** the **results** **must** **be** before **rejecting** the **null** hypothesis. (alpha should be **defined** **before** the **experiment** and not after the results have been obtained)
148
How is alpha for two-tailed tests?
For **two-tailed tests**, the **alpha** is **divided** by **two**. Thus, if the **alpha** is **0.05** (5%), then the **critical areas** of the curve each **represent** **0.025** (2.5%). Hypothesis **tests** usually adopt an alpha of **between** 0.01 (**1%**) and 0.1 (**10%**), there is **no** **predefined** or **optimal** **alpha** for **all** **hyp**othesis **tests**.
149
Why is there a tendency to set alpha to a low value such as 0.01?
**alpha** is **equal** to the **probability** of a **Type I Error** (**incorrect** **reject**ion of the **H0** due to **false** **pos**itive) (when the **result** **falls** into the **alpha**% **critical** (rejection) **zone**(s).. when the result is in the critical zone (defined by alpha) -\> the **H0** **rejected** --\> **tendency** to **minimalize** the **critical** **zone** by **decreasing** it's size choosing **smaller** **alpha** (incorrect rejection of the null hypothesis) the critical area is smaller \>\> **less** **chance** of **incorrectly** **rejecting** **H0** but! **increases** the **risk** of a **Type II Error** (**incorrectly** **accepting** the **null** **hyp**othesis) because the **critical** **zone** will be so **tiny**, that **no** **value** can **fall** **into** it anymore --\> **can** **not** **reject** the **HO** --\> **incorrect** **acceptance** of H0 \>\> inherent trade-off in hypothesis testing \>\> most industries have found that 0.05 (5%) is the ideal alpha for hypothesis testing
150
What is alpha equal to?
alpha is **equal** to the **probability** of a **Type I Error** (**incorrect** **rejection** of the **null** **hyp**othesis) (**false** **pos**itive result)
151
Confidence in essence
Confidence is a **statistical** **measure** of **prediction** **confidence** regarding whether the **sample** **result** of the **experiment** is **true** **of** the **full** **pop**ulation
152
Confidence is calculated as
**Confidence** is calculated as (**1 – α**). if the **alpha** is **0.05** \>\> **confidence** level of the experiment is 0.95 (**95%**). 1.0 – α = confidence level 1.0 – 0.05 = **0.95**
153
Confidence relation to alpha
Confidenceis calculated as (1 – α). if the alphais 0.05\>\> confidencelevel of the experiment is 0.95 (95%). 1.0 – α = confidence level 1.0 – 0.05 = 0.95
154
What alpha of 0.05 tells and what not?
alpha = 0.05 --\> **reject** the **null** **hyp**othesis when the **results** are in a **5%** **zone**, but this **doesn’t** **tell** us **where** to **plant** the **null hyp**othesis **rejection** **zone**(**s**). \>\> we need to **define** the **critical** areas set **by** **alpha**. [two-tail test with two confidence intervals and two critical areas .png](https://www.dropbox.com/s/q58y83jtepakq55/two-tail%20test%20with%20two%20confidence%20intervals%20and%20two%20critical%20areas%20.png?dl=0)
155
For what wee need to define the critical areas set by alpha?
for the null hypothesis rejection zone(s)
156
How to define the critical areas set by alpha?
Confidence intervals define the confidence bounds of the curve **Two-tailed test**: **two** **confidence** **intervals** define **two** critical **areas** **outside** the **up**per and **lower** **conf**idence **limits**; **One-tailed test**: a **single** **confidence** **interval** defines the left/right-hand side **critical** **area**. [two-tail test with two confidence intervals and two critical areas .png](https://www.dropbox.com/s/q58y83jtepakq55/two-tail%20test%20with%20two%20confidence%20intervals%20and%20two%20critical%20areas%20.png?dl=0)
157
Confidence intervals define..
Confidence intervals define the confidence bounds of the curve
158
types of hypothesis test
left one-tailed, right one-tailed, two-tailed
159
Normal distribution sufficient sample data (n\>30) what formula for a two-tailed test ?
Z: Z-distribution critical value (found using a Z-distribution table) [formula for a two-tailed test.png](https://www.dropbox.com/s/avtuyvcgfhvk0r4/formula%20for%20a%20two-tailed%20test.png?dl=0)
160
Z-Statistic is used to find..
The Z-Statistic is used to **find** the **distance** between the **null hypothesis** and the sample **mean**.
161
How do you utilize Z-Statistic in hypothesis testing?
In hypothesis testing, the **experiment’s** **Z-Statistic** is **compared** with the **expected** **statistic** (**critical value**) for a given **confidence** **level**. **Z-Statistic** is used to find the **distance** **between** the **null** **hyp**othesis and the **sample** **mean**.
162
Example teenage gaming habits in Europe; data given: **n=100** (100 teens) **mean** (of gaming time): **22 hrs** **Stand. Dev.**= **5.7** (calculated) **alpha** of **0.05** how to find the confidence intervals for 95%? Using a two-tailed test what can you find out?
**95%** **certain** that our **sample** **data** will **fall** somewhere **between** 20.8828 and 23.1172 hours. [Example teenage gaming habits in Europe](https://www.dropbox.com/s/q8v9vze53wt78h6/Example%20teenage%20gaming%20habits%20in%20Europe%20%20.png?dl=0)
163
Example teenage gaming habits in Europe; data given: now **low** **sample** **size** (10) **n=10** (10 teens) **mean** (of gaming time): **22 hrs** Stand. Dev.= **5** (calculated) **alpha** of **0.05** How to find the confidence intervals for 95%? Using a two-tailed test what can you find out?
T-distribution Confidence Intervals can be found [T-distribution Confidence Intervals Xsample.png](https://www.dropbox.com/s/ja6tu4z4iuzfeaf/T-distribution%20Confidence%20Intervals%20%20Xsample.png?dl=0)
164
the overall objective of hypothesis testing is
to **prove** that the **outcome** of the **sample data** is **representative** of the **full population** and **not** **occurring** **by** **chance** caused **by** **random**ness in the **sample** **data**.
165
Hypothesis testing four steps:
1: **Identify** the **null hyp**othesis (what you believe to be the **status quo** and **wish** to **nullify**) and the **type of test** (i.e. **one-tailed** or **two**-tailed). 2: **State** your experiment’s **alpha** (statistical significance and the **probability** of a **Type I Error**) and **set** the **confidence** **interval**(**s**). 3: **Collect** **sample** **data** and conduct a **hypothesis** **test**. 4: **Compare** the test **result** **to** the **critical** **value** (expected result) and **decide** if you should **support** or **reject** the **null** **hyp**othesis.
166
What Z-Score measures?
the **distance** between a **data** **point** and the sample’s **mean**
167
What Z-Score measures in hypothesis testing?
in hypothesis testing, we use the Z-Statistic to find the **distance** between a **sample** **mean** and the **null hypothesis**.
168
How Z-Statistic is expressed? what is the meaning?
**numerically** the **higher** the **statistic**, the **higher** the **discrepancy** **between** the **sample** **data** and the **null** **hypothesis**. Z-Statistic of **close to 0** means the **sample mean** **matches** the **null hyp**othesis—**confirming** the null hypothesis pegged to a **p-value**, which is the probability of that result **occurring** **by** **chance**. hypothesis testing
169
Z-Statistic of close to 0 means
Z-Statistic of close to 0 means the **sample** **mean** **matches** the **null** **hypothesis**—**confirming** the null hypothesis
170
rögzítve van
pegged to
171
What p\<0.05 indicates?
A low p-value, **such** **as** **0.05**, indicates that the **sample** **mean** is **unlikely** to have **occurred** **by** **chance**. a p-value of **0.05** is sufficient to **reject** the **null** **hypothesis**
172
How to find the p-value for a Z-statistic?
To find the p-value for a Z-statistic, we need to refer to a Z-distribution table [Z-distribution table .png](https://www.dropbox.com/s/y1kgn1iqn9eqd3u/Z-distribution%20table.png?dl=0) [z Critical Value.png](https://www.dropbox.com/s/hbomwlbofnufrsy/z%20Critical%20Value.png?dl=0)
173
What a two-Sample Z-Test compares?
A two-sample Z-Test **compares** the **difference** between the **means** of **two** **independent** **samples** with a known **stand**ard **dev**iation. (we assume: the data is **norm**ally **distr**ibuted and a **min**imum of **30** observations)
174
what is high enough Z value (Z-Statistic value)?
what is high enough Z value (Z-Statistic value)? \>\> **depends** on the **level** of **conf**idence (determined by **alpha**) and the **type** of the **test** (**one** tailed or **two** **tailed**) \>\> can be found **in tables** finding the critical Z-value \>\> shows in the table the **level** of **confidence** e.g. in a Two-Sample Z-Test
175
What do you calculate with a Two-Sample Z-Test?
a Z value (Z-Statistic value) it helps to **evaluate** the **null** **hyp**othesis (e.g.: a **diff**erence **between** two **sets** of **values** (**two** **samples**), we need to calculated the **SD** of the two samples \> it shows **what** **extent** they **very** \> it helps to see **if** the **difference** **between** the two **groups** is **due** **to** **variation** or **real**) if **Z** is **close** to **O** \>\> the **sample** **mean** **matches** the **null** **hyp**othesis \>\> **confirms** the **null hyp**othesis (so the **two** **samples** are **equal**, the **difference** found between their means is **due** **to** **chance** (coming from variation) if **Z** is **high** **enough** \>\> **reject** **H0** so **reject** **that** **µ1 = µ2** (mu1 = mu2) \>\> **accept** **H1** (the **means** of samples are **indeed** **different**) what is **high** **enough** Z value (Z-Statistic value)? \>\> **depends** on the **level** of **confidence** (**alpha**) and the **type** of the **test** (**one** tailed or **two tailed**) \>\> can be found in **tables** finding the critical Z-value \>\> **shows** in the table the **level** of **confidence** in tables the critical Z-value can be found: these Z values should be used in **confidence** **interval** **calculations** when a confidence level is determined -by alpha- we need to see the corresponding critical Z-value (find in tables) \>\> **this** **sets** the **limit** **where** the **H0** can be **rejected** [Two-Sample Z-Test formula.png](https://www.dropbox.com/s/x931wztvk4oolwg/Two-Sample%20Z-Test%20formula.png?dl=0) [z Critical Value.png](https://www.dropbox.com/s/hbomwlbofnufrsy/z%20Critical%20Value.png?dl=0)
176
z Critical Value
[z Critical Value.png](https://www.dropbox.com/s/hbomwlbofnufrsy/z%20Critical%20Value.png?dl=0)
177
One-Sample Z-Test example: Company A claims their new phone battery outperforms former 20 hrs time. 30 users mean battery life (sample of 30 users) \>\> 21 hours, SD= 3 is 21 \> 20 if the SD=3 and n=30' ?
[One-Sample Z-Test](https://www.dropbox.com/s/p2j3ad0krfv3gal/One-Sample%20Z-Test.png?dl=0)
178
Two-Sample Z-Test practical: Company A claims their phone battery outperforms Company B. 60 users mean battery life (Company A) (sample of 30 users) \>\> 21 hours, SD= 3 mean battery life (Company A) (sample of 30 users) \>\> 19 hours, SD= 2 is that claim right?
[Two-Sample Z-Test.png](https://www.dropbox.com/s/99bve87s6esv3w7/Two-Sample%20Z-Test.png?dl=0)
179
One-Sample Z-Test in essence
one-sample only (sample size: 30) (I guess it is the min) calculate SD assume norm. distribution calculate mean \>\> is it different from a value? not comparing two samples, only one sample's mean compared to a value
180
One-Sample Z-Test
one-sample only (sample size: 30) (I guess it is the min) calculate SD assume norm. distribution calculate mean \>\> is it different from a value? (not comparing two samples, only one sample's mean compared to a value [One-Sample Z-Test](https://www.dropbox.com/s/lsk8mrtph890ryp/One-Sample%20T-Test%20formula.png?dl=0) [formula](https://www.dropbox.com/s/lsk8mrtph890ryp/One-Sample%20T-Test%20formula.png?dl=0)
181
One-Sample Z-Test formula
[One-Sample Z-Test formula.png](https://www.dropbox.com/s/t7g4zox0w4hqiwq/One-Sample%20Z-Test%20formula.png?dl=0)
182
What do you do if you need to compare two mean values coming from two different samples? (n=30 min and normal distribut with calculated SD)
[Two-Sample Z-Test.png](https://www.dropbox.com/s/99bve87s6esv3w7/Two-Sample%20Z-Test.png?dl=0)
183
T-Test in essence
Similar to the Z-Test, a T-Test analyzes the **distance** **between** a **sample mean** and the **null** **hyp**othesis but is **based on T-distribution** (using a **smaller** **sample** **size**) and **uses** the **stand**ard **dev**iation of the **sample** **rather** **than** of the population.
184
The main categories of T-Tests:
- An **independent** **samples** T-Test (**two-sample T-Tes**t) for **comparing** **means** from **two** different **groups**, such as two different companies or two different athletes. This is the **most** **commonly** used type of T-Test. - A **dependent** **sample** T-Test (**paired T-test**) for **comparing** **means** from the **same** **group** at two **different** **intervals**, i. e. measuring a company’s performance in 2017 against 2018. - A **one-sample T-Test** for **testing** the **sample** **mean** of a single group **against** **a** known or hypothesized **mean**.
185
What is T-Statistic?
The **output** of a **T-Test** called the **T-Statistic** **quantifies** the **difference** **between** the **sample** **mean** and the **null hyp**othesis. As the **T-Statistic increases** in the **+/-** direction, the **gap** between the **sample** **data** and **null hyp**othesis **expands**. we refer to a **T-distribution table**
186
If we have a one-tailed test with an alpha of 0.05 and sample size of 10 (df 9), what can we expect?
we can expect **95% of samples** to **fall** within **1.83 stand**ard **dev**iations of the **null hyp**othesis. [T-distribution table.png](https://www.dropbox.com/s/m38owcfuf9kfjln/T-distribution%20table.png?dl=0)
187
Sample (n=10) \>\> Mean, SD calculated \>\> we carry out T-Test: If our **sample** **mean** returns a **T-Statistic** **greater** than the **critical score** of **1.83**, what can we conclude?
we can conclude the **results** of the **sample** are **stat**istically **significant** and **unlikely** to have occurred **by** chance—allowing us to **reject** the **null hyp**othesis. H0: mu= (a certain) **value** (so the **mean** **is** **different** from that value, the **difference** we **found** is **not** due to a **chance**, **but genuine** [T-distribution table.png](https://www.dropbox.com/s/m38owcfuf9kfjln/T-distribution%20table.png?dl=0)
188
What is the T-Statistic critical score (for 95% confidence)?
for a **one-tail test**: **T-Statistic** must be **greater** than the critical score of **1.83 for 95%** confidence (**alpha**=**0.05**) for a t**wo-tail test**: **T-Statistic** critical score: **2.26** for **95%** confidence (**alpha**=**0.05/2** = **0.025**) **two** **critical** **areas** would **each** account for **2.5%** of the distribution based on **95%** **confidence** with **confidence** **intervals** of **-2.262** and **+2.262** **from** the **null** **hyp**othesis. [T Table](https://www.dropbox.com/s/m38owcfuf9kfjln/T-distribution%20table.png?dl=0)
189
Independent Samples T-Test in essence
An independent samples T-Test **compares** **means** from **two** **different** **groups**. [Independent Samples T-Test formula.png](https://www.dropbox.com/s/48mecjisbglgbbn/Independent%20Samples%20T-Test%20formula.png?dl=0)
190
What is Pooled standard deviation used for?
part of a greater calculation for **Independent** **Samples** **T-Test calculation** [https://www.dropbox.com/s/48mecjisbglgbbn/Independent%20Samples%20T-Test%20formula.png?dl=0](https://www.dropbox.com/s/48mecjisbglgbbn/Independent%20Samples%20T-Test%20formula.png?dl=0)
191
Independent Samples T-Test Xmpl comparecustomer spending between the desktopversion of their website andthe mobilesite. 25desktop customers spent an average of $70with a SD of $15. mobileusers, 20customers spent $74on average with a SD of $25. We test the difference of the sample mean and the known mean using a two-tail test with an alpha of 0.05 (95% confidence).
[Independent Samples T-Test.png](https://www.dropbox.com/s/di2ew8m8diopt7t/Independent%20Samples%20T-Test.png?dl=0)
192
What to do if we want to: compare customer spending between the desktop version of their website and the mobile site. 25 desktop customers spent an average of $70 with a SD of $15. mobile users, 20 customers spent $74 on average with a SD of $25.
Independent Samples T-Test [Independent Samples T-Test.png](https://www.dropbox.com/s/di2ew8m8diopt7t/Independent%20Samples%20T-Test.png?dl=0)
193
Dependent Sample T-Test in essence
A dependent sample T-Test is used for comparing means from the same group at two different intervals. [Dependent Samples T-Test formula.png](https://www.dropbox.com/s/4fv0gbpm6rwfalt/Dependent%20Samples%20T-Test%20%20formula.png?dl=0)
194
What to use if we want to compare means from the same group at two different intervals (at two different timepoints, but same players)
Dependent Samples T-Test [Dependent Samples T-Test.png](https://www.dropbox.com/s/owe9nkj0ozc4yqy/Dependent%20Samples%20T-Test.png?dl=0)
195
Dependent Sample T-Test what for?
if we want to compare means from the same group at two different intervals (at two different timepoints, but same players) [Dependent Samples T-Test.png](https://www.dropbox.com/s/owe9nkj0ozc4yqy/Dependent%20Samples%20T-Test.png?dl=0)
196
One-Sample T-Test in essence
A one-sample T-Test is used for **testing** the **sample** **mean** of a **single** **group** **against** a **known** or **hypothesized** **mean**. [One-Sample T-Test formula.png](https://www.dropbox.com/s/lsk8mrtph890ryp/One-Sample%20T-Test%20formula.png?dl=0)
197
When Z-Test is used for hypothesis testing? what is it based on?
A Z-Test, is used for datasets with **30** or **more** **obs**ervations (**norm**al **distr**ibution) with a known **stand**ard **dev**iation of the population and is **calculated** **based** on **Z-distrib**ution.
198
When T-Test is used for hypothesis testing?
A T-Test is used in scenarios when you have a **small** **sample** **size** or you **don’t** **know** the **standard** **deviation** **of** the **population** and you **instead** **use** the **standard** **deviation** **of** the **sample** and **T-distribution**.
199
What to do, if you want to compare small sample sized sample (group) and you do not know the SD of the whole population (only of your small sized sample's)?
**T-Test** is used in scenarios when you have a small sample size or you don’t know the standard deviation of the population and you **instead** **use** the **standard dev**iation **of** the **sample** and **T-distrib**ution. You can **test** if the **sample** **mean** is the **same** **with** **sg**. (it will be a **hyp**othesis) (**H null**: they are the **same**, **H1**: they are **different**) you can **test** **H0** with **T-test** \>\> you **get** **T-Statistics** value \>\> **lookup** the **critical** **value** in the T-distribution table \>\> **compare** them \>\> **accept**/**reject** the **null** **hyp**othesis
200
What T-Test is used for ?
**small** **sample** **size** or you **don’t** **know** the **standard** **dev**iation of the **population** **instead** **use** the **stand**ard **dev**iation of the **sample** and **T-distrib**ution You **can test** if the **sample** **mean** **is** the **same** with sg. (it will be a **hypo**thesis) (**H null**: they are the **same**, **H1**: they are **different**) you can **test** **H0** with **T-test** \>\> you get **T-Statistics** **value** \>\> **lookup** the critical value in the **T-distribution table** \>\> **compare** them \>\> **accept**/**reject** the null hypothesis
201
What technique is used to compare experimental group and a control group (placebo)?
**hypoth**esis **testing** for comparing **two** **proportions** from the same population population expressed in percentage form, i.e. 40% of males vs 60% of females. we need to conduct a '**two-proportion Z-Test**' [https://www.dropbox.com/s/3ml84x5fhon19gj/Two-proportion%20Z-Test.png?dl=0](https://www.dropbox.com/s/3ml84x5fhon19gj/Two-proportion%20Z-Test.png?dl=0)
202
two-proportion Z-Test'
hypothesis testing for comparing two proportions from the same population population expressed in percentage form, i.e. 40% of males vs 60% of females. we need to conduct a 'two-proportion Z-Test' to compare experimental group and a control group (placebo) [https://www.dropbox.com/s/3ml84x5fhon19gj/Two-proportion%20Z-Test.png?dl=0](https://www.dropbox.com/s/3ml84x5fhon19gj/Two-proportion%20Z-Test.png?dl=0)
203
Two-proportion Z-Test practical
Two-proportion Z-Test practical [Two-proportion Z-Test practical.png](https://www.dropbox.com/s/fwic0qbpxt9lvgy/Two-proportion%20Z-Test%20practical.png?dl=0) We consider a new energy drink formula proposes to improve students’ test scores. max test score: 1600 (the average score is 1050 - 1060). Evaluation: whether the students’ results exceed 1060 points. sample of 2,000 students split evenly into an exp. group (energy drink) and a ctrl group (placebo). Results: Ctrl Group = 500 exceeded /1000 Exp Group = 620 exceeded /1000; looks more than 500 \> real difference?
204
in Two-proportion Z-Test we get Z-Statistic value: how do we evaluate it?
**Critical** **areas** of **2.5%** on **each side** of the **two-tailed** (**n**ormal **d**istribution) **curve** from a **distance** of **1.96** **s**tandard **d**eviations. If the **Z-Statistic** **falls within** **1.96 stand**ard **dev**iations of the **mean** (**within** the **95% area**) \>\> we can conclude that the **proportions** of the 'experimental test' and 'control test' **results** were **equal** (the exp. group and the ctrl group are not different) If the **Z-Statistic** **falls** **out** of the **95% area** \>\> **reject null** **hyp**othesis (the **proportions** are **not** the **same**) \>\> so they are **different** (**H1** is **true**) [Normal distribution curve with marked critical areas.png](https://www.dropbox.com/s/xhqaetkk85ett8c/Normal%20distribution%20curve%20with%20marked%20critical%20areas.png?dl=0)
205
We consider a new energy drink formula proposes to improve students’ test scores. max test score: 1600 (the average score is 1050 - 1060). Evaluation: whether the students’ results exceed 1,060 points. sample of 2,000 students split evenly into an exp. group (energy drink) and a ctrl group (placebo). Results: Ctrl Group = 500 surpassed /1000 Exp Group = 620 surpassed/1000; looks more than 500 \> real difference? How to evaluate the difference?
Two-proportion Z-Test [Two-proportion Z-Test practical.png](https://www.dropbox.com/s/fwic0qbpxt9lvgy/Two-proportion%20Z-Test%20practical.png?dl=0)
206
What is the null hypothesis when comparing exp. group with a ctrl group?
**two-proportion Z-Test** based on the following hypotheses: **H0**: **p1 = p2** (The proportions are the **same** with the **difference** equal to **0**) **H1: p1 ≠ p2** (The two **proportions** are **not** the **same**) we **detect** a **difference** between the two groups \>\> is it a **real** difference (**or** just due to **chance**)? we want to find out \>\> **H0**: we state, that **they** are the **same** (this hypothesis **we** **want** to **nullify**,**reject** \>\> we can reject, **if** the **Z-test** **value** will fall into an **area** of the distribution, where there is less than **5%** **chance** that would fall **by** **chance** **considering** the **variation** in that **sample** **group** we **anchor** the **null** hypothesis with the **statement** that **we** **wish** to **nullify**: (the **two** **proportions** of results are **identical** and it just so happened that the **results** of the **experimental** **group** **different** that of the control group **due** to a **random** **sampling** **error**) \<-- if reject, H1 is true: they are not equal in general: H0: the known, the status quo, what we want to chalenge H0: (equal, not equal, less, more) H1: the opposite, engulfing eveything else [Two-proportion Z-Test practical.png](https://www.dropbox.com/s/fwic0qbpxt9lvgy/Two-proportion%20Z-Test%20practical.png?dl=0)
207
What is the meaning if we define confidence level = 95% ? H0: p1 = p2 (The proportions are the same with the difference equal to 0) H1: p1 ≠ p2 (The two proportions are not the same)
**H0**: **p1 = p2** (The proportions are the same with the difference equal to 0) **H1: p1 ≠ p2** we **test** **it**; (The two proportions are **not** the **same**) \<\< if it occurs **less** than **5%** **by chance** (the **probability** that it happens is **more** **than** **95%** that not by chance) -\>we reject H0, because 95% probility holds that not equal putting other way: actually the **formula** **examines** the **difference** between **the** two **sample** **proportions** H0: p1-p2=0 Ha: p1-p2≠0 we test it; (The two proportions are not the same -\> the difference of the proportions is not zero; if the probality that it is zero by chance is less than 5% -\> 95% -or more- probability that not by chance -\> so it is genuinely true) \<\< if it occurs less than 5% by chance (the probability that it happens is more than 95%) we’ll **reject** the **null** **hyp**othesis **if** **there’s** a **less** than **5% chance** of the **alternative** **hyp**othesis **occurring** **by** **chance**. we **anchor** the **null** **hyp**othesis with the **statement** that we wish to **nullify**: (e.g.: exp group-placebo group test: the two proportions of results are identical and it just so happened that the results of the experimental group different that of the control group due to a random sampling error) [Normal distribution curve with marked critical areas.png](https://www.dropbox.com/s/xhqaetkk85ett8c/Normal%20distribution%20curve%20with%20marked%20critical%20areas.png?dl=0)
208
regression analysis essence
technique in inferential statistics it is used to test **how** **well** a **variable** **predicts** **another** **variable**. the term “regression” is derived from Latin, meaning “going back”
209
What is the the objective of regression analysis ?
The objective of regression analysis is to **find** a **line** that **best fits** the **data points** on the **scatterplot** to make **predictions**. **Linear regression**, the **line** is **straight** and **cannot** **curve** or **pivot**. **Nonlinear regression**, meanwhile, grants the line to curve and bend to fit the data.
210
trendline
trendline **A straight line** **cannot** possibly **intercept** **all** **data** **points** on the scatterplot \> **linear regr**ession can be thought of as a **trendline visualizing** the **underlying** **trend** of the **dataset**. **hyperplane**: a perpendicular **line** **from** the **regression** **line** **to** each **data** **point** on the scatterplot \>\> the **aggregate** **distance** of each point would equate to the smallest possible distance to the hyperplane.
211
hyperplane
a perpendicular **line** **from** the **regression line** **to** **each** **data** **point** on the scatterplot \>\> the **aggregate** **distance** of each point would equate to the **smallest** **possible** **distance** **to** the **hyperplane**.
212
coefficient
**slope** aka. **coefficient** in statistics. the term “**coefficient**” is generally **used** **over** “**slope**” in **cases** where there are **multiple** **variables** in the equation (**multiple** **linear** **regression**) and the **line’s slope** is **not** **explained** **by** any **single** **variable**.
213
slope
The **slope** of a regression line (b) represents the **rate** **of** **change** **in y** as **x** **changes**. Because **y** is **dependent** **on** **x** \> the **slope** **describes** the **predicted** values of **y** given x. The **slope** of a **regression** **line** is **used** with a **t-statistic** to **test** the **significance** of a **linear** **relationship** **between** **x** and **y**. The **slope** can be **found** by **ref**erencing the **hyperplane**; (scatterplots in statistics) as **one** **variable** **increases**, the **other** variable **increases** **by** the **average** value **denoted** **by** the **hyperplane**. The **slope** is **useful** in **forming** **predictions**.
214
How do you calculate slope? ## Footnote (I did not get this)
With **ordinary least squares method** (**one** of the **most** **common** linear regressions) slope, is found by calculating **b** as the **covariance** of **x** **and** **y**, **divided** **by** the **variance** (**sum of squares**) of **x**, The **slope** must be **calculated** **before** the **y-intercept** when using a linear regression, as the **intercept** is **calculated** **using** the **slope**. [slope calculation formula.png](https://www.dropbox.com/s/nhz5gtx5pxykhn7/slope%20calculation%20formula.png?dl=0)
215
How is the slope useful? example..
We can use the slope, in forming **predictions**. to predict a **child's height** **based** on his **parents**' midheight (the intercept between parents’ midheight (X) of 72 inches and a son’s expected height (y) \>\> the y value is approximately 71 inches. [Predicted height of a child whose parents’ midheight.png](https://www.dropbox.com/s/3xh5nnuad02hj2s/Predicted%20height%20of%20a%20child%20whose%20parents%E2%80%99%20midheight.png?dl=0)
216
Regression analysis is useful for..
**Regression** **analysis** (aka **regression** **towards** the **mean**) is a useful **method** for **estimating** **relationships** **among** **variables** **testing** if they're somehow **related**. Linear regression is **not** a **fail-proof** method of making **predictions**, the **trendline** does offer a **primary** **reference** **point** to make **estimates** about the **future**.
217
linear regression summary bbas
The **regression model** (and a **scatter** **chart**) excellent tool to **depict** the **relationship** **between** **two** **variables**. Provides a **visual representation** **and** a **math**ematical **model** that **relates** the two **variables**. describes the **relation** between **x;y** in a **scatter** **plot** **y = mx + b** (m: **slope**; b: **intercept**) **calculates** **m** and **b** in **such** a **way**, that **minimizes** the **distance** (error) of the **points** **from** the **regression line** on the plot (**more** **accu**rately: **reduce** the **sum** **of** the **errors** **squared** \>\> “**least** **squares** **regression**” name) [linear regression summary bbas.png](https://www.dropbox.com/s/4jom45lj6kfbz57/linear%20regression%20summary%20bbas.png?dl=0)
218
Linear regression Xmple
[Linear regression Xmple.png](https://www.dropbox.com/s/q9u7ovrp5wdfy9s/Linear%20regression%20Xmple.png?dl=0)
219
What is R-squared for?
If we apply Linear regression analysis to large datasets with a higher degree of scattering or three-dimensional and four-dimensional data, hard to validate the trendline only by looking at it \> A mathematical solution to this problem is to apply R-squared (the coefficient of determination)
220
R-squared
(the **coefficient** of **determination**) R-squared is **a** **test** to see what **level** **of impact** the **independent** **variable** has **on data variance**. R-squared (**a** **number** **between** **0-1** (produces a **percentage** value) **0%** : the **linear** **regression** model **accounts** for **none** of the data **variability** **in relation to the mean** (of the dataset) \>\> the **regression** **line** is a **poor** **fit** (for the given dataset) **100%** : the **linear** **regression** model **expresses** **all** the **data** **variability** **in relation to the mean** (of the dataset) \>\> the **regression** **line** is a **perfect** **fit** **mathematical** solution to validate the (calculated) relationship in the regression model **defines** the **percentage** of **variance** in the **linear model** in **relation** **to** the **indep**endent **var**iable.
221
How R-squared is calculated?
R2 is a ratio -\> -\> division needed to be calculated: **SSR/SST** R-squared is calculated as the **sum of square regression** (SSR) **divided** by the **sum of squares total** (SST) -\> SSR/SST **SSR**: calculated **from** the **regression** **analysis** given theoretical values for the dependent variable (y'); **y'** based on the **y'=mx+b** formula it is the total sum of [the individual values calculated for each datapoint from the **theoretical** (**y'**) and the **actual/measured y̅ mean** values at **each point**] -\> **squared** -\> **sum** up **SSR= (y' - y̅)2** (y' - y̅)2 calculated for each datapoint and summed up and squared to get SSR **SST**: calculated **from** the actual **measured** **values** of **y** and the **mean** of **actual y** values it is the total sum of [the **individual** **values** **calculated** for **each** **datapoint** from the **actual y** values (**y**) and the actual **y̅ mean** values at each point] -\> **squared** -\> **sum** up **SSR= (y - y̅)2** (y - y̅)2 calculated for each datapoint and summed up and squared to get SSR [R-squared calculation.png](https://www.dropbox.com/s/rw9fdjzsqksrv97/R-squared%20calculation.png?dl=0)
222
Pearson Correlation in essence
A common **measure** **of** **association** **between** **two** **variables**. Describes the **strength** or **absence** of a **relationship** **between** **two** **variables**. **Slightly** **different** from **linear** **regr**ession analysis, which **expresses** the **average** **math**ematical **relationship** **between** two or more **variables** with the intention of **visually** **plotting** the relationship on a **scatterplot**. Pearson correlation is a statistical measure of the **co**-**relationship** **between** two **variables** **without** any **designation** to **independent** and **dependent** **qualities**.
223
Interpretations of Pearson correlation coefficients
**Pearson** **cor**relation (**r**) is expressed as a **number** (coefficient) **between** **-1** and **1**. **-1** denotes the existence of a **strong** **negative** correlation **0** equates to **no** correlation, and **+1** for a **strong** **positive** correlation. a correlation coefficient of **-1** means that **for every positive** **increase** in **one variable**, there is a **negative** **decrease** **of a fixed proportion** in the **variable** (airplane fuel which decreases in line with distance flown) a correlation coefficient of **1** signifies an **equivalent** **positive** **increase** in **one** **variable** **based** on a **positive** **increase** in **another** **variable** (food **calories** of a particular **food** that goes up with its **serving** **size**) a correlation coefficient of **zero** notes that for **every** **increase** in **one** **variable**, there is **neither** a **positive** or **negative** **change** (the two **variables** **aren’t** **related**) [Interpretations of Pearson correlation coefficients.png](https://www.dropbox.com/s/qzzl4wvpqp6hkox/Interpretations%20of%20Pearson%20correlation%20coefficients.png?dl=0)
224
Pearson correlation coefficients xmpl
Describes the **strength** or **absence** of a **relationship** **between** two **variables** [Pearson correlation coefficients xmpl.png](https://www.dropbox.com/s/udj0z4ubg6u3zzw/Pearson%20correlation%20coefficients%20xmpl.png?dl=0)
225
Clustering analysis in essence
clustering analysis aims to **group** **similar** **objects** (**data** **points**) into **clusters** **based** on the **chosen** **variables**. This method **partitions** **data** **into assigned segments** or **subsets** (where **objects** **in** one **cluster** **resemble** one another and are **dissimilar** **to** **objects** contained in the **other** **cluster**(s). Objects can be interval, ordinal, continuous or categorical variables. (a **mixture** of **different** **variable** types can lead to **complications** with the analysis because the **measures** of **distance** **between objects** can **vary** depending on the variable types contained in the data)
226
Regression and clustering
[Regression and clustering shown on a scatterplot.png](https://www.dropbox.com/s/14b3it184op0agu/Regression%20and%20clustering%20shown%20on%20a%20scatterplot.png?dl=0)
227
clustering analysis is used in
**developed** originally from **anthropology**, **psychology** (later) **1930**-s **personality** **psych**ology (**1943**) today: in **data mining**, **inf**ormation **retrieval**, **mach**ine **learn**ing, **text** **mining**, **web** **anal**ysis, **marketing**, **medical** **diagn**osis, and many more Specific use cases include **analyzing** **symptoms**, identifying clusters of **similar** **genes**, **segment**ing **communities** in **ecology**, and **identifying** **objects** in **images**. not one fixed technique rather a **family** **of** **methods**, (includes **hierarchical** clustering analysis and **non**-**hierarchical** **clustering**)
228
Hierarchical Clustering Analysis
(HCA) is a technique to **build** a **hierarchy** of **clusters**. An example: **divisive** **hierarchical clustering**, which is a **top**-**down** method where **all** **objects** **start** **as** a **single cluster** and are **split** into **pairs** of clusters **until** **each** object represents an **individual** **cluster**. [Hierarchical Clustering Analysis.png](https://www.dropbox.com/s/kwfnjb6ong5fe6i/Hierarchical%20Clustering%20Analysis.png?dl=0)
229
Agglomerative hierarchical clustering
a **bottom-up** **method** of **classific**ation (more **popular** approach) Carried out in reverse **each** **object** **starts** as a **standalone** cluster a **hierarchy** is **created** by **merging pairs** of clusters to form **progressively larger** clusters. three steps: 1. **Objects** **start** as their **own** **separate** **cluster**, which results in a **maximum** **number** of clusters. 2. The number of clusters is **reduced** **by** **combining** the **two nearest** (**most** **similar**) clusters. (differentiate by the interpretation of the “**shortest distance**” ) 3. This process is **repeated** **until** **all** objects are grouped inside **one** **single** **cluster**. \>\> **hierarchical clusters** **resemb**le a **series** of **nested** clusters **organized** **within** a **hierarchical** **tree**.
230
What is the difference between "agglomerate clustering" and " divisive clustering"?
The **agglomerate** **cluster** **starts** with a **broad** **base** and a **max**imum **number** of **clusters**. The number of clusters **falls** **at subsequent rounds** **until** there’s **one** **single** cluster **at** the **top** **of** the **tree**. In the case of **divisive clustering**, the **tree** is **upside** **down**. At the **bottom** of the tree is **one** **single** **cluster** that contains **multiple** **loosely** **related** **clust**ers. These clusters are **sequentially** **split** **into** **smaller** clusters **until** the **max**imum number of clusters is reached. **Hierarchical** **clust**ering \>\> **dendrogram** **chart** to **visualize** the **arrangement** of clusters. (they demonstrate **taxonomic** **relationships** and are commonly used in **biology** to map **clusters** **of** **genes** or other samples) (Greek dendron - “tree.”) [Nearest neighbor and a hierarchical dendrogram.png](https://www.dropbox.com/s/jwbqsos3gsbwmn9/Nearest%20neighbor%20and%20a%20hierarchical%20dendrogram.png?dl=0)
231
Agglomerative Clustering Techniques
Various methods (**differ** in both the **technique** -to find the “**shortest** **distance**” **between** **clusters**- and in the **shape** of the **clusters** they produce) Nearest Neighbor The furthest neighbor Average aka UPGMA (Unweighted Pair Group Method with Arithmetic Mean) Centroid Method Ward’s Method
232
Nearest neighbor
**creates** **clusters** **based** **on** the **distance** between the two closest neighbors. you find the shortest distance between two objects \>\> combine them into one cluster \>\> repeated \>\> the next shortest distance between two objects is found (either expands the size of the first cluster or forms a new cluster between two objects)
233
Furthest Neighbor Method
**Produce**s **clust**ers by **measuring** the **distance** **between** the **most** **distant** pair of objects. The distance between each possible object pair is computed \>\> the **object pairs located furthest apart** are **unable** to **be** **linked**. At each stage of hierarchical clustering, the **two closest** **objects** are **merged** into a single cluster. **Sensitive** to **outliers**.
234
Average aka UPGMA
(**Unweigh**ted **P**air **G**roup Method with **A**rithmetic Mean) Merges objects by calculating the **distance** **between** two **clusters** and measuring the **average** **distance** between **all** **objects** in **each** **cluster** and **joining** the **closest** **cluster** **pair**. **Initially**, **no** **different** to **nearest neighbors** because the first cluster to be linked contains only one object. **Once** **a cluster** includes **two or more objects** \> the **average** **distance** **between objects** **within** the **cluster** can be **measured** which has an **impact** on **classification**.
235
Centroid Method
**Utilizes** the **object** **in** the **center** of each cluster (**centroid**) **to** **determine** the **distance** **between** **two clusters**. At **each** **step**, the two clusters whose **centroids** are measured to be **closest** together are **merged**.
236
Ward’s Method
Draws on the **sum of squares error** (**SSE**) between two **clusters** over all variables **to determ**ine the **distance** **between** **clusters**. **All possible** cluster **pairs** are **combined** \>\> the **sum** of the **squared** **distance** across all clusters is **calculated**. At each round attempts to merge two separate clusters by **combining** the two **clusters** that **best minimize SSE** \>\> The pair of clusters that return the highest sum of squares is selected and conjoined. **Produces** **clusters** relatively **equal** in **size** (**may** **not** always be **effective**). **Can** be **sensitive** to **outliers**. **One** of the **most pop**ular **agglomerative** clustering methods in use today.
237
Measures of Distance why important?
Measurement method \>\> **different** **method** \>\> **different** **distance** \>\> lead to different **classification** results \>\> impact on **cluster** composition [Measures of Distance.png](https://www.dropbox.com/s/9kw14c6cwqqxctb/Measures%20of%20Distance.png?dl=0)
238
Distance measurement methods
**Euclidean distance** (standard across most industries, including machine learning and psychology) **Squared Euclidean** distance **Manhattan** **distance** (**reduces** the influence of **outliers** and **resembles** **walking** a **city** **block**) **Maximum distance**, and **Mahalanobis** (internal cluster distances tend to be emphasized (distances between clusters are less significant). [Manhattan distance versus Euclidean distance.png](https://www.dropbox.com/s/9kw14c6cwqqxctb/Measures%20of%20Distance.png?dl=0)
239
Euclidean distance formula
[Euclidean distance formula.png](https://www.dropbox.com/s/wqelzvq883qtpak/Euclidean%20distance%20formula.png?dl=0)
240
Nearest Neighbor Exercise
[Nearest Neighbor Exercise.png](https://www.dropbox.com/s/xbeph2r9iv4nusr/Nearest%20Neighbor%20Exercise.png?dl=0)
241
Non-Hierarchical Clustering methods
(**Partitional clustering**) different from hierarchical clustering and is **common**ly used in **business** **analytics**. **Divide** **n** number of **objects** into **m** number of **clusters** (rather than nesting clusters inside large clusters). **Each** **object** can **only** be assigned to **one cluster** and **each cluster** is **discrete** (unlike hierarchical clustering) \>\> **no overlap** between **clusters** and **no case** of nesting a cluster **inside** **another**. \>\> usually **faster** and require **less storage** space **than** **hierarchical** methods \>\> (typically used in business scenarios) **Helps** to **select** the **optimal** **number** of **clusters** to perform **classification** (**rather** **than** mapping the hierarchy of relationships within a dataset using a **dendrogram** chart) [Non-Hierarchical Clustering methods.png](https://www.dropbox.com/s/4zbn5wlyq9bna48/Non-Hierarchical%20Clustering%20methods.png?dl=0)
242
Example of k-means clustering
[Example of k-means clustering.png](https://www.dropbox.com/s/5f0yiep8ajvm6tg/Example%20of%20k-means%20clustering.png?dl=0)
243
k-means clustering in a nutshell and downsides
attempts to **split** data into **k number of clusters** **not** **always** **able** to reliably **identify** a **final** **comb**ination of **clusters** (need to **switch** **tactics** and utilize **another** **algorithm** to formulate your **classific**ation **model**) measuring multiple distances between data points in a **three** or **four-dimen**sional **space** (with **more** than **two** **variabl**es) is much more **complicated** and **time**-**consuming** to **compute** its **success** **depends** largely on the **quality** of **data** and there’s **no mechanism** to **differentiate** between **relevant** and **irrelevant** **variables**; the variables you selected are relevant and especially if chosen from a large pool of variables
244
What are Measures of Spread?
(**measures** of **dispersion**) **how** **wide** the **set** of **data** is The most common basic measures are: **The range** (including the **interquartile** range and the **interdecile** range) (how much is in **between** the **lowest** value (**start**) and **highest** value (**end**) (**interquartile** **range**, which tells you the range in the **middle** **fifty** **percent** of a set of data) **The standard deviation** **square** **root** of **variance** a measure of **how** **spread** out **data** is **around** center of the distribution (the **mean**). gives you an idea of **where**, **percentage wise**, **a** **certain** **value** **falls**. e.g. you score **one SD above the mean** on a test (normally distributed -bell shaped). \>\> your score puts you in the **top 84%** of test takers) **The variance** a very simple statistic, gives an **extremely** **rough** idea of **how spread** out a **data set** is. **As a measure** of spread, it’s actually pretty **weak**. A large variance, **doesn’t** **tell** you **much** about the spread of data — other than it’s big! The most important **reason** the variance **exists** \>\> **to** **find** the **SD** **SD squared** \>\> **variance** **Quartiles** divide your **data set into** **quarters** according to where those numbers falls on the number line. **not** very **useful** on its **own** \>\> used to find **more** **useful** values like the **interquartile range**
245
how to insert unicode character symbols?
x with overline [x̅]: Type the x then go to **Insert** \> **Symbol** In the **Character** **Viewer** select **Unicode** from the left list [You may have to click the **✲** to **Customize** the List] Select **Combining** **Diacritical** **Marks** in the top middle pane **Locate** & double-click the **Overline** [**U-0305**] in the lower middle pane [how to insert unicode character symbols.png](https://www.dropbox.com/s/1ado0j6lwu8qhfa/how%20to%20insert%20unicode%20character%20symbols.png?dl=0)
246
Variance summary
[Variance summary.png](https://www.dropbox.com/s/hlhp9cfd6gevx8e/Variance%20summary.png?dl=0)
247
population mean character
mu
248
sample mean character
x bar (x overline)
249
population variance character
sigma squared
250
sample variance character
s squared
251
frequency distribution
a table dividing the data intro groups (classes) shows how many data values occur in each group
252
Summary of clustering types
[Summary of clustering types.png](https://www.dropbox.com/s/1gle4k6syn9951p/Summary%20of%20clustering%20types.png?dl=0)
253
**Not** everyone has cancer, who **has** the **symptoms** (only 1 out of 10.000) \>\> **1/10.000** healthy individuals have the **same** **symptoms** worldwide but they do not have cancer **What** is the **probability** that a **patient** has **cancer**, if someone **has** the **symptom**? the **incidence** **rate** is **1/100.000**
we need to designate **A** and **B** events: **P(A)**: real cancer case **P(B)**: probability of having symptoms (includes the ones having cancer-with symptomes) and the ones with no cancer, but with symtomes \>\> all real positives and the false positives) **P(A/B)**: this is the question; probability of a **real****cancer** (different from 100% because there is a probability, that the syndromes are false positive resulting from non-cancers **P(A)**: probability of a **real**real cancer \>\> 1/100.000 (implies probability of non-cancer: 1-0.00001 = 0.99999) **P(B/A)**: probability of symptomes if cancer \>\> 1 **P(B)**: the probability of symptomes (two elements: actual cancer + false positively symptomatic people): 1/100.000 + 1/10.000 1. actually identified real users: 1/100.000 = 0.00001 2. false positively identified non users; 1/100.000 + 1/10.000 = **0.00011** (from **1. + 2.)** **P(A/B) = P(A) \* P(B/A) / P(B)** \>\> 0.00001\* 1 / 0.00011 = 0.0909 = 9.1% [Bayes theorem example 2](https://www.dropbox.com/s/scgyznsz6qr0g6n/Bayes%20theorem%20example%202.png?dl=0)
254
The entire output of a **factory** is produced on **three** **machines** (A B C). The three machines account for **20%****,****30%**and**50%**of the**factory****output**. The**fraction**of**defective****items** produced is **5%** for the first machine; **3%** for the second machine; and **1%** for the third machine. If an **item** is **chosen** at **random** **from** the **total** **output** and is **found** to be **defective**, **what** is the **probability** that it was **produced** **by** the **third** **machine** (C)?
question reformulated: what is the **proportion** of the **false** **item** produced **by** **machine** **C** **among** **all** **false** **items**? **all** **false** items: 2.4% 0.05\*0.2 + 0.03\*0.3 + 0.01\*0.5 = **0.024** **false** **items** by **C** **machine**: 0.01 \* 0.5 = 0.005 \>\> **0.5%** **false** **items** by **C** machine **among** **all** false items: 0.5% / 2.4% = 5/24 [Bayes theorem example 3.png](https://www.dropbox.com/s/74surffvi95qjig/Bayes%20theorem%20example%203.png?dl=0)
255
main problem with mean how to overcome?
the **mean** can be **highly** **sensitive** **to** **outliers**. (statisticians sometimes use the trimmed mean, which is the mean obtained after **removing** **extreme** **values** at **both** the **high** and **low** **band** of the dataset, such as **removing** the **bottom** and **top** **2%** of **salary** **earners** in a national income survey).
256
how do you label population variance?
sigma squared
257
how do you label population standard deviation? sample SD?
population SD: sigma sample SD: **s**
258
Variance summary
[Variance summary.png](https://www.dropbox.com/s/hlhp9cfd6gevx8e/Variance%20summary.png?dl=0)