statistics notes 2020 march 30 Flashcards

Question

probability

Answer 1

probability expresses the **likelihood** of something **happening** expressed in **percentage** or **decimal** **form**; typically expressed as a number with a decimal value called a **floating**-**point** **number**.

Answer 2

odds define the **likelihood** of an **event** **occurring** **with** **respect** **to** the **number** of **occasions** it does **not** **occur**. For instance, the odds of selecting an **ace** of **spades** from a standard deck of **52** cards is **1 against 51**. On 51 occasions a card other than the ace of spades will be selected from the deck.

Answer 3

Correlation is often computed during the **exploratory** **stage** of **analysis** to understand **general** **relationships** **between** **variables**. Correlation **describes** the **tendency** of **change** in **one** **variable** to **reflect** a change **in** **another** **variable**.

Answer 4

the observed correlation could be caused by a third and **previously** **unconsidered** variable, aka **lurking** variable or **confounding** variable. It’s important to consider variables that fall outside your hypothesis test as you prepare your research and before publishing your results.

Answer 5

**confusing** **correlation** and **causation** arises when you analyze **too** **many** **variables** while looking for a match. (In statistics, dimensions can also be referred to as variables). If we are analyzing **three** **variables**, the **results** **fall** into a **three**-**dimensional** **space**.) You can find instances of the “curse” or phenomenon using **Google** **Correlate** (www.google.com/trends/correlate) the curse of dimensionality tends to **affect** **machine** **learning** and **data** **mining** **analysis** more than traditional hypothesis testing due to the **high** **number** of **variables** **under** **consideration**. e.g: ...It turns out that the **bang** **energy** **drink**, for example, came onto the market at a similar time as **Alibaba** **Cloud’s** international product offering and then grew at a similar pace in terms of Google search volume.. átok

Answer 6

A **term** for **any** **value** that describes the **characteristics** and **attributes** of **an** **item** that can be **moved**, **processed**, and **analyzed**. The item could be a transaction, a person, an event, a result, a change in the weather, and infinite other possibilities. **Data** can **contain** **various** **sorts** of **information**, and through statistical analysis, these **recorded** **values** can be better **understood** and **used** to **support** or **debunk** a research **hypothesis**.

Answer 7

The **parent** **group** **from** **which** the experiment’s **data** is **collected**, e.g., all registered users of an online shopping platform or all investors of cryptocurrency.

Answer 8

A **subset** of a **population** **collected** **for** the **purpose** of an **experiment**, e.g., 10% of all registered users of an online shopping platform or 5% of all investors of cryptocurrency. A sample is often used in statistical experiments for **practical** **reasons**, as it might be **impossible** or prohibitively **expensive** to directly **analyze** the **full** **population**.

Answer 9

A **characteristic** of an **item** **from** the **population** that **varies** **in** **quantity** or **quality** **from** **another** **item**, e.g., the Category of a product sold on Amazon. A **variable** that varies in regards to quantity and takes on **numeric** **values** is known as a **quantitative** **variable**, e.g., the Price of a product. A **variable** that varies in **quality**/**class** is called a **qualitative** **variable**, e.g., the Product Name of an item sold on Amazon. This process is often referred to as **classification**, as it involves **assigning** a **class** **to** a **variable**.

Answer 10

**quantitative** **variable** (varies in regards to quantity and **takes** on **numeric** **values**), **qualitative** **variable** (varies in quality/class), **classification**

Answer 11

A **variable** that can only **accept** a **finite** **number** of **values**, e.g., **customers** purchasing a product on **Amazon.com** can **rate** the product as 1, 2, 3, 4, or 5 stars. In other words, the product has five distinct rating possibilities, and the reviewer cannot submit their own rating value of 2.5 or 0.0009. Helpful tip: **qualitative** **variables** are **discrete**, e.g. **name** or **category** of a product.

Answer 12

A **variable** that can assume an **infinite** **number** of **values**, e.g., depending on supply and demand, gold can be converted into unlimited possible values expressed in U.S. dollars. A **continuous** **variable** **can** also **assume** **values** **arbitrarily** **close** together. e.g.: **price** and reviews (**number** **of** **reviews** on a product) are continuous variables

Answer 13

A **variable** whose **possible** **values** consist of a **discrete** **set** of **categories**, **rather** **than** **numbers** quantifying values on a continuous scale) (such as **gender** or political allegiance,

Answer 14

(a subcategory of **categorical** **variables**), ordinal variables **categorize** **values** in a **logical** and **meaningful** **sequence**. ordinal variables contain an **intrinsic** **ordering** or **sequence** such as {small; medium; large} or {dissatisfied; neutral; satisfied; very satisfied}. The **distance** of separation between ordinal variables does **not** **need** to be **consistent** or **quantified**. (For example, the measurable gap in performance **between** a **gold** and **silver** **medalist** in athletics need not mirror the difference in performance between a silver and bronze medalist.) **standard** **categorical** **variables**, i.e. **gender** or film genre,

Answer 15

An **independent** **variable** (expressed as **X**) is the variable that supposedly **impacts** the **dependent** **variable** (expressed as **y**). For example, the **supply** of **oil** (independent variable) impacts the **cost** of **fuel** (dependent variable). As the **dependent** **variable** is “**dependent**” **on** the **independent** **variable**, it is generally the **independent** **variable** that is **tested** in experiments. **As** the **value** of the **independent** variable **changes**, the **effect** **on** the **dependent** variable is **observed** and **recorded**. In analyzing Amazon products, we could examine Category, Reviews and 2-Day Delivery as the independent variables and observe how changes in those variables affect the dependent variable of Price. Equally, we could select the Reviews variable as the dependent variable and examine Price, 2-Day Delivery and Category as the independent variables and observe how these variables influence the number of customer reviews.

Answer 16

The labels of “independent” and “dependent” are hence **determined** **by** **experiment** **design** **rather** than **inherent** **composition** (one variable could be a dependent variable in one study and an independent variable in another)

Answer 17

In **probability**, **two** **events** are considered **independent** if the **occurrence** of **one** **event** does **not** **influence** the **outcome** of **another** **event** (the outcome of one event, such as flipping a coin, doesn’t predict the outcome of another. If you flip a coin twice, the outcome of the first flip has no bearing on the outcome of the second flip)

Answer 18

the **probability** of **E** **given** **F** The **probability** of **one** **event** (**E**) **given** the **occurrence** of **another** **conditional** event (**F**) is **expressed** as **P(E|F)**,

Answer 19

Conversely, **two** **events** are said to be **independent** if **P(E|F)** = **P(E)**. This equation holds that the **probability** of **E** is the **same** **irrespective** of **F** **being** **present**. This expression can also be tweaked to **compare** **two** **sets** of **results** where the **conditional** event (**F**) is **absent** **from** the **second** **trial**.

Answer 20

The **premise** of this **theory** is **to** **find** the **probability** of an **event**, **based** **on** **prior** **knowledge** of **conditions** potentially **related** **to** the **event**. Bayes' theorem "is to the theory of probability what the **Pythagorean** **theorem** is **to** **geometry**.” For instance, if **reading** **books** is **related** to a person’s **income** **level**, then, using Bayes’ theory, we can assess the **probability** **that** a **person** **enjoys** reading **books** **based** **on** prior knowledge of their **income** **level**. In the case of the **2012** **U.S. election**, **Nate** **Silver** drew from **voter** **polls** as prior knowledge to refine his **predictions** of **which** **candidate** would **win** in each state. Using this method, he **was** **able** to successfully **predict** the outcome of the presidential **election vote** in **all** **50** states.

Answer 21

**Triboluminescence** is the **light** **emitted** when **crystals** are **crushed**…” ‘When you take a **lump** of **sugar** and crush it with a pair of **pliers** in the dark, you can see a **bluish** **flash**. Some other crystals do that too. lump - csomó pliers - fogó

Answer 22

**P(A/B)= P(A) \* P(B/A) / P(B) ** P(A|B) is the probability of A given that B happens (conditional probability) **P(A)** is the **probability of A** (**without** any **regard** to whether event **B** has **occurred** (**marginal** probability) **P(B|A**) is the **probability** of **B** **given** that **A happens** (**conditional** probability) **P(B)** is the **probability** of **B** **without** any **regard** to **whether** event **A** has **occurred** (marginal probability) Bayes’ theorem can be written in multiple formats including the use of **∩** (**intersection**) instead of P(B/A). [https://www.dropbox.com/s/p8io6kx4d4d0vne/Bayes%20theorem%20formula.png?dl=0](https://www.dropbox.com/s/p8io6kx4d4d0vne/Bayes%20theorem%20formula.png?dl=0)

Answer 23

Both **P(A|B)** and **P(B|A)** are the **conditional** **probability** of observing **one** **event** **given** the **occurrence** **of** **the** **other**. Both **P(A)** and **P(B)** are **marginal** **probabilities**, which is the **probability** of a **variable** **without** **reference** to the values of **other** variables.

Answer 24

we need to designate A and B events: **P(A)**: real drug user probability and **P(B)**: probability of identifying someone as positive (even if in reality is not \>\> all real positives from users and the false positives from non-users) **P(A/B)**: this is the question; probability of a **real****drug****user**identified **positive**ly**in****the****test** (different from 0.99 because there is a probability, that the test shows false positive result from non-users (the test does not catch all positives either, but not important now) **P(A)**: probability of a **real**“**drug user**” \>\> 0.05 (implies probability of non-user: 1-0.05 = 0.95) **P(B/A)**: probability of a **positive****test**\>\> 0.99 (result given that the individual is a drug user) **P(B)**: the probability of a **positive****test****result**(two elements: actually identified real users + false positively identified non-users): 0.059 1. actually identified real users: 0.05 \* 0.99 = 0.0495 2. false positively identified non users; (1-0.05) \* 0.01 = 0.95 \* 0.01= 0.9505 \* 0.01=0.0095 **0.059**= 0.0495 + 0.0095 (from 1. + 2.) **P(A/B) = P(A) \* P(B/A) / P(B)** \>\> 0.05 \* 0.99 / 0.059 = 0.8389 P(user|positive test) = P(user) \* P(positive test|user)/P(positive test) [Bayes theorem example 1](https://www.dropbox.com/s/yqmlvu2tca7ljo1/Bayes%20theorem%20example%201%20.png?dl=0)

Answer 25

Using Bayes’ theorem, we’re able to determine that (in the current example) there’s an 83.9% probability that an individual with a positive test result is an actual drug user. The reason this **prediction** is **lower** for the general population **than** the **successful** **detection** **rate** **of** actual **drug** **users** or P (positive test | user), **which** was **99%**, is due to the **occurrence** of **false**-**positive** **results**.

Answer 26

important to acknowledge that Bayes’ theorem can be a **weak** **predictor** **in** **the** **case** **of** **poor** **data** regarding prior knowledge and this **should** be **taken** **into** **consideration**.

Answer 27

used for **interpreting** **scenarios** **with** **two** possible **outcomes**. (**Pregnancy** and **drug** **tests** both produce binomial outcomes in the form of **negative** and **positive** **results**, and so too **flipping** a two-sided **coin**.) The **probability** of **success** in a binomial experiment is expressed as **p**, and the **number** of **trials** is referred to as **n**.

Answer 28

you would need to **calculate** the **likelihood** of **multiple** **independent** **events** **happening**, which is the product (**multiplication**) of **their** **individual** **probabilities**

Answer 29

tool to **assess** the **likelihood** of an **outcome**. **not** a **direct** **metric** of **probability**, permutations can be **calculated** to **understand** the **total** number of **possible** **outcomes**, which can be **used** for **defining** **odds**. calculate the **full** **number** of **permutations**, which refers to the **maximum** **number** of **possible** **outcomes** **from** **arranging** **multiple** **items**

Answer 30

we can apply the function **three**-**factorial**, which entails **multiplying** the **total** **number** of **items** by **each** discrete **value** **below** that number, i.e., 3 x 2 x 1 = 6.

Answer 31

Four-factorial is 4 x 3 x 2 x 1 = 24

Answer 32

using **permutations** is for horse betting; we’re **calculating** the **total** **number** of **permutations** and also a **subset** of **desired** **possibilities** (recording a **1st** place, recording a **2nd** place, and recording a **3rd** **place** **finish**). The **total** number of **combinations** on where each horse can finish is calculated as **Twenty**-**factorial** We next **need** to **divide** **twenty**-**factorial** **by** **seventeen**-**factorial** to ascertain **all** **possible** **combinations** of a **top** **three** placing. **Twenty**-**factorial** / **Seventeen**-**factorial** = **6,840** Thus, there are 6,840 possible combinations among a 20-horse field that will offer you a box trifecta.

Answer 33

the **central** **point** of a **given** **dataset**, aka central tendency measures. the three primary measures of central tendency are the **mean**, **mode**, and **median**.

Answer 34

**Arithmetic** **mean** (**sum** **divided** by the **sample** **number**) the **midpoint** **of** a **dataset**, is the **average** **of** a **set** of **values** and the **easiest** **central** **tendency** **measure** to understand. sum of all numeric values / by the number of observations

Answer 35

the **mean** can be **highly** **sensitive** **to** **outliers**. (statisticians sometimes use the trimmed mean, which is the mean obtained after **removing** **extreme** **values** at **both** the **high** and **low** **band** of the dataset, such as **removing** the **bottom** and **top** **2%** of **salary** **earners** in a national income survey).

Answer 36

the **median** **pinpoints** the **data** **point(s)** **located** in the **middle** **of** the **dataset** to suggest a **viable** **midpoint**. The median, therefore, occurs at the position in which exactly **half** **of** **the data values** are **above** and **half** **are** **below when arranged in ascending** or **descending** **order**. The solution for an **even** **number** **of data points** is to **calculate** the **average** of the **two** **middle** **points**

Answer 37

The **mean** and **median** **sometimes** produce **similar** **results**, but, in general, the **median** is a **better** measure of **central** **tendency** than the mean **for** **data** that is **asymmetrical** as it is **less** **susceptible** to **outliers** and **anomalies**. The **median** is a **more** **reliable** **metric** for **skewed** (**asymmetric**) **data**

Answer 38

statistical technique to **measure** **central** **tendency** The mode is the **data** **point** in the dataset that **occurs** **most** **frequently**.

Answer 39

a **variable** that can **only** **accept** a **finite number** of **values**

Answer 40

the **categorization** of **values** in a **clear** **sequence** (such as a 1 to 5-**star** **rating** system on **Amazon**)

Answer 41

**easy** to **locate** in **datasets** with a low number of discrete **categorical** **values** (a variable that can only accept a finite number of values) or **ordinal** **values** (the categorization of values in a clear sequence)

Answer 42

The **effectiveness** of the mode can be **arbitrary** and **depends** heavily **on** the **composition** of **the** **data**. The mode, **for** **instance**, can be a **poor** **predictor** for **datasets** that do not have a **single** **high** **number** of **common** **discrete** **outcomes** (**all** star **values** have about the **same** **%**)

Answer 43

statistical measure of central tendency factors the **weight** of **each** **data** point **to** **analyze** the **mean**. used when you want to **emphasize** **a** **particular** **segment** of **data** **without** **disregarding** the **rest** of the dataset. e.g.: students’ grades, the **final** **exam** accounting for **70%** **of** the **total** **grade**.

Answer 44

**depends** on the **composition** of the **data**. The **mode**: **easy** **to** **locate** in datasets **with** a **low** **number** of **discrete** **values** or **ordinal** **values**, The **mean** and **median**: suitable for datasets that contain **continuous** **variables**. The **weighted** **mean**: used when you want to **emphasize** a **particular** **segment** of **data** **without** **disregarding the rest** of the dataset.

Answer 45

describes **how** **data** **varies** The **composition** of **two** **datasets** **can be** very **different** **despite** the fact they each dataset has the **same** **mean**. The critical point of difference is the **range** of the **datasets**, which is a simple **measurement** **of** **data** **variance**.

Answer 46

As the **difference** **between** the **highest value** (**maximum**) and the **lowest** value (**minimum**), the range is **calculated** by **subtracting** the **minimum** from the **maximum**. **knowing** the **range** **for** the **dataset** can be **useful** **for** data screening and **identifying errors**. An **extreme** minimum or maximum **value**, for example, might indicate a **data** **entry** **error**, such as the inclusion of a measurement in **meters** in the same column as other measurements expressed in **kilometers**.

Answer 47

**describes** the **extent** to which **individual** **observations** **differ** **from** the **mean**. the standard deviation is a **measure** **of the spread** or **dispersion** **among** **data points** just **as** **important** **as** **central** **tendency** measures for **understanding** the **underlying shape of the data**.

Answer 48

Standard deviation **measures** **variability** by **calculating** the **average** **squared** **distance** of **all** **data observation**s **from** the **mean** of the dataset.

Answer 49

the **lower** the **standard** **deviation**, the **less** **variation** **in** the **data** When **SD** is a **lower** **number** (**relative** **to** the **mean** of the dataset) \>\> it indicates that most of the **data** **values** are **clustered** closely **together**, whereas a **higher** **value** **indicates** a **higher** **level** of **variation** and **spread**. a low or high standard deviation value **depends** **on** the **dataset** (depends on the mean, on the range and even on the variability of the values in the dataset ) [SD -1.png](https://www.dropbox.com/s/98k3bh0xafh7pl7/SD%20-1.png?dl=0)

Answer 50

[SD-2](https://www.dropbox.com/s/aigdxxosvbs5xlz/SD-2.png?dl=0)

Answer 51

visual technique for **interpreting data variance** is to **plot** the **dataset’s distribution values**

Answer 52

A **normal** **distribution** with a **mean of 0** and a **standard deviation of 1**

Answer 53

data is distributed evenly \>\> a bell curve A symmetrical bell curve of a standard normal model [bell curve -1.png](https://www.dropbox.com/s/dr7c4sak2u0fv8g/bell%20curve%20-1.png?dl=0)

Answer 54

converting the original values to **standardized** **scores**

Answer 55

- the **highest** **point** of the dataset occurs at the **mean** (**x̄**). - the **curve** is **symmetrical** around an imaginary **line** that lies **at** **the** **mean**. - **at** its **outermost ends,** the **curves** **approach** but **never** quite **touch** or **cross** the **horizontal** **axis**. - the **location** at which the curves transition **from** **upward** **to** **downward** cupping (known as **inflection** **points**) occur **one standard deviation above** and **below** the **mean**. [bell curve -1.png](https://www.dropbox.com/s/dr7c4sak2u0fv8g/bell%20curve%20-1.png?dl=0)

Answer 56

The **symmetrical** shape of **normal** **distribution** is a **often** **reasonable** description. (body **height**, **IQ** tests, **variable** **values** **generally** **gravitate** **towards** a **symmetrical** **shape** **around** the **mean** as **more** **cases** are **added**)

Answer 57

variables often diverge in the real world like a The symmetrical shape of normal distribution

Answer 58

Approximately **68% of values** fall **within** **one standard** **deviation** of the **mean**. Approximately **95% of values** fall **within two standard deviations** of the **mean**. Approximately **99.7%** **of values** fall within **three standard deviations** of the mean. Aka the **68 95 99.7 Rule** or the **Three Sigma Rule**

Answer 59

Following an **empirical experiment** flipping a two-sided coin, de Moivre discovered that **an increase in events** (coin **flips**) gradually **leads** **to** a **symmetrical curve** of **binomial distribution**.

Answer 60

It **describes** a **statistical** **scenario** when only **one** of **two** **mutually exclusive outcome**s of a trial is possible, i.e., a head or a tail, true or false.)

Answer 61

the **histogram** has **five possible outcomes** ## Footnote the probability of most outcomes is now lower. the **more data \>**\> the **histogram** contorts into a **symmetrical** **bell**-**shape**. As **more data** is **collected** \>\> **more observations** settle **in** **the middle** of the **bell curve**, a **smaller** **proportion** of observations land **on the left and right tails** of the curve. The histogram eventually produces approximately **68%** of values **within one standard deviation of the mean**. Using the histogram, we can pinpoint the probability of a given outcome such as **two heads (37.5%)** and whether that **outcome** is **common** or **uncommon** **compared** **to other results**—a potentially **useful** piece of **information** **for** gamblers and other **prediction** scenarios. It's also interesting to note that the **mean**, **median**, and **mode** all occur at the **same** **point** **on** **the curve** **as** this location is both the **symmetrical center** and the **most common point**. However, **not all frequency curves produce a normal distribution**. [symm bell shape in binom distrib.png](https://www.dropbox.com/s/i0zinc64xo14x4x/symm%20bell%20shape%20in%20binom%20distrib.png?dl=0)

Answer 62

**on** a **normal curve** there’s a **decreasing** **likelihood** of **replicating a result** the **further** that observed data point is **from** the **mean**. We can also assess whether that data point is approximately **one** (**68**%), **two** (**95**%) or **three** **stand**ard **dev**iations (**99.7**%) **from** the **mean**. This, however, **doesn’t** **tell** us the **probability** of **replicating** the **result**. **we** **want** to **identify** the **probability** of **replicating** a result.

Answer 63

Depending on the size of the dataset: **Z-Score**

Answer 64

**finds** the **distance** **from** the sample’s **mean** **to** an individual **data** **point** expressed **in units** of **stand**ard **deviation**. [z-score.png](https://www.dropbox.com/s/lzhc7ppn87lyb6p/z-score.png?dl=0)

Answer 65

the **data point** is **located** **2.96 stand**ard **dev**iations **from** the **mean** in the **pos**itive **direction**. This data point could also be considered an **anomaly** as it is **close to three deviations** from the mean and **different** from other data points.

Answer 66

the **data point** is positioned **0.42 stand**ard **dev**iations from the **mean** in the **negative** **direct**ion, (this data point is **lower** **than** the **mean**)

Answer 67

**if** the **Z-Score** **falls three** positive or negative **deviations** **from** the **mean** (in case of normal distribution) \>\> anomaly \>\> data points that lie an **abnormal distance** from other data points. \>\> **a rare event** that is **abnormal** and perhaps **should not have occurred**. in the case of a normal distribution if the Z-Score falls three positive or negative deviations from the mean of the dataset, it **falls beyond 99.7%** of the **other** **data points** on a normal distribution curve. sometimes viewed as a **negative exception**, such as **fraudulent behavior** or an **environmental crisis**. **help** to **identify** **data** **entry** **errors** and are commonly used in **fraud** **detection** to **identify** **illegal** **activities**.

Answer 68

**data points** that **diverge** from **primary data patterns** as outliers because they record **unusual scores** on at least one variable and are **more plentiful than anomalies**.

Answer 69

to a **normally distributed sample** with a **known stand**ard **dev**iation of the population.

Answer 70

sometimes the **mean** is**n’t** **norm**ally **distributed** or the **stand**ard **dev**iation of the population is **unknown** or **not** **reliable**, \<\< which could be **due** to **insufficient** **sampling** (**small** **sample** **size**)

Answer 71

The standard deviation of small datasets is susceptible to change as more observations are included

Answer 72

**Irish** statistician W. S. **Gosset**. **early** **20th** Century published **under** the pen **name**"**Student**" \>\> sometimes called "**Student's T-distribution**."

Answer 73

Z-distribution / T-distribution (Student's T-distribution)

Answer 74

same primary function (measure distribution) they’re used with different sizes of sample data.

Answer 75

standard normal distribution

Answer 76

the **deviation** of an individual **data** **point** **from** the **mean** for **datasets** with **30** or more **observations** based on **Z-distribut**ion (**stand**ard **norm**al **distr**ibution). [Z and T distribution graph.png](https://www.dropbox.com/s/jloya1tooes6w3q/Z%20and%20T%20distribution%20graph.png?dl=0)

Answer 77

the T-distribution is **not** **one** fixed bell **curve** rather its distribution curve **changes** (**multiple shapes**) **in** **accordance** with the **size** of the **sample**. - if the **sample size is small,** (e.g. 10): \>\> the **curve** is relatively **flat** with a **high proportion** of data points in the curve’s **tails**. - as the **sample size increas**es \>\> the **distrib**ution **curve** **approaches** the **stand**ard **norm**al **curve** (**Z-distribution**) with **more** data **points** **closer** to the **mean** at the **center** of the curve. [Z and T distribution graph.png](https://www.dropbox.com/s/jloya1tooes6w3q/Z%20and%20T%20distribution%20graph.png?dl=0)

Answer 78

by the **68 95 99.7 rule**, which **sets** approximate **confidence levels for one, two**, and **three stand**ard **dev**iations **from** a **mean** of **0**. Based on this rule, **95%** of **data points** will **fall** **1.96 stand**ard **dev**iations **from** the **mean**

Answer 79

the **probability** of that **data point falli**ng **within 1.96** **stand**ard **dev**iations of 100 is 0.95 or **95%**. **To** **find** the **exact variation** of that data point **from** the **mean** we can use the **Z-Score**

Answer 80

they don’t follow a normal curve—we instead need to use the **T-Score**.

Answer 81

The formula is **similar** to that of the **Z-Score**, **except** the **stand**ard **dev**iation is **divided** by the **sample size**. Also, the **stand**ard **dev**iation is that **of the sample** in question, which **may** or **may not reflect** that of the **population** (when more observations are added to the dataset). [T-score.png](https://www.dropbox.com/s/lv0m2p4tlp9qsqz/T-score.png?dl=0)

Answer 82

when you don’t know the population standard deviation and you have a small sample (under 30).

Answer 83

[T-score formula.png](https://www.dropbox.com/s/ngqgby38h2vewsj/T-score%20formula.png?dl=0)

Answer 84

You’ll want to use the t score formula when you don’t know the population standard deviation and you have a small sample (under 30).

Answer 85

A t score is **one form** of a **standardized** **test** statistic (the other you’ll come across in elementary statistics is the z-score). The **t score formula** **enables** you to **take an individual score** and **transform** it into a **standardized** **form** \> one which **helps** you **to** **compare** scores.

Answer 86

z score tells you how many standard deviations from the mean your score is

Answer 87

https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/z-score/

Answer 88

Your observation is right in the middle of the distribution (in the mean)

Answer 89

Your observation is 1 SD away from the mean (above if +1, bellow if -1)

Answer 90

[Z-score summary.png](https://www.dropbox.com/s/ozx6gzy94t9oato/Z-score%20summary.png?dl=0)

Answer 91

if we take a sample (**n**) **observations** of our **random** **variable** & **avg** the observation (**mean**)-- it will **approach** the **expected** **value** **E**(**x**) of the random variable.

Answer 92

In **practice**, "**n = 30**" is usually what distinguishes a "large" sample from a "small" one. In other words, if your sample has a size of at least 30 you can say it is **approximately** **Normal** (and, hence, **use** the **Normal** **distribution**). If, on the other hand, your sample has a size **less** **than** **30**, it's best to use the **t-distribution instead**.

Answer 93

We are **not** **averaging** a large number of samples, **rather**, we are **obtaining** the **averages** **from** **many** **repeated** **samples**. The **distribution** of the **sample** **averages** is the **Normal** **distribution** we obtained. It **does** **not** **represent** the **original** **distrib**ution **well**. But it's **not** **supposed** to do so! This Normal distribution is the **distribution** **of** **the** **sample** **mean**. Its use it to let us talk about the **probability** **of** the **sample** **mean** **being** **in** a **given** **interval**, **better** **understand**ing the **population** **mean**, and so forth.

Answer 94

We can **get** **info** **about** **a** **population** **not** **taking** **large** **number** of **samples**, but getting the **averages** **from many repeated** smaller **samples** \>\> their **distribution** will be **normal** (**around** the **mean**) \>\> this **normal distrib**ution **is** the **distribution** **of** **the** **sample** **mean**. \>\> **population** **mean** can be determined \>\> can **determine** the **probability** of the **sample** **mean** **being** in a **given interval** (and maybe more what I still dont get)

Answer 95

**if** we **take** the **mean** of the **samples** (**n**) and **plot** the **frequencies** **of** **their** **mean**, \>\> we get a **normal** **distribution**! as the **sample** **size** (**n**) **increases** --\> approaches **infinity** --\> we find a **normal** **distribution** (**calculate** the **mean** of a **few random samples** (e.g: **n=4**) from the whole population \> gives a value (**sample mean**) \> **repeat** **several times** with the **same sample size** (4-4-4 samples) \> **plot** **their** **means** on a **freq**uency **distrib**ution \> if you do it many times \> the **distrib**ution of the **sample** **means** will **follow** **norm**al **distrib**ution if the **sample** **size** is **low** (e g.: **n=4**) \>\> the **curve** will be **wide** and **flat** as **sample size** **incr**eases (e g.: n \>\>\> 4) \> the **curve** will be **higher** and **tighter** **around** the **mean** [Central Limit Theorem .png](https://www.dropbox.com/s/9mdioirrjozjubv/Central%20Limit%20Theorem%20.png?dl=0)

Answer 96

The word '**average**' is a bit more **ambiguous**. **Average** **can** legitimately **mean** almost **any** **measure** of **central tendency**: **mean**, **median**, **mode**, **typical value**, etc. However, even "**mean**" admits some **ambiguity**, as there are **different** **types** of means. The one you are probably **most** **familiar** with it the **arithmetic** **mean**, although there is also a **geometric** **mean** and a **harmonic** **mean**.

Answer 97

[Skew and Kurtosis of the Normal Distribution .png](https://www.dropbox.com/s/as1spl78k8p0b5g/Skew%20and%20Kurtosis%20of%20the%20%20Normal%20Distribution%20.png?dl=0)

Answer 98

the Standard Error of the Mean the Stand Dev of the Mean the 'stand deviation' of the 'sample distribution' of the 'sample mean' --\> all the same [the Standard Error of the Mean.png](https://www.dropbox.com/s/59jhiv2r53zsofe/the%20Standard%20Error%20of%20the%20Mean.png?dl=0)

Answer 99

the **whole** **population** can be characterized by a **mean** **μ** (mu), but it is impossible to measure (everybody) so we take several samples from the whole population and calculate the **sample mean**s **x̄** (x upper lined) according to the **Central Limit Theorem** the **means** of the **taken** **samples** will follow **Normal** **distrib**ution **even** **if** the **distrib**ution is **not** **normal** **in** the **population**

Answer 100

population variance

Answer 101

population SD

Answer 102

sample variance

Answer 103

sample SD (square rooted sample variance) but square rooting is non -linear \>\> **square** **root**ing (**n-1**) \>\> introduces **slight** **errors**, **still** the **best** **we** **have** [sample standard deviation.png](https://www.dropbox.com/s/3ttv5e460dpzf71/sample%20standard%20deviation.png?dl=0)

Answer 104

sample SD (**square** **rooted** sample **variance**) but **square** **rooting** is **non** -**linear** \>\> square rooting (**n-1**) \>\> introduces **slight** **errors**, still the best we have sample [standard deviation.png](https://www.dropbox.com/s/3ttv5e460dpzf71/sample%20standard%20deviation.png?dl=0)

Answer 105

**squared** **stand**ard **dev**iation **square root** of **variance** gives --\> **stand**ard **devi**ation population variance / population variance: the **differences** of sample **values** and **means** **squared** --\> **summed** **up** --\> **divided** by sample number (**n**; in case of population variance) or (**n-1**; sample variance) **pop**ulation **variance**: **sigma** **samp**le **variance**: '**s**' [Variance.png](https://www.dropbox.com/s/fcakp656hqcolpg/Variance.png?dl=0)

Answer 106

**one-tailed test** considers **one** **direction** of results (**left** or **right**) **from** the **null** **hypoth**esis, whereas a **two-tailed test** considers **both** **directions** (**left** and **right**). the **objective** of the **hypothesis** **test** is not to **challenge** the null hypothesis in one particular direction but to **consider** **both** **directions** **as** **evidence** **of** an **altern**ative **hypoth**esis. there are **two rejection zones**, known as the **critical** **areas**. **Results** that **fall** **within** either of the two **critical** **areas** **trigger** **rejection** **of** the **null hypoth**esis and thereby **validate** the **alternati**ve **hypoth**esis. [1 tailed test-1.png](https://www.dropbox.com/s/hrcrkeb2ndsg4d4/1%20tailed%20test-1.png?dl=0) [2 tailed test-1.png](https://www.dropbox.com/s/hlhp9cfd6gevx8e/Variance%20summary.png?dl=0)

Answer 107

the **rejection** of a null hypothesis (**H0**) that was **true** and **should** **not** **have** **been** **reject**ed. This means that although the **data** appears to **support** that **a relationship** is responsible, the **covariance** of the **variables** is **occurring** entirely **by** **chance**. (this does **not** **prove** that a **relation**ship does**n’t** **exist**, merely that it’s **not** the most **likely** **cause**) **covariance**: a measurement of **how related** the **variance** is **between** **two** **variables** This is commonly referred to as a **false-positive**.

Answer 108

**accepting** a **null** hypothesis (**H0**) **that** **should’ve** **been** **rejected** because the **covariance** of **variables** was probably **not** **due** to **chance**. This is also known as a **false-negative**. **covariance**: a measurement of how related the variance is between two variables

Answer 109

we **need** to **establish** a **H0** what can be **challenged** **experimentally** we can do **test** **for** **pregnancy** -\> if the test shows pregnancy -\> we **can** **reject** **H0** stating that the **woman** is **not** **pregnant --\>\>** the **null** hypothesis (**H0**): the **woman** is **not** **pregnant**. **H0** **rejected** **if** the woman is **pregnant** --\> H0 is false and **H0** **accepted** **if** the woman is **not** **pregnant** (**H0** is **true**). the **test** may **not** be **100%** accurate \>\> mistakes may occur. If **H0** **rejected** (**false +** test) and the woman is not actually pregnant (H0 is true), this leads to a **Type I Error**. If **H0** is **accepted** (the **test** **fails** to **show** **pregn**ancy, **false** **negative**) and the woman is **pregnant** (**H0 is false**) --\> this leads to a **Type II Erro**r (**we** do **not** **reject** **H0** \> **accept** **H1**)

Answer 110

we change sg --\> causing effect or not? let's detect events to see H0: no affect H1: does have affect --\> if we can detect events, wich would be highly unlikely by chance (e.g. three SD away from the random distribution mean) ez az otletem, de majmeglattyuk

Answer 111

a **measure** **of** the **variance** **between** **two** **variables**. covariance is a **measure** **of** the **relationship** **between** two **random** **variables**. a **measurement** of **how** related the **variance** is **between** two variables The metric evaluates how much – to **what** **extent** – the **variables** **change** **together**. However, the metric does **not** **assess** the **dependency** between variables. [Covariance summed](https://www.dropbox.com/s/rum9ewu7sc6wrff/Covariance%20summed.png?dl=0)

Answer 112

covariance is **measured** **in units**. The units are computed by **multiplying** the **units** **of** the two **variables**. The variance can take any **positive** or **negative** **values**. The values are interpreted as follows: **Positive** **covariance**: Indicates that **two** **variables** tend to **move** in the **same** **direction**. **Negative** **covariance**: Reveals that two **variables** tend to **move** in **inverse** **directions**. [Covariance summed](https://www.dropbox.com/s/rum9ewu7sc6wrff/Covariance%20summed.png?dl=0)

Answer 113

**In finance**, the concept is primarily used in **portfolio** **theory**. One of its most common applications in portfolio theory is the **diversification** **method**, using the **covariance** **between** **assets** **in** a **portfolio**. By **choosing** **assets** that do **not** **exhibit** a high **positive** **covariance** with each other, the **unsystematic** **risk** can be **partially** **eliminated** [Covariance summed](https://www.dropbox.com/s/rum9ewu7sc6wrff/Covariance%20summed.png?dl=0)

Answer 114

[Covariance summed](https://www.dropbox.com/s/rum9ewu7sc6wrff/Covariance%20summed.png?dl=0)

Answer 115

Covariance measures the **total** **variation** of **two** **random** **variables** **from** their **expected** **values**. Using covariance, we can **only** **gauge** the **direction** of the **relationship** (whether the variables tend to move in tandem or show an inverse relationship) it does **not** **indicate** the **strength** of the relationship, **nor** the **dependency** between the variables. [Covariance summed](https://www.dropbox.com/s/rum9ewu7sc6wrff/Covariance%20summed.png?dl=0)

Answer 116

**Correlation** measures the **strength** of the **relationship** **between** **variables**. Correlation is the **scaled** **measure** of **covariance**. It is **dimensionless**. In other words, the **correlation** **coefficient** is always a **pure** **value** and **not** measured in **any** **units**. **correlation**: **covariance** **divided** by **stand**ard **dev**iation of **both** X and Y variables [Covariance summed](https://www.dropbox.com/s/rum9ewu7sc6wrff/Covariance%20summed.png?dl=0)

Answer 117

John is an **investor**. **His** **portfolio** primarily **tracks** the **performance** of the **S&P 500** and John **wants** to **add** the **stock** of ABC Corp. Before adding the stock to his portfolio, he wants to **assess** the **directional** **relationship** between **the** **stock** and the **S&P 500**. John **does** **not want to increase the unsystematic risk** of his **portfolio**. Thus, he is **not** **interested** in **owning** **securities** in the portfolio that tend to **move** in the **same** **direction**. John can **calculate** **the covariance between** the **stock** of ABC Corp. **and** **S&P 500** by following the steps below: [https://corporatefinanceinstitute.com/resources/knowledge/finance/covariance/](https://corporatefinanceinstitute.com/resources/knowledge/finance/covariance/)

Answer 118

Given that the **sample data** **cannot** **be** truly **reliable** and **representative** **of** the **full population**, there is the possibility of a **sampling error** or **random chance affecting** the **experiment’s** **results**. **not all samples random**ly extracted from the population are preordained to **reproduce** the **same** result. It’s natural for **some samples** to contain a **higher number of outliers** and **anomalies** than other samples, and **naturally**, **results** can **vary**. If we continued to extract random samples, we would likely see a **range of results** and the **mean** of **each random sample** is **unlikely** to be **equal** to the true mean of the **full population.**

Answer 119

**outlines** a **threshold** for **rejecting** the **null** **hyp**othesis. Statistical significance is often referred to as the **p-value** (**probability value**) and is expressed **between** **0** and **1**.

Answer 120

A p-value of 0.05, expresses a **5% possibility** of **replicating** a **result** if we take **another** **sample**.

Answer 121

the **p-value** is **compared** to a **pre-fixed value** (the **alpha**). If the **p-value returns** as equal or **less** than **alpha**, then the **result** is **stat**istically **significant** and **we** **can** **reject** the **null** **hyp**othesis. If the **p-value** is **greater** than **alpha**, the result is **not** **stat**istically **significant** and we **cannot** **reject** the **null** hypothesis. **Alpha** sets a **fixed threshold** for **how** **extreme** the **results** **must** **be** before **rejecting** the **null** hypothesis. (alpha should be **defined** **before** the **experiment** and not after the results have been obtained)

Answer 122

For **two-tailed tests**, the **alpha** is **divided** by **two**. Thus, if the **alpha** is **0.05** (5%), then the **critical areas** of the curve each **represent** **0.025** (2.5%). Hypothesis **tests** usually adopt an alpha of **between** 0.01 (**1%**) and 0.1 (**10%**), there is **no** **predefined** or **optimal** **alpha** for **all** **hyp**othesis **tests**.

Answer 123

**alpha** is **equal** to the **probability** of a **Type I Error** (**incorrect** **reject**ion of the **H0** due to **false** **pos**itive) (when the **result** **falls** into the **alpha**% **critical** (rejection) **zone**(s).. when the result is in the critical zone (defined by alpha) -\> the **H0** **rejected** --\> **tendency** to **minimalize** the **critical** **zone** by **decreasing** it's size choosing **smaller** **alpha** (incorrect rejection of the null hypothesis) the critical area is smaller \>\> **less** **chance** of **incorrectly** **rejecting** **H0** but! **increases** the **risk** of a **Type II Error** (**incorrectly** **accepting** the **null** **hyp**othesis) because the **critical** **zone** will be so **tiny**, that **no** **value** can **fall** **into** it anymore --\> **can** **not** **reject** the **HO** --\> **incorrect** **acceptance** of H0 \>\> inherent trade-off in hypothesis testing \>\> most industries have found that 0.05 (5%) is the ideal alpha for hypothesis testing

Answer 124

alpha is **equal** to the **probability** of a **Type I Error** (**incorrect** **rejection** of the **null** **hyp**othesis) (**false** **pos**itive result)

Answer 125

Confidence is a **statistical** **measure** of **prediction** **confidence** regarding whether the **sample** **result** of the **experiment** is **true** **of** the **full** **pop**ulation

Answer 126

**Confidence** is calculated as (**1 – α**). if the **alpha** is **0.05** \>\> **confidence** level of the experiment is 0.95 (**95%**). 1.0 – α = confidence level 1.0 – 0.05 = **0.95**

Answer 127

Confidenceis calculated as (1 – α). if the alphais 0.05\>\> confidencelevel of the experiment is 0.95 (95%). 1.0 – α = confidence level 1.0 – 0.05 = 0.95

Answer 128

alpha = 0.05 --\> **reject** the **null** **hyp**othesis when the **results** are in a **5%** **zone**, but this **doesn’t** **tell** us **where** to **plant** the **null hyp**othesis **rejection** **zone**(**s**). \>\> we need to **define** the **critical** areas set **by** **alpha**. [two-tail test with two confidence intervals and two critical areas .png](https://www.dropbox.com/s/q58y83jtepakq55/two-tail%20test%20with%20two%20confidence%20intervals%20and%20two%20critical%20areas%20.png?dl=0)

Answer 129

for the null hypothesis rejection zone(s)

Answer 130

Confidence intervals define the confidence bounds of the curve **Two-tailed test**: **two** **confidence** **intervals** define **two** critical **areas** **outside** the **up**per and **lower** **conf**idence **limits**; **One-tailed test**: a **single** **confidence** **interval** defines the left/right-hand side **critical** **area**. [two-tail test with two confidence intervals and two critical areas .png](https://www.dropbox.com/s/q58y83jtepakq55/two-tail%20test%20with%20two%20confidence%20intervals%20and%20two%20critical%20areas%20.png?dl=0)

Answer 131

Confidence intervals define the confidence bounds of the curve

Answer 132

left one-tailed, right one-tailed, two-tailed

Answer 133

Z: Z-distribution critical value (found using a Z-distribution table) [formula for a two-tailed test.png](https://www.dropbox.com/s/avtuyvcgfhvk0r4/formula%20for%20a%20two-tailed%20test.png?dl=0)

Answer 134

The Z-Statistic is used to **find** the **distance** between the **null hypothesis** and the sample **mean**.

Answer 135

In hypothesis testing, the **experiment’s** **Z-Statistic** is **compared** with the **expected** **statistic** (**critical value**) for a given **confidence** **level**. **Z-Statistic** is used to find the **distance** **between** the **null** **hyp**othesis and the **sample** **mean**.

Answer 136

**95%** **certain** that our **sample** **data** will **fall** somewhere **between** 20.8828 and 23.1172 hours. [Example teenage gaming habits in Europe](https://www.dropbox.com/s/q8v9vze53wt78h6/Example%20teenage%20gaming%20habits%20in%20Europe%20%20.png?dl=0)

Answer 137

T-distribution Confidence Intervals can be found [T-distribution Confidence Intervals Xsample.png](https://www.dropbox.com/s/ja6tu4z4iuzfeaf/T-distribution%20Confidence%20Intervals%20%20Xsample.png?dl=0)

Answer 138

to **prove** that the **outcome** of the **sample data** is **representative** of the **full population** and **not** **occurring** **by** **chance** caused **by** **random**ness in the **sample** **data**.

Answer 139

1: **Identify** the **null hyp**othesis (what you believe to be the **status quo** and **wish** to **nullify**) and the **type of test** (i.e. **one-tailed** or **two**-tailed). 2: **State** your experiment’s **alpha** (statistical significance and the **probability** of a **Type I Error**) and **set** the **confidence** **interval**(**s**). 3: **Collect** **sample** **data** and conduct a **hypothesis** **test**. 4: **Compare** the test **result** **to** the **critical** **value** (expected result) and **decide** if you should **support** or **reject** the **null** **hyp**othesis.

Answer 140

the **distance** between a **data** **point** and the sample’s **mean**

Answer 141

in hypothesis testing, we use the Z-Statistic to find the **distance** between a **sample** **mean** and the **null hypothesis**.

Answer 142

**numerically** the **higher** the **statistic**, the **higher** the **discrepancy** **between** the **sample** **data** and the **null** **hypothesis**. Z-Statistic of **close to 0** means the **sample mean** **matches** the **null hyp**othesis—**confirming** the null hypothesis pegged to a **p-value**, which is the probability of that result **occurring** **by** **chance**. hypothesis testing

Answer 143

Z-Statistic of close to 0 means the **sample** **mean** **matches** the **null** **hypothesis**—**confirming** the null hypothesis

Answer 144

A low p-value, **such** **as** **0.05**, indicates that the **sample** **mean** is **unlikely** to have **occurred** **by** **chance**. a p-value of **0.05** is sufficient to **reject** the **null** **hypothesis**

Answer 145

To find the p-value for a Z-statistic, we need to refer to a Z-distribution table [Z-distribution table .png](https://www.dropbox.com/s/y1kgn1iqn9eqd3u/Z-distribution%20table.png?dl=0) [z Critical Value.png](https://www.dropbox.com/s/hbomwlbofnufrsy/z%20Critical%20Value.png?dl=0)

Answer 146

A two-sample Z-Test **compares** the **difference** between the **means** of **two** **independent** **samples** with a known **stand**ard **dev**iation. (we assume: the data is **norm**ally **distr**ibuted and a **min**imum of **30** observations)

Answer 147

what is high enough Z value (Z-Statistic value)? \>\> **depends** on the **level** of **conf**idence (determined by **alpha**) and the **type** of the **test** (**one** tailed or **two** **tailed**) \>\> can be found **in tables** finding the critical Z-value \>\> shows in the table the **level** of **confidence** e.g. in a Two-Sample Z-Test

Answer 148

a Z value (Z-Statistic value) it helps to **evaluate** the **null** **hyp**othesis (e.g.: a **diff**erence **between** two **sets** of **values** (**two** **samples**), we need to calculated the **SD** of the two samples \> it shows **what** **extent** they **very** \> it helps to see **if** the **difference** **between** the two **groups** is **due** **to** **variation** or **real**) if **Z** is **close** to **O** \>\> the **sample** **mean** **matches** the **null** **hyp**othesis \>\> **confirms** the **null hyp**othesis (so the **two** **samples** are **equal**, the **difference** found between their means is **due** **to** **chance** (coming from variation) if **Z** is **high** **enough** \>\> **reject** **H0** so **reject** **that** **µ1 = µ2** (mu1 = mu2) \>\> **accept** **H1** (the **means** of samples are **indeed** **different**) what is **high** **enough** Z value (Z-Statistic value)? \>\> **depends** on the **level** of **confidence** (**alpha**) and the **type** of the **test** (**one** tailed or **two tailed**) \>\> can be found in **tables** finding the critical Z-value \>\> **shows** in the table the **level** of **confidence** in tables the critical Z-value can be found: these Z values should be used in **confidence** **interval** **calculations** when a confidence level is determined -by alpha- we need to see the corresponding critical Z-value (find in tables) \>\> **this** **sets** the **limit** **where** the **H0** can be **rejected** [Two-Sample Z-Test formula.png](https://www.dropbox.com/s/x931wztvk4oolwg/Two-Sample%20Z-Test%20formula.png?dl=0) [z Critical Value.png](https://www.dropbox.com/s/hbomwlbofnufrsy/z%20Critical%20Value.png?dl=0)

Answer 149

[z Critical Value.png](https://www.dropbox.com/s/hbomwlbofnufrsy/z%20Critical%20Value.png?dl=0)

Answer 150

[One-Sample Z-Test](https://www.dropbox.com/s/p2j3ad0krfv3gal/One-Sample%20Z-Test.png?dl=0)

Answer 151

[Two-Sample Z-Test.png](https://www.dropbox.com/s/99bve87s6esv3w7/Two-Sample%20Z-Test.png?dl=0)

Answer 152

one-sample only (sample size: 30) (I guess it is the min) calculate SD assume norm. distribution calculate mean \>\> is it different from a value? not comparing two samples, only one sample's mean compared to a value

Answer 153

one-sample only (sample size: 30) (I guess it is the min) calculate SD assume norm. distribution calculate mean \>\> is it different from a value? (not comparing two samples, only one sample's mean compared to a value [One-Sample Z-Test](https://www.dropbox.com/s/lsk8mrtph890ryp/One-Sample%20T-Test%20formula.png?dl=0) [formula](https://www.dropbox.com/s/lsk8mrtph890ryp/One-Sample%20T-Test%20formula.png?dl=0)

Answer 154

[One-Sample Z-Test formula.png](https://www.dropbox.com/s/t7g4zox0w4hqiwq/One-Sample%20Z-Test%20formula.png?dl=0)

Answer 155

[Two-Sample Z-Test.png](https://www.dropbox.com/s/99bve87s6esv3w7/Two-Sample%20Z-Test.png?dl=0)

Answer 156

Similar to the Z-Test, a T-Test analyzes the **distance** **between** a **sample mean** and the **null** **hyp**othesis but is **based on T-distribution** (using a **smaller** **sample** **size**) and **uses** the **stand**ard **dev**iation of the **sample** **rather** **than** of the population.

Answer 157

- An **independent** **samples** T-Test (**two-sample T-Tes**t) for **comparing** **means** from **two** different **groups**, such as two different companies or two different athletes. This is the **most** **commonly** used type of T-Test. - A **dependent** **sample** T-Test (**paired T-test**) for **comparing** **means** from the **same** **group** at two **different** **intervals**, i. e. measuring a company’s performance in 2017 against 2018. - A **one-sample T-Test** for **testing** the **sample** **mean** of a single group **against** **a** known or hypothesized **mean**.

Answer 158

The **output** of a **T-Test** called the **T-Statistic** **quantifies** the **difference** **between** the **sample** **mean** and the **null hyp**othesis. As the **T-Statistic increases** in the **+/-** direction, the **gap** between the **sample** **data** and **null hyp**othesis **expands**. we refer to a **T-distribution table**

Answer 159

we can expect **95% of samples** to **fall** within **1.83 stand**ard **dev**iations of the **null hyp**othesis. [T-distribution table.png](https://www.dropbox.com/s/m38owcfuf9kfjln/T-distribution%20table.png?dl=0)

Answer 160

we can conclude the **results** of the **sample** are **stat**istically **significant** and **unlikely** to have occurred **by** chance—allowing us to **reject** the **null hyp**othesis. H0: mu= (a certain) **value** (so the **mean** **is** **different** from that value, the **difference** we **found** is **not** due to a **chance**, **but genuine** [T-distribution table.png](https://www.dropbox.com/s/m38owcfuf9kfjln/T-distribution%20table.png?dl=0)

Answer 161

for a **one-tail test**: **T-Statistic** must be **greater** than the critical score of **1.83 for 95%** confidence (**alpha**=**0.05**) for a t**wo-tail test**: **T-Statistic** critical score: **2.26** for **95%** confidence (**alpha**=**0.05/2** = **0.025**) **two** **critical** **areas** would **each** account for **2.5%** of the distribution based on **95%** **confidence** with **confidence** **intervals** of **-2.262** and **+2.262** **from** the **null** **hyp**othesis. [T Table](https://www.dropbox.com/s/m38owcfuf9kfjln/T-distribution%20table.png?dl=0)

Answer 162

An independent samples T-Test **compares** **means** from **two** **different** **groups**. [Independent Samples T-Test formula.png](https://www.dropbox.com/s/48mecjisbglgbbn/Independent%20Samples%20T-Test%20formula.png?dl=0)

Answer 163

part of a greater calculation for **Independent** **Samples** **T-Test calculation** [https://www.dropbox.com/s/48mecjisbglgbbn/Independent%20Samples%20T-Test%20formula.png?dl=0](https://www.dropbox.com/s/48mecjisbglgbbn/Independent%20Samples%20T-Test%20formula.png?dl=0)

Answer 164

[Independent Samples T-Test.png](https://www.dropbox.com/s/di2ew8m8diopt7t/Independent%20Samples%20T-Test.png?dl=0)

Answer 165

Independent Samples T-Test [Independent Samples T-Test.png](https://www.dropbox.com/s/di2ew8m8diopt7t/Independent%20Samples%20T-Test.png?dl=0)

Answer 166

A dependent sample T-Test is used for comparing means from the same group at two different intervals. [Dependent Samples T-Test formula.png](https://www.dropbox.com/s/4fv0gbpm6rwfalt/Dependent%20Samples%20T-Test%20%20formula.png?dl=0)

Answer 167

Dependent Samples T-Test [Dependent Samples T-Test.png](https://www.dropbox.com/s/owe9nkj0ozc4yqy/Dependent%20Samples%20T-Test.png?dl=0)

Answer 168

if we want to compare means from the same group at two different intervals (at two different timepoints, but same players) [Dependent Samples T-Test.png](https://www.dropbox.com/s/owe9nkj0ozc4yqy/Dependent%20Samples%20T-Test.png?dl=0)

Answer 169

A one-sample T-Test is used for **testing** the **sample** **mean** of a **single** **group** **against** a **known** or **hypothesized** **mean**. [One-Sample T-Test formula.png](https://www.dropbox.com/s/lsk8mrtph890ryp/One-Sample%20T-Test%20formula.png?dl=0)

Answer 170

A Z-Test, is used for datasets with **30** or **more** **obs**ervations (**norm**al **distr**ibution) with a known **stand**ard **dev**iation of the population and is **calculated** **based** on **Z-distrib**ution.

Answer 171

A T-Test is used in scenarios when you have a **small** **sample** **size** or you **don’t** **know** the **standard** **deviation** **of** the **population** and you **instead** **use** the **standard** **deviation** **of** the **sample** and **T-distribution**.

Answer 172

**T-Test** is used in scenarios when you have a small sample size or you don’t know the standard deviation of the population and you **instead** **use** the **standard dev**iation **of** the **sample** and **T-distrib**ution. You can **test** if the **sample** **mean** is the **same** **with** **sg**. (it will be a **hyp**othesis) (**H null**: they are the **same**, **H1**: they are **different**) you can **test** **H0** with **T-test** \>\> you **get** **T-Statistics** value \>\> **lookup** the **critical** **value** in the T-distribution table \>\> **compare** them \>\> **accept**/**reject** the **null** **hyp**othesis

Answer 173

**small** **sample** **size** or you **don’t** **know** the **standard** **dev**iation of the **population** **instead** **use** the **stand**ard **dev**iation of the **sample** and **T-distrib**ution You **can test** if the **sample** **mean** **is** the **same** with sg. (it will be a **hypo**thesis) (**H null**: they are the **same**, **H1**: they are **different**) you can **test** **H0** with **T-test** \>\> you get **T-Statistics** **value** \>\> **lookup** the critical value in the **T-distribution table** \>\> **compare** them \>\> **accept**/**reject** the null hypothesis

Answer 174

**hypoth**esis **testing** for comparing **two** **proportions** from the same population population expressed in percentage form, i.e. 40% of males vs 60% of females. we need to conduct a '**two-proportion Z-Test**' [https://www.dropbox.com/s/3ml84x5fhon19gj/Two-proportion%20Z-Test.png?dl=0](https://www.dropbox.com/s/3ml84x5fhon19gj/Two-proportion%20Z-Test.png?dl=0)

Answer 175

hypothesis testing for comparing two proportions from the same population population expressed in percentage form, i.e. 40% of males vs 60% of females. we need to conduct a 'two-proportion Z-Test' to compare experimental group and a control group (placebo) [https://www.dropbox.com/s/3ml84x5fhon19gj/Two-proportion%20Z-Test.png?dl=0](https://www.dropbox.com/s/3ml84x5fhon19gj/Two-proportion%20Z-Test.png?dl=0)

Answer 176

Two-proportion Z-Test practical [Two-proportion Z-Test practical.png](https://www.dropbox.com/s/fwic0qbpxt9lvgy/Two-proportion%20Z-Test%20practical.png?dl=0) We consider a new energy drink formula proposes to improve students’ test scores. max test score: 1600 (the average score is 1050 - 1060). Evaluation: whether the students’ results exceed 1060 points. sample of 2,000 students split evenly into an exp. group (energy drink) and a ctrl group (placebo). Results: Ctrl Group = 500 exceeded /1000 Exp Group = 620 exceeded /1000; looks more than 500 \> real difference?

Answer 177

**Critical** **areas** of **2.5%** on **each side** of the **two-tailed** (**n**ormal **d**istribution) **curve** from a **distance** of **1.96** **s**tandard **d**eviations. If the **Z-Statistic** **falls within** **1.96 stand**ard **dev**iations of the **mean** (**within** the **95% area**) \>\> we can conclude that the **proportions** of the 'experimental test' and 'control test' **results** were **equal** (the exp. group and the ctrl group are not different) If the **Z-Statistic** **falls** **out** of the **95% area** \>\> **reject null** **hyp**othesis (the **proportions** are **not** the **same**) \>\> so they are **different** (**H1** is **true**) [Normal distribution curve with marked critical areas.png](https://www.dropbox.com/s/xhqaetkk85ett8c/Normal%20distribution%20curve%20with%20marked%20critical%20areas.png?dl=0)

Answer 178

Two-proportion Z-Test [Two-proportion Z-Test practical.png](https://www.dropbox.com/s/fwic0qbpxt9lvgy/Two-proportion%20Z-Test%20practical.png?dl=0)

Answer 179

**two-proportion Z-Test** based on the following hypotheses: **H0**: **p1 = p2** (The proportions are the **same** with the **difference** equal to **0**) **H1: p1 ≠ p2** (The two **proportions** are **not** the **same**) we **detect** a **difference** between the two groups \>\> is it a **real** difference (**or** just due to **chance**)? we want to find out \>\> **H0**: we state, that **they** are the **same** (this hypothesis **we** **want** to **nullify**,**reject** \>\> we can reject, **if** the **Z-test** **value** will fall into an **area** of the distribution, where there is less than **5%** **chance** that would fall **by** **chance** **considering** the **variation** in that **sample** **group** we **anchor** the **null** hypothesis with the **statement** that **we** **wish** to **nullify**: (the **two** **proportions** of results are **identical** and it just so happened that the **results** of the **experimental** **group** **different** that of the control group **due** to a **random** **sampling** **error**) \<-- if reject, H1 is true: they are not equal in general: H0: the known, the status quo, what we want to chalenge H0: (equal, not equal, less, more) H1: the opposite, engulfing eveything else [Two-proportion Z-Test practical.png](https://www.dropbox.com/s/fwic0qbpxt9lvgy/Two-proportion%20Z-Test%20practical.png?dl=0)

Answer 180

**H0**: **p1 = p2** (The proportions are the same with the difference equal to 0) **H1: p1 ≠ p2** we **test** **it**; (The two proportions are **not** the **same**) \<\< if it occurs **less** than **5%** **by chance** (the **probability** that it happens is **more** **than** **95%** that not by chance) -\>we reject H0, because 95% probility holds that not equal putting other way: actually the **formula** **examines** the **difference** between **the** two **sample** **proportions** H0: p1-p2=0 Ha: p1-p2≠0 we test it; (The two proportions are not the same -\> the difference of the proportions is not zero; if the probality that it is zero by chance is less than 5% -\> 95% -or more- probability that not by chance -\> so it is genuinely true) \<\< if it occurs less than 5% by chance (the probability that it happens is more than 95%) we’ll **reject** the **null** **hyp**othesis **if** **there’s** a **less** than **5% chance** of the **alternative** **hyp**othesis **occurring** **by** **chance**. we **anchor** the **null** **hyp**othesis with the **statement** that we wish to **nullify**: (e.g.: exp group-placebo group test: the two proportions of results are identical and it just so happened that the results of the experimental group different that of the control group due to a random sampling error) [Normal distribution curve with marked critical areas.png](https://www.dropbox.com/s/xhqaetkk85ett8c/Normal%20distribution%20curve%20with%20marked%20critical%20areas.png?dl=0)

Answer 181

technique in inferential statistics it is used to test **how** **well** a **variable** **predicts** **another** **variable**. the term “regression” is derived from Latin, meaning “going back”

Answer 182

The objective of regression analysis is to **find** a **line** that **best fits** the **data points** on the **scatterplot** to make **predictions**. **Linear regression**, the **line** is **straight** and **cannot** **curve** or **pivot**. **Nonlinear regression**, meanwhile, grants the line to curve and bend to fit the data.

Answer 183

trendline **A straight line** **cannot** possibly **intercept** **all** **data** **points** on the scatterplot \> **linear regr**ession can be thought of as a **trendline visualizing** the **underlying** **trend** of the **dataset**. **hyperplane**: a perpendicular **line** **from** the **regression** **line** **to** each **data** **point** on the scatterplot \>\> the **aggregate** **distance** of each point would equate to the smallest possible distance to the hyperplane.

Answer 184

a perpendicular **line** **from** the **regression line** **to** **each** **data** **point** on the scatterplot \>\> the **aggregate** **distance** of each point would equate to the **smallest** **possible** **distance** **to** the **hyperplane**.

Answer 185

**slope** aka. **coefficient** in statistics. the term “**coefficient**” is generally **used** **over** “**slope**” in **cases** where there are **multiple** **variables** in the equation (**multiple** **linear** **regression**) and the **line’s slope** is **not** **explained** **by** any **single** **variable**.

Answer 186

The **slope** of a regression line (b) represents the **rate** **of** **change** **in y** as **x** **changes**. Because **y** is **dependent** **on** **x** \> the **slope** **describes** the **predicted** values of **y** given x. The **slope** of a **regression** **line** is **used** with a **t-statistic** to **test** the **significance** of a **linear** **relationship** **between** **x** and **y**. The **slope** can be **found** by **ref**erencing the **hyperplane**; (scatterplots in statistics) as **one** **variable** **increases**, the **other** variable **increases** **by** the **average** value **denoted** **by** the **hyperplane**. The **slope** is **useful** in **forming** **predictions**.

Answer 187

With **ordinary least squares method** (**one** of the **most** **common** linear regressions) slope, is found by calculating **b** as the **covariance** of **x** **and** **y**, **divided** **by** the **variance** (**sum of squares**) of **x**, The **slope** must be **calculated** **before** the **y-intercept** when using a linear regression, as the **intercept** is **calculated** **using** the **slope**. [slope calculation formula.png](https://www.dropbox.com/s/nhz5gtx5pxykhn7/slope%20calculation%20formula.png?dl=0)

Answer 188

We can use the slope, in forming **predictions**. to predict a **child's height** **based** on his **parents**' midheight (the intercept between parents’ midheight (X) of 72 inches and a son’s expected height (y) \>\> the y value is approximately 71 inches. [Predicted height of a child whose parents’ midheight.png](https://www.dropbox.com/s/3xh5nnuad02hj2s/Predicted%20height%20of%20a%20child%20whose%20parents%E2%80%99%20midheight.png?dl=0)

Answer 189

**Regression** **analysis** (aka **regression** **towards** the **mean**) is a useful **method** for **estimating** **relationships** **among** **variables** **testing** if they're somehow **related**. Linear regression is **not** a **fail-proof** method of making **predictions**, the **trendline** does offer a **primary** **reference** **point** to make **estimates** about the **future**.

Answer 190

The **regression model** (and a **scatter** **chart**) excellent tool to **depict** the **relationship** **between** **two** **variables**. Provides a **visual representation** **and** a **math**ematical **model** that **relates** the two **variables**. describes the **relation** between **x;y** in a **scatter** **plot** **y = mx + b** (m: **slope**; b: **intercept**) **calculates** **m** and **b** in **such** a **way**, that **minimizes** the **distance** (error) of the **points** **from** the **regression line** on the plot (**more** **accu**rately: **reduce** the **sum** **of** the **errors** **squared** \>\> “**least** **squares** **regression**” name) [linear regression summary bbas.png](https://www.dropbox.com/s/4jom45lj6kfbz57/linear%20regression%20summary%20bbas.png?dl=0)

Answer 191

[Linear regression Xmple.png](https://www.dropbox.com/s/q9u7ovrp5wdfy9s/Linear%20regression%20Xmple.png?dl=0)

Answer 192

If we apply Linear regression analysis to large datasets with a higher degree of scattering or three-dimensional and four-dimensional data, hard to validate the trendline only by looking at it \> A mathematical solution to this problem is to apply R-squared (the coefficient of determination)

Answer 193

(the **coefficient** of **determination**) R-squared is **a** **test** to see what **level** **of impact** the **independent** **variable** has **on data variance**. R-squared (**a** **number** **between** **0-1** (produces a **percentage** value) **0%** : the **linear** **regression** model **accounts** for **none** of the data **variability** **in relation to the mean** (of the dataset) \>\> the **regression** **line** is a **poor** **fit** (for the given dataset) **100%** : the **linear** **regression** model **expresses** **all** the **data** **variability** **in relation to the mean** (of the dataset) \>\> the **regression** **line** is a **perfect** **fit** **mathematical** solution to validate the (calculated) relationship in the regression model **defines** the **percentage** of **variance** in the **linear model** in **relation** **to** the **indep**endent **var**iable.

Answer 194

R² is a ratio -\> -\> division needed to be calculated: **SSR/SST** R-squared is calculated as the **sum of square regression** (SSR) **divided** by the **sum of squares total** (SST) -\> SSR/SST **SSR**: calculated **from** the **regression** **analysis** given theoretical values for the dependent variable (y'); **y'** based on the **y'=mx+b** formula it is the total sum of [the individual values calculated for each datapoint from the **theoretical** (**y'**) and the **actual/measured y̅ mean** values at **each point**] -\> **squared** -\> **sum** up **SSR= (y' - y̅)²** (y' - y̅)² calculated for each datapoint and summed up and squared to get SSR **SST**: calculated **from** the actual **measured** **values** of **y** and the **mean** of **actual y** values it is the total sum of [the **individual** **values** **calculated** for **each** **datapoint** from the **actual y** values (**y**) and the actual **y̅ mean** values at each point] -\> **squared** -\> **sum** up **SSR= (y - y̅)²** (y - y̅)² calculated for each datapoint and summed up and squared to get SSR [R-squared calculation.png](https://www.dropbox.com/s/rw9fdjzsqksrv97/R-squared%20calculation.png?dl=0)

Answer 195

A common **measure** **of** **association** **between** **two** **variables**. Describes the **strength** or **absence** of a **relationship** **between** **two** **variables**. **Slightly** **different** from **linear** **regr**ession analysis, which **expresses** the **average** **math**ematical **relationship** **between** two or more **variables** with the intention of **visually** **plotting** the relationship on a **scatterplot**. Pearson correlation is a statistical measure of the **co**-**relationship** **between** two **variables** **without** any **designation** to **independent** and **dependent** **qualities**.

Answer 196

**Pearson** **cor**relation (**r**) is expressed as a **number** (coefficient) **between** **-1** and **1**. **-1** denotes the existence of a **strong** **negative** correlation **0** equates to **no** correlation, and **+1** for a **strong** **positive** correlation. a correlation coefficient of **-1** means that **for every positive** **increase** in **one variable**, there is a **negative** **decrease** **of a fixed proportion** in the **variable** (airplane fuel which decreases in line with distance flown) a correlation coefficient of **1** signifies an **equivalent** **positive** **increase** in **one** **variable** **based** on a **positive** **increase** in **another** **variable** (food **calories** of a particular **food** that goes up with its **serving** **size**) a correlation coefficient of **zero** notes that for **every** **increase** in **one** **variable**, there is **neither** a **positive** or **negative** **change** (the two **variables** **aren’t** **related**) [Interpretations of Pearson correlation coefficients.png](https://www.dropbox.com/s/qzzl4wvpqp6hkox/Interpretations%20of%20Pearson%20correlation%20coefficients.png?dl=0)

Answer 197

Describes the **strength** or **absence** of a **relationship** **between** two **variables** [Pearson correlation coefficients xmpl.png](https://www.dropbox.com/s/udj0z4ubg6u3zzw/Pearson%20correlation%20coefficients%20xmpl.png?dl=0)

Answer 198

clustering analysis aims to **group** **similar** **objects** (**data** **points**) into **clusters** **based** on the **chosen** **variables**. This method **partitions** **data** **into assigned segments** or **subsets** (where **objects** **in** one **cluster** **resemble** one another and are **dissimilar** **to** **objects** contained in the **other** **cluster**(s). Objects can be interval, ordinal, continuous or categorical variables. (a **mixture** of **different** **variable** types can lead to **complications** with the analysis because the **measures** of **distance** **between objects** can **vary** depending on the variable types contained in the data)

Answer 199

[Regression and clustering shown on a scatterplot.png](https://www.dropbox.com/s/14b3it184op0agu/Regression%20and%20clustering%20shown%20on%20a%20scatterplot.png?dl=0)

Answer 200

**developed** originally from **anthropology**, **psychology** (later) **1930**-s **personality** **psych**ology (**1943**) today: in **data mining**, **inf**ormation **retrieval**, **mach**ine **learn**ing, **text** **mining**, **web** **anal**ysis, **marketing**, **medical** **diagn**osis, and many more Specific use cases include **analyzing** **symptoms**, identifying clusters of **similar** **genes**, **segment**ing **communities** in **ecology**, and **identifying** **objects** in **images**. not one fixed technique rather a **family** **of** **methods**, (includes **hierarchical** clustering analysis and **non**-**hierarchical** **clustering**)

Answer 201

(HCA) is a technique to **build** a **hierarchy** of **clusters**. An example: **divisive** **hierarchical clustering**, which is a **top**-**down** method where **all** **objects** **start** **as** a **single cluster** and are **split** into **pairs** of clusters **until** **each** object represents an **individual** **cluster**. [Hierarchical Clustering Analysis.png](https://www.dropbox.com/s/kwfnjb6ong5fe6i/Hierarchical%20Clustering%20Analysis.png?dl=0)

Answer 202

a **bottom-up** **method** of **classific**ation (more **popular** approach) Carried out in reverse **each** **object** **starts** as a **standalone** cluster a **hierarchy** is **created** by **merging pairs** of clusters to form **progressively larger** clusters. three steps: 1. **Objects** **start** as their **own** **separate** **cluster**, which results in a **maximum** **number** of clusters. 2. The number of clusters is **reduced** **by** **combining** the **two nearest** (**most** **similar**) clusters. (differentiate by the interpretation of the “**shortest distance**” ) 3. This process is **repeated** **until** **all** objects are grouped inside **one** **single** **cluster**. \>\> **hierarchical clusters** **resemb**le a **series** of **nested** clusters **organized** **within** a **hierarchical** **tree**.

Answer 203

The **agglomerate** **cluster** **starts** with a **broad** **base** and a **max**imum **number** of **clusters**. The number of clusters **falls** **at subsequent rounds** **until** there’s **one** **single** cluster **at** the **top** **of** the **tree**. In the case of **divisive clustering**, the **tree** is **upside** **down**. At the **bottom** of the tree is **one** **single** **cluster** that contains **multiple** **loosely** **related** **clust**ers. These clusters are **sequentially** **split** **into** **smaller** clusters **until** the **max**imum number of clusters is reached. **Hierarchical** **clust**ering \>\> **dendrogram** **chart** to **visualize** the **arrangement** of clusters. (they demonstrate **taxonomic** **relationships** and are commonly used in **biology** to map **clusters** **of** **genes** or other samples) (Greek dendron - “tree.”) [Nearest neighbor and a hierarchical dendrogram.png](https://www.dropbox.com/s/jwbqsos3gsbwmn9/Nearest%20neighbor%20and%20a%20hierarchical%20dendrogram.png?dl=0)

Answer 204

Various methods (**differ** in both the **technique** -to find the “**shortest** **distance**” **between** **clusters**- and in the **shape** of the **clusters** they produce) Nearest Neighbor The furthest neighbor Average aka UPGMA (Unweighted Pair Group Method with Arithmetic Mean) Centroid Method Ward’s Method

Answer 205

**creates** **clusters** **based** **on** the **distance** between the two closest neighbors. you find the shortest distance between two objects \>\> combine them into one cluster \>\> repeated \>\> the next shortest distance between two objects is found (either expands the size of the first cluster or forms a new cluster between two objects)

Answer 206

**Produce**s **clust**ers by **measuring** the **distance** **between** the **most** **distant** pair of objects. The distance between each possible object pair is computed \>\> the **object pairs located furthest apart** are **unable** to **be** **linked**. At each stage of hierarchical clustering, the **two closest** **objects** are **merged** into a single cluster. **Sensitive** to **outliers**.

Answer 207

(**Unweigh**ted **P**air **G**roup Method with **A**rithmetic Mean) Merges objects by calculating the **distance** **between** two **clusters** and measuring the **average** **distance** between **all** **objects** in **each** **cluster** and **joining** the **closest** **cluster** **pair**. **Initially**, **no** **different** to **nearest neighbors** because the first cluster to be linked contains only one object. **Once** **a cluster** includes **two or more objects** \> the **average** **distance** **between objects** **within** the **cluster** can be **measured** which has an **impact** on **classification**.

Answer 208

**Utilizes** the **object** **in** the **center** of each cluster (**centroid**) **to** **determine** the **distance** **between** **two clusters**. At **each** **step**, the two clusters whose **centroids** are measured to be **closest** together are **merged**.

Answer 209

Draws on the **sum of squares error** (**SSE**) between two **clusters** over all variables **to determ**ine the **distance** **between** **clusters**. **All possible** cluster **pairs** are **combined** \>\> the **sum** of the **squared** **distance** across all clusters is **calculated**. At each round attempts to merge two separate clusters by **combining** the two **clusters** that **best minimize SSE** \>\> The pair of clusters that return the highest sum of squares is selected and conjoined. **Produces** **clusters** relatively **equal** in **size** (**may** **not** always be **effective**). **Can** be **sensitive** to **outliers**. **One** of the **most pop**ular **agglomerative** clustering methods in use today.

Answer 210

Measurement method \>\> **different** **method** \>\> **different** **distance** \>\> lead to different **classification** results \>\> impact on **cluster** composition [Measures of Distance.png](https://www.dropbox.com/s/9kw14c6cwqqxctb/Measures%20of%20Distance.png?dl=0)

Answer 211

**Euclidean distance** (standard across most industries, including machine learning and psychology) **Squared Euclidean** distance **Manhattan** **distance** (**reduces** the influence of **outliers** and **resembles** **walking** a **city** **block**) **Maximum distance**, and **Mahalanobis** (internal cluster distances tend to be emphasized (distances between clusters are less significant). [Manhattan distance versus Euclidean distance.png](https://www.dropbox.com/s/9kw14c6cwqqxctb/Measures%20of%20Distance.png?dl=0)

Answer 212

[Euclidean distance formula.png](https://www.dropbox.com/s/wqelzvq883qtpak/Euclidean%20distance%20formula.png?dl=0)

Answer 213

[Nearest Neighbor Exercise.png](https://www.dropbox.com/s/xbeph2r9iv4nusr/Nearest%20Neighbor%20Exercise.png?dl=0)

Answer 214

(**Partitional clustering**) different from hierarchical clustering and is **common**ly used in **business** **analytics**. **Divide** **n** number of **objects** into **m** number of **clusters** (rather than nesting clusters inside large clusters). **Each** **object** can **only** be assigned to **one cluster** and **each cluster** is **discrete** (unlike hierarchical clustering) \>\> **no overlap** between **clusters** and **no case** of nesting a cluster **inside** **another**. \>\> usually **faster** and require **less storage** space **than** **hierarchical** methods \>\> (typically used in business scenarios) **Helps** to **select** the **optimal** **number** of **clusters** to perform **classification** (**rather** **than** mapping the hierarchy of relationships within a dataset using a **dendrogram** chart) [Non-Hierarchical Clustering methods.png](https://www.dropbox.com/s/4zbn5wlyq9bna48/Non-Hierarchical%20Clustering%20methods.png?dl=0)

Answer 215

[Example of k-means clustering.png](https://www.dropbox.com/s/5f0yiep8ajvm6tg/Example%20of%20k-means%20clustering.png?dl=0)

Answer 216

attempts to **split** data into **k number of clusters** **not** **always** **able** to reliably **identify** a **final** **comb**ination of **clusters** (need to **switch** **tactics** and utilize **another** **algorithm** to formulate your **classific**ation **model**) measuring multiple distances between data points in a **three** or **four-dimen**sional **space** (with **more** than **two** **variabl**es) is much more **complicated** and **time**-**consuming** to **compute** its **success** **depends** largely on the **quality** of **data** and there’s **no mechanism** to **differentiate** between **relevant** and **irrelevant** **variables**; the variables you selected are relevant and especially if chosen from a large pool of variables

Answer 217

(**measures** of **dispersion**) **how** **wide** the **set** of **data** is The most common basic measures are: **The range** (including the **interquartile** range and the **interdecile** range) (how much is in **between** the **lowest** value (**start**) and **highest** value (**end**) (**interquartile** **range**, which tells you the range in the **middle** **fifty** **percent** of a set of data) **The standard deviation** **square** **root** of **variance** a measure of **how** **spread** out **data** is **around** center of the distribution (the **mean**). gives you an idea of **where**, **percentage wise**, **a** **certain** **value** **falls**. e.g. you score **one SD above the mean** on a test (normally distributed -bell shaped). \>\> your score puts you in the **top 84%** of test takers) **The variance** a very simple statistic, gives an **extremely** **rough** idea of **how spread** out a **data set** is. **As a measure** of spread, it’s actually pretty **weak**. A large variance, **doesn’t** **tell** you **much** about the spread of data — other than it’s big! The most important **reason** the variance **exists** \>\> **to** **find** the **SD** **SD squared** \>\> **variance** **Quartiles** divide your **data set into** **quarters** according to where those numbers falls on the number line. **not** very **useful** on its **own** \>\> used to find **more** **useful** values like the **interquartile range**

Answer 218

x with overline [x̅]: Type the x then go to **Insert** \> **Symbol** In the **Character** **Viewer** select **Unicode** from the left list [You may have to click the **✲** to **Customize** the List] Select **Combining** **Diacritical** **Marks** in the top middle pane **Locate** & double-click the **Overline** [**U-0305**] in the lower middle pane [how to insert unicode character symbols.png](https://www.dropbox.com/s/1ado0j6lwu8qhfa/how%20to%20insert%20unicode%20character%20symbols.png?dl=0)

Answer 219

[Variance summary.png](https://www.dropbox.com/s/hlhp9cfd6gevx8e/Variance%20summary.png?dl=0)

Answer 220

x bar (x overline)

Answer 221

sigma squared

Answer 222

a table dividing the data intro groups (classes) shows how many data values occur in each group

Answer 223

[Summary of clustering types.png](https://www.dropbox.com/s/1gle4k6syn9951p/Summary%20of%20clustering%20types.png?dl=0)

Answer 224

we need to designate **A** and **B** events: **P(A)**: real cancer case **P(B)**: probability of having symptoms (includes the ones having cancer-with symptomes) and the ones with no cancer, but with symtomes \>\> all real positives and the false positives) **P(A/B)**: this is the question; probability of a **real****cancer** (different from 100% because there is a probability, that the syndromes are false positive resulting from non-cancers **P(A)**: probability of a **real**real cancer \>\> 1/100.000 (implies probability of non-cancer: 1-0.00001 = 0.99999) **P(B/A)**: probability of symptomes if cancer \>\> 1 **P(B)**: the probability of symptomes (two elements: actual cancer + false positively symptomatic people): 1/100.000 + 1/10.000 1. actually identified real users: 1/100.000 = 0.00001 2. false positively identified non users; 1/100.000 + 1/10.000 = **0.00011** (from **1. + 2.)** **P(A/B) = P(A) \* P(B/A) / P(B)** \>\> 0.00001\* 1 / 0.00011 = 0.0909 = 9.1% [Bayes theorem example 2](https://www.dropbox.com/s/scgyznsz6qr0g6n/Bayes%20theorem%20example%202.png?dl=0)

Answer 225

question reformulated: what is the **proportion** of the **false** **item** produced **by** **machine** **C** **among** **all** **false** **items**? **all** **false** items: 2.4% 0.05\*0.2 + 0.03\*0.3 + 0.01\*0.5 = **0.024** **false** **items** by **C** **machine**: 0.01 \* 0.5 = 0.005 \>\> **0.5%** **false** **items** by **C** machine **among** **all** false items: 0.5% / 2.4% = 5/24 [Bayes theorem example 3.png](https://www.dropbox.com/s/74surffvi95qjig/Bayes%20theorem%20example%203.png?dl=0)

Answer 226

the **mean** can be **highly** **sensitive** **to** **outliers**. (statisticians sometimes use the trimmed mean, which is the mean obtained after **removing** **extreme** **values** at **both** the **high** and **low** **band** of the dataset, such as **removing** the **bottom** and **top** **2%** of **salary** **earners** in a national income survey).

Answer 227

sigma squared

Answer 228

population SD: sigma sample SD: **s**

Answer 229

[Variance summary.png](https://www.dropbox.com/s/hlhp9cfd6gevx8e/Variance%20summary.png?dl=0)

statistics notes 2020 march 30 Flashcards

(258 cards)