statistics notes 2020 march 30 Flashcards
Data types
Categorical and numerical
types of Categorical data
Nominal, Ordinal
Nominal:
Named data which can be separated into discrete categories which do not overlap.
Ordinal:
the variables have natural, ordered categories and the distances between the categories is not known.
types of numerical data
Discrete, continuous
Ordinal data
a categorical, statistical data type
the variables have natural, ordered categories and the distances between the categories is not known.
data which is placed into order or scale (no standardised value for the difference)
(easy to remember because ordinal sounds like order).
e.g.: rating happiness on a scale of 1-10. (no standardised value for the difference from one score to the next)
Nominal Data mytutor.co.uk
Named data which can be
separated into discrete categories which do not overlap.
(e.g. gender; male and female) (eye colour and hair colour)
An easy way to remember this type of data is that nominal sounds like named,
nominal = named.
Ordinal Data
mytutor.co.uk
Ordinal data:
placed into some kind of order or scale. (ordinal sounds like order).
e.g.:
rating happiness on a scale of 1-10. (In scale data there is no standardised value for the difference from one score to the next)
positions in a race (1st, 2nd, 3rd etc). (the runners are placed in order of who completed the race in the fastest time to the slowest time, but no standardised difference in time between the scores).
Intervaldata:
comes in the form of a numerical value where the differencebetween points is standardisedand meaningful.
Interval Data
mytutor.co.uk
Interval data:
comes in the form of a numerical value where the difference between points is standardised and meaningful.
e.g.: temperature, the difference in temperature between 10-20 degrees is the same as the difference in temperature between 20-30 degrees.
can be negative
(ratio data can NOT)
Ratio Data
mytutor.co.uk
Ratio data:
much like interval data – numerical values where the difference between points is standardised and meaningful.
it must have a true zero >> not possible to have negative values in ratio data.
e.g.: height be that centimetres, metres, inches or feet. It is not possible to have a negative height.
(comparing this to temperature (possible for the temperature to be -10 degrees, but nothing can be – 10 inches tall)
inferential statistics
Population: an entire group of items, such as people, animals, transactions, or purchases >> Descriptive statistics applied if all values in the dataset are known.
>> not possible or feasible to analyse >>
Sample: a selected subset, called a sample, is extracted from the population.
The selection of the sample data from the population is random >> Inferential statistics applied >> develop models to extrapolate from the sample data to draw inferences about the entire population (while accounting for the influence of randomness)
Quantitative analysis can be split into two major branches of statistics:
Descriptive statistics (if all values in the dataset are known)
Inferential statistics (extrapolates from the sample data to draw inferences about the entire population)
inferential
következtetési, deductive
Descriptive statistical analysis
As a critical distinction from inferential statistics, descriptive statistical analysis applies to scenarios where all values in the dataset are known.
Confidence, confidence level
Confidence is a measure to express how closely the sample results match the true value of the population.
Confidence level: 0% - 100%
95%: if we repeat the experiment numerous times (under the same conditions), the results will match that of the full population in 95% of all possible cases.
Hypothesis Testing
Hypothesis test:
evaluate two mutually exclusive statements to determine which statement is correct given the data presented.
incomplete dataset >> hypothesis testing is applied in inferential statistics to determine if there’s reasonable evidence from the sample data to infer that a particular condition holds true of the population.
null hypothesis
A hypothesis that the researcher attempts or wishes to “nullify.”
most of the world believed swans were white, and black swans didn’t exist inside the confines of mother nature. The null hypothesis that swans are white
The term “null” does not mean “invalid” or associated with the value zero.
In hypothesis testing, the null hypothesis (H0)
In hypothesis testing, the null hypothesis (H0) is assumed to be the commonly accepted fact but that is simultaneously open to contrary arguments.
If substantial evidence to the contrary >> the null hypothesis is disproved or rejected >> the alternative hypothesis is accepted to explain a given phenomenon.
The alternative hypothesis
The alternative hypothesis is expressed as Ha or H1.
Covers all possible outcomes excluding the null hypothesis.
What is the relationship between the null hypothesis and alternative hypothesis?
null hypothesis and alternative hypothesis are mutually exclusive,
which means no result should satisfy both hypotheses.
a hypothesis statement must be
a hypothesis statement must be clear and simple. Hypotheses are also most effective when based on existing knowledge, intuition, or prior research.
Hypothesis statements are seldom chosen at random. a good hypothesis statement should be testable through an experiment, controlled test or observation.
(Designing an effective hypothesis test that reliably assesses your assumptions is complicated and even when implemented correctly can lead to unintended consequences.)
A clear hypothesis
A clear hypothesis tests only one relationship and avoids conjunctions such as “and,” “nor” and “or.”
A good hypothesis should include an “if” and “then” statement
(such as: If [I study statistics] then [my employment opportunities increase])
The good hypothesis sentence structure
The first half of this sentence structure generally contains an independent variable (this is the hypothesys) (i.e., if study statistics) in the
second half: a dependent variable (whatyou’re attempting to predict) (i.e., employment opportunities).
A dependent variable represents
A dependent variable represents what you’re attempting to predict,
2nd half of the hypothesys sentence
The independent variable is
The independent variable (in the first half of the sentence) is the variable, that supposedly impacts the outcome of the dependent variable (which is the 2nd half of the hypothesys senetence)
double-blind
where both the participants and the experimental team aren’t aware of who is allocated to the experimental group and the control group respectively.
probability
probability expresses the likelihood of something happening expressed in percentage or decimal form; typically expressed as a number with a decimal value called a floating-point number.
odds
odds define the likelihood of an event occurring with respect to the number of occasions it does not occur.
For instance, the odds of selecting an ace of spades from a standard deck of 52 cards is 1 against 51. On 51 occasions a card other than the ace of spades will be selected from the deck.
correlation
Correlation is often computed during the exploratory stage of analysis to understand general relationships between variables.
Correlation describes the tendency of change in one variable to reflect a change in another variable.
confounding variable
the observed correlation could be caused by a third and previously unconsidered variable,
aka lurking variable or confounding variable.
It’s important to consider variables that fall outside your hypothesis test as you prepare your research and before publishing your results.
zavarba hoz
confound
the curse of dimensionality
confusing correlation and causation arises when you analyze too many variables while looking for a match.
(In statistics, dimensions can also be referred to as variables).
If we are analyzing three variables, the results fall into a three-dimensional space.)
You can find instances of the “curse” or phenomenon using Google Correlate (www.google.com/trends/correlate)
the curse of dimensionality tends to affect machine learning and data mining analysis more than traditional hypothesis testing due to the high number of variables under consideration. e.g:
…It turns out that the bang energy drink, for example, came onto the market at a similar time as Alibaba Cloud’s international product offering and then grew at a similar pace in terms of Google search volume..
átok
Data
A term for any value that describes the characteristics and attributes of an item that can be moved, processed, and analyzed.
The item could be a transaction, a person, an event, a result, a change in the weather, and infinite other possibilities.
Data can contain various sorts of information, and through statistical analysis, these recorded values can be better understood and used to support or debunk a research hypothesis.
Population
The parent group from which the experiment’s data is collected,
e.g., all registered users of an online shopping platform or all investors of cryptocurrency.
Sample
A subset of a population collected for the purpose of an experiment,
e.g., 10% of all registered users of an online shopping platform or 5% of all investors of cryptocurrency.
A sample is often used in statistical experiments for practical reasons, as it might be impossible or prohibitively expensive to directly analyze the full population.
Variable
A characteristic of an item from the population that varies in quantity or quality from another item,
e.g., the Category of a product sold on Amazon.
A variable that varies in regards to quantity and takes on numeric values is known as a quantitative variable,
e.g., the Price of a product.
A variable that varies in quality/class is called a qualitative variable,
e.g., the Product Name of an item sold on Amazon.
This process is often referred to as classification, as it involves assigning a class to a variable.
Variable types (what is the term for the process to establish types?)
quantitative variable (varies in regards to quantity and takes on numeric values),
qualitative variable (varies in quality/class),
classification
Discrete Variable
A variable that can only accept a finite number of values,
e.g., customers purchasing a product on Amazon.com can rate the product as 1, 2, 3, 4, or 5 stars. In other words, the product has five distinct rating possibilities, and the reviewer cannot submit their own rating value of 2.5 or 0.0009.
Helpful tip: qualitative variables are discrete,
e.g. name or category of a product.
Continuous Variable
A variable that can assume an infinite number of values,
e.g., depending on supply and demand, gold can be converted into unlimited possible values expressed in U.S. dollars.
A continuous variable can also assume values arbitrarily close together.
e.g.: price and reviews (number of reviews on a product) are continuous variables
Categorical Variables
A variable whose possible values consist of a discrete set of categories,
rather than numbers quantifying values on a continuous scale)
(such as gender or political allegiance,
Ordinal Variables
(a subcategory of categorical variables),
ordinal variables categorize values in a logical and meaningful sequence.
ordinal variables contain an intrinsic ordering or sequence such as {small; medium; large} or {dissatisfied; neutral; satisfied; very satisfied}.
The distance of separation between ordinal variables does not need to be consistent or quantified. (For example, the measurable gap in performance between a gold and silver medalist in athletics need not mirror the difference in performance between a silver and bronze medalist.)
standard categorical variables, i.e. gender or film genre,
Independent and Dependent Variables
An independent variable (expressed as X) is the variable that supposedly impacts the dependent variable (expressed as y).
For example, the supply of oil (independent variable) impacts the cost of fuel (dependent variable).
As the dependent variable is “dependent” on the independent variable, it is generally the independent variable that is tested in experiments. As the value of the independent variable changes, the effect on the dependent variable is observed and recorded.
In analyzing Amazon products, we could examine Category, Reviews and 2-Day Delivery as the independent variables and observe how changes in those variables affect the dependent variable of Price. Equally, we could select the Reviews variable as the dependent variable and examine Price, 2-Day Delivery and Category as the independent variables and observe how these variables influence the number of customer reviews.
What determines wether a variable is “independent” or “dependent” ?
The labels of “independent” and “dependent” are hence determined by experiment design rather than inherent composition
(one variable could be a dependent variable in one study and an independent variable in another)
two events are considered independent if …
In probability,
two events are considered independent if the occurrence of one event does not influence the outcome of another event
(the outcome of one event, such as flipping a coin, doesn’t predict the outcome of another. If you flip a coin twice, the outcome of the first flip has no bearing on the outcome of the second flip)
P(E|F)
the probability of E given F
The probability of one event (E) given the occurrence of another conditional event (F) is expressed as P(E|F),
two events are said to be independent if ..
Conversely, two events are said to be independent if
P(E|F) = P(E).
This equation holds that the probability of E is the same irrespective of F being present.
This expression can also be tweaked to compare two sets of results where the conditional event (F) is absent from the second trial.
Bayes’ theorem in nutshell
The premise of this theory is to find the probability of an event, based on prior knowledge of conditions potentially related to the event.
Bayes’ theorem “is to the theory of probability what the Pythagorean theorem is to geometry.”
For instance, if reading books is related to a person’s income level, then, using Bayes’ theory, we can assess the probability that a person enjoys reading books based on prior knowledge of their income level.
In the case of the 2012 U.S. election, Nate Silver drew from voter polls as prior knowledge to refine his predictions of which candidate would win in each state. Using this method, he was able to successfully predict the outcome of the presidential election vote in all 50 states.
Triboluminescence
Triboluminescence is the light emitted when crystals are crushed…”
‘When you take a lump of sugar and crush it with a pair of pliers in the dark, you can see a bluish flash. Some other crystals do that too.
lump - csomó
pliers - fogó
Bayes’ theorem formula
P(A/B)= P(A) * P(B/A) / P(B)
P(A|B) is the probability of A given that B happens (conditional probability)
P(A) is the probability of A (without any regard to whether event B has occurred (marginal probability)
P(B|A) is the probability of B given that A happens (conditional probability)
P(B) is the probability of B without any regard to whether event A has occurred (marginal probability)
Bayes’ theorem can be written in multiple formats including the use of ∩ (intersection) instead of P(B/A).
https://www.dropbox.com/s/p8io6kx4d4d0vne/Bayes%20theorem%20formula.png?dl=0

conditional probability (and what is opposite?)
Both P(A|B) and P(B|A)
are the conditional probability of observing one event given the occurrence of the other.
Both P(A) and P(B)
are marginal probabilities, which is the probability of a variable without reference to the values of other variables.
Let’s imagine a particular drug test is 99% accurate at detecting a subject as a drug user.
Suppose now that 5% of the population has consumed a banned drug.
How can Bayes’ theorem be applied to determine the probability that an individual, who has been selected at random from the population is a drug user if they test positive?
we need to designate A and B events:
P(A): real drug user probability and
P(B): probability of identifying someone as positive (even if in reality is not >> all real positives from users and the false positives from non-users)
P(A/B): this is the question; probability of a realdruguseridentified positivelyinthetest
(different from 0.99 because there is a probability, that the test shows false positive result from non-users
(the test does not catch all positives either, but not important now)
P(A): probability of a real“drug user” >> 0.05 (implies probability of non-user: 1-0.05 = 0.95)
P(B/A): probability of a positivetest>> 0.99 (result given that the individual is a drug user)
P(B): the probability of a positivetestresult(two elements: actually identified real users + false positively identified non-users): 0.059
- actually identified real users: 0.05 * 0.99 = 0.0495
- false positively identified non users; (1-0.05) * 0.01 = 0.95 * 0.01= 0.9505 * 0.01=0.0095
0.059= 0.0495 + 0.0095 (from 1. + 2.)
P(A/B) = P(A) * P(B/A) / P(B) >> 0.05 * 0.99 / 0.059 = 0.8389
P(user|positive test) = P(user) * P(positive test|user)/P(positive test)

What is the implication of the false positive test results? How to deal with it?
Using Bayes’ theorem, we’re able to determine that (in the current example) there’s an 83.9% probability that an individual with a positive test result is an actual drug user.
The reason this prediction is lower for the general population than the successful detection rate of actual drug users or P (positive test | user), which was 99%,
is due to the occurrence of false-positive results.
Bayes’ theorem weakness
important to acknowledge that Bayes’ theorem can be a weak predictor in the case of poor data regarding prior knowledge and this should be taken into consideration.
Binomial Probability
used for interpreting scenarios with two possible outcomes.
(Pregnancy and drug tests both produce binomial outcomes in the form of negative and positive results, and so too flipping a two-sided coin.)
The probability of success in a binomial experiment is expressed as p, and the number of trials is referred to as n.
drawing aggregated conclusions from multiple binomial experiments such as flipping consecutive heads using a fair coin?
you would need to calculate the likelihood of multiple independent events happening,
which is the product (multiplication) of their individual probabilities
Permutations
tool to assess the likelihood of an outcome.
not a direct metric of probability,
permutations can be calculated to understand the total number of possible outcomes, which can be used for defining odds.
calculate the full number of permutations, which refers to the maximum number of possible outcomes from arranging multiple items
find the full number of seating combinations for a table of three
we can apply the function three-factorial,
which entails multiplying the total number of items by each discrete value below that number,
i.e., 3 x 2 x 1 = 6.
Four-factorial is
Four-factorial is
4 x 3 x 2 x 1 = 24
you want to know the full number of combinations for randomly picking a box trifecta,
which is a scenario where you select three horses to fill the first three finishers in any order.
using permutations is for horse betting;
we’re calculating the total number of permutations
and also a
subset of desired possibilities (recording a 1st place, recording a 2nd place, and recording a 3rd place finish).
The total number of combinations on where each horse can finish is calculated as Twenty-factorial
We next need to divide twenty-factorial by
seventeen-factorial to ascertain all possible combinations of a top three placing.
Twenty-factorial / Seventeen-factorial = 6,840
Thus, there are 6,840 possible combinations among a 20-horse field that will offer you a box trifecta.
CENTRAL TENDENCY
the central point of a given dataset,
aka central tendency measures.
the three primary measures of central tendency are the mean, mode, and median.
The Mean
Arithmetic mean (sum divided by the sample number)
the midpoint of a dataset, is
the average of a set of values and the easiest central tendency measure to understand.
sum of all numeric values / by the number of observations
trimmed mean
the mean can be highly sensitive to outliers.
(statisticians sometimes use the trimmed mean, which is the mean obtained after removing extreme values at both the high and low band of the dataset,
such as removing the bottom and top 2% of salary earners in a national income survey).
The Median
the median pinpoints the data point(s) located in the middle of the dataset to suggest a viable midpoint.
The median, therefore, occurs at the position in which exactly half of the data values are above and half are below when arranged in ascending or descending order.
The solution for an even number of data points is to calculate the average of the two middle points
The Median or mean is better?
The mean and median sometimes produce similar results, but, in general,
the median is a better measure of central tendency than the mean for data that is asymmetrical as it is less susceptible to outliers and anomalies.
The median is a more reliable metric for skewed (asymmetric) data
The Mode
statistical technique to measure central tendency
The mode is the data point in the dataset that occurs most frequently.
discrete categorical values
a variable that can only accept a finite number of values
ordinal values
the categorization of values in a clear sequence
(such as a 1 to 5-star rating system on Amazon)
Why The Mode is advantageous?
easy to locate in datasets with a low number of discrete
categorical values (a variable that can only accept a finite number of values) or
ordinal values (the categorization of values in a clear sequence)
Why can be The Mode is disadvantageous?
The effectiveness of the mode can be arbitrary and depends heavily on the composition of the data.
The mode, for instance, can be a poor predictor for datasets that do not have a single high number of common discrete outcomes (all star values have about the same %)
Weighted Mean
statistical measure of central tendency factors the
weight of each data point to analyze the mean.
used when you want to emphasize a particular segment of data without disregarding the rest of the dataset.
e.g.: students’ grades, the final exam accounting for 70% of the total grade.
What is the a suitable measure of central tendency?
depends on the composition of the data.
The mode: easy to locate in datasets with a low number of discrete values or ordinal values,
The mean and median: suitable for datasets that contain continuous variables.
The weighted mean: used when you want to emphasize a particular segment of data without disregarding the rest of the dataset.
MEASURES OF SPREAD
describes how data varies
The composition of two datasets can be very different despite the fact they each dataset has the same mean.
The critical point of difference is the range of the datasets, which is a simple measurement of data variance.
range of the datasets
As the difference between the highest value (maximum) and the lowest value (minimum),
the range is calculated by subtracting the minimum from the maximum.
knowing the range for the dataset can be useful for data screening and identifying errors.
An extreme minimum or maximum value, for example, might indicate a data entry error, such as the inclusion of a measurement in meters in the same column as other measurements expressed in kilometers.
Standard Deviation
describes the extent to which individual observations differ from the mean.
the standard deviation is a measure of the spread or dispersion among data points just as important as central tendency measures for understanding the underlying shape of the data.
How Standard deviation measures variability ?
Standard deviation measures variability
by calculating the average squared distance of all data observations from the mean of the dataset.
Standard Deviation what low/high SD values mean?
the lower the standard deviation, the less variation in the data
When SD is a lower number (relative to the mean of the dataset) >> it indicates that most of the data values are clustered closely together,
whereas a higher value indicates a higher level of variation and spread.
a low or high standard deviation value depends on the dataset (depends on the mean, on the range and even on the variability of the values in the dataset )

histogram
visual technique for interpreting data variance is to plot the dataset’s distribution values
what is standard normal distribution?
A normal distribution with a
mean of 0 and a
standard deviation of 1
What histogram shape a normal distribution produces?
data is distributed evenly >> a bell curve
A symmetrical bell curve of a standard normal model

Normal distribution can be transformed to a standard normal distribution by ..
converting the original values to standardized scores
normal distribution features:
- the highest point of the dataset occurs at the mean (x̄).
- the curve is symmetrical around an imaginary line that lies at the mean.
- at its outermost ends, the curves approach but never quite touch or cross the horizontal axis.
- the location at which the curves transition from upward to downward cupping (known as inflection points) occur one standard deviation above and below the mean.
how variables diverge in the real world?
The symmetrical shape of normal distribution is a often reasonable description.
(body height, IQ tests, variable values generally gravitate towards a symmetrical shape around the mean as more cases are added)
Empirical Rule
variables often diverge in the real world like a
The symmetrical shape of normal distribution
How the Empirical Rule describes normal distribution ?
Approximately 68% of values fall within one standard deviation of the mean.
Approximately 95% of values fall within two standard deviations of the mean.
Approximately 99.7% of values fall within three standard deviations of the mean.
Aka the 68 95 99.7 Rule or the Three Sigma Rule
What the French mathematician Abraham de Moivre discovered?
Following an empirical experiment flipping a two-sided coin, de Moivre discovered that
an increase in events (coin flips) gradually leads to a symmetrical curve of binomial distribution.
What is Binomial distribution?
It describes a statistical scenario when only one of two mutually exclusive outcomes of a trial is possible,
i.e., a head or a tail, true or false.)
Total possible outcomes of flipping a head with four standard coins
Flipping exp. with 4 coins..
the histogram has five possible outcomes

the probability of most outcomes is now lower.
the more data >> the histogram contorts into a symmetrical bell-shape.
As more data is collected >> more observations settle in the middle of the bell curve, a smaller proportion of observations land on the left and right tails of the curve.
The histogram eventually produces approximately 68% of values within one standard deviation of the mean.
Using the histogram, we can pinpoint the probability of a given outcome such as two heads (37.5%) and whether that outcome is common or uncommon compared to other results—a potentially useful piece of information for gamblers and other prediction scenarios.
It’s also interesting to note that the mean, median, and mode all occur at the same point on the curve as this location is both the symmetrical center and the most common point. However, not all frequency curves produce a normal distribution.
MEASURES OF POSITION
on a normal curve there’s a decreasing likelihood of replicating a result the further that observed data point is from the mean.
We can also assess whether that data point is approximately
one (68%), two (95%) or three standard deviations (99.7%) from the mean.
This, however, doesn’t tell us the probability of replicating the result.
we want to identify the probability of replicating a result.
How to identify the probability of replicating a result?
Depending on the size of the dataset: Z-Score
Z-Score
finds the distance from the sample’s mean to an individual data point expressed in units of standard deviation.

Z-Score is 2.96, means ..
the data point is located 2.96 standard deviations from the mean in the positive direction.
This data point could also be considered an anomaly as it is close to three deviations from the mean and different from other data points.
Z-Score is -0.42, means ..
the data point is positioned 0.42 standard deviations from the mean in the negative direction,
(this data point is lower than the mean)
anomaly
if the Z-Score falls three positive or negative deviations from the mean (in case of normal distribution) >> anomaly
>> data points that lie an abnormal distance from other data points. >> a rare event that is abnormal and perhaps should not have occurred.
in the case of a normal distribution if the Z-Score falls three positive or negative deviations from the mean of the dataset, it falls beyond 99.7% of the other data points on a normal distribution curve.
sometimes viewed as a negative exception, such as fraudulent behavior or an environmental crisis.
help to identify data entry errors and are commonly used in fraud detection to identify illegal activities.
Outliers
no unified agreement on how to define outliers, but:
data points that diverge from primary data patterns as outliers because they record unusual scores on at least one variable and are more plentiful than anomalies.
Z-Score applies to..
to a normally distributed sample
with a known standard deviation of the population.
When to use T-Score?
sometimes the mean isn’t normally distributed or the
standard deviation of the population is unknown or not reliable,
<< which could be due to insufficient sampling (small sample size)
What is the problem with small datasets?
The standard deviation of small datasets is susceptible to change as more observations are included
T-Score who, when discovered, how else called?
Irish statistician W. S. Gosset. early 20th Century published under the pen name“Student” >>
sometimes called “Student’s T-distribution.”
What Z-Score/ T-Score using?
Z-distribution / T-distribution (Student’s T-distribution)
What is Z-Score and T-Score primary function?
same primary function (measure distribution) they’re used with different sizes of sample data.
What is Z-distribution?
standard normal distribution
What Z-Score measures?
the deviation of an individual data point from the mean for datasets with 30 or more observations
based on Z-distribution (standard normal distribution).
Z and T distribution graph.png

T-distribution features
the T-distribution is not one fixed bell curve rather its distribution curve changes (multiple shapes) in accordance with the size of the sample.
- if the sample size is small, (e.g. 10): >> the curve is relatively flat with a high proportion of data points in the curve’s tails.
- as the sample size increases >> the distribution curve approaches the standard normal curve (Z-distribution) with more data points closer to the mean at the center of the curve.
Z and T distribution graph.png

A standard normal curve is defined by…
by the 68 95 99.7 rule,
which sets approximate confidence levels for one, two, and three standard deviations from a mean of 0.
Based on this rule, 95% of data points will fall 1.96 standard deviations from the mean
if the sample’s mean = 100 and we randomly select an observation from the sample (in case of standard normal curve)..
the probability of that data point falling within 1.96 standard deviations of 100 is 0.95 or 95%.
To find the exact variation of that data point from the mean we can use the Z-Score
In the case of smaller datasets we need to..
what is the problem?
they don’t follow a normal curve—we instead need to use the T-Score.
T-Score
The formula is similar to that of the Z-Score,
except the standard deviation is divided by the sample size.
Also, the standard deviation is that of the sample in question, which may or may not reflect that of the population (when more observations are added to the dataset).

You’ll want to use the t score formula when ..
when you don’t know the population standard deviation and you have a small sample (under 30).
T-score formula
When to use T-score formula ?
You’ll want to use the t score formula when you don’t know the population standard deviation and you have a small sample (under 30).
What is the T Score in essence?
A t score is one form of a standardized test statistic
(the other you’ll come across in elementary statistics is the z-score).
The t score formula enables you to take an individual score and transform it into a standardized form > one which helps you to compare scores.
Z-score tells you:
z score tells you how many standard deviations from the mean your score is
very good website >> work out here
https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/z-score/
Z score = 0: what is the meaning?
Your observation is right in the middle of the distribution (in the mean)
Z score = 1: what is the meaning?
Your observation is 1 SD away from the mean (above if +1, bellow if -1)
Z-score summary
The Law of Large Numbers
if we take a sample (n) observations of our random variable & avg the observation (mean)–
it will approach the expected value E(x) of the random variable.
What is a typical sample size that would allow for usage of the central limit theorem?
In practice, “n = 30” is usually what distinguishes a “large” sample from a “small” one.
In other words, if your sample has a size of at least 30 you can say it is approximately Normal (and, hence, use the Normal distribution).
If, on the other hand, your sample has a size less than 30, it’s best to use the t-distribution instead.
Do we average large number of samples when applying Central limit theorem?
We are not averaging a large number of samples, rather, we are obtaining the averages from many repeated samples.
The distribution of the sample averages is the Normal distribution we obtained.
It does not represent the original distribution well. But it’s not supposed to do so!
This Normal distribution is the distribution of the sample mean. Its use it to let us talk about the probability of the sample mean being in a given interval, better understanding the population mean,
and so forth.
How can we use the Central Limit Theorem?
We can get info about a population
not taking large number of samples, but
getting the averages from many repeated smaller samples
>> their distribution will be normal (around the mean)
>> this normal distribution is the distribution of the sample mean.
>> population mean can be determined
>> can determine the probability of the sample mean being in a given interval
(and maybe more what I still dont get)
Central Limit Theorem
if we take the mean of the samples (n) and plot the frequencies of their mean,
>> we get a normal distribution! as the sample size (n) increases –> approaches infinity –> we find a normal distribution
(calculate the mean of a few random samples (e.g: n=4) from the whole population > gives a value (sample mean) > repeat several times with the same sample size (4-4-4 samples) > plot their means on a frequency distribution > if you do it many times > the distribution of the sample means will follow normal distribution
if the sample size is low (e g.: n=4) >> the curve will be wide and flat
as sample size increases (e g.: n >>> 4) > the curve will be higher and tighter around the mean

what’s the difference between an average and mean?
The word ‘average’ is a bit more ambiguous.
Average can legitimately mean almost any measure of central tendency: mean, median, mode, typical value, etc.
However, even “mean” admits some ambiguity, as there are different types of means.
The one you are probably most familiar with it the arithmetic mean, although there is
also a geometric mean and a harmonic mean.
Skew and Kurtosis of the Normal Distribution
opposite of fraction number
integer
The Standard Error of the Mean
the Standard Error of the Mean
the Stand Dev of the Mean
the ‘stand deviation’ of the ‘sample distribution’ of the ‘sample mean’
–> all the same
the Standard Error of the Mean.png

what is ‘mu’ and ‘X upper lined’
the whole population can be characterized by a mean μ (mu),
but it is impossible to measure (everybody) so we take
several samples from the whole population and calculate the sample means x̄ (x upper lined)
according to the Central Limit Theorem the means of the taken samples will follow Normal distribution
even if the distribution is not normal in the population
what is sigma squared?
population variance
what is sigma ?
population SD
what is ‘s’ squared?
sample variance
what is ‘s’ ?
sample SD (square rooted sample variance)
but square rooting is non -linear >> square rooting (n-1) >> introduces slight errors, still the best we have

sample standard deviation
sample SD (square rooted sample variance)
but square rooting is non -linear >> square rooting (n-1) >> introduces slight errors, still the best we have sample

Variance
squared standard deviation
square root of variance gives –> standard deviation
population variance / population variance:
the differences of sample values and means squared –>
summed up –> divided by sample number (n; in case of population variance) or (n-1; sample variance)
population variance: sigma
sample variance: ‘s’

difference between one-tailed test and 2 tailed test
one-tailed test considers one direction of results (left or right) from the null hypothesis,
whereas a two-tailed test considers both directions (left and right).
the objective of the hypothesis test is not to challenge the null hypothesis in one particular direction but to consider both directions as evidence of an alternative hypothesis.
there are two rejection zones, known as the critical areas.
Results that fall within either of the two critical areas trigger rejection of the null hypothesis and thereby validate the alternative hypothesis.

Type I Error in hypothesis testing
the rejection of a null hypothesis (H0) that was true and should not have been rejected.
This means that although the data appears to support that a relationship is responsible,
the covariance of the variables is occurring entirely by chance. (this does not prove that a relationship doesn’t exist, merely that it’s not the most likely cause)
covariance: a measurement of how related the variance is between two variables
This is commonly referred to as a false-positive.
Type II Error in hypothesis testing
accepting a null hypothesis (H0) that should’ve been rejected because
the covariance of variables was probably not due to chance.
This is also known as a false-negative.
covariance: a measurement of how related the variance is between two variables
pregnancy test example for
type I
type II errors
we need to establish a H0 what can be challenged experimentally
we can do test for pregnancy -> if the test shows pregnancy -> we can reject H0 stating that the woman is not pregnant –>>
the null hypothesis (H0): the woman is not pregnant.
H0 rejected if the woman is pregnant –> H0 is false and
H0 accepted if the woman is not pregnant (H0 is true).
the test may not be 100% accurate >> mistakes may occur.
If H0 rejected (false + test) and the woman is not actually pregnant (H0 is true), this leads to a Type I Error.
If H0 is accepted (the test fails to show pregnancy, false negative) and the woman is pregnant (H0 is false) –> this leads to a Type II Error
(we do not reject H0 > accept H1)
example for hypothesis testing my take (not sure)
we change sg –> causing effect or not? let’s detect events to see
H0: no affect
H1: does have affect
–> if we can detect events, wich would be highly unlikely by chance (e.g. three SD away from the random distribution mean)
ez az otletem, de majmeglattyuk
What is Covariance?
a measure of the variance between two variables.
covariance is a measure of the relationship between two random variables.
a measurement of how related the variance is between two variables
The metric evaluates how much – to what extent – the variables change together.
However, the metric does not assess the dependency between variables.
covariance is measured..
covariance is measured in units.
The units are computed by multiplying the units of the two variables. The variance can take any positive or negative values.
The values are interpreted as follows:
Positive covariance: Indicates that two variables tend to move in the same direction.
Negative covariance: Reveals that two variables tend to move in inverse directions.
covariance concept is used..
In finance, the concept is primarily used in portfolio theory.
One of its most common applications in portfolio theory is the diversification method,
using the covariance between assets in a portfolio.
By choosing assets that do not exhibit a high positive covariance with each other,
the unsystematic risk can be partially eliminated
the covariance between two random variables X and Y can be calculated using the following formula (for population):
Covariance measures what?
what are the limitations of covariance?
Covariance measures the total variation of two random variables
from their expected values.
Using covariance, we can only gauge the direction of the relationship (whether the variables tend to move in tandem or show an inverse relationship)
it does not indicate the strength of the relationship,
nor the dependency between the variables.
Correlation measures
Correlation measures the strength of the relationship between variables.
Correlation is the scaled measure of covariance.
It is dimensionless.
In other words, the correlation coefficient is always a pure value and not measured in any units.
correlation:
covariance divided by standard deviation of both X and Y variables
investing Example of Covariance
John is an investor. His portfolio primarily tracks the performance of the S&P 500 and John wants to add the stock of ABC Corp. Before adding the stock to his portfolio, he wants to assess the directional relationship between the stock and the S&P 500.
John does not want to increase the unsystematic risk of his portfolio.
Thus, he is not interested in owning securities in the portfolio that tend to move in the same direction.
John can calculate the covariance between the stock of ABC Corp. and S&P 500 by following the steps below:
https://corporatefinanceinstitute.com/resources/knowledge/finance/covariance/
Why Statistical Significance important?
Given that the sample data cannot be truly reliable and representative of the full population, there is the possibility of a sampling error or random chance affecting the experiment’s results.
not all samples randomly extracted from the population are preordained to reproduce the same result. It’s natural for some samples to contain a higher number of outliers and anomalies than other samples, and naturally, results can vary.
If we continued to extract random samples, we would likely see a range of results and the mean of each random sample is unlikely to be equal to the true mean of the full population.
statistical significance : what is the role?
outlines a threshold for rejecting the null hypothesis.
Statistical significance is often referred to as the p-value (probability value) and is expressed between 0 and 1.
what is the meaning of p-value of 0.05?
A p-value of 0.05, expresses a 5% possibility of replicating a result if we take another sample.
how we use the p-value in hypothesis testing?
the p-value is compared to a pre-fixed value (the alpha).
If the p-value returns as
equal or less than alpha, then the result is statistically significant and we can reject the null hypothesis.
If the p-value is greater than alpha, the result is not statistically significant and we cannot reject the null hypothesis.
Alpha sets a fixed threshold for how extreme the results must be before rejecting the null hypothesis.
(alpha should be defined before the experiment and not after the results have been obtained)
How is alpha for two-tailed tests?
For two-tailed tests, the alpha is divided by two.
Thus, if the alpha is 0.05 (5%), then the critical areas of the curve each represent 0.025 (2.5%).
Hypothesis tests usually adopt an alpha of between 0.01 (1%) and 0.1 (10%), there is no predefined or optimal alpha for all hypothesis tests.
Why is there a tendency to set alpha to a low value such as 0.01?
alpha is equal to the probability of a Type I Error (incorrect rejection of the H0 due to false positive)
(when the result falls into the alpha% critical (rejection) zone(s)..
when the result is in the critical zone (defined by alpha) -> the H0 rejected –> tendency to minimalize the critical zone by decreasing it’s size choosing smaller alpha
(incorrect rejection of the null hypothesis) the critical area is smaller >> less chance of incorrectly rejecting H0
but!
increases the risk of a Type II Error (incorrectly accepting the null hypothesis) because
the critical zone will be so tiny, that no value can fall into it anymore –> can not reject the HO –> incorrect acceptance of H0
>> inherent trade-off in hypothesis testing >> most industries have found that 0.05 (5%) is the ideal alpha for hypothesis testing
What is alpha equal to?
alpha is equal to the probability of a Type I Error
(incorrect rejection of the null hypothesis) (false positive result)
Confidence in essence
Confidence is
a statistical measure of prediction confidence regarding whether
the sample result of the experiment is true of the full population
Confidence is calculated as
Confidence is calculated as (1 – α).
if the alpha is 0.05 >> confidence level of the experiment is 0.95 (95%).
1.0 – α = confidence level 1.0 – 0.05 = 0.95
Confidence relation to alpha
Confidenceis calculated as (1 – α).
if the alphais 0.05>> confidencelevel of the experiment is 0.95 (95%).
1.0 – α = confidence level 1.0 – 0.05 = 0.95
What alpha of 0.05 tells and
what not?
alpha = 0.05
–> reject the null hypothesis when the results are in a 5% zone, but
this doesn’t tell us where to plant the null hypothesis rejection zone(s). >> we need to define the critical areas set by alpha.
two-tail test with two confidence intervals and two critical areas .png

For what wee need to define the critical areas set by alpha?
for the null hypothesis rejection zone(s)
How to define the critical areas set by alpha?
Confidence intervals define the confidence bounds of the curve
Two-tailed test:
two confidence intervals define two critical areas outside the upper and lower confidence limits;
One-tailed test:
a single confidence interval defines the left/right-hand side critical area.
two-tail test with two confidence intervals and two critical areas .png
Confidence intervals define..
Confidence intervals define the confidence bounds of the curve
types of hypothesis test
left one-tailed, right one-tailed, two-tailed
Normal distribution sufficient sample data (n>30) what formula for a two-tailed test ?
Z: Z-distribution critical value (found using a Z-distribution table)
formula for a two-tailed test.png

Z-Statistic is used to find..
The Z-Statistic is used
to find the distance between the null hypothesis and the sample mean.
How do you utilize Z-Statistic in hypothesis testing?
In hypothesis testing, the experiment’s Z-Statistic is compared with the expected statistic (critical value) for a given confidence level.
Z-Statistic is used to find the distance between the null hypothesis and the sample mean.
Example teenage gaming habits in Europe; data given: n=100 (100 teens) mean (of gaming time): 22 hrs
Stand. Dev.= 5.7 (calculated) alpha of 0.05
how to find the confidence intervals for 95%?
Using a two-tailed test what can you find out?
95% certain that our sample data will fall somewhere between 20.8828 and 23.1172 hours.
Example teenage gaming habits in Europe

Example teenage gaming habits in Europe;
data given: now low sample size (10) n=10 (10 teens)
mean (of gaming time): 22 hrs Stand. Dev.= 5 (calculated) alpha of 0.05
How to find the confidence intervals for 95%?
Using a two-tailed test what can you find out?
the overall objective of hypothesis testing is
to prove that the outcome of the sample data is representative of the full population and not occurring by chance caused by randomness in the sample data.
Hypothesis testing four steps:
1: Identify the null hypothesis
(what you believe to be the status quo and wish to nullify)
and the type of test (i.e. one-tailed or two-tailed).
2: State your experiment’s alpha
(statistical significance and the probability of a Type I Error) and set the confidence interval(s).
3: Collect sample data and conduct a hypothesis test.
4: Compare the test result to the critical value
(expected result) and decide if you should support or reject the null hypothesis.
What Z-Score measures?
the distance between a data point and the sample’s mean
What Z-Score measures in hypothesis testing?
in hypothesis testing,
we use the Z-Statistic to find the distance between a sample mean and the null hypothesis.
How Z-Statistic is expressed?
what is the meaning?
numerically
the higher the statistic, the higher the discrepancy between the sample data and the null hypothesis.
Z-Statistic of close to 0 means the sample mean matches the null hypothesis—confirming the null hypothesis pegged to a p-value, which is the probability of that result occurring by chance.
hypothesis testing
Z-Statistic of close to 0 means
Z-Statistic of close to 0 means the sample mean matches the null hypothesis—confirming the null hypothesis
rögzítve van
pegged to
What p<0.05 indicates?
A low p-value, such as 0.05, indicates that the sample mean is unlikely to have occurred by chance.
a p-value of 0.05 is sufficient to reject the null hypothesis
How to find the p-value for a Z-statistic?
To find the p-value for a Z-statistic,
we need to refer to a Z-distribution table

What a two-Sample Z-Test compares?
A two-sample Z-Test compares the difference between the means of two independent samples with a known standard deviation.
(we assume: the data is normally distributed and a minimum of 30 observations)
what is high enough Z value
(Z-Statistic value)?
what is high enough Z value (Z-Statistic value)? >>
depends on the level of confidence (determined by alpha)
and the type of the test (one tailed or two tailed) >>
can be found in tables finding the critical Z-value >>
shows in the table the level of confidence
e.g. in a Two-Sample Z-Test
What do you calculate with a Two-Sample Z-Test?
a Z value (Z-Statistic value)
it helps to evaluate the null hypothesis (e.g.: a difference between two sets of values (two samples), we need to calculated the SD of the two samples > it shows what extent they very > it helps to see if the difference between the two groups is due to variation or real)
if Z is close to O >> the sample mean matches the null hypothesis >> confirms the null hypothesis (so the two samples are equal, the difference found between their means is due to chance (coming from variation)
if Z is high enough >> reject H0 so reject that µ1 = µ2 (mu1 = mu2) >> accept H1 (the means of samples are indeed different)
what is high enough Z value (Z-Statistic value)? >> depends on the level of confidence (alpha) and the type of the test (one tailed or two tailed) >> can be found in tables finding the critical Z-value >> shows in the table the level of confidence in tables the critical Z-value can be found: these Z values should be used in confidence interval calculations when a confidence level is determined -by alpha- we need to see the corresponding critical Z-value (find in tables) >> this sets the limit where the H0 can be rejected

z Critical Value
One-Sample Z-Test example:
Company A claims their new phone battery outperforms
former 20 hrs time.
30 users
mean battery life (sample of 30 users) >> 21 hours,
SD= 3
is 21 > 20 if the SD=3 and n=30’ ?
Two-Sample Z-Test practical:
Company A claims their phone battery outperforms Company B. 60 users mean battery life (Company A) (sample of 30 users) >> 21 hours, SD= 3
mean battery life (Company A) (sample of 30 users) >> 19 hours, SD= 2
is that claim right?
One-Sample Z-Test in essence
one-sample only (sample size: 30) (I guess it is the min) calculate SD
assume norm. distribution
calculate mean >> is it different from a value?
not comparing two samples, only one sample’s mean compared to a value
One-Sample Z-Test
one-sample only (sample size: 30) (I guess it is the min) calculate SD assume norm. distribution calculate mean >> is it different from a value? (not comparing two samples, only one sample’s mean compared to a value

One-Sample Z-Test formula
What do you do if you need to compare two mean values coming from two different samples?
(n=30 min and normal distribut with calculated SD)
T-Test in essence
Similar to the Z-Test,
a T-Test analyzes the distance between a sample mean and the null hypothesis but is based on T-distribution (using a smaller sample size) and
uses the standard deviation of the sample rather than of the population.
The main categories of T-Tests:
- An independent samples T-Test (two-sample T-Test) for comparing means from two different groups,
such as two different companies or two different athletes.
This is the most commonly used type of T-Test.
- A dependent sample T-Test (paired T-test) for comparing means from the same group at two different intervals,
i. e. measuring a company’s performance in 2017 against 2018. - A one-sample T-Test for testing the sample mean of a single group against a known or hypothesized mean.
What is T-Statistic?
The output of a T-Test called the T-Statistic
quantifies the difference between the sample mean and the null hypothesis.
As the T-Statistic increases in the +/- direction, the gap between the sample data and null hypothesis expands.
we refer to a T-distribution table
If we have a one-tailed test with an alpha of 0.05 and sample size of 10 (df 9), what can we expect?
we can expect 95% of samples to fall within 1.83 standard deviations of the null hypothesis.

Sample (n=10) >> Mean, SD calculated >> we carry out T-Test:
If our sample mean returns a T-Statistic greater than the critical score of 1.83, what can we conclude?
we can conclude the results of the sample are statistically significant and unlikely to have occurred by chance—allowing us to reject the null hypothesis.
H0: mu= (a certain) value (so the mean is different from that value, the difference we found is not due to a chance, but genuine

What is the T-Statistic critical score (for 95% confidence)?
for a one-tail test: T-Statistic must be greater than the critical score of 1.83 for 95% confidence (alpha=0.05)
for a two-tail test: T-Statistic critical score: 2.26 for 95% confidence (alpha=0.05/2 = 0.025) two critical areas would each account for 2.5% of the distribution based on 95% confidence with confidence intervals of -2.262 and +2.262 from the null hypothesis.
Independent Samples T-Test in essence
An independent samples T-Test compares means from two different groups.
Independent Samples T-Test formula.png

What is Pooled standard deviation used for?
part of a greater calculation for Independent Samples T-Test calculation
https://www.dropbox.com/s/48mecjisbglgbbn/Independent%20Samples%20T-Test%20formula.png?dl=0
Independent Samples T-Test Xmpl
comparecustomer spending between the
desktopversion of their website andthe mobilesite.
25desktop customers spent an average of $70with a SD of $15.
mobileusers, 20customers spent $74on average with a SD of $25.
We test the difference of the sample mean and the known mean using a two-tail test with an alpha of 0.05 (95% confidence).
What to do if we want to: compare customer spending between the desktop version of their website and the mobile site. 25 desktop customers spent an average of $70 with a SD of $15. mobile users, 20 customers spent $74 on average with a SD of $25.
Independent Samples T-Test
Dependent Sample T-Test in essence
A dependent sample T-Test is used for comparing means from the same group at two different intervals.
Dependent Samples T-Test formula.png

What to use if we want to compare means from the same group at two different intervals (at two different timepoints, but same players)
Dependent Sample T-Test what for?
if we want to compare means from the same group at two different intervals (at two different timepoints, but same players)
One-Sample T-Test in essence
A one-sample T-Test is used for testing the sample mean of a single group against a known or hypothesized mean.

When Z-Test is used for hypothesis testing?
what is it based on?
A Z-Test, is
used for datasets with 30 or more observations (normal distribution) with a known standard deviation of the population and is calculated based on Z-distribution.
When T-Test is used for hypothesis testing?
A T-Test is used in scenarios when you have a small sample size or you don’t know the standard deviation of the population
and you instead use the standard deviation of the sample and T-distribution.
What to do, if you want to compare small sample sized sample (group) and you do not know the SD of the whole population (only of your small sized sample’s)?
T-Test is used in scenarios when you have a small sample size or you don’t know the standard deviation of the population and you instead use the standard deviation of the sample and T-distribution.
You can test if the sample mean is the same with sg. (it will be a hypothesis)
(H null: they are the same, H1: they are different)
you can test H0 with T-test >> you get T-Statistics value >> lookup the critical value in the T-distribution table >> compare them >> accept/reject the null hypothesis
What T-Test is used for ?
small sample size or you don’t know the standard deviation of the population instead use the standard deviation of the sample and T-distribution
You can test if the sample mean is the same with sg. (it will be a hypothesis) (H null: they are the same, H1: they are different) you can test H0 with T-test >> you get T-Statistics value >> lookup the critical value in the T-distribution table >> compare them >> accept/reject the null hypothesis
What technique is used to compare experimental group and a control group (placebo)?
hypothesis testing for comparing two proportions from the same population population expressed in percentage form,
i.e. 40% of males vs 60% of females.
we need to conduct a ‘two-proportion Z-Test’
https://www.dropbox.com/s/3ml84x5fhon19gj/Two-proportion%20Z-Test.png?dl=0
two-proportion Z-Test’
hypothesis testing for comparing two proportions from the same population population expressed in percentage form,
i.e. 40% of males vs 60% of females.
we need to conduct a ‘two-proportion Z-Test’ to compare experimental group and a control group (placebo)
https://www.dropbox.com/s/3ml84x5fhon19gj/Two-proportion%20Z-Test.png?dl=0

Two-proportion Z-Test practical
Two-proportion Z-Test practical
Two-proportion Z-Test practical.png
We consider a new energy drink formula proposes to improve students’ test scores.
max test score: 1600 (the average score is 1050 - 1060). Evaluation: whether the students’ results exceed 1060 points.
sample of 2,000 students split evenly into an exp. group (energy drink) and a ctrl group (placebo). Results:
Ctrl Group = 500 exceeded /1000
Exp Group = 620 exceeded /1000; looks more than 500 > real difference?

in Two-proportion Z-Test we get Z-Statistic value: how do we evaluate it?
Critical areas of 2.5% on each side of the two-tailed (normal distribution) curve from a distance of 1.96 standard deviations.
If the Z-Statistic falls within 1.96 standard deviations of the mean (within the 95% area) >>
we can conclude that the proportions of the ‘experimental test’ and ‘control test’ results were equal (the exp. group and the ctrl group are not different)
If the Z-Statistic falls out of the 95% area >> reject null hypothesis (the proportions are not the same) >> so they are different (H1 is true)
Normal distribution curve with marked critical areas.png

We consider a new energy drink formula proposes to improve students’ test scores. max test score: 1600 (the average score is 1050 - 1060). Evaluation: whether the students’ results exceed 1,060 points. sample of 2,000 students split evenly into an exp. group (energy drink) and a ctrl group (placebo). Results: Ctrl Group = 500 surpassed /1000 Exp Group = 620 surpassed/1000; looks more than 500 > real difference? How to evaluate the difference?
What is the null hypothesis when comparing exp. group with a ctrl group?
two-proportion Z-Test based on the following hypotheses:
H0: p1 = p2 (The proportions are the same with the difference equal to 0)
H1: p1 ≠ p2 (The two proportions are not the same)
we detect a difference between the two groups >> is it a real difference (or just due to chance)?
we want to find out >> H0: we state, that they are the same (this hypothesis we want to nullify,reject >> we can reject, if the Z-test value will fall into an area of the distribution, where there is less than 5% chance that would fall by chance considering the variation in that sample group
we anchor the null hypothesis with the statement that we wish to nullify:
(the two proportions of results are identical and it just so happened that the results of the experimental group different that of the control group due to a random sampling error) <– if reject, H1 is true: they are not equal
in general:
H0: the known, the status quo, what we want to chalenge
H0: (equal, not equal, less, more)
H1: the opposite, engulfing eveything else
Two-proportion Z-Test practical.png

What is the meaning if we define confidence level = 95% ?
H0: p1 = p2 (The proportions are the same with the difference equal to 0)
H1: p1 ≠ p2 (The two proportions are not the same)
H0: p1 = p2 (The proportions are the same with the difference equal to 0)
H1: p1 ≠ p2 we test it; (The two proportions are not the same) << if it occurs less than 5% by chance (the probability that it happens is more than 95% that not by chance) ->we reject H0, because 95% probility holds that not equal
putting other way: actually the formula examines the difference between the two sample proportions
H0: p1-p2=0
Ha: p1-p2≠0 we test it; (The two proportions are not the same -> the difference of the proportions is not zero; if the probality that it is zero by chance is less than 5% -> 95% -or more- probability that not by chance -> so it is genuinely true) << if it occurs less than 5% by chance (the probability that it happens is more than 95%)
we’ll reject the null hypothesis if there’s a less than 5% chance of the alternative hypothesis occurring by chance.
we anchor the null hypothesis with the statement that we wish to nullify:
(e.g.: exp group-placebo group test: the two proportions of results are identical and it just so happened that the results of the experimental group different that of the control group due to a random sampling error)
Normal distribution curve with marked critical areas.png

regression analysis essence
technique in inferential statistics it is used to test how well a variable predicts another variable.
the term “regression” is derived from Latin, meaning “going back”
What is the the objective of regression analysis ?
The objective of regression analysis is to find a line that best fits the data points on the scatterplot to make predictions.
Linear regression, the line is straight and cannot curve or pivot.
Nonlinear regression, meanwhile, grants the line to curve and bend to fit the data.
trendline
trendline
A straight line cannot possibly intercept all data points on the scatterplot > linear regression can be thought of as a trendline visualizing the underlying trend of the dataset.
hyperplane:
a perpendicular line from the regression line to each data point on the scatterplot >> the aggregate distance of each point would equate to the smallest possible distance to the hyperplane.
hyperplane
a perpendicular line from the regression line to each data point on the scatterplot
>> the aggregate distance of each point would equate to the smallest possible distance to the hyperplane.
coefficient
slope aka. coefficient in statistics.
the term “coefficient” is generally used over “slope” in cases where there are multiple variables in the equation (multiple linear regression) and the line’s slope is not explained by any single variable.
slope
The slope of a regression line (b) represents the rate of change in y as x changes.
Because y is dependent on x > the slope describes the predicted values of y given x.
The slope of a regression line is used with a t-statistic to test the significance of a linear relationship between x and y.
The slope can be found by referencing the hyperplane;
(scatterplots in statistics) as one variable increases, the other variable increases by the average value denoted by the hyperplane.
The slope is useful in forming predictions.
How do you calculate slope?
(I did not get this)
With ordinary least squares method
(one of the most common linear regressions) slope, is found by calculating
b as the covariance of x and y,
divided by the variance (sum of squares) of x,
The slope must be calculated before the y-intercept when using a linear regression, as
the intercept is calculated using the slope.

How is the slope useful? example..
We can use the slope, in forming predictions.
to predict a child’s height based on his parents’ midheight
(the intercept between parents’ midheight (X) of 72 inches and a son’s expected height (y)
>> the y value is approximately 71 inches.
Predicted height of a child whose parents’ midheight.png

Regression analysis is useful for..
Regression analysis
(aka regression towards the mean) is a useful method for estimating relationships among variables testing if they’re somehow related.
Linear regression is not a fail-proof method of making predictions,
the trendline does offer a primary reference point to make estimates about the future.
linear regression summary bbas
The regression model (and a scatter chart)
excellent tool to depict the relationship between two variables. Provides a visual representation and a mathematical model that relates the two variables.
describes the relation between x;y in a scatter plot
y = mx + b
(m: slope; b: intercept)
calculates m and b in such a way, that minimizes the distance (error) of the points from the regression line on the plot
(more accurately: reduce the sum of the errors squared >> “least squares regression” name)
Linear regression Xmple
What is R-squared for?
If we apply Linear regression analysis to large datasets with a higher degree of scattering or three-dimensional and four-dimensional data, hard to validate the trendline only by looking at it > A mathematical solution to this problem is to apply R-squared (the coefficient of determination)
R-squared
(the coefficient of determination)
R-squared is a test to see what level of impact the independent variable has on data variance.
R-squared (a number between 0-1 (produces a percentage value)
0% : the linear regression model accounts for none of the data variability in relation to the mean (of the dataset) >> the regression line is a poor fit (for the given dataset)
100% : the linear regression model expresses all the data variability in relation to the mean (of the dataset) >> the regression line is a perfect fit mathematical solution to validate the (calculated) relationship in the regression model
defines the percentage of variance in the linear model in relation to the independent variable.
How R-squared is calculated?
R2 is a ratio ->
-> division needed to be calculated: SSR/SST
R-squared is calculated as
the sum of square regression (SSR) divided by
the sum of squares total (SST) -> SSR/SST
SSR: calculated from the regression analysis given theoretical values for the dependent variable (y’); y’ based on the y’=mx+b formula
it is the total sum of
[the individual values calculated for each datapoint from the theoretical (y’) and the actual/measured y̅ mean values at each point] -> squared -> sum up
SSR= (y’ - y̅)2
(y’ - y̅)2 calculated for each datapoint and summed up and squared to get SSR
SST: calculated from the actual measured values of y and the mean of actual y values
it is the total sum of
[the individual values calculated for each datapoint from the actual y values (y) and the actual y̅ mean values at each point] -> squared -> sum up
SSR= (y - y̅)2
(y - y̅)2 calculated for each datapoint and summed up and squared to get SSR

Pearson Correlation in essence
A common measure of association between two variables.
Describes the strength or absence of a relationship between two variables.
Slightly different from linear regression analysis, which expresses the average mathematical relationship between two or more variables with the intention of visually plotting the relationship on a scatterplot.
Pearson correlation is a statistical measure of the co-relationship between two variables without any designation to independent and dependent qualities.
Interpretations of Pearson correlation coefficients
Pearson correlation (r) is expressed as a number (coefficient) between -1 and 1.
-1 denotes the existence of a strong negative correlation
0 equates to no correlation, and
+1 for a strong positive correlation.
a correlation coefficient of -1 means that for every positive increase in one variable, there is a negative decrease of a fixed proportion in the variable
(airplane fuel which decreases in line with distance flown)
a correlation coefficient of 1 signifies an equivalent positive increase in one variable based on a positive increase in another variable
(food calories of a particular food that goes up with its serving size)
a correlation coefficient of zero notes that for every increase in one variable, there is neither a positive or negative change (the two variables aren’t related)
Interpretations of Pearson correlation coefficients.png

Pearson correlation coefficients xmpl
Describes the strength or absence of a relationship between two variables
Pearson correlation coefficients xmpl.png

Clustering analysis in essence
clustering analysis aims
to group similar objects (data points) into clusters based on the chosen variables.
This method partitions data into assigned segments or subsets (where objects in one cluster resemble one another and are dissimilar to objects contained in the other cluster(s).
Objects can be interval, ordinal, continuous or categorical variables.
(a mixture of different variable types can lead to complications with the analysis because the measures of distance between objects can vary depending on the variable types contained in the data)
Regression and clustering
clustering analysis is used in
developed originally from anthropology,
psychology (later) 1930-s
personality psychology (1943)
today: in data mining, information retrieval, machine learning, text mining, web analysis, marketing, medical diagnosis, and many more
Specific use cases include analyzing symptoms, identifying clusters of similar genes, segmenting communities in ecology, and identifying objects in images.
not one fixed technique rather a family of methods, (includes hierarchical clustering analysis and non-hierarchical clustering)
Hierarchical Clustering Analysis
(HCA) is a technique
to build a hierarchy of clusters.
An example: divisive hierarchical clustering, which is a top-down method where all objects start as a single cluster and are split into pairs of clusters until each object represents an individual cluster.
Hierarchical Clustering Analysis.png

Agglomerative hierarchical clustering
a bottom-up method of classification (more popular approach)
Carried out in reverse each object starts as a standalone cluster a hierarchy is created by merging pairs of clusters to form progressively larger clusters.
three steps:
- Objects start as their own separate cluster, which results in a maximum number of clusters.
- The number of clusters is reduced by combining the two nearest (most similar) clusters. (differentiate by the interpretation of the “shortest distance” )
- This process is repeated until all objects are grouped inside one single cluster.
>> hierarchical clusters resemble a series of nested clusters organized within a hierarchical tree.
What is the difference between “agglomerate clustering” and “ divisive clustering”?
The agglomerate cluster starts with a broad base and a maximum number of clusters.
The number of clusters falls at subsequent rounds until there’s one single cluster at the top of the tree.
In the case of divisive clustering, the tree is upside down. At the bottom of the tree is one single cluster that contains multiple loosely related clusters. These clusters are sequentially split into smaller clusters until the maximum number of clusters is reached. Hierarchical clustering >> dendrogram chart to visualize the arrangement of clusters. (they demonstrate taxonomic relationships and are commonly used in biology to map clusters of genes or other samples)
(Greek dendron - “tree.”)
Nearest neighbor and a hierarchical dendrogram.png

Agglomerative Clustering Techniques
Various methods
(differ in both the technique -to find the “shortest distance” between clusters- and in the shape of the clusters they produce)
Nearest Neighbor
The furthest neighbor
Average aka UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
Centroid Method
Ward’s Method
Nearest neighbor
creates clusters based on the distance between the two closest neighbors.
you find the shortest distance between two objects
>> combine them into one cluster >> repeated
>> the next shortest distance between two objects is found
(either expands the size of the first cluster or forms a new cluster between two objects)
Furthest Neighbor Method
Produces clusters by measuring the distance between the most distant pair of objects. The distance between each possible object pair is computed
>> the object pairs located furthest apart are unable to be linked.
At each stage of hierarchical clustering, the two closest objects are merged into a single cluster.
Sensitive to outliers.
Average aka UPGMA
(Unweighted Pair Group Method with Arithmetic Mean)
Merges objects by calculating the distance between two clusters and measuring the average distance between all objects in each cluster and joining the closest cluster pair.
Initially, no different to nearest neighbors because the first cluster to be linked contains only one object. Once a cluster includes two or more objects > the average distance between objects within the cluster can be measured which has an impact on classification.
Centroid Method
Utilizes the object in the center of each cluster (centroid) to determine the distance between two clusters.
At each step, the two clusters whose centroids are measured to be closest together are merged.
Ward’s Method
Draws on the sum of squares error (SSE) between two clusters over all variables to determine the distance between clusters.
All possible cluster pairs are combined >> the sum of the squared distance across all clusters is calculated. At each round attempts to merge two separate clusters by combining the two clusters that best minimize SSE >> The pair of clusters that return the highest sum of squares is selected and conjoined.
Produces clusters relatively equal in size (may not always be effective).
Can be sensitive to outliers.
One of the most popular agglomerative clustering methods in use today.
Measures of Distance why important?
Measurement method >>
different method >>
different distance >>
lead to different classification results >>
impact on cluster composition

Distance measurement methods
Euclidean distance
(standard across most industries, including machine learning and psychology)
Squared Euclidean distance
Manhattan distance (reduces the influence of outliers and resembles walking a city block)
Maximum distance, and
Mahalanobis (internal cluster distances tend to be emphasized (distances between clusters are less significant).
Manhattan distance versus Euclidean distance.png

Euclidean distance formula
Nearest Neighbor Exercise
Non-Hierarchical Clustering methods
(Partitional clustering) different from hierarchical clustering and is commonly used in business analytics.
Divide n number of objects into m number of clusters (rather than nesting clusters inside large clusters).
Each object can only be assigned to one cluster and each cluster is discrete (unlike hierarchical clustering) >> no overlap between clusters and
no case of nesting a cluster inside another. >>
usually faster and require less storage space than hierarchical methods >>
(typically used in business scenarios)
Helps to select the optimal number of clusters to perform classification (rather than mapping the hierarchy of relationships within a dataset using a dendrogram chart)
Non-Hierarchical Clustering methods.png

Example of k-means clustering
k-means clustering in a nutshell and downsides
attempts to split data into k number of clusters
not always able to reliably identify a final combination of clusters
(need to switch tactics and utilize another algorithm to formulate your classification model)
measuring multiple distances between data points in a three or four-dimensional space (with more than two variables) is much more complicated and time-consuming to compute its
success depends largely on the quality of data and
there’s no mechanism to differentiate between relevant and irrelevant variables;
the variables you selected are relevant and especially if chosen from a large pool of variables
What are Measures of Spread?
(measures of dispersion)
how wide the set of data is
The most common basic measures are:
The range
(including the interquartile range and the interdecile range)
(how much is in between the lowest value (start) and highest value (end)
(interquartile range, which tells you the range in the middle fifty percent of a set of data)
The standard deviation
square root of variance
a measure of how spread out data is around center of the distribution (the mean).
gives you an idea of where, percentage wise, a certain value falls.
e.g. you score one SD above the mean on a test (normally distributed -bell shaped). >> your score puts you in the top 84% of test takers)
The variance
a very simple statistic, gives an extremely rough idea of how spread out a data set is. As a measure of spread, it’s actually pretty weak. A large variance, doesn’t tell you much about the spread of data — other than it’s big!
The most important reason the variance exists >> to find the SD
SD squared >> variance
Quartiles
divide your data set into quarters according to where those numbers falls on the number line.
not very useful on its own >> used to find more useful values like the interquartile range
how to insert unicode character symbols?
x with overline [x̅]:
Type the x then go to Insert >
Symbol
In the Character Viewer select Unicode from the left list
[You may have to click the ✲ to Customize the List]
Select Combining Diacritical Marks in the top middle pane
Locate & double-click the Overline [U-0305] in the lower middle pane
how to insert unicode character symbols.png

Variance summary
population mean character
mu
sample mean character
x bar (x overline)
population variance character
sigma squared
sample variance character
s squared
frequency distribution
a table dividing the data intro groups (classes) shows how many data values occur in each group
Summary of clustering types
Not everyone has cancer, who has the symptoms (only 1 out of 10.000) >>
1/10.000 healthy individuals have the same symptoms worldwide but they do not have cancer
What is the probability that a patient has cancer, if someone has the symptom?
the incidence rate is 1/100.000
we need to designate A and B events:
P(A): real cancer case
P(B): probability of having symptoms (includes the ones having cancer-with symptomes) and the ones with no cancer, but with symtomes >> all real positives and the false positives)
P(A/B): this is the question; probability of a realcancer
(different from 100% because there is a probability, that the syndromes are false positive resulting from non-cancers
P(A): probability of a realreal cancer >> 1/100.000 (implies probability of non-cancer: 1-0.00001 = 0.99999)
P(B/A): probability of symptomes if cancer >> 1
P(B): the probability of symptomes (two elements: actual cancer + false positively symptomatic people): 1/100.000 + 1/10.000
- actually identified real users: 1/100.000 = 0.00001
- false positively identified non users; 1/100.000 + 1/10.000 = 0.00011
(from 1. + 2.)
P(A/B) = P(A) * P(B/A) / P(B) >> 0.00001* 1 / 0.00011 = 0.0909 = 9.1%

The entire output of a factory is produced on three machines (A B C). The three machines account for
20%,30%and50%of thefactoryoutput. Thefractionofdefectiveitems produced is
5% for the first machine; 3% for the second machine; and 1% for the third machine.
If an item is chosen at random from the total output and is found to be defective, what is the probability that it was produced by the third machine (C)?
question reformulated:
what is the proportion of the false item produced by machine C among all false items?
all false items: 2.4%
0.05*0.2 + 0.03*0.3 + 0.01*0.5 = 0.024
false items by C machine:
0.01 * 0.5 = 0.005 >> 0.5%
false items by C machine
among all false items:
0.5% / 2.4% = 5/24

main problem with mean
how to overcome?
the mean can be highly sensitive to outliers.
(statisticians sometimes use the trimmed mean, which is the mean obtained after removing extreme values at both the high and low band of the dataset,
such as removing the bottom and top 2% of salary earners in a national income survey).
how do you label population variance?
sigma squared
how do you label population standard deviation?
sample SD?
population SD: sigma
sample SD: s
Variance summary