statistics notes 2020 march 30 Flashcards
Data types
Categorical and numerical
types of Categorical data
Nominal, Ordinal
Nominal:
Named data which can be separated into discrete categories which do not overlap.
Ordinal:
the variables have natural, ordered categories and the distances between the categories is not known.
types of numerical data
Discrete, continuous
Ordinal data
a categorical, statistical data type
the variables have natural, ordered categories and the distances between the categories is not known.
data which is placed into order or scale (no standardised value for the difference)
(easy to remember because ordinal sounds like order).
e.g.: rating happiness on a scale of 1-10. (no standardised value for the difference from one score to the next)
Nominal Data mytutor.co.uk
Named data which can be
separated into discrete categories which do not overlap.
(e.g. gender; male and female) (eye colour and hair colour)
An easy way to remember this type of data is that nominal sounds like named,
nominal = named.
Ordinal Data
mytutor.co.uk
Ordinal data:
placed into some kind of order or scale. (ordinal sounds like order).
e.g.:
rating happiness on a scale of 1-10. (In scale data there is no standardised value for the difference from one score to the next)
positions in a race (1st, 2nd, 3rd etc). (the runners are placed in order of who completed the race in the fastest time to the slowest time, but no standardised difference in time between the scores).
Intervaldata:
comes in the form of a numerical value where the differencebetween points is standardisedand meaningful.
Interval Data
mytutor.co.uk
Interval data:
comes in the form of a numerical value where the difference between points is standardised and meaningful.
e.g.: temperature, the difference in temperature between 10-20 degrees is the same as the difference in temperature between 20-30 degrees.
can be negative
(ratio data can NOT)
Ratio Data
mytutor.co.uk
Ratio data:
much like interval data – numerical values where the difference between points is standardised and meaningful.
it must have a true zero >> not possible to have negative values in ratio data.
e.g.: height be that centimetres, metres, inches or feet. It is not possible to have a negative height.
(comparing this to temperature (possible for the temperature to be -10 degrees, but nothing can be – 10 inches tall)
inferential statistics
Population: an entire group of items, such as people, animals, transactions, or purchases >> Descriptive statistics applied if all values in the dataset are known.
>> not possible or feasible to analyse >>
Sample: a selected subset, called a sample, is extracted from the population.
The selection of the sample data from the population is random >> Inferential statistics applied >> develop models to extrapolate from the sample data to draw inferences about the entire population (while accounting for the influence of randomness)
Quantitative analysis can be split into two major branches of statistics:
Descriptive statistics (if all values in the dataset are known)
Inferential statistics (extrapolates from the sample data to draw inferences about the entire population)
inferential
következtetési, deductive
Descriptive statistical analysis
As a critical distinction from inferential statistics, descriptive statistical analysis applies to scenarios where all values in the dataset are known.
Confidence, confidence level
Confidence is a measure to express how closely the sample results match the true value of the population.
Confidence level: 0% - 100%
95%: if we repeat the experiment numerous times (under the same conditions), the results will match that of the full population in 95% of all possible cases.
Hypothesis Testing
Hypothesis test:
evaluate two mutually exclusive statements to determine which statement is correct given the data presented.
incomplete dataset >> hypothesis testing is applied in inferential statistics to determine if there’s reasonable evidence from the sample data to infer that a particular condition holds true of the population.
null hypothesis
A hypothesis that the researcher attempts or wishes to “nullify.”
most of the world believed swans were white, and black swans didn’t exist inside the confines of mother nature. The null hypothesis that swans are white
The term “null” does not mean “invalid” or associated with the value zero.
In hypothesis testing, the null hypothesis (H0)
In hypothesis testing, the null hypothesis (H0) is assumed to be the commonly accepted fact but that is simultaneously open to contrary arguments.
If substantial evidence to the contrary >> the null hypothesis is disproved or rejected >> the alternative hypothesis is accepted to explain a given phenomenon.
The alternative hypothesis
The alternative hypothesis is expressed as Ha or H1.
Covers all possible outcomes excluding the null hypothesis.
What is the relationship between the null hypothesis and alternative hypothesis?
null hypothesis and alternative hypothesis are mutually exclusive,
which means no result should satisfy both hypotheses.
a hypothesis statement must be
a hypothesis statement must be clear and simple. Hypotheses are also most effective when based on existing knowledge, intuition, or prior research.
Hypothesis statements are seldom chosen at random. a good hypothesis statement should be testable through an experiment, controlled test or observation.
(Designing an effective hypothesis test that reliably assesses your assumptions is complicated and even when implemented correctly can lead to unintended consequences.)
A clear hypothesis
A clear hypothesis tests only one relationship and avoids conjunctions such as “and,” “nor” and “or.”
A good hypothesis should include an “if” and “then” statement
(such as: If [I study statistics] then [my employment opportunities increase])
The good hypothesis sentence structure
The first half of this sentence structure generally contains an independent variable (this is the hypothesys) (i.e., if study statistics) in the
second half: a dependent variable (whatyou’re attempting to predict) (i.e., employment opportunities).
A dependent variable represents
A dependent variable represents what you’re attempting to predict,
2nd half of the hypothesys sentence
The independent variable is
The independent variable (in the first half of the sentence) is the variable, that supposedly impacts the outcome of the dependent variable (which is the 2nd half of the hypothesys senetence)
double-blind
where both the participants and the experimental team aren’t aware of who is allocated to the experimental group and the control group respectively.
probability
probability expresses the likelihood of something happening expressed in percentage or decimal form; typically expressed as a number with a decimal value called a floating-point number.
odds
odds define the likelihood of an event occurring with respect to the number of occasions it does not occur.
For instance, the odds of selecting an ace of spades from a standard deck of 52 cards is 1 against 51. On 51 occasions a card other than the ace of spades will be selected from the deck.
correlation
Correlation is often computed during the exploratory stage of analysis to understand general relationships between variables.
Correlation describes the tendency of change in one variable to reflect a change in another variable.
confounding variable
the observed correlation could be caused by a third and previously unconsidered variable,
aka lurking variable or confounding variable.
It’s important to consider variables that fall outside your hypothesis test as you prepare your research and before publishing your results.
zavarba hoz
confound
the curse of dimensionality
confusing correlation and causation arises when you analyze too many variables while looking for a match.
(In statistics, dimensions can also be referred to as variables).
If we are analyzing three variables, the results fall into a three-dimensional space.)
You can find instances of the “curse” or phenomenon using Google Correlate (www.google.com/trends/correlate)
the curse of dimensionality tends to affect machine learning and data mining analysis more than traditional hypothesis testing due to the high number of variables under consideration. e.g:
…It turns out that the bang energy drink, for example, came onto the market at a similar time as Alibaba Cloud’s international product offering and then grew at a similar pace in terms of Google search volume..
átok
Data
A term for any value that describes the characteristics and attributes of an item that can be moved, processed, and analyzed.
The item could be a transaction, a person, an event, a result, a change in the weather, and infinite other possibilities.
Data can contain various sorts of information, and through statistical analysis, these recorded values can be better understood and used to support or debunk a research hypothesis.
Population
The parent group from which the experiment’s data is collected,
e.g., all registered users of an online shopping platform or all investors of cryptocurrency.
Sample
A subset of a population collected for the purpose of an experiment,
e.g., 10% of all registered users of an online shopping platform or 5% of all investors of cryptocurrency.
A sample is often used in statistical experiments for practical reasons, as it might be impossible or prohibitively expensive to directly analyze the full population.
Variable
A characteristic of an item from the population that varies in quantity or quality from another item,
e.g., the Category of a product sold on Amazon.
A variable that varies in regards to quantity and takes on numeric values is known as a quantitative variable,
e.g., the Price of a product.
A variable that varies in quality/class is called a qualitative variable,
e.g., the Product Name of an item sold on Amazon.
This process is often referred to as classification, as it involves assigning a class to a variable.
Variable types (what is the term for the process to establish types?)
quantitative variable (varies in regards to quantity and takes on numeric values),
qualitative variable (varies in quality/class),
classification
Discrete Variable
A variable that can only accept a finite number of values,
e.g., customers purchasing a product on Amazon.com can rate the product as 1, 2, 3, 4, or 5 stars. In other words, the product has five distinct rating possibilities, and the reviewer cannot submit their own rating value of 2.5 or 0.0009.
Helpful tip: qualitative variables are discrete,
e.g. name or category of a product.
Continuous Variable
A variable that can assume an infinite number of values,
e.g., depending on supply and demand, gold can be converted into unlimited possible values expressed in U.S. dollars.
A continuous variable can also assume values arbitrarily close together.
e.g.: price and reviews (number of reviews on a product) are continuous variables
Categorical Variables
A variable whose possible values consist of a discrete set of categories,
rather than numbers quantifying values on a continuous scale)
(such as gender or political allegiance,
Ordinal Variables
(a subcategory of categorical variables),
ordinal variables categorize values in a logical and meaningful sequence.
ordinal variables contain an intrinsic ordering or sequence such as {small; medium; large} or {dissatisfied; neutral; satisfied; very satisfied}.
The distance of separation between ordinal variables does not need to be consistent or quantified. (For example, the measurable gap in performance between a gold and silver medalist in athletics need not mirror the difference in performance between a silver and bronze medalist.)
standard categorical variables, i.e. gender or film genre,
Independent and Dependent Variables
An independent variable (expressed as X) is the variable that supposedly impacts the dependent variable (expressed as y).
For example, the supply of oil (independent variable) impacts the cost of fuel (dependent variable).
As the dependent variable is “dependent” on the independent variable, it is generally the independent variable that is tested in experiments. As the value of the independent variable changes, the effect on the dependent variable is observed and recorded.
In analyzing Amazon products, we could examine Category, Reviews and 2-Day Delivery as the independent variables and observe how changes in those variables affect the dependent variable of Price. Equally, we could select the Reviews variable as the dependent variable and examine Price, 2-Day Delivery and Category as the independent variables and observe how these variables influence the number of customer reviews.
What determines wether a variable is “independent” or “dependent” ?
The labels of “independent” and “dependent” are hence determined by experiment design rather than inherent composition
(one variable could be a dependent variable in one study and an independent variable in another)
two events are considered independent if …
In probability,
two events are considered independent if the occurrence of one event does not influence the outcome of another event
(the outcome of one event, such as flipping a coin, doesn’t predict the outcome of another. If you flip a coin twice, the outcome of the first flip has no bearing on the outcome of the second flip)
P(E|F)
the probability of E given F
The probability of one event (E) given the occurrence of another conditional event (F) is expressed as P(E|F),
two events are said to be independent if ..
Conversely, two events are said to be independent if
P(E|F) = P(E).
This equation holds that the probability of E is the same irrespective of F being present.
This expression can also be tweaked to compare two sets of results where the conditional event (F) is absent from the second trial.
Bayes’ theorem in nutshell
The premise of this theory is to find the probability of an event, based on prior knowledge of conditions potentially related to the event.
Bayes’ theorem “is to the theory of probability what the Pythagorean theorem is to geometry.”
For instance, if reading books is related to a person’s income level, then, using Bayes’ theory, we can assess the probability that a person enjoys reading books based on prior knowledge of their income level.
In the case of the 2012 U.S. election, Nate Silver drew from voter polls as prior knowledge to refine his predictions of which candidate would win in each state. Using this method, he was able to successfully predict the outcome of the presidential election vote in all 50 states.
Triboluminescence
Triboluminescence is the light emitted when crystals are crushed…”
‘When you take a lump of sugar and crush it with a pair of pliers in the dark, you can see a bluish flash. Some other crystals do that too.
lump - csomó
pliers - fogó
Bayes’ theorem formula
P(A/B)= P(A) * P(B/A) / P(B)
P(A|B) is the probability of A given that B happens (conditional probability)
P(A) is the probability of A (without any regard to whether event B has occurred (marginal probability)
P(B|A) is the probability of B given that A happens (conditional probability)
P(B) is the probability of B without any regard to whether event A has occurred (marginal probability)
Bayes’ theorem can be written in multiple formats including the use of ∩ (intersection) instead of P(B/A).
https://www.dropbox.com/s/p8io6kx4d4d0vne/Bayes%20theorem%20formula.png?dl=0
conditional probability (and what is opposite?)
Both P(A|B) and P(B|A)
are the conditional probability of observing one event given the occurrence of the other.
Both P(A) and P(B)
are marginal probabilities, which is the probability of a variable without reference to the values of other variables.
Let’s imagine a particular drug test is 99% accurate at detecting a subject as a drug user.
Suppose now that 5% of the population has consumed a banned drug.
How can Bayes’ theorem be applied to determine the probability that an individual, who has been selected at random from the population is a drug user if they test positive?
we need to designate A and B events:
P(A): real drug user probability and
P(B): probability of identifying someone as positive (even if in reality is not >> all real positives from users and the false positives from non-users)
P(A/B): this is the question; probability of a realdruguseridentified positivelyinthetest
(different from 0.99 because there is a probability, that the test shows false positive result from non-users
(the test does not catch all positives either, but not important now)
P(A): probability of a real“drug user” >> 0.05 (implies probability of non-user: 1-0.05 = 0.95)
P(B/A): probability of a positivetest>> 0.99 (result given that the individual is a drug user)
P(B): the probability of a positivetestresult(two elements: actually identified real users + false positively identified non-users): 0.059
- actually identified real users: 0.05 * 0.99 = 0.0495
- false positively identified non users; (1-0.05) * 0.01 = 0.95 * 0.01= 0.9505 * 0.01=0.0095
0.059= 0.0495 + 0.0095 (from 1. + 2.)
P(A/B) = P(A) * P(B/A) / P(B) >> 0.05 * 0.99 / 0.059 = 0.8389
P(user|positive test) = P(user) * P(positive test|user)/P(positive test)
What is the implication of the false positive test results? How to deal with it?
Using Bayes’ theorem, we’re able to determine that (in the current example) there’s an 83.9% probability that an individual with a positive test result is an actual drug user.
The reason this prediction is lower for the general population than the successful detection rate of actual drug users or P (positive test | user), which was 99%,
is due to the occurrence of false-positive results.
Bayes’ theorem weakness
important to acknowledge that Bayes’ theorem can be a weak predictor in the case of poor data regarding prior knowledge and this should be taken into consideration.
Binomial Probability
used for interpreting scenarios with two possible outcomes.
(Pregnancy and drug tests both produce binomial outcomes in the form of negative and positive results, and so too flipping a two-sided coin.)
The probability of success in a binomial experiment is expressed as p, and the number of trials is referred to as n.
drawing aggregated conclusions from multiple binomial experiments such as flipping consecutive heads using a fair coin?
you would need to calculate the likelihood of multiple independent events happening,
which is the product (multiplication) of their individual probabilities
Permutations
tool to assess the likelihood of an outcome.
not a direct metric of probability,
permutations can be calculated to understand the total number of possible outcomes, which can be used for defining odds.
calculate the full number of permutations, which refers to the maximum number of possible outcomes from arranging multiple items
find the full number of seating combinations for a table of three
we can apply the function three-factorial,
which entails multiplying the total number of items by each discrete value below that number,
i.e., 3 x 2 x 1 = 6.
Four-factorial is
Four-factorial is
4 x 3 x 2 x 1 = 24
you want to know the full number of combinations for randomly picking a box trifecta,
which is a scenario where you select three horses to fill the first three finishers in any order.
using permutations is for horse betting;
we’re calculating the total number of permutations
and also a
subset of desired possibilities (recording a 1st place, recording a 2nd place, and recording a 3rd place finish).
The total number of combinations on where each horse can finish is calculated as Twenty-factorial
We next need to divide twenty-factorial by
seventeen-factorial to ascertain all possible combinations of a top three placing.
Twenty-factorial / Seventeen-factorial = 6,840
Thus, there are 6,840 possible combinations among a 20-horse field that will offer you a box trifecta.
CENTRAL TENDENCY
the central point of a given dataset,
aka central tendency measures.
the three primary measures of central tendency are the mean, mode, and median.
The Mean
Arithmetic mean (sum divided by the sample number)
the midpoint of a dataset, is
the average of a set of values and the easiest central tendency measure to understand.
sum of all numeric values / by the number of observations
trimmed mean
the mean can be highly sensitive to outliers.
(statisticians sometimes use the trimmed mean, which is the mean obtained after removing extreme values at both the high and low band of the dataset,
such as removing the bottom and top 2% of salary earners in a national income survey).
The Median
the median pinpoints the data point(s) located in the middle of the dataset to suggest a viable midpoint.
The median, therefore, occurs at the position in which exactly half of the data values are above and half are below when arranged in ascending or descending order.
The solution for an even number of data points is to calculate the average of the two middle points
The Median or mean is better?
The mean and median sometimes produce similar results, but, in general,
the median is a better measure of central tendency than the mean for data that is asymmetrical as it is less susceptible to outliers and anomalies.
The median is a more reliable metric for skewed (asymmetric) data
The Mode
statistical technique to measure central tendency
The mode is the data point in the dataset that occurs most frequently.
discrete categorical values
a variable that can only accept a finite number of values
ordinal values
the categorization of values in a clear sequence
(such as a 1 to 5-star rating system on Amazon)
Why The Mode is advantageous?
easy to locate in datasets with a low number of discrete
categorical values (a variable that can only accept a finite number of values) or
ordinal values (the categorization of values in a clear sequence)
Why can be The Mode is disadvantageous?
The effectiveness of the mode can be arbitrary and depends heavily on the composition of the data.
The mode, for instance, can be a poor predictor for datasets that do not have a single high number of common discrete outcomes (all star values have about the same %)
Weighted Mean
statistical measure of central tendency factors the
weight of each data point to analyze the mean.
used when you want to emphasize a particular segment of data without disregarding the rest of the dataset.
e.g.: students’ grades, the final exam accounting for 70% of the total grade.
What is the a suitable measure of central tendency?
depends on the composition of the data.
The mode: easy to locate in datasets with a low number of discrete values or ordinal values,
The mean and median: suitable for datasets that contain continuous variables.
The weighted mean: used when you want to emphasize a particular segment of data without disregarding the rest of the dataset.
MEASURES OF SPREAD
describes how data varies
The composition of two datasets can be very different despite the fact they each dataset has the same mean.
The critical point of difference is the range of the datasets, which is a simple measurement of data variance.
range of the datasets
As the difference between the highest value (maximum) and the lowest value (minimum),
the range is calculated by subtracting the minimum from the maximum.
knowing the range for the dataset can be useful for data screening and identifying errors.
An extreme minimum or maximum value, for example, might indicate a data entry error, such as the inclusion of a measurement in meters in the same column as other measurements expressed in kilometers.
Standard Deviation
describes the extent to which individual observations differ from the mean.
the standard deviation is a measure of the spread or dispersion among data points just as important as central tendency measures for understanding the underlying shape of the data.
How Standard deviation measures variability ?
Standard deviation measures variability
by calculating the average squared distance of all data observations from the mean of the dataset.
Standard Deviation what low/high SD values mean?
the lower the standard deviation, the less variation in the data
When SD is a lower number (relative to the mean of the dataset) >> it indicates that most of the data values are clustered closely together,
whereas a higher value indicates a higher level of variation and spread.
a low or high standard deviation value depends on the dataset (depends on the mean, on the range and even on the variability of the values in the dataset )
histogram
visual technique for interpreting data variance is to plot the dataset’s distribution values
what is standard normal distribution?
A normal distribution with a
mean of 0 and a
standard deviation of 1
What histogram shape a normal distribution produces?
data is distributed evenly >> a bell curve
A symmetrical bell curve of a standard normal model
Normal distribution can be transformed to a standard normal distribution by ..
converting the original values to standardized scores
normal distribution features:
- the highest point of the dataset occurs at the mean (x̄).
- the curve is symmetrical around an imaginary line that lies at the mean.
- at its outermost ends, the curves approach but never quite touch or cross the horizontal axis.
- the location at which the curves transition from upward to downward cupping (known as inflection points) occur one standard deviation above and below the mean.
how variables diverge in the real world?
The symmetrical shape of normal distribution is a often reasonable description.
(body height, IQ tests, variable values generally gravitate towards a symmetrical shape around the mean as more cases are added)
Empirical Rule
variables often diverge in the real world like a
The symmetrical shape of normal distribution
How the Empirical Rule describes normal distribution ?
Approximately 68% of values fall within one standard deviation of the mean.
Approximately 95% of values fall within two standard deviations of the mean.
Approximately 99.7% of values fall within three standard deviations of the mean.
Aka the 68 95 99.7 Rule or the Three Sigma Rule
What the French mathematician Abraham de Moivre discovered?
Following an empirical experiment flipping a two-sided coin, de Moivre discovered that
an increase in events (coin flips) gradually leads to a symmetrical curve of binomial distribution.
What is Binomial distribution?
It describes a statistical scenario when only one of two mutually exclusive outcomes of a trial is possible,
i.e., a head or a tail, true or false.)
Total possible outcomes of flipping a head with four standard coins
Flipping exp. with 4 coins..
the histogram has five possible outcomes
the probability of most outcomes is now lower.
the more data >> the histogram contorts into a symmetrical bell-shape.
As more data is collected >> more observations settle in the middle of the bell curve, a smaller proportion of observations land on the left and right tails of the curve.
The histogram eventually produces approximately 68% of values within one standard deviation of the mean.
Using the histogram, we can pinpoint the probability of a given outcome such as two heads (37.5%) and whether that outcome is common or uncommon compared to other results—a potentially useful piece of information for gamblers and other prediction scenarios.
It’s also interesting to note that the mean, median, and mode all occur at the same point on the curve as this location is both the symmetrical center and the most common point. However, not all frequency curves produce a normal distribution.
MEASURES OF POSITION
on a normal curve there’s a decreasing likelihood of replicating a result the further that observed data point is from the mean.
We can also assess whether that data point is approximately
one (68%), two (95%) or three standard deviations (99.7%) from the mean.
This, however, doesn’t tell us the probability of replicating the result.
we want to identify the probability of replicating a result.
How to identify the probability of replicating a result?
Depending on the size of the dataset: Z-Score
Z-Score
finds the distance from the sample’s mean to an individual data point expressed in units of standard deviation.
Z-Score is 2.96, means ..
the data point is located 2.96 standard deviations from the mean in the positive direction.
This data point could also be considered an anomaly as it is close to three deviations from the mean and different from other data points.
Z-Score is -0.42, means ..
the data point is positioned 0.42 standard deviations from the mean in the negative direction,
(this data point is lower than the mean)
anomaly
if the Z-Score falls three positive or negative deviations from the mean (in case of normal distribution) >> anomaly
>> data points that lie an abnormal distance from other data points. >> a rare event that is abnormal and perhaps should not have occurred.
in the case of a normal distribution if the Z-Score falls three positive or negative deviations from the mean of the dataset, it falls beyond 99.7% of the other data points on a normal distribution curve.
sometimes viewed as a negative exception, such as fraudulent behavior or an environmental crisis.
help to identify data entry errors and are commonly used in fraud detection to identify illegal activities.
Outliers
no unified agreement on how to define outliers, but:
data points that diverge from primary data patterns as outliers because they record unusual scores on at least one variable and are more plentiful than anomalies.
Z-Score applies to..
to a normally distributed sample
with a known standard deviation of the population.
When to use T-Score?
sometimes the mean isn’t normally distributed or the
standard deviation of the population is unknown or not reliable,
<< which could be due to insufficient sampling (small sample size)
What is the problem with small datasets?
The standard deviation of small datasets is susceptible to change as more observations are included
T-Score who, when discovered, how else called?
Irish statistician W. S. Gosset. early 20th Century published under the pen name“Student” >>
sometimes called “Student’s T-distribution.”
What Z-Score/ T-Score using?
Z-distribution / T-distribution (Student’s T-distribution)
What is Z-Score and T-Score primary function?
same primary function (measure distribution) they’re used with different sizes of sample data.
What is Z-distribution?
standard normal distribution
What Z-Score measures?
the deviation of an individual data point from the mean for datasets with 30 or more observations
based on Z-distribution (standard normal distribution).
T-distribution features
the T-distribution is not one fixed bell curve rather its distribution curve changes (multiple shapes) in accordance with the size of the sample.
- if the sample size is small, (e.g. 10): >> the curve is relatively flat with a high proportion of data points in the curve’s tails.
- as the sample size increases >> the distribution curve approaches the standard normal curve (Z-distribution) with more data points closer to the mean at the center of the curve.
A standard normal curve is defined by…
by the 68 95 99.7 rule,
which sets approximate confidence levels for one, two, and three standard deviations from a mean of 0.
Based on this rule, 95% of data points will fall 1.96 standard deviations from the mean