General Flashcards
Sample percentile
Within a sample that has been ranked from least to greatest the 100p percentile of data is the value of data where
- ) 100p percent of the data is equal to or less than the data value and
- ) 100(1-p) percent are greater than or equal to it.
A statistic
This is a numerical value that is derived from data.
Bivariate Data Analysis
This is when you are investigating an IV and a DV and the relationship between IV and DV
Box Plots
This is a plot that shows the extreme values, the first quartile, median, and third quartile.
Central Tendency Measures
This is described by the mean, median, and mode of the dataset where the mean is influenced by the extreme values and median is independent
Chubyshovs inequality
If we are trying to identify how much of a dataset lies between the values of x̄ +-ks where s = standard deviation and k = some number then
% min = 100(1- 1/(k2))
Class Boundaries
These are the max/min of the class intervals. We use the left-end inclusion rule which says that the value to the left is included in the bin and the one in the right is not.
Class intervals
These are the bins for grouping observations in a reasonable way.
Closed Data
This is data that is of a fixed ratio where the maximum cannot exceed some value.
Examples include any cumulative data.
Correlation Coefficient
r = [Σ (xi - x̄)( yi - ȳ)]/(n-1)sxsy = [Σ (xi - x̄)( yi - ȳ)]/[Σ (xi - x̄)2( yi - ȳ)2].5
This says that if we have a paired dataset such that xi,yi are the pairs and are described by their respective means such that y = mx + b then this statistic will indicate the linearity of the pairs of data
Cumulative Frequency
This shows the bins as a function of an additive frequency.
These are also called Ogives
Directional Data
This is data expressed in angles and can indicate how a vector is directed in space.
Frequency Table
This is a table that displays the number of occurrences vs. a characteristic of the sample being investigated with relatively small and discrete values.
Gini Coefficient
The gini coefficient (G) is the integral of the area between L(p) = 1 and the Lorenz Curve. It has a maximum value of .5 and a minimum value of 0
G=1-2B where B = area under Lorenze curve, L(p)
Histograms
These are bar charts without spaces
Image Processing
This is an increasingly important form of analysis that involves the changing of images from signals to visuals, enhancing the signal to noise ratio, extract features, and understand patterns.
Inferential Statistics
This is the practice of using statistics to make inferences about a experiment or population
Interval Data
These are data that are seperated by even values but they can be less than zero (temperature)
Lorenz Curve
This is a cumulative curve showing the income distribution
mean
x bar = Σx/n = Σ v*f/n
where v = bin value and f = frequency
Mean influence by multiplication/addition
for some function y = ax+b
y bar = a x(bar) + b so the mean is affected by both multiplication and addition in a linear way
Median
This is the middle value of a sample when data is arranged from least to greatest
If n is odd then the median value occurs at n = (n+1)/2
If n is even then the median is the average of (n/2)+1 and n/2
Mode
This is the observed value that occurs most often within a dataset. If there are more than one values that occur the same number of times then there are modal values
Nominal Data
This is data that is non-numerical in character (fossils, minerals, rocks…)
It is occasionally converted into binary (0=not present, 1 = present)
Normal Data Set
This is a data set where mean=median=mode and where 68% of the data lies between x̄+-s
95% is within x̄ +-2s
99.7% is within x̄ +- 3s
Ordinal Data
This is ranked data that can be numerical but the intervals separating the data is not equal. (Ex: Moh’s scale of hardness). Values also cannot be negative
Paired Data Sets
These are data sets that are trying to understand how one variable influences a different variable
Population
This is the total collection of elements that we want to investigate. This is too large to investigate each of the contained elements.
Probability Models
These are models that help us understand the validity of our conclusions by assigning probabilities of finding our results. It acts as the basis of statistical inference and if an inference cannot be checked using a probability model then we cannot conclude the inference is legitamate.
r meaning
If the slope relating y and x is <0 then r <0 and vice versa. the absolute value of r indicates the linearity of the relationship
If r is for (x, y) where w = a + bx and z = c + dy then
r(x,y) = r(w,z)
Ratio data
This is data that is greater than zero and on a scale where each interval is spaced evenly. Examples include weight or length
Relative frequency
This is f/n where f is the number of occurrences for a given phenomena and n is the total number of phenomena investigated.
The summation of f/n =1
Sample 100p percentile
The data point equal to where less than 100*p% of data lies. It includes that data point
p=probability as a decimal
Sample Variance
s2=Σ (xi-x)2 /(n-1)
This fundamentally finds the average values of the squared difference of a data point to the mean (hypotenuse)
It is squared to find the absolute value of the difference and not be influenced by negatives
Samples
These are subgroups of populations which ideally represent the population
Sampling Strategies (3 kinds for geology)
regular sampling is where sampling occurs in evenly distributed plots
Uniform sampling occurs by taking random samples within a defined area
clustered sampling takes samples from an outcrop or other area of limitted exposure
Scatter diagrams
These are x vs y plots where y = DV
signal processing
This includes all techniques for manipulating a signal to minimize the noise. They are most often used in combination with time series anlayses to make sense of things like geophysical data
Spatial Analyses
This is a suite of techniques used to understand how observations relate to one another in 2 or 3 space.
Spatial Data
This is data that is collected in either 2 or 3 space and represent the occurrence of something in space
Ex: Spatial distribution of a tracer in water
Spread influence
Central tendencies are influenced by addition/subtraction. Spread is not but multiplication changes spread by the constant squared
Statistics
This is the art of learning from data and includes the collection, description, and inference of data
Stem and leaf Plots
These are plots for small to medium datasets of data that have two parts which can be separated to make a stem and a leaf
Time series Analysis
This is understooding data sequences as a function of time. It can also include periodic oscilatory data
Univariate Data Analysis
This is reserved for data that is independent meaning that every outcome under analysis does not influence the other
Ways to Display a frequency table
These can be displayed using line graphs, bar graphs, or frequency polygons
Ways to show relative frequency
This can be shown on a table, line graph, bar chart, relative frequency polygon, or a pie chart.
What is a common need for descriptive statistics?
We need to be able to display data in a way that makes it interpretable and meaningful. It should be intuitive.
When are normal distributions most common?
These are most common in very large datasets or the conglomeration of datasets
Finding a sample percentile
- ) arrange data from least to greatest. n = number of data
- ) find n*p. The resultant value is the np’th smallest value which satisfies the 100p criterions.
- )
IF np is not an integer then ROUND UP and that value is the 100p
IF np is an integer then the 100p value is given by ((np)+(np+1))/2 This is like the median where if the median is an even number then you use the average of the n/2 and n/2+1 values
Quartiles
These are the 25%th, 50%th, and 75%th values. You find what value of the data set meets these criteria by finding np where p = .25,.5,.75 and if np is whole use the average.
Mean vs. Median when to use
Generally mean gives a better understanding of the dataset in terms of describing the data. The median should be used when probabilities are involved and/or the value is being used to understand the order of a group.
Ex: Housing. The mean income would be best for determining what the average person in an area can spend on a home but if we want to design housing where we could expect 50% of the population could live (P(Affordable)=.5) then the median is more useful.
Sample Space
S = sample space = all possible outcomes to some event or occurrence
Subset
This is a specific outcome of the sample space S consisting of one or more outcomes that can be defined within one event.
Intersection
This is the ∩ symbol that is used interchangeably with “and”
To say we have two events, E ; F, which occur we could say EF, E∩F, or E and F
This represents E and F must occur together and if E and F are mutually exclusive then EF = ϕ = null event
Null event
ϕ = null event = the scenerio where the input situation cannot occur. This means that there is no way for the inputs to occur as described.
Ex: if E and F are mutually exclusive then EF = ϕ because there are no parts of E and F that overlap
Union
E U F = E or F which means that any outcomes within either subset or event E or F are valid.
Compliment
If we have an event, E, within the sample space, S then Ec = compliment of E and includes everything that is not E
Set Containment
If the occurrence of E means that F must have also occured then E is contained within F which is shown with a sideways U symbol
Communitive law for union/intersection
E U F = F U E
Event E or F = Event F or E
EF=FE
Event E and F = Events F and E
Associative Law for Union and Intersection
(EUF)UG = EU(FUG)
E or F or G = E or F or G
EF(G) = (E)FG
E and F and G = E and F and G
Distributive law for intersection and union
(EUF)G = EGUFG
Events (E or F) and G = Events E and G or F and G
EFUG = (EUG)(FUG)
Events E and F or G = (E or G) and (F or G)
Demorgans Laws
(EUF)C = ECFC
E or F do not occur = E not occurring and F not occurring
(EF)C = EC U FC
E and F not occurring = E not occurring or F not occurring
Three axioms of Probability
1: 0
2: P(S) = 1
3: P(Uin Ei) =Σin P(Ei) = P(E1) + P(E2) +… +P(En)
This is assuming that Ei and Ei+1 are mutually exclusive
Sample spaces with equally likely outcomes
This refers to a sample space where each outcome has an equal probability of occurring, aka there is no weight to a particular outcome
In this scenario P(E) = 1/N = p
Theory of counting
This says that if we have a set of events that can occur inour sample space and each of these events creates a series of potential outcomes then the total number of outcomes is the product of the total number of secondary outcomes and the total number of first events
In other terms if we have “r” experiments and each experiment has “n” outcomes then r*n = total number of outcomes
Permutations
This is a specific arrangement of a set of objects where the total number of permutations available to a subset of things is equal to n! where n is the total number of things in the subset
number of unique groups in a set
If we have a sample of size = n and we want to know how many unique combinations of size r can be made from the elements of n this is
= n!/[(n-r)!r!]
This says that for a sample size=n that we can arrange “r” elements this many ways uniquely
Combinations notation
The number of unique combinations of n within groups of size r
(nr) = n!/[(n-r)!r!] where r
When is conditional probability particularly useful?
It is used when there is limited information within a problem (you are attempting to derive the probability of an event based on other events)
or
It is the easiest way to find the probability of a cause or input to an event with new information (backwards reasoning)
P(E|F) = ?
P(E|F) = P(EF)/(P(F)
Probability of E occurring given F has occurred = the probability E and F occur divided by the probability F occurs
P(E|Fc) = ? (expand)
P(E|FC) = P(EFC)/P(Fc)
This is says that the probability of E occurring given that F does NOT occur equals the probability of E occurring and F not occurring divided by the probability F does not occur.
P(E) Expansion using compliments
P(E) = P(EF) + P(EFC))
Probability of E = Prob of E and F + Probability of E and not F
P(E) = ? as a weighted average
P(E) = P(E|F)P(F) + P(E|Fc)(1-P(F))
The P(E) = The weighted average of E occurring if F has occurred and if E occurs and F does not occur
Where E occurs as the consequence of F.
How to use the weighted probability of E?
If tasked with finding P(F|E) where E is the second event
Then P(F|E) = P(FE)/(P(E)
where P(FE) = P(F)P(E|F) and P(E) = P(F)P(E|F) + P(Fc)P(E|Fc)
Independent events
If P(E|F) = P(E) then E and F are independent and E is not a function of event F
and P(EFG)=P(E)P(F)P(G)