General Flashcards
Sample percentile
Within a sample that has been ranked from least to greatest the 100p percentile of data is the value of data where
- ) 100p percent of the data is equal to or less than the data value and
- ) 100(1-p) percent are greater than or equal to it.
A statistic
This is a numerical value that is derived from data.
Bivariate Data Analysis
This is when you are investigating an IV and a DV and the relationship between IV and DV
Box Plots
This is a plot that shows the extreme values, the first quartile, median, and third quartile.
Central Tendency Measures
This is described by the mean, median, and mode of the dataset where the mean is influenced by the extreme values and median is independent
Chubyshovs inequality
If we are trying to identify how much of a dataset lies between the values of x̄ +-ks where s = standard deviation and k = some number then
% min = 100(1- 1/(k2))
Class Boundaries
These are the max/min of the class intervals. We use the left-end inclusion rule which says that the value to the left is included in the bin and the one in the right is not.
Class intervals
These are the bins for grouping observations in a reasonable way.
Closed Data
This is data that is of a fixed ratio where the maximum cannot exceed some value.
Examples include any cumulative data.
Correlation Coefficient
r = [Σ (xi - x̄)( yi - ȳ)]/(n-1)sxsy = [Σ (xi - x̄)( yi - ȳ)]/[Σ (xi - x̄)2( yi - ȳ)2].5
This says that if we have a paired dataset such that xi,yi are the pairs and are described by their respective means such that y = mx + b then this statistic will indicate the linearity of the pairs of data
Cumulative Frequency
This shows the bins as a function of an additive frequency.
These are also called Ogives
Directional Data
This is data expressed in angles and can indicate how a vector is directed in space.
Frequency Table
This is a table that displays the number of occurrences vs. a characteristic of the sample being investigated with relatively small and discrete values.
Gini Coefficient
The gini coefficient (G) is the integral of the area between L(p) = 1 and the Lorenz Curve. It has a maximum value of .5 and a minimum value of 0
G=1-2B where B = area under Lorenze curve, L(p)
Histograms
These are bar charts without spaces
Image Processing
This is an increasingly important form of analysis that involves the changing of images from signals to visuals, enhancing the signal to noise ratio, extract features, and understand patterns.
Inferential Statistics
This is the practice of using statistics to make inferences about a experiment or population
Interval Data
These are data that are seperated by even values but they can be less than zero (temperature)
Lorenz Curve
This is a cumulative curve showing the income distribution
mean
x bar = Σx/n = Σ v*f/n
where v = bin value and f = frequency
Mean influence by multiplication/addition
for some function y = ax+b
y bar = a x(bar) + b so the mean is affected by both multiplication and addition in a linear way
Median
This is the middle value of a sample when data is arranged from least to greatest
If n is odd then the median value occurs at n = (n+1)/2
If n is even then the median is the average of (n/2)+1 and n/2
Mode
This is the observed value that occurs most often within a dataset. If there are more than one values that occur the same number of times then there are modal values
Nominal Data
This is data that is non-numerical in character (fossils, minerals, rocks…)
It is occasionally converted into binary (0=not present, 1 = present)
Normal Data Set
This is a data set where mean=median=mode and where 68% of the data lies between x̄+-s
95% is within x̄ +-2s
99.7% is within x̄ +- 3s
Ordinal Data
This is ranked data that can be numerical but the intervals separating the data is not equal. (Ex: Moh’s scale of hardness). Values also cannot be negative
Paired Data Sets
These are data sets that are trying to understand how one variable influences a different variable
Population
This is the total collection of elements that we want to investigate. This is too large to investigate each of the contained elements.
Probability Models
These are models that help us understand the validity of our conclusions by assigning probabilities of finding our results. It acts as the basis of statistical inference and if an inference cannot be checked using a probability model then we cannot conclude the inference is legitamate.
r meaning
If the slope relating y and x is <0 then r <0 and vice versa. the absolute value of r indicates the linearity of the relationship
If r is for (x, y) where w = a + bx and z = c + dy then
r(x,y) = r(w,z)