Chapter 1: Intro to statistics Flashcards
What are cases, variables, and labels ?
Cases are the objects described by the data. Variable is a characteristic of case. Labels are special variables which distinguish between cases e.g. name of song.
What are types of variables ?
Categorical: places cases in categories or groups. Continuous/quantitative has numerical value for which arithmatic operations can be performed.
Distribution of variable tells us WHAT VALUE variable takes & HOW OFTEN it takes each value.
Categorical: bar graph and pie chart (% required to make up whole pie, all cate must be included, less flexible)
Continuous: histograms, stemplots, timeplots
stemplot: all but final (rightmost) digit is stem, leaf are all final digits, put leaves in ascending order
Bar graph vs histogram
Bar graph gives the no. of cases in each categorical variables. There can be space between 2 bars.
Histogram gives the DISTRIBUTION of counts between single continuous variable which is subdivided into numerical classes. Cannot be space unless the frequency is 0.
Overall pattern of distribution
Shape (symmetry, unimodal, right-left skewed), center, spread (max-min value, outlier)
Right-skewed
Right side OF THE GRAPH (containing half of the observations and larger values) is longer than it’s left side
mean vs median
Give us center
Mean is average, sensitive to outliers
Median is the mid-pt (half values larger the other half smaller), resistant to outlier
odd: middle no.
even: average of 2 middle nos.
Spread
Can be given by:
1) Quartiles: Q1,Q2,Q3
Q1 is the median of obs to left of media
Q3 is the median of obs to the right of media
Q2=median
IQR=Q3-Q1
2) Five Number summary: Min Q1 M Q3 Max
Can be indicated by box plot. The whiskers extending from box give min and max values (THAT ARE NOT OUTLIERS).
3) 1.5IQR: Q1-1.5IQR, value below this is outlier
1.5IQR: Q1+1.5IQR,value above this is an outlier
4) Standard deviation (s):
obs-mean: devaition
average of square root of all deviations: variance (s2)
square root of variance is SD
But average done by n-1: degrees of freedom
USED only when MEAN is measure of center
SENSITIVE to outliers like mean , more bcuz of square root
s=0 when all values same i.e. no spread
SD preferred over Variance as UNITS same as OBS
Measure of center and spread
Use mean and SD only for symmetric distri as sensitive to outliers
For skewed distributions use median and quartiles
Linear transformation of variables
xnew=a +bx; x:obs
1) b multiplies center (mean, median) as well as spread(SD,IQR)
2) “a” adds to the center (mean, median) but NOT to the spread (SD,IQR) i.e. remains same
BUT OVERALL SPREAD and CENTER does not change by linear transformation
Density Curve
- Histogram, smooth curve over it, make AUC=1 (proportion), always above or on hori axis
- Outliers not described
- Median: divides curve in half
- Mean: Balance-pt
- Actual obs: x-bar and s; density curve: mu and sigma
Normal distribution and Normal density curve
- Distribution that has shape: symmetric, unimodal and bell-shaped
- It’s density curve is NORMAL density curve
- N(mu, sigma)
- Mean=median and it is the measure of center as it is a symmetric distribution
- If mu changed then the curve moves to right or left but spread stays same
- Spread determined by sigma
68-95-99.7 Rule
68% of obs fall within sigma of mu, 95% with 2sigma and 99.7% within 3sigma of mu
CHECKOUT SUMS
Standard Normal Distribution
- Variable x has a “z-score/standardized-value”=x-u/sigma
- This is standard normal distri
- Always has N(0,1)
- CHECKOUT SUMS
- Would be with center 0 and 3 posi/nega SD on each side (acc to 68-95-99.7 rule)
Standard Normal Table
AUC to the left of each z value
-3.4 to 3.4
SOLVE SUMS