CSCI 343 Quiz 1 Flashcards
continuous data types
float, double
discrete data types
int
categorical data types
specific set, enum (ex: red, orange, yellow; classified, unclassified, part-time)
binary data types
boolean, 0/1, T/F, logical
ordinal data types
categorical with order (ex: rating 1, 2, 3, 4, 5; fresh, soph, jr, sr)
mean (aka average)
sum / count
trimmed mean
drop a few of the high values and the low values and calculate the mean of the remaining (ex: Olympics)
weighted mean (aka weighted average)
sum of all values times corresponding weights divided by sum of weights (ex: calculating grades)
median
middle number (when sorted), if an odd # elements if even # elements, median is the average of the two middle elements
(median/mean) is better for skewed data sets b/c ?
median b/c it won’t include outliers
deviations
difference between the observed values and the median
variance
sum of the squared deviations from the mean, divided by n-1
standard deviation
square root of the variance
range
max - min
percentile
the pth percentile is a value (not necessarily in the data set) such that at least p% of the data items are of this value or less and (100-p)% of the data items are this value or more
quartiles are
25% pieces
mode
the data item that occurs the most
box plots show
the four quartiles, the median, and the box is the inner quartile range
machine learning
learning from experience/history
supervised learning
prediction, regression (ex: professor tells you answers); uses labels
the vast majority of data science work is in
supervised learning
unsupervised learning
clustering; does not use lables
reinforcement learning
robots, AI; no answer, but a reward (like closeness to victory, hot/cold)
basic formulation of machine learning
- assume complete, correct data
- correct label, unique label
- prediction (or regression)
- non-mixed types of attributes