CSCI 343 Quiz 1 Flashcards

1
Q

continuous data types

A

float, double

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

discrete data types

A

int

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

categorical data types

A

specific set, enum (ex: red, orange, yellow; classified, unclassified, part-time)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

binary data types

A

boolean, 0/1, T/F, logical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

ordinal data types

A

categorical with order (ex: rating 1, 2, 3, 4, 5; fresh, soph, jr, sr)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

mean (aka average)

A

sum / count

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

trimmed mean

A

drop a few of the high values and the low values and calculate the mean of the remaining (ex: Olympics)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

weighted mean (aka weighted average)

A

sum of all values times corresponding weights divided by sum of weights (ex: calculating grades)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

median

A
middle number (when sorted), if an odd # elements
if even # elements, median is the average of the two middle elements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

(median/mean) is better for skewed data sets b/c ?

A

median b/c it won’t include outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

deviations

A

difference between the observed values and the median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

variance

A

sum of the squared deviations from the mean, divided by n-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

standard deviation

A

square root of the variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

range

A

max - min

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

percentile

A

the pth percentile is a value (not necessarily in the data set) such that at least p% of the data items are of this value or less and (100-p)% of the data items are this value or more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

quartiles are

A

25% pieces

17
Q

mode

A

the data item that occurs the most

18
Q

box plots show

A

the four quartiles, the median, and the box is the inner quartile range

19
Q

machine learning

A

learning from experience/history

20
Q

supervised learning

A

prediction, regression (ex: professor tells you answers); uses labels

21
Q

the vast majority of data science work is in

A

supervised learning

22
Q

unsupervised learning

A

clustering; does not use lables

23
Q

reinforcement learning

A

robots, AI; no answer, but a reward (like closeness to victory, hot/cold)

24
Q

basic formulation of machine learning

A
  • assume complete, correct data
  • correct label, unique label
  • prediction (or regression)
  • non-mixed types of attributes
25
goal of machine learning is to
generalize from some samples so we can predict label on unseen examples
26
process of machine learning
build a model, train our model, and test the model built
27
when we train a model, we
put in samples with labels, run a learning algorithm, and then create a classifier/predictor
28
when we test a model, we
put in new examples to the classifier/predictor and retrieve the prediction of the label for the new example (then compare in confusion matrix)
29
accuracy
amount correct / total runs
30
confusion matrix
compare predicted to actual, want numbers along the diagonal to be big; can also help you evaluate your model
31
k-fold cross validation
for i=1 to k build model on all data except subset i predict on subset i average results
32
LOOCV
leave one out cross validation (k=n)
33
noise
data that is not correct
34
impute
put in an estimate or meaningless data to show it's blank
35
curse of dimensionality
too much data for each example
36
how to help with the curse of dimensionality
select a helpful subset through dimension reduction or feature extraction
37
overfitting
doing well on training data, but poorly on the test data (work too hard, not general enough)
38
must do better than baseline
tempting to predict the most common label; improve upon the baseline (ex: flipping a coin 50% of the time you'd be right, but if you could increase it to 53% that's better)
39
interpretability
should be able to understand the model, not say the algorithm told me so but idk why