CSCI 343 Quiz 1 Flashcards

1
Q

continuous data types

A

float, double

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

discrete data types

A

int

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

categorical data types

A

specific set, enum (ex: red, orange, yellow; classified, unclassified, part-time)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

binary data types

A

boolean, 0/1, T/F, logical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

ordinal data types

A

categorical with order (ex: rating 1, 2, 3, 4, 5; fresh, soph, jr, sr)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

mean (aka average)

A

sum / count

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

trimmed mean

A

drop a few of the high values and the low values and calculate the mean of the remaining (ex: Olympics)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

weighted mean (aka weighted average)

A

sum of all values times corresponding weights divided by sum of weights (ex: calculating grades)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

median

A
middle number (when sorted), if an odd # elements
if even # elements, median is the average of the two middle elements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

(median/mean) is better for skewed data sets b/c ?

A

median b/c it won’t include outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

deviations

A

difference between the observed values and the median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

variance

A

sum of the squared deviations from the mean, divided by n-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

standard deviation

A

square root of the variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

range

A

max - min

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

percentile

A

the pth percentile is a value (not necessarily in the data set) such that at least p% of the data items are of this value or less and (100-p)% of the data items are this value or more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

quartiles are

A

25% pieces

17
Q

mode

A

the data item that occurs the most

18
Q

box plots show

A

the four quartiles, the median, and the box is the inner quartile range

19
Q

machine learning

A

learning from experience/history

20
Q

supervised learning

A

prediction, regression (ex: professor tells you answers); uses labels

21
Q

the vast majority of data science work is in

A

supervised learning

22
Q

unsupervised learning

A

clustering; does not use lables

23
Q

reinforcement learning

A

robots, AI; no answer, but a reward (like closeness to victory, hot/cold)

24
Q

basic formulation of machine learning

A
  • assume complete, correct data
  • correct label, unique label
  • prediction (or regression)
  • non-mixed types of attributes
25
Q

goal of machine learning is to

A

generalize from some samples so we can predict label on unseen examples

26
Q

process of machine learning

A

build a model, train our model, and test the model built

27
Q

when we train a model, we

A

put in samples with labels, run a learning algorithm, and then create a classifier/predictor

28
Q

when we test a model, we

A

put in new examples to the classifier/predictor and retrieve the prediction of the label for the new example (then compare in confusion matrix)

29
Q

accuracy

A

amount correct / total runs

30
Q

confusion matrix

A

compare predicted to actual, want numbers along the diagonal to be big; can also help you evaluate your model

31
Q

k-fold cross validation

A

for i=1 to k
build model on all data except subset i
predict on subset i

average results

32
Q

LOOCV

A

leave one out cross validation (k=n)

33
Q

noise

A

data that is not correct

34
Q

impute

A

put in an estimate or meaningless data to show it’s blank

35
Q

curse of dimensionality

A

too much data for each example

36
Q

how to help with the curse of dimensionality

A

select a helpful subset through dimension reduction or feature extraction

37
Q

overfitting

A

doing well on training data, but poorly on the test data (work too hard, not general enough)

38
Q

must do better than baseline

A

tempting to predict the most common label; improve upon the baseline (ex: flipping a coin 50% of the time you’d be right, but if you could increase it to 53% that’s better)

39
Q

interpretability

A

should be able to understand the model, not say the algorithm told me so but idk why