CSCI 343 Quiz 1 Flashcards by Jennifer Lauriello

continuous data types

float, double

How well did you know this?

Not at all

Perfectly

discrete data types

int

How well did you know this?

Not at all

Perfectly

categorical data types

specific set, enum (ex: red, orange, yellow; classified, unclassified, part-time)

How well did you know this?

Not at all

Perfectly

binary data types

boolean, 0/1, T/F, logical

How well did you know this?

Not at all

Perfectly

ordinal data types

categorical with order (ex: rating 1, 2, 3, 4, 5; fresh, soph, jr, sr)

How well did you know this?

Not at all

Perfectly

mean (aka average)

sum / count

How well did you know this?

Not at all

Perfectly

trimmed mean

drop a few of the high values and the low values and calculate the mean of the remaining (ex: Olympics)

How well did you know this?

Not at all

Perfectly

weighted mean (aka weighted average)

sum of all values times corresponding weights divided by sum of weights (ex: calculating grades)

How well did you know this?

Not at all

Perfectly

median

middle number (when sorted), if an odd # elements
if even # elements, median is the average of the two middle elements

How well did you know this?

Not at all

Perfectly

(median/mean) is better for skewed data sets b/c ?

median b/c it won’t include outliers

How well did you know this?

Not at all

Perfectly

deviations

difference between the observed values and the median

How well did you know this?

Not at all

Perfectly

variance

sum of the squared deviations from the mean, divided by n-1

How well did you know this?

Not at all

Perfectly

standard deviation

square root of the variance

How well did you know this?

Not at all

Perfectly

range

max - min

How well did you know this?

Not at all

Perfectly

percentile

the pth percentile is a value (not necessarily in the data set) such that at least p% of the data items are of this value or less and (100-p)% of the data items are this value or more

How well did you know this?

Not at all

Perfectly

quartiles are

Study These Flashcards

25% pieces

mode

Study These Flashcards

the data item that occurs the most

box plots show

Study These Flashcards

the four quartiles, the median, and the box is the inner quartile range

machine learning

Study These Flashcards

learning from experience/history

supervised learning

Study These Flashcards

prediction, regression (ex: professor tells you answers); uses labels

the vast majority of data science work is in

Study These Flashcards

supervised learning

unsupervised learning

Study These Flashcards

clustering; does not use lables

reinforcement learning

Study These Flashcards

robots, AI; no answer, but a reward (like closeness to victory, hot/cold)

basic formulation of machine learning

Study These Flashcards

assume complete, correct data
correct label, unique label
prediction (or regression)
non-mixed types of attributes

goal of machine learning is to

generalize from some samples so we can predict label on unseen examples

process of machine learning

build a model, train our model, and test the model built

when we train a model, we

put in samples with labels, run a learning algorithm, and then create a classifier/predictor

when we test a model, we

put in new examples to the classifier/predictor and retrieve the prediction of the label for the new example (then compare in confusion matrix)

accuracy

amount correct / total runs

confusion matrix

compare predicted to actual, want numbers along the diagonal to be big; can also help you evaluate your model

k-fold cross validation

for i=1 to k build model on all data except subset i predict on subset i average results

LOOCV

leave one out cross validation (k=n)

noise

data that is not correct

impute

put in an estimate or meaningless data to show it's blank

curse of dimensionality

too much data for each example

how to help with the curse of dimensionality

select a helpful subset through dimension reduction or feature extraction

overfitting

doing well on training data, but poorly on the test data (work too hard, not general enough)

must do better than baseline

tempting to predict the most common label; improve upon the baseline (ex: flipping a coin 50% of the time you'd be right, but if you could increase it to 53% that's better)

interpretability

should be able to understand the model, not say the algorithm told me so but idk why

CSCI 343 Quiz 1 Flashcards

(39 cards)