Midterm Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

what can data do (4)

A

 Describe the current state of an organization or process
 Detect anomalous events
 Diagnose the causes of events and behaviors
 Predict future events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

describe the 4 steps in ds workflow

A

data collection and storage

data preparation

exploration and visualization

experimentation and prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what are the 3 applications of data science

A

traditional machine learning

internet of things

deep learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what do we need for machine learning

A

a well defined question

a set of example data

a new set of data to use our algorithm on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is deep learning

A

may neurons work together

requires much more training data

used in complex problems: image classifications, language learning/understanding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is supervised machine learning

A

predictions from data with labels and features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is churn prediction

A

trying to predict whether the customer will likely terminate their subscription with a certain service in the future

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is clustering and what are 3 use cases

A

divide data into categories

use cases:
customer segmentation
image segmentation
anomaly detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

how do you slice a list in python

A

list[start:end] [inclusive (optional) : exclusive (optional)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

how do you delete an element in a list

A

del(list[index])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

does python work by reference or assignment

A

reference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

how can you make a copy of a list instead of referencing the original

A

y = x[:]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what are the 3 parameters of np.random.normal()

A

distribution mean
distribution standard deviation
number of samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

how to check if “x” is a key in dictionary y

A

“x” in y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is pandas

A

high level data manipulation tool built on numpy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

suppose brics is a dataframe. what is the difference between brics[“country”] and brics[[“country”]]

A

the first only lists the countries with their indexes. (type series)

the second returns a dataframe with one column, countries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is the type of brics[1:4] considering brics is a dataframe

A

dataframe

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what is the difference between df.loc[’’,’’] and df.iloc[rowint,colint]

A

loc locates keys while iloc locates indices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

how to use logical operatos with numpy

A

np.logical_and()
np.logical_or()
np.logical_not()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

create a for loop that loops through a list and prints the index and its value

A

for index, height in enumerate(fam):

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

loop over the contets of a dictionary

A

for key, value in worlds.items():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

how to loop through a dataframe printing index and row content

A

for index, row in brics.iterrows():
print(index)
print(row) #row is a list in this case

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what does the following do

brics[“country”].apply(len)

A

adds a column to the dataframe that contains the length of the content of country column in each row

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

sort a dataframe by multiple values in ascending and descending order

A

df.sort_values([‘col1”, “col2”], ascending=[True, False])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

how to subset a dataframe to match 2 conditions

A

h[cond1 & cond2]
h[cond1 | cond2]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

how to return all rows where the value in column “state” in a dataframe is one of 3 predetermined values

A

h1 = h[h[‘state’].isin([‘north’, ‘virginia’, ‘arizona’])]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

how to plot a dataframe

A

df.plot(xcol, ycol, kind, title)
plt.show()

incase of histogram

df[col].hist()

28
Q

how to find how many nulls in every column in a dataframe

A

df.isna().sum() (can be plotted using .plot(kind=’bar’))

29
Q

how to replace null values with a default value

A

df.fillna(0)

30
Q

what is statistics

A

practice and study of collection and analyzing data to derive a fact or a summary

31
Q

what are the type of statistics

A

descriptive statistics: describe and summarize data

inferential statistics: use a sample of data to make inferences about a larger population

32
Q

what are the types of data

A

numeric (quantitative):
continuous (measured)
discrete (counted)

categorical (qualitative):
nominal (unordered)
ordinal (ordered) : <strongly disagree -> strongly agree>

33
Q

what is the difference between mean and median and mode

A

mean (average): sum/total samples

median: the value where 50% of the data is above it

mode: the most frequent value in the data

34
Q

what is a left skewed histogram and right skewed

A

left skewed histogram is when the tail of histogram is to the left of the mean/median indicating a high concentration of low value entries (right skewed is the opposite)

35
Q

generate a one line code that groups a dataframe by country, and measures the maen and median of consumption

A

df.groupby(‘country’)[‘consumption’].agg([np.mean, np.median])

36
Q

what are the two methods to calculate standard deviatino

A

np.sqrt(np.var(df[‘co’], ddof=1))
np.std(df[‘col’], ddof=1)

note that when ddof = 0, the data is a sample size
when ddof = 1, the data is all the possible population of what we are calculating

37
Q

what are quantiles (percentiles)

A

spllit the data into some number of equal parts
np.quantile(df[col], 0.5)

38
Q

what is IQR

A

interquartile range: another measure of spread, it’s the distance between the 25th and 75th percentile

39
Q

what is an outlier

A

a data point that is largely different from the others

a data point is an outlier if:

data< Q1 - 1.5 x IQR
or
data > Q3 + 1.5 x IQR

40
Q

what is a plot that visualizes outliers

A

boxplot

41
Q

how to check if arrays A and B are equal?

A

np.allclose(A, B)

42
Q

how to select a random entry from a dataframe

A

df.sample(n)
or
df.sample(n, replace=False) to completely remove the sampled entry

43
Q

what does np.random.seed(5)

A

initializes the initial number used for pseudorandom calculation so that we get the same random numbers on every run of the code

44
Q

how to generate a random number using scipy.stats

A

uniform.rvs(start, end, arraySize)

45
Q

how to get the probability of a continuous distribution function

A

uniform.cdf(end, start, probability)

46
Q

when does teh binomial distribution fail to apply

A

when the trials are not independent, the binomial distribution does not apply

47
Q

what is the inverse of norm.cdf(intended, mean, variance)

A

norm.ppf(percent, mean, variance)

48
Q

what is the difference between pmf and cdf

A

pmf, probaility at x

cdf probability up to x

49
Q

what is correlation

A

a number that defines the relationship between x and y [-1, 1] if it is close to 0, a weaker relationship exists

df[x],corr(df[y])

could be used for linear regression

50
Q

how to visualize the linear regression model

A

import seaborn as sns

sns.lmplot(x, y, data, ci)
plt.show()

51
Q

what are pivot tables

A

tables that are derived from original tables

df.pivot_table(values=, index=, aggfunc=[np.mean, np.median])

aggfunc is optional and can be omitted

can also add columns=boolcol to calculate the mean of values for that bool col

52
Q

what are the requirements of supervised learningn

A

no missing values

data in numeric format

data stored in padas dataframe o numpy array

53
Q

what is k nearest neighbors

A

predict the label of any data point by looking at the k closest labeled data points and getting them to vote on what label the unlabeled observation should have

54
Q

what happens when our selected k in kNN is too high

A

high k causes underfitting

low k causes overfitting

55
Q

wha happens if a and b are too high in linear regression

A

overfitting

when alpha in the ridge is too high, we get underfitting

56
Q

when is lasso regression used

A

it is used to measure feature importance

57
Q

is accuracy always a good measure? what can replace it

A

no, it is not a good measure on uneven classes. we can use a confusion matrix instead

58
Q

what are hyper parameters

A

parameters taht we specify before fitting a model like alpha and n_neigbors

59
Q

how do you achieve hyper parameter tunig

A

1- Try lots of different hyperparameter values
2- Fit all of them separately
3- See how well they perform
4- Choose the best performing values

60
Q

why do we use cross-validatino when fitting different hyperparameters

A

to avoid overfitting the hyperparameters to the test set

61
Q

when do we use standardization and when do we use normalization. how do we do them

A

we use standardization when the data follow a gaussian distribution or when the features are normally distributed (linear and logistic regression or neural networks) mean =0 and std =1 (it does not maintain the shape of the original distribution)

we use normalization when we know the data does not follow gaussian (normal) distribution

it maintains the shape of the original distribution

62
Q

what is L1 and L2 linear regression

A

L1 (lass) and l2 (ridge)

LogisticRegression(solver=’liblinear’, penalty=’l1’) l2 by default

63
Q

when is SVM used

A

widely used for classification problems but can be employed in regression problems

64
Q

what is a kernel trick

A

svm uses this trick to transform non-seperable datasets to a higher dimension to become linearly separable.

65
Q

what are the KPI’s to evaluate models

A

size of the dataset: fewer features = simpler model and faster training time

interpretability: easier to explain => important for stakeholders. like linear and logistic regression

flexibility: improve accuracy by making fewer assumptions like the KNN

metrics: RMSE, R-squared, accuracy, precision, recall

66
Q

what is inertia

A

how spread out the samples of kmenas cluster are . it is to measure the quality of k means cluster if we dont have prelabeled clusters

model.intertia_

67
Q
A