Analysis Basics Flashcards

Question 1

Q

What do you use to visualize the distribution or spread of a variable?

Answer

A

Histogram
Box plot

Question 2

Q

What do you do to understand the distribution?

Answer

A

Examine the “measured of central tendency”. This refers to describing the “middle” of the data by getting the mean, median, and mode.

Question 3

Q

A simple average based on adding together all of the values in the sample set and then dividing the total by the number of samples.

Question 4

Q

The value in the middle of the range of all of the sample values.

Question 5

Q

The most commonly occurring value in the sample set

Question 6

Q

This refers to a tie for the most common value.

Answer

A

Bimodal or Multimodal

Question 7

Q

It refers to the data that we have on hand.

Question 8

Q

It refers to all the data that we can collect.

Answer

A

Population

Question 9

Q

Which function is used to estimate the distribution of a variable for the full population?

Answer

A

Probability Density Function (PDF)

Question 10

Q

What type of distribution has the mean and mode at the center and symmetric tail?

Answer

A

Normal Distribution

Question 11

Q

What type of distribution has the “bell shape” characteristic?

Answer

A

Normal Distribution

Question 12

Q

It refers to a tendency to select certain types of values more frequently than others, in a way that misrepresents the underlying population, or ‘real world’.

Question 13

Q

What are the things to remember when examining real world data?

Answer

A

Check for missing values and badly recorded data
Consider removal of obvious outliers
Consider what real-world factors might affect your analysis and consider if your dataset size is large enough to handle this
Check for biased raw data and consider your options to fix this, if found

Question 14

Q

It is a value that lies significantly outside the range of the rest of the distribution.

Question 15

Q

Which type of distribution has the mass of the data on the left side of the distribution, creating a long tail to the right because of the values at the extreme high end, which pull the mean to the right.

Answer

A

Right skewed

Question 16

Q

How do you measure variability (variance) in the data?

Answer

A

Range
Variance
Standard Deviation

Question 17

Q

This refers to the difference between the maximum and minimum. There’s no built-in function for this, but it’s easy to calculate using the min and max functions.

Question 18

Q

This refers to the average of the squared difference from the mean. You can use the built-in var function to find this.

Question 19

Q

This refers to the square root of the variance. You can use the built-in std function to find this.

Answer

A

Standard Deviation

Question 20

Q

It is a built-in method of the DataFrame object that returns the main descriptive statistics for all numeric columns.

Answer

A

df.describe()

Question 21

Q

When comparing numeric variables, how do you deal with numeric data in different scales?

Answer

A

Normalize the data

Question 22

Q

It is a technique that distributes the values proportionally on a scale of 0 to 1.

Answer

A

MinMax scaling

Question 23

Q

This indicates the strength of the relationship between variables.

Answer

A

Correlation

Values above 0 indicate a positive correlation (high values of one variable tend to coincide with high values of the other), while values below 0 indicate a negative correlation (high values of one variable tend to coincide with low values of the other).

Question 24

Q

What do you use to visualize the correlation between two numeric variables?

Answer

A

Scatter plot
2.

Question 25

Q

It is added to a scatter plot that shows the general trend in the data.

Answer

A

Regression line (line of best fit)

Question 26

Q

What is the slope-intercept form of a linear equation?

Answer

A

y = mx + b

Where:

y and x are the coordinate variables
m is the slope of the line
b is the y-intercept (where the line goes through the axis)

Question 27

Q

It is the line that gives us the lowest value for the sum of the squared errors

Answer

A

Least Squares Regression

Question 28

Q

This returns (among other things) the coefficients you need for the slope equation: slope (m) and intercept (b) based on a given pair of variable samples you want to compare.

Answer

A

linregress method

Question 29

Q

It is the process of taking a set of sample data that includes one or more features (in this case, the number of hours studied) and a known label value (in this case, the grade achieved) and use the sample data to derive a function that calculates predicted label values for any given set of features.

Answer

A

Machine Learning

Question 30

Q

It works by establishing a relationship between variables in the data that represents characteristics known as the features of the thing being observed and the variable that we’re trying to predict known as the label.

Answer

A

Regression

Question 31

Q

It is an example of a supervised machine learning technique in which you train a model to predict a numeric label based on an item’s features.

Answer

A

Regression

Question 32

Q

It is the difference between a predicted label value and the actual label value as a measure of error.

Answer

A

Residuals

Question 33

Q

What are the kinds of Linear Regression algorithms?

Answer

A

Least Squares
Lasso
Ridge

Question 34

Q

What are the kinds of Regression algorithms?

Answer

A

Linear Regression
Tree-based
Ensemble

Question 35

Q

These are algorithms that build a decision tree to reach a prediction.

Answer

A

Tree-based Regression Algorithm

Question 36

Q

These are algorithms that combine the outputs of multiple base algorithms to improve generalizability.

Answer

A

Ensemble Algorithm

Question 37

Q

This algorithm work by combining multiple base estimators to produce an optimal model, either by applying an aggregate function to a collection of base models (sometimes referred to a bagging) or by building a sequence of models that build on one another to improve predictive performance (referred to as boosting).

Answer

A

Ensemble Algorithm

Question 38

Q

This algorithm work by using a tree-based approach in which the features in the dataset are examined in a series of evaluations, each of which results in a branch in a decision tree based on the feature value. At the end of each series of branches are leaf-nodes with the predicted label value based on the feature values.

Answer

A

Decision Tree Algorithm

Question 39

Q

This optimization method works by applying an aggregate function to a collection of base models.

Question 40

Q

This optimization method works by building a sequence of models that build on one another to improve predictive performance

Question 41

Q

This boosting algorithm is similar to a Random Forest algorithm which builds multiple trees; but instead of building them all independently and taking the average result, each tree is built on the outputs of the previous one in an attempt to incrementally reduce the loss (error) in the model.

Answer

A

Gradient Boosting

Question 42

Q

This bagging algorithm applies an averaging function to multiple Decision Tree models for a better overall model.

Answer

A

Random Forest

Question 43

Q

This refers to changes you make to your data before it’s passed to the model.

Answer

A

Preprocessing

Question 44

Q

This refers to the values that you specify to affect the behavior of a training algorithm are more correctly.

Answer

A

Hyperparameters

Question 45

Q

This algorithm is an ensemble that combines multiple decision trees to create an overall predictive model.

Answer

A

Gradient Boosting Regressor

Question 46

Q

Provide preprocessing transformations to get your data for modeling.

Answer

A

Scaling numeric features
Encoding categorical variables

Question 47

Q

Provide techniques to encode categorical variables.

Answer

A

Ordinal encoding
One-hot encoding

Question 48

Q

How do you save a model?

Answer

A

joblib.dump(model, filename)

Question 49

Q

How do you load a model

Answer

A

model = joblib.load(filename)

Question 50

Q

How do you use a model to generate predictions?

Answer

A

model.predict

Question 51

Q

You have created a model object using the scikit-learn LinearRegression class. What should you do to train the model?

Answer

A

Call the fit() method of the model object, specifying the training feature and label arrays

Question 52

Q

It is a measure of how much of the variance the model can explain.

Answer

A

R squared metri

Question 53

Q

It works by establishing a relationship between variables in the data that represent characteristics—known as the features—of the thing being observed, and the variable we’re trying to predict—known as the label.

Answer

A

Regression

Question 54

Q

This metric for measuring loss in a regression squares the individual residuals, sum the squares, and calculate the mean. Squaring the residuals has the effect of basing the calculation on absolute values (ignoring whether the difference is negative or positive) and giving more weight to larger differences.

Answer

A

Mean Squared Error or MSE

Question 55

Q

This metric for measuring loss in a regression is calculated by getting the square root of the MSE. This is to express the loss in the same unit of measurement as the predicted label value itself.

Answer

A

Root Mean Squared Error or RMSE

Question 56

Q

This metric for measuring loss in a regression is also known as coefficient of determination.

Answer

A

R squared

Question 57

Q

This metric for measuring loss in a regression is the correlation between x and y squared. This produces a value between 0 and 1 that measures the amount of variance that can be explained by the model. Generally, the closer this value is to 1, the better the model predicts.

Answer

A

R squared

Question 58

Q

It is a container that holds related resources for an Azure solution

Answer

A

Resource Group

Question 59

Q

What are the 4 kinds of compute resource?

Answer

A

Compute instances
Compute clusters
Inference clusters
Attached compute

Question 60

Q

These are development workstations that data scientists can use to work with data and models.

Answer

A

Compute instances

Question 61

Q

These are scalable clusters of virtual machines for on-demand processing of experiment code.

Answer

A

Compute clusters

Question 62

Q

These are deployment targets for predictive services that used your trained models.

Answer

A

Inference clusters

Question 63

Q

These are links to Azure compute resources, such as Virtual Machines or Azure Databricks clusters.

Answer

A

Attached compute

Question 64

Q

It is the variance between predicted and true values that cannot be explained by the model.

Answer

A

Residuals

Answer 55

A

Regression

Answer 56

A

Microsoft Azure Machine Learning

Answer 57

A

Mean Absolute Error (MAE)

Answer 58

A

Root. Mean Squared Error (RMSE)

Answer 59

A

Relative Squared Error (RSE)

Answer 60

A

Relative Absolute Error (RAE)

Answer 61

A

Coefficient of Determination