Google Advanced Data Analytics Flashcards

1
Q

A way to compare two versions of something to find out which version performs better

A

A/B testing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

(Refer to observed values)

A

Absolute values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Refers to the proportion of data points that were correctly categorized

A

Accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A Tableau tool to help an audience interact with a visualization or dashboard by allowing control of selection

A

Action

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Refers to allowing team members, bosses, and other collaborative stakeholders to share their own points of view before offering responses

A

Active listening

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

(Refer to adaptive boosting)

A

AdaBoost

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

A boosting methodology where each consecutive base learner assigns greater weight to the observations incorrectly predicted by the preceding learner

A

Adaptive boosting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The concept that if the events A and B are mutually exclusive, then the probability of A or B happening is the sum of the probabilities of A and B

A

Addition rule (for mutually exclusive events):

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A variation of R² that accounts for having multiple independent variables present in a linear regression model

A

Adjusted R²

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The metric used to calculate the distance between points/clusters

A

Affinity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A pandas groupby method that allows the user to apply multiple calculations to groups of data

A

agg():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A clustering methodology that works by first assigning every point to its own cluster, then progressively combining clusters based on intercluster distance

A

Agglomerative clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data from a significant number of users that has eliminated personal information

A

Aggregate information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

A set of instructions for solving a problem or accomplishing a task

A

Algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A process that allows the user to assign an alternate name—or alias—to something

A

Aliasing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A group of statistical techniques that test the difference of means between three or more groups

A

Analysis of Variance (ANOVA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

A data professional who supervises analytical strategy for an organization, often managing multiple groups

A

Analytics Team Manager

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Stage of the PACE workflow where the necessary data is acquired from primary and secondary sources and then cleaned, reorganized, and analyzed

A

Analyze stage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

A statistical technique that tests the difference of means between three or more groups while controlling for the effects of covariates, or variable(s) irrelevant to the test

A

ANCOVA (Analysis of Covariance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

A method that adds an element to the end of a list

A

append():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

An ordered collection of items of a single data type

A

Array:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

A function for converting input to an array

A

Array():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Information given to a function in its parentheses

A

Argument

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Refers to computer systems able to perform tasks that normally require human intelligence

A

Artificial intelligence (AI)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

The process of storing a value in a variable

A

Assignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

A value associated with an object or class which is referenced by name using dot notation

A

Attribute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Average: The distance between each cluster’s centroid and other clusters’ centroids

A

Average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

A stepwise variable selection process that begins with the full model, with all possible independent variables, and removes the independent variable that adds the least explanatory power to the model

A

Backward elimination

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

A technique used by certain kinds of models that use ensembles of base learners to make predictions; refers to the combination of bootstrapping and aggregating

A

Bagging

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Each individual model that comprises an ensemble

A

Base learner

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

(Refer to Bayes’ theorem)

A

Bayes’ rule

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

An equation that can be used to calculate the probability of an outcome or class, given the values of predictor variables

A

Bayes’ theorem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

(Refer to Bayesian statistics)

A

Bayesian inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

A powerful method for analyzing and interpreting data in modern data analytics; also referred to as Bayesian inference

A

Bayesian statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

The line that fits the data best by minimizing some loss function or error

A

Best fit line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

In data structuring, refers to organizing data results in groupings, categories, or variables that are misrepresentative of the whole dataset

A

Bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Balance between two model qualities, bias and variance, to minimize overall error for unobserved data

A

Bias-variance trade-off

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

A segment of data that groups values into categories

A

Bin

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Grouping continuous values into a smaller number of categories, or intervals

A

Binning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

A discrete distribution that models the probability of events with only two possible outcomes: success or failure

A

Binomial distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

A technique that models the probability of an observation falling into one of two categories, based on one or more independent variables

A

Binomial logistic regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

An assumption stating that there should be a linear relationship between each X variable and the logit of the probability that Y equals one

A

Binomial logistic regression linearity assumption

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Any model whose predictions cannot be precisely explained

A

Black-box model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

A data type that has only two possible values, usually true or false

A

Boolean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

A data type that has only two possible values, usually true or false

A

Boolean data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

A filtering technique that overlays a Boolean grid onto a dataframe in order to select only the values in the dataframe that align with the True values of the grid

A

Boolean masking

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

A technique that that builds an ensemble of weak learners sequentially, with each consecutive learner trying to correct the errors of the one that preceded it

A

Boosting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Refers to sampling with replacement

A

Bootstrapping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

A data visualization that depicts the locality, spread, and skew of groups of values within quartiles

A

Box plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

The ability of a program to alter its execution sequence

A

Branching

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

A keyword that lets a user escape a loop without triggering any ELSE statement that follows it in the loop

A

break

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

(Refer to Business Intelligence Engineer)

A

Business Intelligence Analyst:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

A data professional who uses their knowledge of business trends and databases to organize information and make it accessible; also referred to as a Business Intelligence Analyst

A

Business Intelligence Engineer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Data that is divided into a limited number of qualitative groups

A

Categorical data:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Variables that contain a finite number of groups or categories

A

Categorical variables:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Describes a cause-and-effect relationship where one variable directly causes the other to change in a particular way

A

Causation:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

The modular code input and output fields into which Jupyter Notebooks are partitioned

A

Cells:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

The idea that the sampling distribution of the mean approaches a normal distribution as the sample size increases

A

Central Limit Theorem:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

The center of a cluster determined by the mathematical mean of all the points in that cluster

A

Centroid:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

A hypothesis test that determines whether an observed categorical variable follows an expected distribution

A

Chi-squared (χ²) Goodness of Fit Test:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

A hypothesis test that determines whether or not two categorical variables are associated with each other

A

Chi-squared (χ²) Test for Independence:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

An executive-level data professional who is responsible for the consistency, accuracy, relevancy, interpretability, and reliability of the data a team provides

A

Chief Data Officer:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

A node that is pointed to from another node

A

Child node:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

When a dataset has a predictor variable that contains more instances of one outcome than another

A

Class imbalance:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

An object’s data type that bundles data and functionality together

A

Class:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

A type of probability based on formal reasoning about events with equally likely outcomes

A

Classical probability:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

The process of removing errors that might distort your data or make it less useful; one of the six practices of EDA

A

Cleaning:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

A probability sampling method that divides a population into clusters, randomly selects certain clusters, and includes all members from the chosen clusters in the sample

A

Cluster random sample:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

A technique used by recommendation systems to make comparisons based on who else liked the content

A

Collaborative filtering:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

A group of abnormal points, following similar patterns and isolated from the rest of the population

A

Collective outliers:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

An operator that compares two values and produces Boolean values (True/False)

A

Comparator:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

In statistics, refers to an event not occuring

A

Complement of an event:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

The maximum pairwise distance between clusters

A

Complete:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

A concept stating that the probability that event A does not occur is one minus the probability of A

A

Complement rule:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

Refers to defining attributes and methods at the instance level to have a more differentiated relationship between objects in the same class

A

Composition:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

The process of giving instructions to a computer to perform an action or set of actions

A

Computer programming:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

A pandas function that combines data either by adding it horizontally as new columns for existing rows or vertically as new rows for existing columns

A

concat():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

To link or join together

A

Concatenate:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

Refers to building longer strings out of smaller strings

A

Concatenation:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

Refers to the probability of an event occurring given that another event has already occurred

A

Conditional probability:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

A section of code that directs the execution of programs

A

Conditional statement:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

The area surrounding a line that describes the uncertainty around the predicted outcome at every value of X

A

Confidence band:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

A range of values that describes the uncertainty surrounding an estimate

A

Confidence interval:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

A measure that expresses the uncertainty of the estimation process

A

Confidence level:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

A graphical representation of how accurate a classifier is at predicting the labels for a categorical variable

A

Confusion matrix:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

Stage of the PACE workflow where data models and machine learning
algorithms are built, interpreted, and revised to uncover relationships within the data and help unlock insights from those relationships

A

Construct stage:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

A special method to add values to an instance in object creation

A

Constructor:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

A technique used by recommendation systems to make comparisons based on attributes of content

A

Content-based filtering:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

Normal data points under certain conditions but become anomalies under most other conditions

A

Contextual outliers:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

A variable that takes all the possible values in some range of numbers

A

Continuous random variable:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

A mathematical concept indicating that a measure or dimension has an infinite and uncountable number of outcomes

A

Continuous:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

Variables that can take on an infinite and uncountable set of values

A

Continuous variables:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

A non-probability sampling method that involves choosing members of a population that are easy to contact or reach

A

Convenience sample:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

Measures the way two variables tend to change together

A

Correlation:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

A process that uses different portions of the data to test and train a model on different iterations

A

Cross-validation:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

A plaintext file that uses commas to separate distinct values from one another; Stands for “comma-separated values”

A

CSV file:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

The business term that describes how many and at what rate customers stop using a product or service, or stop doing business with a company

A

Customer churn:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

The process of formatting data and removing unwanted material

A

Data cleaning:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

The process of protecting people’s private or sensitive data by eliminating PII

A

Data anonymization:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

A data professional who makes data accessible, ensures data ecosystems offer reliable results, and manages infrastructure for data across enterprises

A

Data engineer:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

Well-founded standards of right and wrong that dictate how data is collected, shared, and used

A

Data ethics:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

A process for ensuring the formal management of a company’s data assets

A

Data governance:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

Any individual who works with data and/or has data skills

A

Data professional:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
102
Q

The discipline of making data useful

A

Data science:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
103
Q

A data professional who works closely with analytics to provide meaningful insights that help improve current business operations

A

Data scientist:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
104
Q

The location where data originates

A

Data source:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
105
Q

The practices of an organization that ensures that data is accessible, usable, and safe

A

Data stewardship:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
106
Q

An attribute that describes a piece of data based on its values, its programming language, or the operations it can perform

A

Data type:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
107
Q

A collection of data values or objects that contain different data types

A

Data structure:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
108
Q

A two-dimensional, labeled data structure with rows and columns

A

DataFrame:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
109
Q

A graph, chart, diagram, or dashboard that is created as a representation of information

A

Data visualization:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
110
Q

A file type used to store data, often in tables, indexes, or fields

A

Database (DB) file:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
111
Q

A two-dimensional data-structure organized into rows and columns

A

Dataframe:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
112
Q

A clustering methodology that searches data space for continuous regions of high density; stands for “density-based spatial clustering of applications with noise”

A

DBSCAN:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
113
Q

Troubleshooting, or searching for errors in a script or program

A

Debugging:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
114
Q

A node of the tree where decisions are made

A

Decision node:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
115
Q

A flowchart-like structure that uses branching paths to predict the outcomes of events, or the probability of certain outcomes

A

Decision tree:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
116
Q

The elimination or removal of matching data values in a dataset

A

Deduplication:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
117
Q

A keyword that defines a function at the start of the function block

A

def:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
118
Q

The concept that two events are dependent if one event changes the probability of the other event

A

Dependent events:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
119
Q

The variable a given model estimates

A

Dependent variable (Y):

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
120
Q

A type of statistics that summarizes the main features of a dataset

A

Descriptive statistics:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
121
Q

A function that returns the statistical summary of a dataframe or series, including mean, standard deviation, and minimum and maximum column values.

A

Describe():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
122
Q

A function used to create a dictionary

A

dict():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
123
Q

A data structure that consists of a collection of key-value pairs

A

Dictionary:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
124
Q

A function that finds the elements present in one set but not the other

A

difference():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
125
Q

Qualitative data values used to categorize and group data to reveal details about it

A

Dimensions:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
126
Q

The process data professionals use to familiarize themselves with the data so they can start conceptualizing how to use it; one of the six practices of EDA

A

Discovering:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
127
Q

Features with a countable number of values between any two values

A

Discrete features:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
128
Q

A variable that has a countable number of possible values

A

Discrete random variable:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
129
Q

A mathematical concept indicating that a measure or dimension has a finite and countable number of outcomes

A

Discrete:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
130
Q

A hyperparameter in agglomerative clustering models that determines the distance above which clusters will not be merged

A

distance_threshold:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
131
Q

An in-depth guide that is written by the developers who created a package that features very specific information on various functions and features

A

Documentation:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
132
Q

How to access the methods and attributes that belong to an instance of a class

A

Dot notation:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
133
Q

A group of text that explains what a method or function does; also referred to as a “docstring”

A

Documentation string:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
134
Q

The process of removing some observations from the majority class, making it so they make up a smaller percentage of the dataset than before

A

Downsampling:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
135
Q

Variables with values of 0 or 1 that indicate the presence or absence of something

A

Dummy variables:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
136
Q

A NumPy attribute used to check the data type of the contents of an array

A

dtype:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
137
Q

Variables that can point to objects of any data type

A

Dynamic typing:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
138
Q

A value the user inputs or the output of a program, an operation, or a function

A

Dynamic value:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
139
Q

A branch of economics that uses statistics to analyze economic problems

A

Econometrics:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
140
Q

A way of distributing computational tasks over a bunch of nearby processors (i.e., computers) that is good for speed and resiliency and does not depend on a single source of computational power

A

Edge computing:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
141
Q

A reserved keyword that executes subsequent conditions when the previous conditions are not true

A

elif:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
142
Q

A reserved keyword that executes when preceding conditions evaluate as False

A

else:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
143
Q

A type of probability based on experimental or historical data

A

Empirical probability:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
144
Q

A concept stating that the values on a normal curve are distributed in a regular pattern, based on their distance from the mean

A

Empirical rule:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
145
Q

Refers to building multiple models and aggregating their predictions

A

Ensemble learning:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
146
Q

In DBSCAN clustering models, a hyperparameter that determines the radius of a search area from any given point

A

eps (Epsilon):

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
147
Q

(Refer to ensemble learning)

A

Ensembling:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
148
Q

A built-in function that iterates through a sequence and tracks each element and its place in the index

A

Enumerate():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
149
Q

In a regression model, the natural noise assumed to be in a model

A

Errors:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
150
Q

A character that changes the typical behavior of the characters that follow it

A

Escape character:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
151
Q

Stage of the PACE workflow where a data professional will present findings with internal and external stakeholders, answer questions, consider different viewpoints, and make recommendations

A

Execute stage:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
152
Q

(Refer to independent variable)

A

Explanatory variable:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
153
Q

The process of converting a data type of an object to a required data type

A

Explicit conversion:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
154
Q

The process of investigating, organizing, and analyzing datasets and summarizing their main characteristics, often by employing data wrangling and visualization methods; the six main practices of EDA are: discovering, structuring, cleaning, joining, validating, and presenting

A

Exploratory data analysis (EDA):

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
155
Q

Expression: A combination of numbers, symbols, or other variables that produce a result when evaluated

A

Expression:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
156
Q

Quantifies the difference between the amount of variance that is left unexplained by a reduced model that is explained by the full model

A

Extra Sum of Squares F-test:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
157
Q

The process of retrieving data out of data sources for further data processing

A

Extracting:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
158
Q

A model’s ability to predict new values that fall outside of the range of values in the training data

A

Extrapolation:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
159
Q

The harmonic mean of precision and recall

A

F1-Score:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
160
Q

A test result that indicates something is present when it really is not

A

False positive:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
161
Q

The process of using practical, statistical, and data science knowledge to select, transform, or extract characteristics, properties, and attributes from raw data

A

Feature engineering:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
162
Q

A type of feature engineering that involves taking multiple features to create a new one that would improve the accuracy of the algorithm

A

Feature extraction:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
163
Q

A type of feature engineering that involves selecting the features in the data that contribute the most to predicting the response variable

A

Feature selection:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
164
Q

A type of feature engineering that involves modifying existing features in a way that improves accuracy when training the model

A

Feature transformation:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
165
Q

The process of selecting a smaller part of a dataset based on specified values and using it for viewing or analysis

A

Filtering:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
166
Q

Data that was gathered from inside your own organization

A

First-party data:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
167
Q

A data type that represents numbers that contain decimals

A

Float:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
168
Q

A piece of code that iterates over a sequence of values

A

For loop:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
169
Q

A string method that formats and inserts specific substrings into designated places within a larger string

A

format():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
170
Q

A stepwise variable selection process that begins with the null mode—with zero independent variables—and considers all possible variables to add; incorporates the independent variable that contributes the most explanatory power to the model

A

Forward selection:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
171
Q

A body of reusable code for performing specific processes or tasks

A

Function:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
172
Q

A function that returns an object (iterator) which can be iterated over (one value at a time)

A

Generator():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
173
Q

Values that are completely different from the overall data group and have no association with any other outliers

A

Global outliers:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
174
Q

A variable that can be accessed from anywhere in a program or script

A

Global variable:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
175
Q

Model ensembles that use gradient boosting

A

Gradient boosting machines (GBMs):

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
176
Q

A boosting methodology where each base learner in the sequence is built to predict the residual errors of the model that preceded it

A

Gradient boosting:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
177
Q

A tool to confirm that a model achieves its intended purpose by systematically checking every combination of hyperparameters to identify which set produces the best results, based on the selected metric

A

GridSearch:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
178
Q

A pandas DataFrame method that groups rows of the dataframe together based on their values at one or more columns, which allows further analysis of the groups

A

groupby():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
179
Q

Grouping: The process of aggregating individual observations of a variable into groups

A

Grouping:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
180
Q

An event where programmers and data professionals come together and work on a project

A

Hackathon:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
181
Q

A function that returns a preview of the column names and the first few rows of a dataset

A

Head():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
182
Q

A type of data visualization that depicts the magnitude of an instance or set of values based on two colors

A

Heatmap:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
183
Q

A Python help function used to display the documentation of modules, functions, classes, keywords, and more

A

Help():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
184
Q

A data visualization that depicts an approximate representation of the distribution of values in a dataset

A

Histogram:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
185
Q

A random sample of observed data that is not used to fit the model

A

Hold-out sample:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
186
Q

An assumption of simple linear regression stating that the variation of the residuals (errors) is constant or similar across the model

A

Homoscedasticity assumption:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
187
Q

Hyperparameters: Parameters that can be set by the modeler before the model is trained

A

Hyperparameters:

188
Q

Refers to changing parameters that directly affect how the model trains, before the learning process begins

A

Hyperparameter tuning:

189
Q

A theory or an explanation, based on evidence, that is not yet proven true

A

Hypothesis:

190
Q

A statistical procedure that uses sample data to evaluate an assumption about a population parameter

A

Hypothesis testing:

191
Q

A reserved keyword that sets up a condition in Python

A

if:

192
Q

A type of notation in pandas that indicates when the user wants to select by integer-location-based position

A

iloc[]:

193
Q

The concept that a data structure or element’s values can never be altered or updated

A

Immutability:

194
Q

A data type in which the values can never be altered or updated

A

Immutable data type:

195
Q

The process Python uses to automatically convert one data type to another without user involvement

A

Implicit conversion:

196
Q

A statement that uses the import keyword to load an external library, package, module, or function into the computing environment

A

Import statement:

197
Q

The concept that two events are independent if the occurrence of one event does not change the probability of the other event

A

Independent events:

198
Q

An assumption of simple linear regression stating that each observation in the dataset is independent

A

Independent observation assumption:

199
Q

The variable whose trends are associated with the dependent variable

A

Independent variable (X):

200
Q

A string method that outputs the index number of a character in a string

A

index():

201
Q

A way to refer to the individual items within an iterable by their relative position

A

Indexing:

202
Q

The sum of the squared distances between each observation and its nearest centroid

A

Inertia:

203
Q

Inferential statistics: A type of statistics that uses sample data to draw conclusions about a larger population

A

Inferential statistics:

204
Q

Gives the total number of entries, along with the data types—called Dtypes in pandas—of the individual entries

A

Info():

205
Q

Refers to letting a programmer build relationships between concepts and group them together to reduce code duplication

A

Inheritance:

206
Q

A way of combining data such that only the keys that are in both dataframes get included in the merge

A

Inner join:

207
Q

Input validation: The practice of thoroughly analyzing and double-checking to make sure data is complete, error-free, and high-quality

A

Input validation:

208
Q

Information entered into a program

A

Input:

209
Q

A Python function that can be used to ask a question in a message and store the answer in a variable

A

Input():

210
Q

A function that takes an index as the first parameter and an element as the second parameter, then inserts the element into a list at the given index

A

insert():

211
Q

A variable that is declared in a class outside of other methods or blocks

A

Instance variable:

212
Q

Refers to creating a copy of the class that inherits all class variables and methods

A

Instantiation:

213
Q

A standard integer data type, representing numbers somewhere between negative nine quintillion and positive nine quintillion

A

Int64:

214
Q

A data type used to represent whole numbers without fractions

A

Integer:

215
Q

A piece of software that has an interface to write, run, and test a piece of code

A

Integrated Development Environment (IDE):

216
Q

Represents how the relationship between two independent variables is associated with changes in the mean of the dependent variable

A

Interaction term:

217
Q

The y value of the point on the regression line where it intersects with the y-axis

A

Intercept (constant 𝐵0):

218
Q

Traits that focus on communicating and building relationships

A

Interpersonal skills:

219
Q

The distance between the first quartile (Q1) and the third quartile (Q3)

A

Interquartile range:

220
Q

intersection(): A function that finds the elements that two sets have in common

A

intersection():

221
Q

A sample statistic plus or minus the margin of error

A

Interval:

222
Q

A calculation that uses a range of values to estimate a population parameter

A

Interval estimate:

223
Q

A rule that checks objects and classes for ancestry

A

Is:

224
Q

A dictionary method to retrieve both the dictionary’s keys and values

A

items():

225
Q

An object that’s looped, or iterated, over

A

Iterable:

226
Q

The repeated execution of a set of statements, where one iteration is the single execution of a block of code

A

Iteration:

227
Q

The process of augmenting data by adding values from other datasets; one of the six practices of EDA

A

Joining:

228
Q

A data storage file that is saved in a JavaScript format

A

JSON file:

229
Q

An open-source web application for creating and sharing documents containing live code, mathematical formulas, visualizations, and text

A

Jupyter Notebook:

230
Q

An unsupervised partitioning algorithm used to organize unlabeled data into groups, or clusters

A

K-means:

231
Q

An underlying core program, like Python

A

Kernel:

232
Q

The shared points of reference between different dataframes

A

Keys:

233
Q

A dictionary method to retrieve only the dictionary’s keys

A

keys():

234
Q

A special word in a programming language that is reserved for a specific purpose and that can only be used for that purpose

A

Keyword:

235
Q

Data transformation technique where each category is assigned a unique number instead of a qualitative value

A

Label encoding:

236
Q

The nodes where a final prediction is made

A

Leaf node:

237
Q

In XGBoost, a hyperparameter that specifies how much weight is given to each consecutive tree’s prediction in the final ensemble

A

learning_rate:

238
Q

A way of combining data such that all of the keys in the left dataframe are included, even if they aren’t in the right dataframe

A

Left join:

239
Q

A function used to measure the length of strings

A

Len():

240
Q

A reusable collection of code; also referred to as a “package”

A

Library:

241
Q

The probability of observing the actual data, given some set of beta parameters

A

Likelihood:

242
Q

A collection of an infinite number of points extending in two opposite directions

A

Line:

243
Q

A technique that estimates the linear relationship between a continuous dependent variable and one or more independent variables

A

Linear regression:

244
Q

An assumption of simple linear regression stating that each predictor variable (Xi) is linearly related to the outcome variable (Y)

A

Linearity assumption:

245
Q

A nonlinear function that connects or links the dependent variable to the independent variables mathematically

A

Link function:

246
Q

The method used to determine which points/clusters to merge

A

Linkage:

247
Q

A data structure that helps store and manipulate an ordered collection of items

A

List:

248
Q

Formulaic creation of a new list based on the values in an existing list

A

List comprehension:

249
Q

The percentage of the population in a given age group that can read and write

A

Literacy rate:

250
Q

Notation that is used to select pandas rows and columns by name

A

loc[]:

251
Q

(Refer to logit)

A

Log-Odds function:

252
Q

An operator that connects multiple statements together and performs complex comparisons

A

Logical operator:

253
Q

A technique that models a categorical dependent variable (Y) based on one or more independent variables (X)

A

Logistic regression:

254
Q

The logarithm of the odds of a given probability

A

Logit:

255
Q

A block of code used to carry out iterations

A

Loop:

256
Q

A function that measures the distance between the observed values and the model’s estimated values

A

Loss function:

257
Q

When constructing an interval, the calculation of the sample means minus the margin of error

A

Lower limit:

258
Q

The use and development of algorithms and statistical models to teach computer systems to analyze and discover patterns in data

A

Machine learning:

259
Q

The average of the absolute difference between the predicted and actual values

A

MAE (Mean Absolute Error):

260
Q

Commands that are built into IPython to simplify common tasks

A

Magic commands:

261
Q

(Refer to magic commands)

A

Magics:

262
Q

An extension of ANCOVA and MANOVA that compares how two or more continuous outcome variables vary according to categorical independent variables, while controlling for covariates

A

MANCOVA (Multivariate Analysis of Covariance):

263
Q

An extension of ANOVA that compares how two or more continuous outcome variables vary according to categorical independent variables

A

MANOVA (Multivariate Analysis of Variance):

264
Q

The maximum expected difference between a population parameter and a sample estimate

A

Margin of error:

265
Q

A markup language that lets the user write formatted text in a coding environment or plain-text editor

A

Markdown:

266
Q

A library for creating static, animated, and interactive visualizations in Python

A

matplotlib:

267
Q

In tree-based models, a hyperparameter that controls how deep each base learner tree will grow

A

max_depth:

268
Q

In decision tree and random forest models, a hyperparameter that specifies the number of features that each tree randomly selects during training called “colsample_bytree” in XGBoost

A

max_features:

269
Q

A technique for estimating the beta parameters that maximizes the likelihood of the model producing the observed data

A

Maximum Likelihood Estimation (MLE):

270
Q

The average value in a dataset

A

Mean:

271
Q

A value that represents the center of a dataset

A

Measure of central tendency:

272
Q

A value that represents the spread of a dataset, or the amount of variation in data points

A

Measure of dispersion:

273
Q

A method by which the position of a value in relation to other values in a dataset is determined

A

Measure of position:

274
Q

Numeric values that can be aggregated or placed in calculations

A

Measures:

275
Q

The middle value in a dataset

A

Median:

276
Q

Someone who shares knowledge, skills, and experience to help another grow both professionally and personally

A

Mentor:

277
Q

A pandas function that joins two dataframes together; it only combines data by extending along axis one horizontally

A

merge():

278
Q

A method to combine two (or more) different dataframes along a specified starting column(s)

A

Merging:

279
Q

A function that belongs to a class and typically performs an action or operation

A

Method:

280
Q

In XGBoost models, a hyperparameter indicating that a tree will not split a node if it results in any child node with less weight than this value called “min_samples_leaf” in decision tree and random forest models

A

min_child_weight:

281
Q

Methods and criteria used to evaluate data

A

Metrics:

282
Q

In decision tree and random forest models, a hyperparameter that defines the minimum number of samples for a leaf node called “min_child_weight” in XGBoost

A

min_samples_leaf:

282
Q

In DBSCAN clustering models, a hyperparameter that specifies the number of samples in an ε-neighborhood for a point to be considered a core point (including itself)

A

min_samples:

283
Q

In decision tree and random forest models, a hyperparameter that defines the minimum number of samples that a node must have to split into more nodes

A

min_samples_split:

284
Q

A data value that is not stored for a variable in the observation of interest

A

Missing data:

285
Q

The most frequently occurring value in a dataset

A

Mode:

286
Q

Statements about the data that must be true in order to justify the use of a particular modeling technique

A

Model assumptions:

287
Q

The process of determining which model should be the final product and put into production

A

Model selection:

288
Q

The set of processes and activities intended to verify that models are performing as expected

A

Model validation:

289
Q

The ability to write code in separate components that work together and that can be reused for other programs

A

Modularity:

290
Q

A simple Python file containing a collection of functions and global variables

A

Module:

291
Q

An operator that returns the remainder when one number is divided by another

A

Modulo:

292
Q

A technique that estimates the relationship between one continuous dependent variable and two or more independent variables

A

Multiple linear regression:

292
Q

The average of the squared difference between the predicted and actual values

A

MSE (Mean Squared Error):

293
Q

(Refer to multiple linear regression)

A

Multiple regression:

294
Q

The concept that if the events A and B are independent, then the probability of both A and B happening is the probability of A multiplied by the probability of B

A

Multiplication rule (for independent events):

295
Q

The ability to change the internal state of a data structure

A

Mutability:

295
Q

The concept that two events are mutually exclusive if they cannot occur at the same time

A

Mutually exclusive:

296
Q

In K-means and agglomerative clustering models, a hyperparameter that specifies the number of clusters in the final model

A

n_clusters:

297
Q

The core data object of NumPy; also referred to as “ndarray”

A

N-dimensional array:

298
Q

In random forest and XGBoost models, a hyperparameter that specifies the number of trees your model will build in its ensemble

A

n_estimators:

299
Q

A supervised classification technique that is based on Bayes’s Theorem with an assumption of independence among predictors

A

Naive Bayes:

300
Q

Consistent guidelines that describe the content, creation date, and version of a file in its name

A

Naming conventions:

301
Q

Rules built into the syntax of a programming language

A

Naming restrictions:

302
Q

How null values are represented in pandas; stands for “not a number”

A

NaN:

303
Q

A NumPy attribute used to check the number of dimensions of an array

A

ndim:

304
Q

An inverse relationship between two variables, where when one variable increases, the other variable tends to decrease, and vice versa

A

Negative correlation:

305
Q

A loop inside of another loop

A

Nested loop:

306
Q

An assumption of simple linear regression stating that no two independent variables (Xi and Xj) can be highly correlated with each other

A

No multicollinearity assumption:

306
Q

A group organized for purposes other than generating profit; often aims to further a social cause or provide a benefit to the public

A

Nonprofit:

307
Q

A special data type in Python used to indicate that things are empty or that they return nothing

A

None:

307
Q

The total number of data entries for a data column that are not blank

A

Non-null count:

307
Q

A sampling method that is based on convenience or the personal preferences of the researcher, rather than random selection

A

Non-probability sampling:

307
Q

A programming system that is based around objects which can contain both data and code that manipulates that data

A

Object-oriented programming:

307
Q

Refers to when certain groups of people are less likely to provide responses

A

Nonresponse bias:

307
Q

An assumption of simple linear regression stating that the residuals are normally distributed

A

Normality assumption:

307
Q

A type of probability based on statistics, experiments, and mathematical measurements

A

Objective probability:

307
Q

A continuous probability distribution that is symmetrical on both sides of the mean and bell-shaped

A

Normal distribution:

307
Q

An essential library that contains multidimensional array and matrix data structures and functions to manipulate them

A

NumPy:

307
Q

A component category, usually associated with its respective class

A

Object type:

307
Q

A collection of data that consists of variables and methods or functions

A

Object:

308
Q

The existing sample of data, where each data point in the sample is represented by an observed value of the dependent variable and an observed value of the independent variable

A

Observed values:

309
Q

A data transformation technique that turns one categorical variable into several binary variables

A

One hot encoding:

310
Q

(Refer to dependent variable)

A

Outcome variable (Y):

310
Q

A type of statistical testing that compares the means of one continuous dependent variable based on three or more groups of one categorical variable

A

One-Way ANOVA:

310
Q

Data that is available to the public and free to use, with guidance on how to navigate the datasets and acknowledge the source

A

Open data:

311
Q

A common way to calculate linear regression coefficients

A

Ordinary least squares estimation (OLS):

312
Q

Observations that are an abnormal distance from other values or an overall pattern in a data population

A

Outliers:

312
Q

A way of combining data such that all of the keys from both dataframes get included in the merge

A

Outer join:

313
Q

A message stating what to do next

A

Output:

314
Q

When a model fits the observed or training data too specifically and is unable to generate suitable estimates for the general population

A

Overfitting:

315
Q

The probability of observing results as extreme as those observed when the null hypothesis is true

A

P-value:

316
Q

A workflow data professionals can use to remain focused on the end goal of any given dataset; stands for plan, analyze, construct, and execute

A

PACE:

317
Q

A fundamental unit of shareable code that others have developed for a specific purpose

A

Package:

318
Q

A powerful library built on top of NumPy that’s used to manipulate and analyze tabular data

A

pandas:

319
Q

A characteristic of a population

A

Parameter:

320
Q

The value below which a percentage of data falls

A

Percentile:

321
Q

Information that permits the identity of an individual to be inferred by either direct or indirect means

A

Personally identifiable information (PII):

322
Q

Stage of the PACE workflow where the scope of a project is defined and the informational needs of the organization are identified

A

Plan stage:

323
Q

A calculation that uses a single value to estimate a population parameter

A

Point estimate:

324
Q

A probability distribution that models the probability that a certain number of events will occur during a specific time period

A

Poisson distribution:

325
Q

A method that extracts an element from a list by removing it at a given index

A

pop():

326
Q

The phenomenon of more popular items being recommended too frequently

A

Popularity bias:

326
Q

Every possible element that a data professional is interested in measuring

A

Population:

327
Q

The percentage of individuals or elements in a population that share a certain characteristic

A

Population proportion:

328
Q

A relationship between two variables that tend to increase or decrease together.

A

Positive correlation:

329
Q

An ANOVA test that performs a pairwise comparison between all available groups while controlling for the error rate

A

Post hoc test:

330
Q

The probability of an event occurring after taking into consideration new information

A

Posterior probability:

331
Q

The proportion of positive predictions that were correct to all positive predictions

A

Precision:

332
Q

The estimated Y values for each X calculated by a model

A

Predicted values:

333
Q

(Refer to independent variable)

A

Predictor variable:

334
Q

The process of making a cleaned dataset available to others for analysis or further modeling; one of the six practices of EDA

A

Presenting:

335
Q

Refers to the probability of an event before new data is collected

A

Prior probability:

336
Q

The branch of mathematics that deals with measuring and quantifying uncertainty

A

Probability:

337
Q

A function that describes the likelihood of the possible outcomes of a random event

A

Probability distribution:

338
Q

A sampling method that uses random selection to generate a sample

A

Probability sampling:

339
Q

A series of instructions written so that a computer can perform a certain task, independent of any other application

A

Program:

340
Q

The words and symbols used to write instructions for computers to follow

A

Programming languages:

341
Q

A method of non-probability sampling that involves researchers selecting participants based on the purpose of their study

A

Purposive sample:

341
Q

Measures the proportion of variation in the dependent variable, Y, explained by the independent variable(s), X

A

R2 (The Coefficient of Determination):

341
Q

A general-purpose programming language

A

Python:

341
Q

A value that divides a dataset into four equal parts

A

Quartile:

342
Q

A visual that helps to define roles and responsibilities for individuals or teams to ensure work gets done efficiently; lists who is responsible, accountable, consulted, and informed for project tasks

A

RACI chart:

342
Q

A process whose outcome cannot be predicted with certainty

A

Random experiment:

343
Q

A starting point for generating random numbers

A

Random seed:

343
Q

An ensemble of decision trees trained on bootstrapped data with randomly selected features

A

Random forest:

344
Q

A Python function that returns a sequence of numbers starting from zero, increments by 1 by default, and stops before the given number

A

range():

344
Q

A variable that represents the values for the possible outcomes of a random event

A

Random variable:

345
Q

The difference between the largest and smallest value in a dataset

A

Range:

346
Q

Unsupervised learning techniques that use unlabeled data to offer relevant suggestions to users

A

Recommendation systems:

346
Q

The proportion of actual positives that were identified correctly to all actual positives

A

Recall:

347
Q

The process of restructuring code while maintaining its original functionality

A

Refactoring:

348
Q

A group of statistical techniques that use existing data to estimate the relationships between a single dependent variable and one or more independent variables

A

Regression analysis:

349
Q

The estimated betas in a regression model

A

Regression coefficient:

350
Q

A set of regression techniques that shrinks regression coefficient estimates towards zero, adding in bias, to reduce variance

A

Regularization:

350
Q

(Refer to regression analysis)

A

Regression models:

351
Q

A method that removes an element from a list

A

remove():

352
Q

A sample that accurately reflects the characteristics of a population

A

Representative sample:

353
Q

A NumPy method used to change the shape of an array

A

reshape():

354
Q

The difference between observed or actual values and the predicted values of the regression line

A

Residual:

355
Q

A reserved keyword in Python that makes a function produce new results which are saved for later use

A

return:

355
Q

The capability to define code once and using it many times without having to rewrite it

A

Reusability:

356
Q

(Refer to dependent variable)

A

Response variable:

357
Q

A way of combining data such that all the keys in the right dataframe are included—even if they aren’t in the left dataframe

A

Right join:

358
Q

The first node of the tree, where the first decision is made

A

Root node:

359
Q

A segment of a population that is representative of the entire population

A

Sample:

360
Q

The set of all possible values for a random variable

A

Sample space:

361
Q

The number of individuals or items chosen for a study or experiment

A

Sample size:

362
Q

The process of selecting a subset of data from a population

A

Sampling:

363
Q

A probability distribution of a sample statistic

A

Sampling distribution:

363
Q

Refers to when a sample is not representative of the population as a whole

A

Sampling bias:

364
Q

A list of all the items in a target population

A

Sampling frame:

365
Q

Refers to how much an estimate varies between samples

A

Sampling variability:

366
Q

Refers to when a population element can be selected more than one time

A

Sampling with replacement:

367
Q

Refers to when a population element can be selected only one time

A

Sampling without replacement:

368
Q

A series of scatterplots that show the relationships between pairs of variables

A

Scatterplot matrix:

369
Q

A collection of commands in a file designed to be executed like a program

A

Script:

370
Q

A visualization library based on matplotlib that provides a simpler interface for working with common plots and graphs

A

Seaborn:

371
Q

Data that was gathered outside your organization directly from the original source

A

Second-party data:

372
Q

A parameter passed to a method or attributes used to instantiate an object

A

Self:

373
Q

Code written in a way that is readable and makes its purpose clear

A

Self-documenting code:

374
Q

A one-dimensional labeled array capable of holding any data type

A

Series:

374
Q

Refers to the variables and objects that give meaning to Python code

A

Semantics:

374
Q

A positionally-ordered collection of items

A

Sequence:

374
Q

A function that takes an iterable as an argument and returns a new set object

A

Set():

374
Q

A data structure in Python that contains only unordered, non-interchangeable elements; a Tableau term for a custom field of data created from a larger dataset based on custom conditions

A

Set:

374
Q

A NumPy attribute used to check the shape of an array

A

shape:

374
Q

The mean of the silhouette coefficients of all the observations in a model

A

Silhouette score:

375
Q

A technique that estimates the linear relationship between one independent variable, X, and one continuous dependent variable, Y

A

Simple linear regression:

375
Q

(Refer to learning_rate)

A

Shrinkage:

376
Q

A probability sampling method in which every member of a population is selected randomly and has an equal chance of being chosen

A

Simple random sample:

377
Q

The comparison of different models’ silhouette scores

A

Silhouette analysis:

378
Q

A probability sampling method in which every member of a population is selected randomly and has an equal chance of being chosen

A

Simple random sample:

379
Q

The minimum pairwise distance between clusters

A

Single:

380
Q

A method for breaking information down into smaller parts to facilitate efficient examination and analysis from different viewpoints

A

Slicing:

381
Q

The amount that y increases or decreases per one-unit increase of x

A

Slope:

382
Q

A method of non-probability sampling that involves researchers recruiting initial participants to be in a study and then asking them to recruit other people to participate in the study

A

Snowball sample:

383
Q

The process of arranging data into a meaningful order for analysis

A

Sorting:

384
Q

A statistic that calculates the typical distance of a data point from the mean of a dataset

A

Standard deviation:

385
Q

The standard deviation of a sample statistic

A

Standard error:

386
Q

The sample standard deviation divided by the square root of the sample size

A

Standard error of the mean:

387
Q

The square root of the sample proportion times one minus the sample proportion divided by the sample size

A

Standard error of the proportion:

388
Q

The process of putting different variables on the same scale

A

Standardization:

389
Q

A characteristic of a sample

A

Statistic:

390
Q

The study of the collection, analysis, and interpretation of data

A

Statistics:

390
Q

The claim that the results of a test or experiment are not explainable by chance alone

A

Statistical significance:

391
Q

A probability sampling method that divides a population into groups and randomly selects some members from each group to be in the sample

A

Stratified random sample:

391
Q

A Tableau term for a group of dashboards or worksheets assembled into a presentation

A

Story:

392
Q

The portion of a string that can contain more than one character; also referred to as a substring

A

String slice:

393
Q

A sequence of characters and punctuation that contains textual information

A

String:

393
Q

A machine learning model that is used to make predictions about unseen events

A

Supervised model:

393
Q

A programming string used in code in which characters exist as the value themselves, rather than as variables

A

String literal:

394
Q

The sum of the squared difference between each observed value and its associated predicted value

A

Sum of squared residuals (SSR):

394
Q

A type of probability based on personal feelings, experience, or judgment

A

Subjective probability:

394
Q

A category of machine learning that uses labeled datasets to train algorithms to classify or predict outcomes

A

Supervised machine learning:

395
Q

The process of taking raw data and organizing or transforming it to be more easily visualized, explained, or modeled; one of the six practices of EDA

A

Structuring:

396
Q

A method that summarizes data using a single number

A

Summary statistics:

397
Q

A function that finds elements from both sets that are mutually not present in the other

A

symmetric_difference():

397
Q

The structure of code words, symbols, placement, and punctuation

A

Syntax:

398
Q

A probability sampling method that puts every member of a population into an ordered sequence, chooses a random starting point in the sequence, and selects members for the sample at regular intervals

A

Systematic random sample:

399
Q

A business intelligence and analytics platform that helps people visualize, understand, and make decisions with data

A

Tableau:

400
Q

Data that is in the form of a table, with rows and columns

A

Tabular data:

401
Q

The complete set of elements that someone is interested in knowing more about

A

Target population:

402
Q

Data gathered outside your organization and aggregated

A

Third-party data:

402
Q

A NumPy method to convert arrays into lists

A

Tolist():

403
Q

A type of statistical testing that compares the means of one continuous dependent variable based on three or more groups of two categorical variables

A

Two-Way ANOVA:

403
Q

A type of supervised machine learning that performs classification and regression tasks

A

Tree-based learning:

404
Q

An immutable sequence that can contain elements of any data type

A

Tuple:

405
Q

A function used to identify the type of data in a list

A

type():

406
Q

A function that transforms input into tuples

A

tuple():

407
Q

Refers to when some members of a population are inadequately represented in a sample

A

Undercoverage bias:

408
Q

A machine learning model that is used to discover the natural structure of the data, finding relationships within unlabeled data

A

Unsupervised model:

408
Q

A function that finds all the elements from both sets

A

union():

409
Q

When constructing an interval, the calculation of the sample means plus the margin of error

A

Upper limit:

409
Q

A named container which stores values in a reserved location in the computer’s memory

A

Variable:

410
Q

The process of taking observations from the minority class and either adding copies of those observations to the dataset or generating new observations to add to the dataset

A

Upsampling:

411
Q

The process of verifying that the data is consistent and high quality; one of the six practices of EDA

A

Validating:

412
Q

A dictionary method to retrieve only the dictionary’s values

A

values():

413
Q

The process of determining which variables or features to include in a given model

A

Variable selection:

414
Q

Quantifies how correlated each independent variable is with all of the other independent variables

A

Variance inflation factors (VIF):

414
Q

Refers to model flexibility and complexity, so the model learns from existing data; the average of the squared difference of each data point from the mean

A

Variance:

415
Q

A process that enables operations to be performed on multiple components of a data object at the same time

A

Vectorization:

416
Q

A method of non-probability sampling that consists of members of a population who volunteer to participate in a study

A

Voluntary response sample:

417
Q

Merges two clusters whose merging will result in the lowest inertia

A

Ward:

418
Q

A model that performs slightly better than randomly guessing

A

Weak learner:

419
Q

A loop that instructs the computer to continuously execute the code based on the value of a condition

A

While loop:

420
Q

An optimized GBM package

A

XGBoost (extreme gradient boosting):

421
Q

Occurs when the dataset has no occurrences of a class label and some value of a predictor variable together

A

Zero Frequency problem:

422
Q

A measure of how many standard deviations below or above the population mean a data point is

A

Z-score: