Google Advanced Data Analytics Flashcards by Nahman L

A way to compare two versions of something to find out which version performs better

A/B testing

How well did you know this?

Not at all

Perfectly

(Refer to observed values)

Absolute values

How well did you know this?

Not at all

Perfectly

Refers to the proportion of data points that were correctly categorized

Accuracy

How well did you know this?

Not at all

Perfectly

A Tableau tool to help an audience interact with a visualization or dashboard by allowing control of selection

Action

How well did you know this?

Not at all

Perfectly

Refers to allowing team members, bosses, and other collaborative stakeholders to share their own points of view before offering responses

Active listening

How well did you know this?

Not at all

Perfectly

(Refer to adaptive boosting)

AdaBoost

How well did you know this?

Not at all

Perfectly

A boosting methodology where each consecutive base learner assigns greater weight to the observations incorrectly predicted by the preceding learner

Adaptive boosting

How well did you know this?

Not at all

Perfectly

The concept that if the events A and B are mutually exclusive, then the probability of A or B happening is the sum of the probabilities of A and B

Addition rule (for mutually exclusive events):

How well did you know this?

Not at all

Perfectly

A variation of R² that accounts for having multiple independent variables present in a linear regression model

Adjusted R²

How well did you know this?

Not at all

Perfectly

The metric used to calculate the distance between points/clusters

Affinity

How well did you know this?

Not at all

Perfectly

A pandas groupby method that allows the user to apply multiple calculations to groups of data

agg():

How well did you know this?

Not at all

Perfectly

A clustering methodology that works by first assigning every point to its own cluster, then progressively combining clusters based on intercluster distance

Agglomerative clustering

How well did you know this?

Not at all

Perfectly

Data from a significant number of users that has eliminated personal information

Aggregate information

How well did you know this?

Not at all

Perfectly

A set of instructions for solving a problem or accomplishing a task

Algorithm

How well did you know this?

Not at all

Perfectly

A process that allows the user to assign an alternate name—or alias—to something

Aliasing

How well did you know this?

Not at all

Perfectly

A group of statistical techniques that test the difference of means between three or more groups

Analysis of Variance (ANOVA)

How well did you know this?

Not at all

Perfectly

A data professional who supervises analytical strategy for an organization, often managing multiple groups

Analytics Team Manager

How well did you know this?

Not at all

Perfectly

Stage of the PACE workflow where the necessary data is acquired from primary and secondary sources and then cleaned, reorganized, and analyzed

Analyze stage

How well did you know this?

Not at all

Perfectly

A statistical technique that tests the difference of means between three or more groups while controlling for the effects of covariates, or variable(s) irrelevant to the test

ANCOVA (Analysis of Covariance)

How well did you know this?

Not at all

Perfectly

A method that adds an element to the end of a list

append():

How well did you know this?

Not at all

Perfectly

An ordered collection of items of a single data type

Array:

How well did you know this?

Not at all

Perfectly

A function for converting input to an array

Array():

How well did you know this?

Not at all

Perfectly

Information given to a function in its parentheses

Argument

How well did you know this?

Not at all

Perfectly

Refers to computer systems able to perform tasks that normally require human intelligence

Artificial intelligence (AI)

How well did you know this?

Not at all

Perfectly

The process of storing a value in a variable

Assignment

A value associated with an object or class which is referenced by name using dot notation

Attribute

Average: The distance between each cluster’s centroid and other clusters’ centroids

Average

A stepwise variable selection process that begins with the full model, with all possible independent variables, and removes the independent variable that adds the least explanatory power to the model

Backward elimination

A technique used by certain kinds of models that use ensembles of base learners to make predictions; refers to the combination of bootstrapping and aggregating

Bagging

Each individual model that comprises an ensemble

Base learner

(Refer to Bayes’ theorem)

Bayes’ rule

An equation that can be used to calculate the probability of an outcome or class, given the values of predictor variables

Bayes’ theorem

(Refer to Bayesian statistics)

Bayesian inference

A powerful method for analyzing and interpreting data in modern data analytics; also referred to as Bayesian inference

Bayesian statistics

The line that fits the data best by minimizing some loss function or error

Best fit line

In data structuring, refers to organizing data results in groupings, categories, or variables that are misrepresentative of the whole dataset

Bias

Balance between two model qualities, bias and variance, to minimize overall error for unobserved data

Bias-variance trade-off

A segment of data that groups values into categories

Bin

Grouping continuous values into a smaller number of categories, or intervals

Binning

A discrete distribution that models the probability of events with only two possible outcomes: success or failure

Binomial distribution

A technique that models the probability of an observation falling into one of two categories, based on one or more independent variables

Binomial logistic regression

An assumption stating that there should be a linear relationship between each X variable and the logit of the probability that Y equals one

Binomial logistic regression linearity assumption

Any model whose predictions cannot be precisely explained

Black-box model

A data type that has only two possible values, usually true or false

Boolean

A data type that has only two possible values, usually true or false

Boolean data

A filtering technique that overlays a Boolean grid onto a dataframe in order to select only the values in the dataframe that align with the True values of the grid

Boolean masking

A technique that that builds an ensemble of weak learners sequentially, with each consecutive learner trying to correct the errors of the one that preceded it

Boosting

Refers to sampling with replacement

Bootstrapping

A data visualization that depicts the locality, spread, and skew of groups of values within quartiles

Box plot

The ability of a program to alter its execution sequence

Branching

A keyword that lets a user escape a loop without triggering any ELSE statement that follows it in the loop

break

(Refer to Business Intelligence Engineer)

Business Intelligence Analyst:

A data professional who uses their knowledge of business trends and databases to organize information and make it accessible; also referred to as a Business Intelligence Analyst

Business Intelligence Engineer

Data that is divided into a limited number of qualitative groups

Categorical data:

Variables that contain a finite number of groups or categories

Categorical variables:

Describes a cause-and-effect relationship where one variable directly causes the other to change in a particular way

Causation:

The modular code input and output fields into which Jupyter Notebooks are partitioned

Cells:

The idea that the sampling distribution of the mean approaches a normal distribution as the sample size increases

Central Limit Theorem:

The center of a cluster determined by the mathematical mean of all the points in that cluster

Centroid:

A hypothesis test that determines whether an observed categorical variable follows an expected distribution

Chi-squared (χ²) Goodness of Fit Test:

A hypothesis test that determines whether or not two categorical variables are associated with each other

Chi-squared (χ²) Test for Independence:

An executive-level data professional who is responsible for the consistency, accuracy, relevancy, interpretability, and reliability of the data a team provides

Chief Data Officer:

A node that is pointed to from another node

Child node:

When a dataset has a predictor variable that contains more instances of one outcome than another

Class imbalance:

An object’s data type that bundles data and functionality together

Class:

A type of probability based on formal reasoning about events with equally likely outcomes

Classical probability:

The process of removing errors that might distort your data or make it less useful; one of the six practices of EDA

Cleaning:

A probability sampling method that divides a population into clusters, randomly selects certain clusters, and includes all members from the chosen clusters in the sample

Cluster random sample:

A technique used by recommendation systems to make comparisons based on who else liked the content

Collaborative filtering:

A group of abnormal points, following similar patterns and isolated from the rest of the population

Collective outliers:

An operator that compares two values and produces Boolean values (True/False)

Comparator:

In statistics, refers to an event not occuring

Complement of an event:

The maximum pairwise distance between clusters

Complete:

A concept stating that the probability that event A does not occur is one minus the probability of A

Complement rule:

Refers to defining attributes and methods at the instance level to have a more differentiated relationship between objects in the same class

Composition:

The process of giving instructions to a computer to perform an action or set of actions

Computer programming:

A pandas function that combines data either by adding it horizontally as new columns for existing rows or vertically as new rows for existing columns

concat():

To link or join together

Concatenate:

Refers to building longer strings out of smaller strings

Concatenation:

Refers to the probability of an event occurring given that another event has already occurred

Conditional probability:

A section of code that directs the execution of programs

Conditional statement:

The area surrounding a line that describes the uncertainty around the predicted outcome at every value of X

Confidence band:

A range of values that describes the uncertainty surrounding an estimate

Confidence interval:

A measure that expresses the uncertainty of the estimation process

Confidence level:

A graphical representation of how accurate a classifier is at predicting the labels for a categorical variable

Confusion matrix:

Stage of the PACE workflow where data models and machine learning algorithms are built, interpreted, and revised to uncover relationships within the data and help unlock insights from those relationships

Construct stage:

A special method to add values to an instance in object creation

Constructor:

A technique used by recommendation systems to make comparisons based on attributes of content

Content-based filtering:

Normal data points under certain conditions but become anomalies under most other conditions

Contextual outliers:

A variable that takes all the possible values in some range of numbers

Continuous random variable:

A mathematical concept indicating that a measure or dimension has an infinite and uncountable number of outcomes

Continuous:

Variables that can take on an infinite and uncountable set of values

Continuous variables:

A non-probability sampling method that involves choosing members of a population that are easy to contact or reach

Convenience sample:

Measures the way two variables tend to change together

Correlation:

A process that uses different portions of the data to test and train a model on different iterations

Cross-validation:

A plaintext file that uses commas to separate distinct values from one another; Stands for "comma-separated values”

CSV file:

The business term that describes how many and at what rate customers stop using a product or service, or stop doing business with a company

Customer churn:

The process of formatting data and removing unwanted material

Data cleaning:

The process of protecting people's private or sensitive data by eliminating PII

Data anonymization:

A data professional who makes data accessible, ensures data ecosystems offer reliable results, and manages infrastructure for data across enterprises

Data engineer:

Well-founded standards of right and wrong that dictate how data is collected, shared, and used

Data ethics:

A process for ensuring the formal management of a company’s data assets

Data governance:

Any individual who works with data and/or has data skills

Data professional:

The discipline of making data useful

Data science:

A data professional who works closely with analytics to provide meaningful insights that help improve current business operations

Data scientist:

The location where data originates

Data source:

The practices of an organization that ensures that data is accessible, usable, and safe

Data stewardship:

An attribute that describes a piece of data based on its values, its programming language, or the operations it can perform

Data type:

A collection of data values or objects that contain different data types

Data structure:

A two-dimensional, labeled data structure with rows and columns

DataFrame:

A graph, chart, diagram, or dashboard that is created as a representation of information

Data visualization:

A file type used to store data, often in tables, indexes, or fields

Database (DB) file:

A two-dimensional data-structure organized into rows and columns

Dataframe:

A clustering methodology that searches data space for continuous regions of high density; stands for “density-based spatial clustering of applications with noise”

DBSCAN:

Troubleshooting, or searching for errors in a script or program

Debugging:

A node of the tree where decisions are made

Decision node:

A flowchart-like structure that uses branching paths to predict the outcomes of events, or the probability of certain outcomes

Decision tree:

The elimination or removal of matching data values in a dataset

Deduplication:

A keyword that defines a function at the start of the function block

def:

The concept that two events are dependent if one event changes the probability of the other event

Dependent events:

The variable a given model estimates

Dependent variable (Y):

A type of statistics that summarizes the main features of a dataset

Descriptive statistics:

A function that returns the statistical summary of a dataframe or series, including mean, standard deviation, and minimum and maximum column values.

Describe():

A function used to create a dictionary

dict():

A data structure that consists of a collection of key-value pairs

Dictionary:

A function that finds the elements present in one set but not the other

difference():

Qualitative data values used to categorize and group data to reveal details about it

Dimensions:

The process data professionals use to familiarize themselves with the data so they can start conceptualizing how to use it; one of the six practices of EDA

Discovering:

Features with a countable number of values between any two values

Discrete features:

A variable that has a countable number of possible values

Discrete random variable:

A mathematical concept indicating that a measure or dimension has a finite and countable number of outcomes

Discrete:

A hyperparameter in agglomerative clustering models that determines the distance above which clusters will not be merged

distance_threshold:

An in-depth guide that is written by the developers who created a package that features very specific information on various functions and features

Documentation:

How to access the methods and attributes that belong to an instance of a class

Dot notation:

A group of text that explains what a method or function does; also referred to as a “docstring”

Documentation string:

The process of removing some observations from the majority class, making it so they make up a smaller percentage of the dataset than before

Downsampling:

Variables with values of 0 or 1 that indicate the presence or absence of something

Dummy variables:

A NumPy attribute used to check the data type of the contents of an array

dtype:

Variables that can point to objects of any data type

Dynamic typing:

A value the user inputs or the output of a program, an operation, or a function

Dynamic value:

A branch of economics that uses statistics to analyze economic problems

Econometrics:

A way of distributing computational tasks over a bunch of nearby processors (i.e., computers) that is good for speed and resiliency and does not depend on a single source of computational power

Edge computing:

A reserved keyword that executes subsequent conditions when the previous conditions are not true

elif:

A reserved keyword that executes when preceding conditions evaluate as False

else:

A type of probability based on experimental or historical data

Empirical probability:

A concept stating that the values on a normal curve are distributed in a regular pattern, based on their distance from the mean

Empirical rule:

Refers to building multiple models and aggregating their predictions

Ensemble learning:

In DBSCAN clustering models, a hyperparameter that determines the radius of a search area from any given point

eps (Epsilon):

(Refer to ensemble learning)

Ensembling:

A built-in function that iterates through a sequence and tracks each element and its place in the index

Enumerate():

In a regression model, the natural noise assumed to be in a model

Errors:

A character that changes the typical behavior of the characters that follow it

Escape character:

Stage of the PACE workflow where a data professional will present findings with internal and external stakeholders, answer questions, consider different viewpoints, and make recommendations

Execute stage:

(Refer to independent variable)

Explanatory variable:

The process of converting a data type of an object to a required data type

Explicit conversion:

The process of investigating, organizing, and analyzing datasets and summarizing their main characteristics, often by employing data wrangling and visualization methods; the six main practices of EDA are: discovering, structuring, cleaning, joining, validating, and presenting

Exploratory data analysis (EDA):

Expression: A combination of numbers, symbols, or other variables that produce a result when evaluated

Expression:

Quantifies the difference between the amount of variance that is left unexplained by a reduced model that is explained by the full model

Extra Sum of Squares F-test:

The process of retrieving data out of data sources for further data processing

Extracting:

A model’s ability to predict new values that fall outside of the range of values in the training data

Extrapolation:

The harmonic mean of precision and recall

F1-Score:

A test result that indicates something is present when it really is not

False positive:

The process of using practical, statistical, and data science knowledge to select, transform, or extract characteristics, properties, and attributes from raw data

Feature engineering:

A type of feature engineering that involves taking multiple features to create a new one that would improve the accuracy of the algorithm

Feature extraction:

A type of feature engineering that involves selecting the features in the data that contribute the most to predicting the response variable

Feature selection:

A type of feature engineering that involves modifying existing features in a way that improves accuracy when training the model

Feature transformation:

The process of selecting a smaller part of a dataset based on specified values and using it for viewing or analysis

Filtering:

Data that was gathered from inside your own organization

First-party data:

A data type that represents numbers that contain decimals

Float:

A piece of code that iterates over a sequence of values

For loop:

A string method that formats and inserts specific substrings into designated places within a larger string

format():

A stepwise variable selection process that begins with the null mode—with zero independent variables—and considers all possible variables to add; incorporates the independent variable that contributes the most explanatory power to the model

Forward selection:

A body of reusable code for performing specific processes or tasks

Function:

A function that returns an object (iterator) which can be iterated over (one value at a time)

Generator():

Values that are completely different from the overall data group and have no association with any other outliers

Global outliers:

A variable that can be accessed from anywhere in a program or script

Global variable:

Model ensembles that use gradient boosting

Gradient boosting machines (GBMs):

A boosting methodology where each base learner in the sequence is built to predict the residual errors of the model that preceded it

Gradient boosting:

A tool to confirm that a model achieves its intended purpose by systematically checking every combination of hyperparameters to identify which set produces the best results, based on the selected metric

GridSearch:

A pandas DataFrame method that groups rows of the dataframe together based on their values at one or more columns, which allows further analysis of the groups

groupby():

Grouping: The process of aggregating individual observations of a variable into groups

Grouping:

An event where programmers and data professionals come together and work on a project

Hackathon:

A function that returns a preview of the column names and the first few rows of a dataset

Head():

A type of data visualization that depicts the magnitude of an instance or set of values based on two colors

Heatmap:

A Python help function used to display the documentation of modules, functions, classes, keywords, and more

Help():

A data visualization that depicts an approximate representation of the distribution of values in a dataset

Histogram:

A random sample of observed data that is not used to fit the model

Hold-out sample:

An assumption of simple linear regression stating that the variation of the residuals (errors) is constant or similar across the model

Homoscedasticity assumption:

Hyperparameters: Parameters that can be set by the modeler before the model is trained

Hyperparameters:

Refers to changing parameters that directly affect how the model trains, before the learning process begins

Hyperparameter tuning:

A theory or an explanation, based on evidence, that is not yet proven true

Hypothesis:

A statistical procedure that uses sample data to evaluate an assumption about a population parameter

Hypothesis testing:

A reserved keyword that sets up a condition in Python

if:

A type of notation in pandas that indicates when the user wants to select by integer-location-based position

iloc[]:

The concept that a data structure or element’s values can never be altered or updated

Immutability:

A data type in which the values can never be altered or updated

Immutable data type:

The process Python uses to automatically convert one data type to another without user involvement

Implicit conversion:

A statement that uses the import keyword to load an external library, package, module, or function into the computing environment

Import statement:

The concept that two events are independent if the occurrence of one event does not change the probability of the other event

Independent events:

An assumption of simple linear regression stating that each observation in the dataset is independent

Independent observation assumption:

The variable whose trends are associated with the dependent variable

Independent variable (X):

A string method that outputs the index number of a character in a string

index():

A way to refer to the individual items within an iterable by their relative position

Indexing:

The sum of the squared distances between each observation and its nearest centroid

Inertia:

Inferential statistics: A type of statistics that uses sample data to draw conclusions about a larger population

Inferential statistics:

Gives the total number of entries, along with the data types—called Dtypes in pandas—of the individual entries

Info():

Refers to letting a programmer build relationships between concepts and group them together to reduce code duplication

Inheritance:

A way of combining data such that only the keys that are in both dataframes get included in the merge

Inner join:

Input validation: The practice of thoroughly analyzing and double-checking to make sure data is complete, error-free, and high-quality

Input validation:

Information entered into a program

Input:

A Python function that can be used to ask a question in a message and store the answer in a variable

Input():

A function that takes an index as the first parameter and an element as the second parameter, then inserts the element into a list at the given index

insert():

A variable that is declared in a class outside of other methods or blocks

Instance variable:

Refers to creating a copy of the class that inherits all class variables and methods

Instantiation:

A standard integer data type, representing numbers somewhere between negative nine quintillion and positive nine quintillion

Int64:

A data type used to represent whole numbers without fractions

Integer:

A piece of software that has an interface to write, run, and test a piece of code

Integrated Development Environment (IDE):

Represents how the relationship between two independent variables is associated with changes in the mean of the dependent variable

Interaction term:

The y value of the point on the regression line where it intersects with the y-axis

Intercept (constant 𝐵0):

Traits that focus on communicating and building relationships

Interpersonal skills:

The distance between the first quartile (Q1) and the third quartile (Q3)

Interquartile range:

intersection(): A function that finds the elements that two sets have in common

intersection():

A sample statistic plus or minus the margin of error

Interval:

A calculation that uses a range of values to estimate a population parameter

Interval estimate:

A rule that checks objects and classes for ancestry

Is:

A dictionary method to retrieve both the dictionary’s keys and values

items():

An object that’s looped, or iterated, over

Iterable:

The repeated execution of a set of statements, where one iteration is the single execution of a block of code

Iteration:

The process of augmenting data by adding values from other datasets; one of the six practices of EDA

Joining:

A data storage file that is saved in a JavaScript format

JSON file:

An open-source web application for creating and sharing documents containing live code, mathematical formulas, visualizations, and text

Jupyter Notebook:

An unsupervised partitioning algorithm used to organize unlabeled data into groups, or clusters

K-means:

An underlying core program, like Python

Kernel:

The shared points of reference between different dataframes

Keys:

A dictionary method to retrieve only the dictionary’s keys

keys():

A special word in a programming language that is reserved for a specific purpose and that can only be used for that purpose

Keyword:

Data transformation technique where each category is assigned a unique number instead of a qualitative value

Label encoding:

The nodes where a final prediction is made

Leaf node:

In XGBoost, a hyperparameter that specifies how much weight is given to each consecutive tree’s prediction in the final ensemble

learning_rate:

A way of combining data such that all of the keys in the left dataframe are included, even if they aren’t in the right dataframe

Left join:

A function used to measure the length of strings

Len():

A reusable collection of code; also referred to as a “package”

Library:

The probability of observing the actual data, given some set of beta parameters

Likelihood:

A collection of an infinite number of points extending in two opposite directions

Line:

A technique that estimates the linear relationship between a continuous dependent variable and one or more independent variables

Linear regression:

An assumption of simple linear regression stating that each predictor variable (Xi) is linearly related to the outcome variable (Y)

Linearity assumption:

A nonlinear function that connects or links the dependent variable to the independent variables mathematically

Link function:

The method used to determine which points/clusters to merge

Linkage:

A data structure that helps store and manipulate an ordered collection of items

List:

Formulaic creation of a new list based on the values in an existing list

List comprehension:

The percentage of the population in a given age group that can read and write

Literacy rate:

Notation that is used to select pandas rows and columns by name

loc[]:

(Refer to logit)

Log-Odds function:

An operator that connects multiple statements together and performs complex comparisons

Logical operator:

A technique that models a categorical dependent variable (Y) based on one or more independent variables (X)

Logistic regression:

The logarithm of the odds of a given probability

Logit:

A block of code used to carry out iterations

Loop:

A function that measures the distance between the observed values and the model’s estimated values

Loss function:

When constructing an interval, the calculation of the sample means minus the margin of error

Lower limit:

The use and development of algorithms and statistical models to teach computer systems to analyze and discover patterns in data

Machine learning:

The average of the absolute difference between the predicted and actual values

MAE (Mean Absolute Error):

Commands that are built into IPython to simplify common tasks

Magic commands:

(Refer to magic commands)

Magics:

An extension of ANCOVA and MANOVA that compares how two or more continuous outcome variables vary according to categorical independent variables, while controlling for covariates

MANCOVA (Multivariate Analysis of Covariance):

An extension of ANOVA that compares how two or more continuous outcome variables vary according to categorical independent variables

MANOVA (Multivariate Analysis of Variance):

The maximum expected difference between a population parameter and a sample estimate

Margin of error:

A markup language that lets the user write formatted text in a coding environment or plain-text editor

Markdown:

A library for creating static, animated, and interactive visualizations in Python

matplotlib:

In tree-based models, a hyperparameter that controls how deep each base learner tree will grow

max_depth:

In decision tree and random forest models, a hyperparameter that specifies the number of features that each tree randomly selects during training called “colsample_bytree” in XGBoost

max_features:

A technique for estimating the beta parameters that maximizes the likelihood of the model producing the observed data

Maximum Likelihood Estimation (MLE):

The average value in a dataset

Mean:

A value that represents the center of a dataset

Measure of central tendency:

A value that represents the spread of a dataset, or the amount of variation in data points

Measure of dispersion:

A method by which the position of a value in relation to other values in a dataset is determined

Measure of position:

Numeric values that can be aggregated or placed in calculations

Measures:

The middle value in a dataset

Median:

Someone who shares knowledge, skills, and experience to help another grow both professionally and personally

Mentor:

A pandas function that joins two dataframes together; it only combines data by extending along axis one horizontally

merge():

A method to combine two (or more) different dataframes along a specified starting column(s)

Merging:

A function that belongs to a class and typically performs an action or operation

Method:

In XGBoost models, a hyperparameter indicating that a tree will not split a node if it results in any child node with less weight than this value called “min_samples_leaf” in decision tree and random forest models

min_child_weight:

Methods and criteria used to evaluate data

Metrics:

In decision tree and random forest models, a hyperparameter that defines the minimum number of samples for a leaf node called “min_child_weight” in XGBoost

min_samples_leaf:

In DBSCAN clustering models, a hyperparameter that specifies the number of samples in an ε-neighborhood for a point to be considered a core point (including itself)

min_samples:

In decision tree and random forest models, a hyperparameter that defines the minimum number of samples that a node must have to split into more nodes

min_samples_split:

A data value that is not stored for a variable in the observation of interest

Missing data:

The most frequently occurring value in a dataset

Mode:

Statements about the data that must be true in order to justify the use of a particular modeling technique

Model assumptions:

The process of determining which model should be the final product and put into production

Model selection:

The set of processes and activities intended to verify that models are performing as expected

Model validation:

The ability to write code in separate components that work together and that can be reused for other programs

Modularity:

A simple Python file containing a collection of functions and global variables

Module:

An operator that returns the remainder when one number is divided by another

Modulo:

A technique that estimates the relationship between one continuous dependent variable and two or more independent variables

Multiple linear regression:

The average of the squared difference between the predicted and actual values

MSE (Mean Squared Error):

(Refer to multiple linear regression)

Multiple regression:

The concept that if the events A and B are independent, then the probability of both A and B happening is the probability of A multiplied by the probability of B

Multiplication rule (for independent events):

The ability to change the internal state of a data structure

Mutability:

The concept that two events are mutually exclusive if they cannot occur at the same time

Mutually exclusive:

In K-means and agglomerative clustering models, a hyperparameter that specifies the number of clusters in the final model

n_clusters:

The core data object of NumPy; also referred to as “ndarray”

N-dimensional array:

In random forest and XGBoost models, a hyperparameter that specifies the number of trees your model will build in its ensemble

n_estimators:

A supervised classification technique that is based on Bayes’s Theorem with an assumption of independence among predictors

Naive Bayes:

Consistent guidelines that describe the content, creation date, and version of a file in its name

Naming conventions:

Rules built into the syntax of a programming language

Naming restrictions:

How null values are represented in pandas; stands for “not a number”

NaN:

A NumPy attribute used to check the number of dimensions of an array

ndim:

An inverse relationship between two variables, where when one variable increases, the other variable tends to decrease, and vice versa

Negative correlation:

A loop inside of another loop

Nested loop:

An assumption of simple linear regression stating that no two independent variables (Xi and Xj) can be highly correlated with each other

No multicollinearity assumption:

A group organized for purposes other than generating profit; often aims to further a social cause or provide a benefit to the public

Nonprofit:

A special data type in Python used to indicate that things are empty or that they return nothing

None:

The total number of data entries for a data column that are not blank

Non-null count:

A sampling method that is based on convenience or the personal preferences of the researcher, rather than random selection

Non-probability sampling:

A programming system that is based around objects which can contain both data and code that manipulates that data

Object-oriented programming:

Refers to when certain groups of people are less likely to provide responses

Nonresponse bias:

An assumption of simple linear regression stating that the residuals are normally distributed

Normality assumption:

A type of probability based on statistics, experiments, and mathematical measurements

Objective probability:

A continuous probability distribution that is symmetrical on both sides of the mean and bell-shaped

Normal distribution:

An essential library that contains multidimensional array and matrix data structures and functions to manipulate them

NumPy:

A component category, usually associated with its respective class

Object type:

A collection of data that consists of variables and methods or functions

Object:

The existing sample of data, where each data point in the sample is represented by an observed value of the dependent variable and an observed value of the independent variable

Observed values:

A data transformation technique that turns one categorical variable into several binary variables

One hot encoding:

(Refer to dependent variable)

Outcome variable (Y):

A type of statistical testing that compares the means of one continuous dependent variable based on three or more groups of one categorical variable

One-Way ANOVA:

Data that is available to the public and free to use, with guidance on how to navigate the datasets and acknowledge the source

Open data:

A common way to calculate linear regression coefficients

Ordinary least squares estimation (OLS):

Observations that are an abnormal distance from other values or an overall pattern in a data population

Outliers:

A way of combining data such that all of the keys from both dataframes get included in the merge

Outer join:

A message stating what to do next

Output:

When a model fits the observed or training data too specifically and is unable to generate suitable estimates for the general population

Overfitting:

The probability of observing results as extreme as those observed when the null hypothesis is true

P-value:

A workflow data professionals can use to remain focused on the end goal of any given dataset; stands for plan, analyze, construct, and execute

PACE:

A fundamental unit of shareable code that others have developed for a specific purpose

Package:

A powerful library built on top of NumPy that’s used to manipulate and analyze tabular data

pandas:

A characteristic of a population

Parameter:

The value below which a percentage of data falls

Percentile:

Information that permits the identity of an individual to be inferred by either direct or indirect means

Personally identifiable information (PII):

Stage of the PACE workflow where the scope of a project is defined and the informational needs of the organization are identified

Plan stage:

A calculation that uses a single value to estimate a population parameter

Point estimate:

A probability distribution that models the probability that a certain number of events will occur during a specific time period

Poisson distribution:

A method that extracts an element from a list by removing it at a given index

pop():

The phenomenon of more popular items being recommended too frequently

Popularity bias:

Every possible element that a data professional is interested in measuring

Population:

The percentage of individuals or elements in a population that share a certain characteristic

Population proportion:

A relationship between two variables that tend to increase or decrease together.

Positive correlation:

An ANOVA test that performs a pairwise comparison between all available groups while controlling for the error rate

Post hoc test:

The probability of an event occurring after taking into consideration new information

Posterior probability:

The proportion of positive predictions that were correct to all positive predictions

Precision:

The estimated Y values for each X calculated by a model

Predicted values:

(Refer to independent variable)

Predictor variable:

The process of making a cleaned dataset available to others for analysis or further modeling; one of the six practices of EDA

Presenting:

Refers to the probability of an event before new data is collected

Prior probability:

The branch of mathematics that deals with measuring and quantifying uncertainty

Probability:

A function that describes the likelihood of the possible outcomes of a random event

Probability distribution:

A sampling method that uses random selection to generate a sample

Probability sampling:

A series of instructions written so that a computer can perform a certain task, independent of any other application

Program:

The words and symbols used to write instructions for computers to follow

Programming languages:

A method of non-probability sampling that involves researchers selecting participants based on the purpose of their study

Purposive sample:

Measures the proportion of variation in the dependent variable, Y, explained by the independent variable(s), X

R2 (The Coefficient of Determination):

A general-purpose programming language

Python:

A value that divides a dataset into four equal parts

Quartile:

A visual that helps to define roles and responsibilities for individuals or teams to ensure work gets done efficiently; lists who is responsible, accountable, consulted, and informed for project tasks

RACI chart:

A process whose outcome cannot be predicted with certainty

Random experiment:

A starting point for generating random numbers

Random seed:

An ensemble of decision trees trained on bootstrapped data with randomly selected features

Random forest:

A Python function that returns a sequence of numbers starting from zero, increments by 1 by default, and stops before the given number

range():

A variable that represents the values for the possible outcomes of a random event

Random variable:

The difference between the largest and smallest value in a dataset

Range:

Unsupervised learning techniques that use unlabeled data to offer relevant suggestions to users

Recommendation systems:

The proportion of actual positives that were identified correctly to all actual positives

Recall:

The process of restructuring code while maintaining its original functionality

Refactoring:

A group of statistical techniques that use existing data to estimate the relationships between a single dependent variable and one or more independent variables

Regression analysis:

The estimated betas in a regression model

Regression coefficient:

A set of regression techniques that shrinks regression coefficient estimates towards zero, adding in bias, to reduce variance

Regularization:

(Refer to regression analysis)

Regression models:

A method that removes an element from a list

remove():

A sample that accurately reflects the characteristics of a population

Representative sample:

A NumPy method used to change the shape of an array

reshape():

The difference between observed or actual values and the predicted values of the regression line

Residual:

A reserved keyword in Python that makes a function produce new results which are saved for later use

return:

The capability to define code once and using it many times without having to rewrite it

Reusability:

(Refer to dependent variable)

Response variable:

A way of combining data such that all the keys in the right dataframe are included—even if they aren’t in the left dataframe

Right join:

The first node of the tree, where the first decision is made

Root node:

A segment of a population that is representative of the entire population

Sample:

The set of all possible values for a random variable

Sample space:

The number of individuals or items chosen for a study or experiment

Sample size:

The process of selecting a subset of data from a population

Sampling:

A probability distribution of a sample statistic

Sampling distribution:

Refers to when a sample is not representative of the population as a whole

Sampling bias:

A list of all the items in a target population

Sampling frame:

Refers to how much an estimate varies between samples

Sampling variability:

Refers to when a population element can be selected more than one time

Sampling with replacement:

Refers to when a population element can be selected only one time

Sampling without replacement:

A series of scatterplots that show the relationships between pairs of variables

Scatterplot matrix:

A collection of commands in a file designed to be executed like a program

Script:

A visualization library based on matplotlib that provides a simpler interface for working with common plots and graphs

Seaborn:

Data that was gathered outside your organization directly from the original source

Second-party data:

A parameter passed to a method or attributes used to instantiate an object

Self:

Code written in a way that is readable and makes its purpose clear

Self-documenting code:

A one-dimensional labeled array capable of holding any data type

Series:

Refers to the variables and objects that give meaning to Python code

Semantics:

A positionally-ordered collection of items

Sequence:

A function that takes an iterable as an argument and returns a new set object

Set():

A data structure in Python that contains only unordered, non-interchangeable elements; a Tableau term for a custom field of data created from a larger dataset based on custom conditions

Set:

A NumPy attribute used to check the shape of an array

shape:

The mean of the silhouette coefficients of all the observations in a model

Silhouette score:

A technique that estimates the linear relationship between one independent variable, X, and one continuous dependent variable, Y

Simple linear regression:

(Refer to learning_rate)

Shrinkage:

A probability sampling method in which every member of a population is selected randomly and has an equal chance of being chosen

Simple random sample:

The comparison of different models’ silhouette scores

Silhouette analysis:

A probability sampling method in which every member of a population is selected randomly and has an equal chance of being chosen

Simple random sample:

The minimum pairwise distance between clusters

Single:

A method for breaking information down into smaller parts to facilitate efficient examination and analysis from different viewpoints

Slicing:

The amount that y increases or decreases per one-unit increase of x

Slope:

A method of non-probability sampling that involves researchers recruiting initial participants to be in a study and then asking them to recruit other people to participate in the study

Snowball sample:

The process of arranging data into a meaningful order for analysis

Sorting:

A statistic that calculates the typical distance of a data point from the mean of a dataset

Standard deviation:

The standard deviation of a sample statistic

Standard error:

The sample standard deviation divided by the square root of the sample size

Standard error of the mean:

The square root of the sample proportion times one minus the sample proportion divided by the sample size

Standard error of the proportion:

The process of putting different variables on the same scale

Standardization:

A characteristic of a sample

Statistic:

The study of the collection, analysis, and interpretation of data

Statistics:

The claim that the results of a test or experiment are not explainable by chance alone

Statistical significance:

A probability sampling method that divides a population into groups and randomly selects some members from each group to be in the sample

Stratified random sample:

A Tableau term for a group of dashboards or worksheets assembled into a presentation

Story:

The portion of a string that can contain more than one character; also referred to as a substring

String slice:

A sequence of characters and punctuation that contains textual information

String:

A machine learning model that is used to make predictions about unseen events

Supervised model:

A programming string used in code in which characters exist as the value themselves, rather than as variables

String literal:

The sum of the squared difference between each observed value and its associated predicted value

Sum of squared residuals (SSR):

A type of probability based on personal feelings, experience, or judgment

Subjective probability:

A category of machine learning that uses labeled datasets to train algorithms to classify or predict outcomes

Supervised machine learning:

The process of taking raw data and organizing or transforming it to be more easily visualized, explained, or modeled; one of the six practices of EDA

Structuring:

A method that summarizes data using a single number

Summary statistics:

A function that finds elements from both sets that are mutually not present in the other

symmetric_difference():

The structure of code words, symbols, placement, and punctuation

Syntax:

A probability sampling method that puts every member of a population into an ordered sequence, chooses a random starting point in the sequence, and selects members for the sample at regular intervals

Systematic random sample:

A business intelligence and analytics platform that helps people visualize, understand, and make decisions with data

Tableau:

Data that is in the form of a table, with rows and columns

Tabular data:

The complete set of elements that someone is interested in knowing more about

Target population:

Data gathered outside your organization and aggregated

Third-party data:

A NumPy method to convert arrays into lists

Tolist():

A type of statistical testing that compares the means of one continuous dependent variable based on three or more groups of two categorical variables

Two-Way ANOVA:

A type of supervised machine learning that performs classification and regression tasks

Tree-based learning:

An immutable sequence that can contain elements of any data type

Tuple:

A function used to identify the type of data in a list

type():

A function that transforms input into tuples

tuple():

Refers to when some members of a population are inadequately represented in a sample

Undercoverage bias:

A machine learning model that is used to discover the natural structure of the data, finding relationships within unlabeled data

Unsupervised model:

A function that finds all the elements from both sets

union():

When constructing an interval, the calculation of the sample means plus the margin of error

Upper limit:

A named container which stores values in a reserved location in the computer’s memory

Variable:

The process of taking observations from the minority class and either adding copies of those observations to the dataset or generating new observations to add to the dataset

Upsampling:

The process of verifying that the data is consistent and high quality; one of the six practices of EDA

Validating:

A dictionary method to retrieve only the dictionary’s values

values():

The process of determining which variables or features to include in a given model

Variable selection:

Quantifies how correlated each independent variable is with all of the other independent variables

Variance inflation factors (VIF):

Refers to model flexibility and complexity, so the model learns from existing data; the average of the squared difference of each data point from the mean

Variance:

A process that enables operations to be performed on multiple components of a data object at the same time

Vectorization:

A method of non-probability sampling that consists of members of a population who volunteer to participate in a study

Voluntary response sample:

Merges two clusters whose merging will result in the lowest inertia

Ward:

A model that performs slightly better than randomly guessing

Weak learner:

A loop that instructs the computer to continuously execute the code based on the value of a condition

While loop:

An optimized GBM package

XGBoost (extreme gradient boosting):

Occurs when the dataset has no occurrences of a class label and some value of a predictor variable together

Zero Frequency problem:

A measure of how many standard deviations below or above the population mean a data point is

Z-score:

Google Advanced Data Analytics Flashcards

(471 cards)