Google Advanced Data Analytics Flashcards
A way to compare two versions of something to find out which version performs better
A/B testing
(Refer to observed values)
Absolute values
Refers to the proportion of data points that were correctly categorized
Accuracy
A Tableau tool to help an audience interact with a visualization or dashboard by allowing control of selection
Action
Refers to allowing team members, bosses, and other collaborative stakeholders to share their own points of view before offering responses
Active listening
(Refer to adaptive boosting)
AdaBoost
A boosting methodology where each consecutive base learner assigns greater weight to the observations incorrectly predicted by the preceding learner
Adaptive boosting
The concept that if the events A and B are mutually exclusive, then the probability of A or B happening is the sum of the probabilities of A and B
Addition rule (for mutually exclusive events):
A variation of R² that accounts for having multiple independent variables present in a linear regression model
Adjusted R²
The metric used to calculate the distance between points/clusters
Affinity
A pandas groupby method that allows the user to apply multiple calculations to groups of data
agg():
A clustering methodology that works by first assigning every point to its own cluster, then progressively combining clusters based on intercluster distance
Agglomerative clustering
Data from a significant number of users that has eliminated personal information
Aggregate information
A set of instructions for solving a problem or accomplishing a task
Algorithm
A process that allows the user to assign an alternate name—or alias—to something
Aliasing
A group of statistical techniques that test the difference of means between three or more groups
Analysis of Variance (ANOVA)
A data professional who supervises analytical strategy for an organization, often managing multiple groups
Analytics Team Manager
Stage of the PACE workflow where the necessary data is acquired from primary and secondary sources and then cleaned, reorganized, and analyzed
Analyze stage
A statistical technique that tests the difference of means between three or more groups while controlling for the effects of covariates, or variable(s) irrelevant to the test
ANCOVA (Analysis of Covariance)
A method that adds an element to the end of a list
append():
An ordered collection of items of a single data type
Array:
A function for converting input to an array
Array():
Information given to a function in its parentheses
Argument
Refers to computer systems able to perform tasks that normally require human intelligence
Artificial intelligence (AI)
The process of storing a value in a variable
Assignment
A value associated with an object or class which is referenced by name using dot notation
Attribute
Average: The distance between each cluster’s centroid and other clusters’ centroids
Average
A stepwise variable selection process that begins with the full model, with all possible independent variables, and removes the independent variable that adds the least explanatory power to the model
Backward elimination
A technique used by certain kinds of models that use ensembles of base learners to make predictions; refers to the combination of bootstrapping and aggregating
Bagging
Each individual model that comprises an ensemble
Base learner
(Refer to Bayes’ theorem)
Bayes’ rule
An equation that can be used to calculate the probability of an outcome or class, given the values of predictor variables
Bayes’ theorem
(Refer to Bayesian statistics)
Bayesian inference
A powerful method for analyzing and interpreting data in modern data analytics; also referred to as Bayesian inference
Bayesian statistics
The line that fits the data best by minimizing some loss function or error
Best fit line
In data structuring, refers to organizing data results in groupings, categories, or variables that are misrepresentative of the whole dataset
Bias
Balance between two model qualities, bias and variance, to minimize overall error for unobserved data
Bias-variance trade-off
A segment of data that groups values into categories
Bin
Grouping continuous values into a smaller number of categories, or intervals
Binning
A discrete distribution that models the probability of events with only two possible outcomes: success or failure
Binomial distribution
A technique that models the probability of an observation falling into one of two categories, based on one or more independent variables
Binomial logistic regression
An assumption stating that there should be a linear relationship between each X variable and the logit of the probability that Y equals one
Binomial logistic regression linearity assumption
Any model whose predictions cannot be precisely explained
Black-box model
A data type that has only two possible values, usually true or false
Boolean
A data type that has only two possible values, usually true or false
Boolean data
A filtering technique that overlays a Boolean grid onto a dataframe in order to select only the values in the dataframe that align with the True values of the grid
Boolean masking
A technique that that builds an ensemble of weak learners sequentially, with each consecutive learner trying to correct the errors of the one that preceded it
Boosting
Refers to sampling with replacement
Bootstrapping
A data visualization that depicts the locality, spread, and skew of groups of values within quartiles
Box plot
The ability of a program to alter its execution sequence
Branching
A keyword that lets a user escape a loop without triggering any ELSE statement that follows it in the loop
break
(Refer to Business Intelligence Engineer)
Business Intelligence Analyst:
A data professional who uses their knowledge of business trends and databases to organize information and make it accessible; also referred to as a Business Intelligence Analyst
Business Intelligence Engineer
Data that is divided into a limited number of qualitative groups
Categorical data:
Variables that contain a finite number of groups or categories
Categorical variables:
Describes a cause-and-effect relationship where one variable directly causes the other to change in a particular way
Causation:
The modular code input and output fields into which Jupyter Notebooks are partitioned
Cells:
The idea that the sampling distribution of the mean approaches a normal distribution as the sample size increases
Central Limit Theorem:
The center of a cluster determined by the mathematical mean of all the points in that cluster
Centroid:
A hypothesis test that determines whether an observed categorical variable follows an expected distribution
Chi-squared (χ²) Goodness of Fit Test:
A hypothesis test that determines whether or not two categorical variables are associated with each other
Chi-squared (χ²) Test for Independence:
An executive-level data professional who is responsible for the consistency, accuracy, relevancy, interpretability, and reliability of the data a team provides
Chief Data Officer:
A node that is pointed to from another node
Child node:
When a dataset has a predictor variable that contains more instances of one outcome than another
Class imbalance:
An object’s data type that bundles data and functionality together
Class:
A type of probability based on formal reasoning about events with equally likely outcomes
Classical probability:
The process of removing errors that might distort your data or make it less useful; one of the six practices of EDA
Cleaning:
A probability sampling method that divides a population into clusters, randomly selects certain clusters, and includes all members from the chosen clusters in the sample
Cluster random sample:
A technique used by recommendation systems to make comparisons based on who else liked the content
Collaborative filtering:
A group of abnormal points, following similar patterns and isolated from the rest of the population
Collective outliers:
An operator that compares two values and produces Boolean values (True/False)
Comparator:
In statistics, refers to an event not occuring
Complement of an event:
The maximum pairwise distance between clusters
Complete:
A concept stating that the probability that event A does not occur is one minus the probability of A
Complement rule:
Refers to defining attributes and methods at the instance level to have a more differentiated relationship between objects in the same class
Composition:
The process of giving instructions to a computer to perform an action or set of actions
Computer programming:
A pandas function that combines data either by adding it horizontally as new columns for existing rows or vertically as new rows for existing columns
concat():
To link or join together
Concatenate:
Refers to building longer strings out of smaller strings
Concatenation:
Refers to the probability of an event occurring given that another event has already occurred
Conditional probability:
A section of code that directs the execution of programs
Conditional statement:
The area surrounding a line that describes the uncertainty around the predicted outcome at every value of X
Confidence band:
A range of values that describes the uncertainty surrounding an estimate
Confidence interval:
A measure that expresses the uncertainty of the estimation process
Confidence level:
A graphical representation of how accurate a classifier is at predicting the labels for a categorical variable
Confusion matrix:
Stage of the PACE workflow where data models and machine learning
algorithms are built, interpreted, and revised to uncover relationships within the data and help unlock insights from those relationships
Construct stage:
A special method to add values to an instance in object creation
Constructor:
A technique used by recommendation systems to make comparisons based on attributes of content
Content-based filtering:
Normal data points under certain conditions but become anomalies under most other conditions
Contextual outliers:
A variable that takes all the possible values in some range of numbers
Continuous random variable:
A mathematical concept indicating that a measure or dimension has an infinite and uncountable number of outcomes
Continuous:
Variables that can take on an infinite and uncountable set of values
Continuous variables:
A non-probability sampling method that involves choosing members of a population that are easy to contact or reach
Convenience sample:
Measures the way two variables tend to change together
Correlation:
A process that uses different portions of the data to test and train a model on different iterations
Cross-validation:
A plaintext file that uses commas to separate distinct values from one another; Stands for “comma-separated values”
CSV file:
The business term that describes how many and at what rate customers stop using a product or service, or stop doing business with a company
Customer churn:
The process of formatting data and removing unwanted material
Data cleaning:
The process of protecting people’s private or sensitive data by eliminating PII
Data anonymization:
A data professional who makes data accessible, ensures data ecosystems offer reliable results, and manages infrastructure for data across enterprises
Data engineer:
Well-founded standards of right and wrong that dictate how data is collected, shared, and used
Data ethics:
A process for ensuring the formal management of a company’s data assets
Data governance:
Any individual who works with data and/or has data skills
Data professional:
The discipline of making data useful
Data science:
A data professional who works closely with analytics to provide meaningful insights that help improve current business operations
Data scientist:
The location where data originates
Data source:
The practices of an organization that ensures that data is accessible, usable, and safe
Data stewardship:
An attribute that describes a piece of data based on its values, its programming language, or the operations it can perform
Data type:
A collection of data values or objects that contain different data types
Data structure:
A two-dimensional, labeled data structure with rows and columns
DataFrame:
A graph, chart, diagram, or dashboard that is created as a representation of information
Data visualization:
A file type used to store data, often in tables, indexes, or fields
Database (DB) file:
A two-dimensional data-structure organized into rows and columns
Dataframe:
A clustering methodology that searches data space for continuous regions of high density; stands for “density-based spatial clustering of applications with noise”
DBSCAN:
Troubleshooting, or searching for errors in a script or program
Debugging:
A node of the tree where decisions are made
Decision node:
A flowchart-like structure that uses branching paths to predict the outcomes of events, or the probability of certain outcomes
Decision tree:
The elimination or removal of matching data values in a dataset
Deduplication:
A keyword that defines a function at the start of the function block
def:
The concept that two events are dependent if one event changes the probability of the other event
Dependent events:
The variable a given model estimates
Dependent variable (Y):
A type of statistics that summarizes the main features of a dataset
Descriptive statistics:
A function that returns the statistical summary of a dataframe or series, including mean, standard deviation, and minimum and maximum column values.
Describe():
A function used to create a dictionary
dict():
A data structure that consists of a collection of key-value pairs
Dictionary:
A function that finds the elements present in one set but not the other
difference():
Qualitative data values used to categorize and group data to reveal details about it
Dimensions:
The process data professionals use to familiarize themselves with the data so they can start conceptualizing how to use it; one of the six practices of EDA
Discovering:
Features with a countable number of values between any two values
Discrete features:
A variable that has a countable number of possible values
Discrete random variable:
A mathematical concept indicating that a measure or dimension has a finite and countable number of outcomes
Discrete:
A hyperparameter in agglomerative clustering models that determines the distance above which clusters will not be merged
distance_threshold:
An in-depth guide that is written by the developers who created a package that features very specific information on various functions and features
Documentation:
How to access the methods and attributes that belong to an instance of a class
Dot notation:
A group of text that explains what a method or function does; also referred to as a “docstring”
Documentation string:
The process of removing some observations from the majority class, making it so they make up a smaller percentage of the dataset than before
Downsampling:
Variables with values of 0 or 1 that indicate the presence or absence of something
Dummy variables:
A NumPy attribute used to check the data type of the contents of an array
dtype:
Variables that can point to objects of any data type
Dynamic typing:
A value the user inputs or the output of a program, an operation, or a function
Dynamic value:
A branch of economics that uses statistics to analyze economic problems
Econometrics:
A way of distributing computational tasks over a bunch of nearby processors (i.e., computers) that is good for speed and resiliency and does not depend on a single source of computational power
Edge computing:
A reserved keyword that executes subsequent conditions when the previous conditions are not true
elif:
A reserved keyword that executes when preceding conditions evaluate as False
else:
A type of probability based on experimental or historical data
Empirical probability:
A concept stating that the values on a normal curve are distributed in a regular pattern, based on their distance from the mean
Empirical rule:
Refers to building multiple models and aggregating their predictions
Ensemble learning:
In DBSCAN clustering models, a hyperparameter that determines the radius of a search area from any given point
eps (Epsilon):
(Refer to ensemble learning)
Ensembling:
A built-in function that iterates through a sequence and tracks each element and its place in the index
Enumerate():
In a regression model, the natural noise assumed to be in a model
Errors:
A character that changes the typical behavior of the characters that follow it
Escape character:
Stage of the PACE workflow where a data professional will present findings with internal and external stakeholders, answer questions, consider different viewpoints, and make recommendations
Execute stage:
(Refer to independent variable)
Explanatory variable:
The process of converting a data type of an object to a required data type
Explicit conversion:
The process of investigating, organizing, and analyzing datasets and summarizing their main characteristics, often by employing data wrangling and visualization methods; the six main practices of EDA are: discovering, structuring, cleaning, joining, validating, and presenting
Exploratory data analysis (EDA):
Expression: A combination of numbers, symbols, or other variables that produce a result when evaluated
Expression:
Quantifies the difference between the amount of variance that is left unexplained by a reduced model that is explained by the full model
Extra Sum of Squares F-test:
The process of retrieving data out of data sources for further data processing
Extracting:
A model’s ability to predict new values that fall outside of the range of values in the training data
Extrapolation:
The harmonic mean of precision and recall
F1-Score:
A test result that indicates something is present when it really is not
False positive:
The process of using practical, statistical, and data science knowledge to select, transform, or extract characteristics, properties, and attributes from raw data
Feature engineering:
A type of feature engineering that involves taking multiple features to create a new one that would improve the accuracy of the algorithm
Feature extraction:
A type of feature engineering that involves selecting the features in the data that contribute the most to predicting the response variable
Feature selection:
A type of feature engineering that involves modifying existing features in a way that improves accuracy when training the model
Feature transformation:
The process of selecting a smaller part of a dataset based on specified values and using it for viewing or analysis
Filtering:
Data that was gathered from inside your own organization
First-party data:
A data type that represents numbers that contain decimals
Float:
A piece of code that iterates over a sequence of values
For loop:
A string method that formats and inserts specific substrings into designated places within a larger string
format():
A stepwise variable selection process that begins with the null mode—with zero independent variables—and considers all possible variables to add; incorporates the independent variable that contributes the most explanatory power to the model
Forward selection:
A body of reusable code for performing specific processes or tasks
Function:
A function that returns an object (iterator) which can be iterated over (one value at a time)
Generator():
Values that are completely different from the overall data group and have no association with any other outliers
Global outliers:
A variable that can be accessed from anywhere in a program or script
Global variable:
Model ensembles that use gradient boosting
Gradient boosting machines (GBMs):
A boosting methodology where each base learner in the sequence is built to predict the residual errors of the model that preceded it
Gradient boosting:
A tool to confirm that a model achieves its intended purpose by systematically checking every combination of hyperparameters to identify which set produces the best results, based on the selected metric
GridSearch:
A pandas DataFrame method that groups rows of the dataframe together based on their values at one or more columns, which allows further analysis of the groups
groupby():
Grouping: The process of aggregating individual observations of a variable into groups
Grouping:
An event where programmers and data professionals come together and work on a project
Hackathon:
A function that returns a preview of the column names and the first few rows of a dataset
Head():
A type of data visualization that depicts the magnitude of an instance or set of values based on two colors
Heatmap:
A Python help function used to display the documentation of modules, functions, classes, keywords, and more
Help():
A data visualization that depicts an approximate representation of the distribution of values in a dataset
Histogram:
A random sample of observed data that is not used to fit the model
Hold-out sample:
An assumption of simple linear regression stating that the variation of the residuals (errors) is constant or similar across the model
Homoscedasticity assumption: