Machine Learning Algorithms Summative 1 (M1, M2(PT1)) Flashcards

1
Q

Movie ratings, Military rank are samples of:
Group of answer choices

Discrete data

Ordinal data

Continuous data

Nominal data

A

Nominal data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Choose all the most popular Python Libraries that are used in data science.
Group of answer choices

NUMPY

ANACONDA

SCIPY

JUPYTER

PANDAS

SQL

A

NUMPY

SCIPY

JUPYTER

PANDAS

ANACONDA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which processes are involved in data preparation?
Group of answer choices

Not in the options

All the given options

Data Cleaning, Feature Engineering

Splitting of dataset

Data collection, Data Cleaning

A

All the given options

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A continuous data is:
Group of answer choices

Qualitative

Quantitative

A

Quantitative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Temperature range is a sample of:
Group of answer choices

Discrete data

Continuous data

A

Continuous data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sorting out missing data is a data cleansing technique.
Group of answer choices

True

False

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Based on the ML application table scenario, when rule complexity is simple and problem scale is large, ML application is:
Group of answer choices

ML Algorithms

Simple Prolem

Manual Rules

Rule-based Algorithms

A

Rule-based Algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Machine Learning is a field of study concerned with giving computers the ability to ________ without being explicitly programmed.

A

LEARN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A nominal data is:
Group of answer choices

Quantitative

Qualitative

A

Qualitative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which is not true about Machine Learning?
Group of answer choices

Their maintenance is much lower than a human’s and costs a lot less in the long run.

Enable computers to operate autonomously with explicit programming.

Machines driven by algorithms designed by humans are able to learn latent rules and inherent patterns and to fulfill tasks desired by humans.

Automation by machine learning can mitigate risks caused by fatigue or inattention.

A

Enable computers to operate autonomously with explicit programming.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Reducing noise in data is a feature engineering technique.
Group of answer choices

True

False

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Rule-based algorithms: Condition

Machine Learning: _________.

A

MODEL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

ML is a research field at the intersection of _________, artificial intelligence, and computer science.

A

STATISTICS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data reduction is a data cleansing technique.
Group of answer choices

True

False

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

In EDA, this process identifies unusual data points. __________

A

OUTLIER DETECTION

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Dataset is divided into _______ set and test set.

A

TRAINING

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

These concepts helps to understand how well a model performs: Overfitting, Underfitting, _________.

A

GENERALIZATION

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Logistic Regression is an example of a regression algorithm.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

This refers to the error resulting from sensitivity to the noise in the training data.
Group of answer choices

Not in the options

Overfitting

Underfitting

Generalization

A

Not in the options

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In supervised learning, market trend analysis is an example of:
Group of answer choices

Classification

Correlation

Prediction

Regression

A

Regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

When the model fits too closely to the training dataset.
Group of answer choices

Overfitting

Underfitting

Generalization

A

Generalization sabi ni canvas pero overfitting talaga

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

The _____ refers to the error from having wrong / too simple assumptions in the learning algorithm.

A

BIAS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Classification algorithms address classification problems where the output variable is categorical.
Group of answer choices

True

False

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

There is a regression variant of the k-nearest neighbors algorithm.
Group of answer choices

True

False

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

In k-NN, High Model Complexity is:
Group of answer choices

Overfitting

Underfitting

A

Overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

The ‘k’ in k-Nearest neighbors refers to the new closest data point.
Group of answer choices

True

False

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

K-nearest neighbors make a prediction for a new data point by finding the data that match from the training dataset.
Group of answer choices

True

False

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

In k-NN, High Model Complexity is underfitting.
Group of answer choices

True

False

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

In k-NN, Euclidean distance (by default) is used to choose the right distance measure.
Group of answer choices

True

False

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

In k-NN, Low Model Complexity is:
Group of answer choices

Overfitting

Underfitting

A

Underfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Linear models make a prediction using a linear function of the input features.
Group of answer choices

True

False

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Linear Regression is also known as Ordinal Least Squares.
Group of answer choices

True

False

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

The ________ is the sum of the squared differences between the predictions and the true values.
Group of answer choices

Mean error

Median error

Total R

Mean Squared Error

Not in the options

A

Mean Squared Error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

The ‘offset’ parameter is also called slope.
Group of answer choices

True

False

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Lasso uses L1 Regularization.
Group of answer choices

True

False

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

n Ridge regression is α (alpha) is lesser, the penalty becomes larger.
Group of answer choices

True

False

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Dichotomous classes means Yes or No.
Group of answer choices

True

False

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Its primary objective is to map the input variable with the output variable.
Group of answer choices

Unsupervised Learning

Classification

Correlation

Supervised Learning

A

Supervised Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

In k-NN, when you choose a small value of k (e.g., k=1), the model becomes more complex.
Group of answer choices

True

False

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Ridge is generally preferred over Lasso, but if you want a model that is easy to analyze and understand then use Lasso.
Group of answer choices

True

False

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

When comparing training set and test set scores, we find that we predict very accurately on the training set, but the R2 on the test set is much worse. This is a sign of:
Group of answer choices

Underfitting

Overfitting

A

Overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Ridge regression is a linear regression model that controls complexity to avoid overfitting.
Group of answer choices

True

False

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

The two phases of supervised ML process: Training, ________.

A

VALIDATION / TESTING? / PREDICTING?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

is about extracting knowledge from data

A

Machine Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

It is a research field at the intersection of statistics, artificial intelligence, and computer science and is also known as predictive analytics or statistical learning

A

Machine Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

A field of study concerned with giving computers the ability to learn without being explicitly programmed

A

Machine Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

is a discipline of artificial intelligence (AI) that provides machines with the ability to automatically learn from data and past experiences while identifying patterns to make predictions with minimal human intervention

A

Machine Learning (ML)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Machine Learning (ML) is a discipline of _____ that provides machines with the ability to automatically learn from data and past experiences while identifying patterns to make predictions with minimal human intervention

A

artificial intelligence (AI)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

is a study of learning algorithms. A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E

A

Machine learning (including deep learning)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Collection, preparation, and analysis of data

A

Data Science

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Leverages AI/ML, research, industry expertise, and statistics to make business decisions

A

Data Science

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Technology for machines to understand/interpret, learn, and make ‘intelligent’ decisions. Includes Machine Learning among many other fields

A

Artificial Intelligence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Algorithms that help machines improve through supervised, unsupervised, and reinforcement learning

A

Machine Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Subset of AI and Data Science tool

A

Machine Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Explicit programming is used to solve problems
Rules can be manually specified

A

Rule-based algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Samples are used for training
The decision-making rules are complex or difficult to describe
Rules are automatically learned by machines

A

Machine Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Small Scale Simple Rule Complexity

A

Simple Problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Large Scale Simple Rule Complexity

A

Rule-based algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Small Scale Complex Rule Complexity

A

Manual Rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Large Scale Complex Rule Complexity

A

Machine Learning Algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

enable computers to operate autonomously without explicit programming. ML application are fed with new data, and they can independently learn, grow, develop, and adapt

A

Machine learning methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

adaptively improves with an increase in the number of available samples during the ‘learning’ process

A

performance of ML algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

______ can work 24/7 and don’t get tired, need breaks, call in sick, or go on strike

A

Computers and robots

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Machines driven by algorithms designed by humans are able to learn ______ and ______ and to fulfill tasks desired by humans

A

latent rules, inherent patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

______ are better suited than humans for tasks that are routine, repetitive, or tedious

A

Learning machines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

______ can mitigate risks caused by fatigue or inattention

A

automation by machine learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Types of Machine Learning

A

Supervised Machine Learning
Unsupervised Machine Learning
Semi-Supervised Learning
Reinforcement Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

a collection of data used in machine learning tasks. Each data record is called a sample

A

Dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

Events or attributes that reflect the performance or nature of a sample in a particular aspect are called ______

A

features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

dataset used in the training process, where each sample is referred to as a training sample.

A

Training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

The process of creating a model from data is called _____

A

learning (training).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

Testing refers to the process of using the model obtained after learning for prediction.

A

Test set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

The dataset used is called a _____, and each sample is called a _____

A

test set, test sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

(1) Project Setup

A

Understand the business goals
Choose the solution to your problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

Speak with your stakeholders and deeply understand the business goal behind the model being proposed. A deep understanding of your business goals will help you scope the necessary technical solution, data sources to be collected, how to evaluate model performance, and more

A

Understand the business goals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

Once you have a deep understanding of your problem - focus on which category of models drives the highest impact.

A

Choose the solution to your problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

(2) Data Preparation

A

Data Collection
Data Cleaning
Feature Engineering
Split the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

Collect all the data you need for your models, whether from your own organization, public, or paid sources

A

Data Collection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

Turn the messy raw data into clean, tidy data ready for analysis.

A

Data Cleaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

Manipulate the datasets to create variables (features) that improve your model’s prediction accuracy. Create the same features in both the training set and the testing set

A

Feature Engineering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

Randomly divide the records in the dataset into a training set and a testing set. For a more reliable assessment of model performance, generate multiple training and testing sets using cross-validation

A

Split the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

(3) Modeling

A

Hyperparameter tuning
Train your models
Make predictions
Assess model performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

For each model, use ______ using techniques to improve model performance.

A

Hyperparameter tuning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

Fit each model to the training set

A

Train your models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

Make predictions on the testing set

A

Make predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

For each model, calculate performance metrics on the testing set such as accuracy, recall, and precision

A

Assess model performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

(4) Deployment

A

Deploy the model
Monitor model performance
Improve your model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

Embed the model you choose in dashboards, applications, or wherever you need it

A

Deploy the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

Regularly test the performance of your model as your data changes to avoid model drift

A

Monitor model performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

Continuously iterate and improve your model post-deployment. Replace your model with an updated version to improve performance

A

Improve your model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

Phase 1: Learning

A

Preprocessing
Learning
Testing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

Preprocessing:

A

Clean Data
Format Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

Learning:

A

Supervised
Unsupervised
Reinforcement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

Testing:

A

Measure Performance
Test Algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

Phase 2: Prediction

A

New Data + Trained Model = Prediction -> Predicted Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

Machine Learning Languages

A

Python R C++

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

Big Data Tools

A

MemSQL
Apache Spark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

General Machine Learning Frameworks

A

Numpy
Scikit-learn
NLTK

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

Data Analysis & Visualization Tools

A

Pandas
Matplotlib
Jupyter Notebook
Weka
Tableau

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

Macine Learning Frameworks for Natural Network Modeling

A

Pytorch
Kenas
Caffe 2
Tensorflow & Tensorboard

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

Top Programming Languages for ML

A

Python
R
Java
Julia
Scala
C++
JavaScript
Lisp
Haskell
Go

102
Q

Why Python?

A

Easy-to-Read Syntax
Extensive Libraries and Frameworks
Strong Community Support
Flexibility
Compatibility with Other Languages
Scalability and Performance

103
Q

Most popular ______ that are used in data analysis, data science, machine learning (ML), artificial intelligence (AI), natural language processing (NLP), deep learning, and by data scientists:

A

Python libraries

104
Q

Top 10 Python Libraries

A

Pandas
Matplotlib
Tensorflow
SciPy
Scrapy
NumPy
SeaBorn
Keras
Pytorch
SQLModel

105
Q

A very popular tool and the most prominent Python library for ML

A

Scikit-learn

106
Q

is one of the fundamental packages for scientific computing

A

Numpy

107
Q

is a collection of functions for scientific computing

A

Scipy

108
Q

is the primary scientific plotting library

A

Matplotlib

109
Q

is a library for data wrangling and analysis

A

Pandas

110
Q

A Python distribution made for large-scale data processing, predictive analysis, and scientific computing

A

Anaconda

111
Q

is an interactive environment for running code in the browser

A

Jupyter Notebook

112
Q

Applications of Machine Learning

A

Manufacturing
Healthcare
E-commerce
Automobile
Insurance
Transportation

113
Q

credit scoring, algorithmic trading

A

Computational finance

114
Q

facial recognition, motion tracking, object detection

A

Computer vision

115
Q

DNA sequencing, brain tumor detection, drug discovery

A

Computational biology

116
Q

predictive maintenance

A

Automotive, aerospace, and manufacturing

117
Q

voice recognition

A

Natural language processing

118
Q

contains missing values or the data that lacks attributes

A

Incompleteness

119
Q

contains incorrect records or exceptions.

A

Noise

120
Q

contains inconsistent records

A

Inconsistency

121
Q

Without good data, there is no

A

good model

122
Q

is an observation that seems to be distant from other observations or, more specifically, one observation that follows a different logic or generative process than the other observations

A

Outlier

123
Q

s the practice of cleaning, altering, and reorganizing raw data prior to processing and analysis, which is also known as data preparation

A

Preprocessing

124
Q

Preprocessing - is the practice of cleaning, altering, and reorganizing raw data prior to processing and analysis, which is also known as ______

A

data preparation

125
Q

It is an important step before processing to prepare, _____

A

prepare data for analysis and modeling by cleaning and transforming

126
Q

Key steps in Data Preprocessing

A

Data Profiling
Data Cleansing
Data Reduction
Data Transformation
Data Enrichment
Data Validation

127
Q

Data Preprocessing Techniques

A

Data Cleansing
Feature Engineering

128
Q

Identify and sort out missing data
Reduce noisy data
Identify and remove duplicates

A

Data Cleansing

129
Q

Involves techniques used by data scientists to organize the data in ways that make it more efficient to train data models and run inferences against them

A

Feature Engineering

130
Q

Feature scaling of normalization
Data reduction
Discretization
Feature encoding

A

Feature Engineering

131
Q

To understand the main characteristics of the data, identify patterns to discover patterns, spot anomalies, test a hypothesis, or check assumptions

A

Exploratory Data Analysis (EDA)

132
Q

Data Visualization Methods

A

Visualization
Summary Statistics
Outlier Detection
Correlation Analysis

133
Q

Creating plots and charts to visualize data distributions and relationships

A

Visualization

134
Q

Calculating measures like mean, median, variance, and standard deviation.

A

Summary Statistics

135
Q

Identifying unusual data points

A

Outlier Detection

136
Q

Examining relationships between variables

A

Correlation Analysis

137
Q

Testing initial assumptions about the data

A

Hypothesis Testing

138
Q

are useful for visualizing the “count” of values in the data set

A

Bar plots and Histograms

139
Q

Machine Learning Model Deployment

A

Training
Validation
Deployment
Monitoring

140
Q

refers to the process of taking a trained Ml model and making it available for use in real-world applications

A

Machine Learning Model Deployment

141
Q

Before deployment, models need to be thoroughly trained and evaluated. This involves data preprocessing, feature engineering, and rigorous testing to ensure the model is robust and ready for real-world scenarios

A

Training

142
Q

ML models should be able to handle increased loads and continue to deliver results efficiently. Ensuring the infrastructure can handle the model’s computational requirements is vital, requiring validation and effective testing for scalability before deploying models

A

Validation

143
Q

Model deployment is the most crucial process of integrating the ML model into its production environment.

A

Deployment

144
Q

Deployment process entails:

A

Defining how to extract or process the data in real time
Determine the storage required for these processes
Collection and predictions of model and data patterns
Setting up APIs, tools, and other software environments to support and improve predictions
Configuring the hardware (cloud or on-prem environments) to help support the ML model
Creating a pipeline for continuous training and parameter tuning

145
Q

This process is the most challenging, involving several moving pieces, tools, data scientists, and ML engineers to collaborate and strategize

A

Deployment

146
Q

Once deployed, models need to be continuously _____

A

monitored.

147
Q

Real world data can evolve, and models may drift in their performance.

A

Monitoring

148
Q

Implementing ______ systems to help to detect deviations and make necessary adjustments in a timely manner

A

monitoring

149
Q

Best Practices for Successful ML Model Deployment

A

Choosing the Right Infrastructure
Effective Versioning and Tracking
Robust Testing and Validation
Implementing Monitoring and Alerting

150
Q

covers the ethical and moral obligations of collecting, sharing, and using data, focused on ensuring that data is used fairly, for good

A

Data Ethics

151
Q

5 Principles of Data Ethics

A

Ownership
Transparency
Privacy
Intention
Outcomes

152
Q

the first principle of data ethics is that an individual has ownership over their personal information. Just as it’s considered stealing to take an item that doesn’t belong to you, it’s unlawful and unethical to collect someone’s personal data without their consent

A

Ownership

153
Q

In addition to owning their personal information, data subjects have a right to know how you plan to collect, store, and use it. When gathering data, exercise ______

A

transparency

154
Q

Another ethical responsibility that comes with handling data is ensuring data subjects’ _____. Even if a customer gives your company to collect, store, and analyze their personally identifiable information (PII)

A

privacy

155
Q

Before collecting data, ask yourself why you need it, what you’ll gain from it, and what changes you’ll be able to make after analysis. If your intention is to hurt others, profit from your subjects’ weaknesses, or any other malicious goal, it’s not ethical to collect their data

A

Intention

156
Q

even when intentions are good, the outcome of data analysis can cause inadvertent harm to individuals or groups of people.

A

Outcomes

157
Q

the outcome of data analysis can cause inadvertent harm to individuals or groups of people. This is called a ______

A

disparate impact

158
Q

Data Privacy Regulation (New Rules of Data)

A

Rule 1: Trust over Transactions
Rule 2: Insight over Identity
Rule 3: Flows over silos

159
Q

This first rule is all about consent. Until now, companies have been gathering as much as data as possible on their current and prospective customers’ preferences, habits, and identities, transaction by transaction - often without customers understanding what is happening

A

Rule 1: Trust over Transactions

160
Q

Firms need to re-think not only how they acquire data from their customers but from each other as well. Currently, companies routinely transfer large amounts of personal identifiable information (PII) through a complex web of data agreements, compromising both privacy and security

A

Rule 2: Insight over Identity

161
Q

New organizing principle for internal data teams. Once all your customer data has meaningful consent and you are acquiring insight without transferring data, CIOs and CDOs no longer need to work in silos, with one trying to keep data locked up while the other is trying to break it out. Instead, CIOs and CDOs can work together to facilitate the flow of insights

A

Rule 3: Flows over silos

162
Q

Data Subject Rights

A

Right to be Informed
Right to Damages
Right to Access
Right to Erasure or Blocking
Right to File a Complaint
Right to Object
Right to Rectify
Right to Data Portability

163
Q

is a set of principles and processes for data collection, management, and use. The goal is to ensure that data is accurate, consistent, and available for use, while protecting data privacy and security

A

Data Governance

164
Q

is a set of policies, procedures, and standards that implements data governance for an organization.

A

Data Governance Framework

165
Q

The Pillars of Data Governance

A

Ownership & Accountability
Data Quality
Data Protection & Safety
Data use & Availability
Data Management

166
Q

10 Questions to Answer before using AI in Public Sector Algorithmic Decision Making

A

Objective
Use
Impacts
Assumptions
Data
Inputs
Mitigation
Ethics
Oversight
Evaluation

167
Q

why is the algorithm needed and what outcomes is it intended to enable

A

Objective

168
Q

In what processes and circumstances is the algorithm appropriate to be used?

A

Use

169
Q

what impacts - good and bad - could the use of the algorithm have on people?

A

Impacts

170
Q

what assumptions is the algorithm based on and what are their limitations and potential biases?

A

Assumptions

171
Q

what datasets is/was the algorithm trained on and what are their limitations and potential biases?

A

Data

172
Q

what new data does the algorithm use when making decisions?

A

Inputs

173
Q

what actions have been taken to mitigate the negative impacts that could result from the algorithm’s limitations and potential biases?

A

Mitigation

174
Q

what assessments has been made of the ethics of using this algorithm?

A

Ethics

175
Q

what human judgement is needed before acting on the algorithm’s output and who is responsible for ensuring its proper use?

A

Oversight

176
Q

how, and by what criteria, will the effectiveness of the algorithm be assessed, and by whom?

A

Evaluation

177
Q

Each example in the dataset is a pair consisting of an input object (such as a _____) and a desired output value (____).

A

feature vector, label

178
Q

The primary objective of the supervised learning technique is to ______

A

map the input variable with the output variable

179
Q

Supervised machine learning is further classified into two broad categories:

A

Regression
Classification

180
Q

Regression: target is a _____ variable

A

continuous

181
Q

Regression Examples

A

Forecasting future stock price
Forecasting energy resources
Weather prediction
Market trend analysis
Predicting the environmental impact of pollutants

182
Q

Classification: target is a ____ variable

A

categorical

183
Q

Classification Examples

A

Classifying objects in images
Classifying chest X-rays images into COVID positive/negative
Handwritten digits recognition
Filter Emails into spam or not
Activity recognition for wearable devices

184
Q

Refer to algorithms that address classification problems where the output variable is categorical; for example, yes or no, true or false, male or female.

A

Classification

185
Q

Predicts one of the possible class labels

A

Classification

186
Q

classification of two classes (yes/no, negative/positive, 0/1

A

Binary Classification

187
Q

classification of three or more classes

A

Multiple Classification

188
Q

Classification algorithms include:

A

Random Forest Algorithm
Decision Tree Algorithm
Logistics Regression Algorithm
Support Vector Machine Algorithm

189
Q

_____ algorithms handle _____ problems where input and output variables have a linear relationship

A

Regression

190
Q

Regression algorithms include:

A

Simple Linear Regression Algorithm, Multivariate Regression Algorithm, Decision Tree Algorithm, and Lasso Regression

191
Q

Same with any ML processes, the supervised ML has two phases: the usual ____ and _____, followed by _____

A

training
validation
prediction

192
Q

the larger variety of data points your data set contains, the more complex a model you can use without ____

A

overfitting

193
Q

how well a model performs:

A

Generalization
Overfitting
Underfitting

194
Q

If a model is able to make accurate predictions on unseen data, we say it is able to _____ from the training set to the test set

A

generalize

195
Q

Occurs when a model learns the training data too well, including its noise and outliers

A

Overfitting

196
Q

occurs when you fit a model too closely to the particularities of the training set and obtain a model that works well on the training set but is not able to generalize to new data

A

Overfitting

197
Q

performs exceptionally well on training data but poorly on new, unseen data because it has essentially memorized the training data rather than learning the underlying patterns

A

overfitted model

198
Q

If your model is too simple then you might not be able to capture all the aspects and variability in the data, and your model will do badly even on the training set. Choosing too simple a model is called underfitting

A

underfitting

199
Q

performs poorly on both training and new data because it hasn’t learned enough from the training data

A

underfitted

200
Q

The more complex we allow our model to be, the better we will be able to predict on the training data

A

Model Complexity Curve

201
Q

error from having wrong / too simple assumptions in the learning algorithm

A

Bias

202
Q

error resulting from sensitivity to the noise / fluctuations in the training data

A

Variance

203
Q

Low Bias and Low Variance = ?

A

Good Model

204
Q

the k-NN algorithm is arguably the simplest machine learning algorithm.

A

k-Nearest Neighbors

205
Q

Building the model consist only of storing the training dataset.

A

k-Nearest Neighbors

206
Q

To make a prediction for a new data point, the algorithm finds the closest data points in the training dataset - its _______

A

“nearest neighbors”

207
Q

in its simplest version, the k-NN algorithm only considers exactly one nearest neighbor, which is the closest training data point to the point we want to make a prediction for

A

k-Neighbors classification

208
Q

Instead of considering only the closest neighbor, we can also consider an _______. This is where the name of the k-nearest neighbors algorithm comes from

A

arbitrary number, k, of neighbors

209
Q

There is also a regression variant of the _____

A

k-nearest neighbors algorithm.

210
Q

The k-nearest neighbors algorithm for regression is implemented in the KNeighbors Regressor class in scikit-learn. It’s used similarly to KNeighborsClassifier:

A

k-NN Estimator

211
Q

_______, also known as the coefficient of determination, is a measure of goodness of a prediction for a regression model, and yields a score between 0 and 1.

A

The Square Score (R^2)

212
Q

A value of 1 corresponds to the perfect prediction, and a value of 0 corresponds to a constant model that just predicts the mean of the training set responses, y_train:

A

The Square Score (R^2)

213
Q

The regression model’s score() function returns the coefficient of determination R.

A

Estimation of the Regression Model

214
Q

Perfect Prediction: target value == prediction -> numerator == denominator

A

R^2 = 1

215
Q

Predicting the average degree of target value: numerator == denominator,

A

R^2 = 0

216
Q

Predicting worse than the average can result in

A

negative numbers

217
Q

Two important parameters to the KNeighbors classifier:

A

The number of neighbors
how you measure distance between data points

218
Q

By default, _____ is used to choose the right distance measure

A

Euclidean distance

219
Q

Strengths/Advantages of KNN

A

Easy to understand
Works well without any special adjustments
Suitable as a first-time models

220
Q

Weaknesses/Disadvantages of KNN

A

If the number of features or samples is large, the prediction is slow and data preprocessing is important.
Does not work well with sparse datasets

221
Q

enerate a formula to create a best-fit line to predict unknown values

A

Linear models

222
Q

make a prediction using a linear function of the input features

A

Linear models

223
Q

They are called _____ because they assume there is a ___ relationship between the outcome variable and each of its predictors

A

linear

224
Q

several real-life scenarios follow linear relations between dependent and independent variables.

A

Application of Linear Models

225
Q

Application of Linear Models Example

A

The relationship between the boiling point of water and change in altitude
The relationship between spending on advertising and the revenue of an organization
The relationship between the amount of fertilizer used and crop yields
Performance of athletes and their training regimen

226
Q

Types of Linear Models

A

Linear Regression
Logistics Regression

227
Q

The algorithm is used for solving regression problems

A

Linear Regression

228
Q

Final output of the model is numeric value (numerical predictions).

A

Linear Regression

229
Q

The algorithm maps a linear relationship between the input features(X) and the output (y)

A

Linear Regression

230
Q

Linear model for classification problems

A

Logistics Regression

231
Q

It generates a probability between 0 and 1. This happens by fitting a logistic function, also known as the sigmoid function.

A

Logistics Regression

232
Q

Logistic Regression generates a probability between 0 and 1. This happens by fitting a logistic function, also known as the _____. The function first transforms the linear regression output between 0 and 1. After that, a predefined threshold helps to determine the probability of the output values

A

sigmoid function

233
Q

is the simplest and most classic linear method for regression

A

Linear Regression (aka Ordinary Least Squares)

234
Q

Linear regression finds the parameters w and b that minimize the _____ between predictions and the true regression targets, y, on the training set.

A

mean square error

235
Q

The ______ is the sum of the squared differences between the predictions and the true values.

A

mean square error

236
Q

The “slope”parameters (w), also called _______, are stored in the coef_attribute,

A

weights or coefficients

237
Q

the offset or ______ is stored in the intercept_attribute:

A

intercept (b)

238
Q

a model that allows us to control complexity. One of the most commonly used alternatives to standard linear regression is ____

A

ridge regression

239
Q

is also a linear model for regression, so the formula is used to make predictions is the same one used for OLS

A

Ridge Regression

240
Q

Each feature should have as little effect on the outcome as possible (which translates to having a small slope), while still predicting well. This constraints is an example of what is called ______

A

regularization.

241
Q

Regularization means explicitly restricting a model to avoid _____

A

overfitting.

242
Q

The particular kind of Regularization used by ridge regression is known as

A

L2 regularization

243
Q

Ridge regression is implemented in ___ function.

A

linear_model

244
Q

a higher alpha means a more restricted model, so we expect the entries of coef_ to have smaller magnitude for a high value of alpha than for a low value of alpha

A

Ridge Coef

245
Q

a higher alpha means ______, so we expect the entries of coef_ to have smaller magnitude for a high value of alpha than for a low value of alpha

A

a more restricted model

246
Q

plots that show model performance as a function of dataset size are called _____

A

learning curves

247
Q

An alternative to Ridge for regularizing linear regression is _____

A

Lasso

248
Q

As with ridge regression, using the lasso also restricts coefficients to be close to zero, but in a slightly different way, called _____

A

L1 regularization

249
Q

The consequence of L1 is that when using lasso, some coefficients are exactly zero. This means some features are ______ by the model

A

entirely ignored

250
Q

A ____ allowed us to fit a more complex model which worked better on the training and testing.

A

lower alpha

251
Q

If only some of the many traits are considered important, ____

A

Lasso

252
Q

When you want a model that is easy to analyze and understand, ___

A

Lasso