Midterms - MLA Flashcards
Movie ratings, Military rank are samples of:
Group of answer choices
Discrete data
Ordinal data
Continuous data
Nominal data
Nominal data
Choose all the most popular Python Libraries that are used in data science.
Group of answer choices
NUMPY
ANACONDA
SCIPY
JUPYTER
PANDAS
SQL
NUMPY
SCIPY
JUPYTER
PANDAS
ANACONDA
Which processes are involved in data preparation?
Group of answer choices
Not in the options
All the given options
Data Cleaning, Feature Engineering
Splitting of dataset
Data collection, Data Cleaning
All the given options
A continuous data is:
Group of answer choices
Qualitative
Quantitative
Quantitative
Temperature range is a sample of:
Group of answer choices
Discrete data
Continuous data
Continuous data
Sorting out missing data is a data cleansing technique.
Group of answer choices
True
False
True
Based on the ML application table scenario, when rule complexity is simple and problem scale is large, ML application is:
Group of answer choices
ML Algorithms
Simple Prolem
Manual Rules
Rule-based Algorithms
Rule-based Algorithms
Machine Learning is a field of study concerned with giving computers the ability to ________ without being explicitly programmed.
LEARN
A nominal data is:
Group of answer choices
Quantitative
Qualitative
Qualitative
Which is not true about Machine Learning?
Group of answer choices
Their maintenance is much lower than a human’s and costs a lot less in the long run.
Enable computers to operate autonomously with explicit programming.
Machines driven by algorithms designed by humans are able to learn latent rules and inherent patterns and to fulfill tasks desired by humans.
Automation by machine learning can mitigate risks caused by fatigue or inattention.
Enable computers to operate autonomously with explicit programming.
Reducing noise in data is a feature engineering technique.
Group of answer choices
True
False
False
Rule-based algorithms: Condition
Machine Learning: _________.
MODEL
ML is a research field at the intersection of _________, artificial intelligence, and computer science.
STATISTICS
Data reduction is a data cleansing technique.
Group of answer choices
True
False
False
In EDA, this process identifies unusual data points. __________
OUTLIER DETECTION
Dataset is divided into _______ set and test set.
TRAINING
These concepts helps to understand how well a model performs: Overfitting, Underfitting, _________.
GENERALIZATION
Logistic Regression is an example of a regression algorithm.
False
This refers to the error resulting from sensitivity to the noise in the training data.
Group of answer choices
Not in the options
Overfitting
Underfitting
Generalization
Not in the options
In supervised learning, market trend analysis is an example of:
Group of answer choices
Classification
Correlation
Prediction
Regression
Regression
When the model fits too closely to the training dataset.
Group of answer choices
Overfitting
Underfitting
Generalization
Generalization sabi ni canvas pero overfitting talaga
The _____ refers to the error from having wrong / too simple assumptions in the learning algorithm.
BIAS
Classification algorithms address classification problems where the output variable is categorical.
Group of answer choices
True
False
True
There is a regression variant of the k-nearest neighbors algorithm.
Group of answer choices
True
False
True
In k-NN, High Model Complexity is:
Group of answer choices
Overfitting
Underfitting
Overfitting
The ‘k’ in k-Nearest neighbors refers to the new closest data point.
Group of answer choices
True
False
False
K-nearest neighbors make a prediction for a new data point by finding the data that match from the training dataset.
Group of answer choices
True
False
False
In k-NN, High Model Complexity is underfitting.
Group of answer choices
True
False
False
In k-NN, Euclidean distance (by default) is used to choose the right distance measure.
Group of answer choices
True
False
True
In k-NN, Low Model Complexity is:
Group of answer choices
Overfitting
Underfitting
Underfitting
Linear models make a prediction using a linear function of the input features.
Group of answer choices
True
False
True
Linear Regression is also known as Ordinal Least Squares.
Group of answer choices
True
False
TRUE
The ________ is the sum of the squared differences between the predictions and the true values.
Group of answer choices
Mean error
Median error
Total R
Mean Squared Error
Not in the options
Mean Squared Error
The ‘offset’ parameter is also called slope.
Group of answer choices
True
False
False
Lasso uses L1 Regularization.
Group of answer choices
True
False
True
n Ridge regression is α (alpha) is lesser, the penalty becomes larger.
Group of answer choices
True
False
False
Dichotomous classes means Yes or No.
Group of answer choices
True
False
True
Its primary objective is to map the input variable with the output variable.
Group of answer choices
Unsupervised Learning
Classification
Correlation
Supervised Learning
Supervised Learning
In k-NN, when you choose a small value of k (e.g., k=1), the model becomes more complex.
Group of answer choices
True
False
True
Ridge is generally preferred over Lasso, but if you want a model that is easy to analyze and understand then use Lasso.
Group of answer choices
True
False
True
When comparing training set and test set scores, we find that we predict very accurately on the training set, but the R2 on the test set is much worse. This is a sign of:
Group of answer choices
Underfitting
Overfitting
Overfitting
Ridge regression is a linear regression model that controls complexity to avoid overfitting.
Group of answer choices
True
False
True
The two phases of supervised ML process: Training, ________.
VALIDATION / TESTING? / PREDICTING?
is about extracting knowledge from data
Machine Learning
It is a research field at the intersection of statistics, artificial intelligence, and computer science and is also known as predictive analytics or statistical learning
Machine Learning
A field of study concerned with giving computers the ability to learn without being explicitly programmed
Machine Learning
is a discipline of artificial intelligence (AI) that provides machines with the ability to automatically learn from data and past experiences while identifying patterns to make predictions with minimal human intervention
Machine Learning (ML)
Machine Learning (ML) is a discipline of _____ that provides machines with the ability to automatically learn from data and past experiences while identifying patterns to make predictions with minimal human intervention
artificial intelligence (AI)
is a study of learning algorithms. A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E
Machine learning (including deep learning)
Collection, preparation, and analysis of data
Data Science
Leverages AI/ML, research, industry expertise, and statistics to make business decisions
Data Science
Technology for machines to understand/interpret, learn, and make ‘intelligent’ decisions. Includes Machine Learning among many other fields
Artificial Intelligence
Algorithms that help machines improve through supervised, unsupervised, and reinforcement learning
Machine Learning
Subset of AI and Data Science tool
Machine Learning
Explicit programming is used to solve problems
Rules can be manually specified
Rule-based algorithms
Samples are used for training
The decision-making rules are complex or difficult to describe
Rules are automatically learned by machines
Machine Learning
Small Scale Simple Rule Complexity
Simple Problems
Large Scale Simple Rule Complexity
Rule-based algorithms
Small Scale Complex Rule Complexity
Manual Rules
Large Scale Complex Rule Complexity
Machine Learning Algorithms
enable computers to operate autonomously without explicit programming. ML application are fed with new data, and they can independently learn, grow, develop, and adapt
Machine learning methods
adaptively improves with an increase in the number of available samples during the ‘learning’ process
performance of ML algorithms
______ can work 24/7 and don’t get tired, need breaks, call in sick, or go on strike
Computers and robots
Machines driven by algorithms designed by humans are able to learn ______ and inherent patterns and to fulfill tasks desired by humans
latent rules, inherent patterns
______ are better suited than humans for tasks that are routine, repetitive, or tedious
Learning machines
______ can mitigate risks caused by fatigue or inattention
automation by machine learning
Types of Machine Learning
Supervised Machine Learning
Unsupervised Machine Learning
Semi-Supervised Learning
Reinforcement Learning
a collection of data used in machine learning tasks. Each data record is called a sample
Dataset
Events or attributes that reflect the performance or nature of a sample in a particular aspect are called ______
features
dataset used in the training process, where each sample is referred to as a training sample.
Training set
The process of creating a model from data is called _____
learning (training).
Testing refers to the process of using the model obtained after learning for prediction.
Test set
The dataset used is called a _____, and each sample is called a _____
test set, test sample
(1) Project Setup
Understand the business goals
Choose the solution to your problem.
Speak with your stakeholders and deeply understand the business goal behind the model being proposed. A deep understanding of your business goals will help you scope the necessary technical solution, data sources to be collected, how to evaluate model performance, and more
Understand the business goals
Once you have a deep understanding of your problem - focus on which category of models drives the highest impact.
Choose the solution to your problem.
(2) Data Preparation
Data Collection
Data Cleaning
Feature Engineering
Split the data
Collect all the data you need for your models, whether from your own organization, public, or paid sources
Data Collection
Turn the messy raw data into clean, tidy data ready for analysis.
Data Cleaning
Manipulate the datasets to create variables (features) that improve your model’s prediction accuracy. Create the same features in both the training set and the testing set
Feature Engineering
Randomly divide the records in the dataset into a training set and a testing set. For a more reliable assessment of model performance, generate multiple training and testing sets using cross-validation
Split the data
(3) Modeling
Hyperparameter tuning
Train your models
Make predictions
Assess model performance
For each model, use ______ using techniques to improve model performance.
Hyperparameter tuning
Fit each model to the training set
Train your models
Make predictions on the testing set
Make predictions
For each model, calculate performance metrics on the testing set such as accuracy, recall, and precision
Assess model performance
(4) Deployment
Deploy the model
Monitor model performance
Improve your model
Embed the model you choose in dashboards, applications, or wherever you need it
Deploy the model
Regularly test the performance of your model as your data changes to avoid model drift
Monitor model performance
Continuously iterate and improve your model post-deployment. Replace your model with an updated version to improve performance
Improve your model
Phase 1: Learning
Preprocessing
Learning
Testing
Preprocessing:
Clean Data
Format Data
Learning:
Supervised
Unsupervised
Reinforcement
Testing:
Measure Performance
Test Algorithm
Phase 2: Prediction
New Data + Trained Model = Prediction -> Predicted Data
Machine Learning Languages
Python R C++
Big Data Tools
MemSQL
Apache Spark
General Machine Learning Frameworks
Numpy
Scikit-learn
NLTK
Data Analysis & Visualization Tools
Pandas
Matplotlib
Jupyter Notebook
Weka
Tableau
Macine Learning Frameworks for Natural Network Modeling
Pytorch
Kenas
Caffe 2
Tensorflow & Tensorboard
Top Programming Languages for ML
Python
R
Java
Julia
Scala
C++
JavaScript
Lisp
Haskell
Go
Why Python?
Easy-to-Read Syntax
Extensive Libraries and Frameworks
Strong Community Support
Flexibility
Compatibility with Other Languages
Scalability and Performance
Most popular ______ that are used in data analysis, data science, machine learning (ML), artificial intelligence (AI), natural language processing (NLP), deep learning, and by data scientists:
Python libraries
Top 10 Python Libraries
Pandas
Matplotlib
Tensorflow
SciPy
Scrapy
NumPy
SeaBorn
Keras
Pytorch
SQLModel
A very popular tool and the most prominent Python library for ML
Scikit-learn
is one of the fundamental packages for scientific computing
Numpy
is a collection of functions for scientific computing
Scipy
is the primary scientific plotting library
Matplotlib
is a library for data wrangling and analysis
Pandas
A Python distribution made for large-scale data processing, predictive analysis, and scientific computing
Anaconda
is an interactive environment for running code in the browser
Jupyter Notebook
Applications of Machine Learning
Manufacturing
Healthcare
E-commerce
Automobile
Insurance
Transportation
credit scoring, algorithmic trading
Computational finance
facial recognition, motion tracking, object detection
Computer vision
DNA sequencing, brain tumor detection, drug discovery
Computational biology
predictive maintenance
Automotive, aerospace, and manufacturing
voice recognition
Natural language processing
contains missing values or the data that lacks attributes
Incompleteness
contains incorrect records or exceptions.
Noise
contains inconsistent records
Inconsistency
Without good data, there is no
good model
is an observation that seems to be distant from other observations or, more specifically, one observation that follows a different logic or generative process than the other observations
Outlier
s the practice of cleaning, altering, and reorganizing raw data prior to processing and analysis, which is also known as data preparation
Preprocessing
Preprocessing - is the practice of cleaning, altering, and reorganizing raw data prior to processing and analysis, which is also known as ______
data preparation
It is an important step before processing to prepare, _____
prepare data for analysis and modeling by cleaning and transforming
Key steps in Data Preprocessing
Data Profiling
Data Cleansing
Data Reduction
Data Transformation
Data Enrichment
Data Validation
Data Preprocessing Techniques
Data Cleansing
Feature Engineering
Identify and sort out missing data
Reduce noisy data
Identify and remove duplicates
Data Cleansing
Involves techniques used by data scientists to organize the data in ways that make it more efficient to train data models and run inferences against them
Feature Engineering
Feature scaling of normalization
Data reduction
Discretization
Feature encoding
Feature Engineering
To understand the main characteristics of the data, identify patterns to discover patterns, spot anomalies, test a hypothesis, or check assumptions
Exploratory Data Analysis (EDA)
Data Visualization Methods
Visualization
Summary Statistics
Outlier Detection
Correlation Analysis
Creating plots and charts to visualize data distributions and relationships
Visualization
Calculating measures like mean, median, variance, and standard deviation.
Summary Statistics
Identifying unusual data points
Outlier Detection
Examining relationships between variables
Correlation Analysis
Testing initial assumptions about the data
Hypothesis Testing
are useful for visualizing the “count” of values in the data set
Bar plots and Histograms
Machine Learning Model Deployment
Training
Validation
Deployment
Monitoring
refers to the process of taking a trained Ml model and making it available for use in real-world applications
Machine Learning Model Deployment
Before deployment, models need to be thoroughly trained and evaluated. This involves data preprocessing, feature engineering, and rigorous testing to ensure the model is robust and ready for real-world scenarios
Training
ML models should be able to handle increased loads and continue to deliver results efficiently. Ensuring the infrastructure can handle the model’s computational requirements is vital, requiring validation and effective testing for scalability before deploying models
Validation
Model deployment is the most crucial process of integrating the ML model into its production environment.
Deployment
Deployment process entails:
Defining how to extract or process the data in real time
Determine the storage required for these processes
Collection and predictions of model and data patterns
Setting up APIs, tools, and other software environments to support and improve predictions
Configuring the hardware (cloud or on-prem environments) to help support the ML model
Creating a pipeline for continuous training and parameter tuning
This process is the most challenging, involving several moving pieces, tools, data scientists, and ML engineers to collaborate and strategize
Deployment
Once deployed, models need to be continuously _____
monitored.
Real world data can evolve, and models may drift in their performance.
Monitoring
Implementing ______ systems to help to detect deviations and make necessary adjustments in a timely manner
monitoring
Best Practices for Successful ML Model Deployment
Choosing the Right Infrastructure
Effective Versioning and Tracking
Robust Testing and Validation
Implementing Monitoring and Alerting
covers the ethical and moral obligations of collecting, sharing, and using data, focused on ensuring that data is used fairly, for good
Data Ethics
5 Principles of Data Ethics
Ownership
Transparency
Privacy
Intention
Outcomes
the first principle of data ethics is that an individual has ownership over their personal information. Just as it’s considered stealing to take an item that doesn’t belong to you, it’s unlawful and unethical to collect someone’s personal data without their consent
Ownership
In addition to owning their personal information, data subjects have a right to know how you plan to collect, store, and use it. When gathering data, exercise ______
transparency
Another ethical responsibility that comes with handling data is ensuring data subjects’ _____. Even if a customer gives your company to collect, store, and analyze their personally identifiable information (PII)
privacy
Before collecting data, ask yourself why you need it, what you’ll gain from it, and what changes you’ll be able to make after analysis. If your intention is to hurt others, profit from your subjects’ weaknesses, or any other malicious goal, it’s not ethical to collect their data
Intention
even when intentions are good, the outcome of data analysis can cause inadvertent harm to individuals or groups of people.
Outcomes
the outcome of data analysis can cause inadvertent harm to individuals or groups of people. This is called a ______
disparate impact
Data Privacy Regulation (New Rules of Data)
Rule 1: Trust over Transactions
Rule 2: Insight over Identity
Rule 3: Flows over silos
This first rule is all about consent. Until now, companies have been gathering as much as data as possible on their current and prospective customers’ preferences, habits, and identities, transaction by transaction - often without customers understanding what is happening
Rule 1: Trust over Transactions
Firms need to re-think not only how they acquire data from their customers but from each other as well. Currently, companies routinely transfer large amounts of personal identifiable information (PII) through a complex web of data agreements, compromising both privacy and security
Rule 2: Insight over Identity
New organizing principle for internal data teams. Once all your customer data has meaningful consent and you are acquiring insight without transferring data, CIOs and CDOs no longer need to work in silos, with one trying to keep data locked up while the other is trying to break it out. Instead, CIOs and CDOs can work together to facilitate the flow of insights
Rule 3: Flows over silos
Data Subject Rights
Right to be Informed
Right to Damages
Right to Access
Right to Erasure or Blocking
Right to File a Complaint
Right to Object
Right to Rectify
Right to Data Portability
is a set of principles and processes for data collection, management, and use. The goal is to ensure that data is accurate, consistent, and available for use, while protecting data privacy and security
Data Governance
is a set of policies, procedures, and standards that implements data governance for an organization.
Data Governance Framework
The Pillars of Data Governance
Ownership & Accountability
Data Quality
Data Protection & Safety
Data use & Availability
Data Management
10 Questions to Answer before using AI in Public Sector Algorithmic Decision Making
Objective
Use
Impacts
Assumptions
Data
Inputs
Mitigation
Ethics
Oversight
Evaluation
why is the algorithm needed and what outcomes is it intended to enable
Objective
In what processes and circumstances is the algorithm appropriate to be used?
Use
what impacts - good and bad - could the use of the algorithm have on people?
Impacts
what assumptions is the algorithm based on and what are their limitations and potential biases?
Assumptions
what datasets is/was the algorithm trained on and what are their limitations and potential biases?
Data
what new data does the algorithm use when making decisions?
Inputs
what actions have been taken to mitigate the negative impacts that could result from the algorithm’s limitations and potential biases?
Mitigation
what assessments has been made of the ethics of using this algorithm?
Ethics
what human judgement is needed before acting on the algorithm’s output and who is responsible for ensuring its proper use?
Oversight
how, and by what criteria, will the effectiveness of the algorithm be assessed, and by whom?
Evaluation
Each example in the dataset is a pair consisting of an input object (such as a _____) and a desired output value (____).
feature vector, label
The primary objective of the supervised learning technique is to ______
map the input variable with the output variable
Supervised machine learning is further classified into two broad categories:
Regression
Classification
Regression: target is a _____ variable
continuous
Regression Examples
Forecasting future stock price
Forecasting energy resources
Weather prediction
Market trend analysis
Predicting the environmental impact of pollutants
Classification: target is a ____ variable
categorical
Classification Examples
Classifying objects in images
Classifying chest X-rays images into COVID positive/negative
Handwritten digits recognition
Filter Emails into spam or not
Activity recognition for wearable devices
Refer to algorithms that address classification problems where the output variable is categorical; for example, yes or no, true or false, male or female.
Classification
Predicts one of the possible class labels
Classification
classification of two classes (yes/no, negative/positive, 0/1
Binary Classification
classification of three or more classes
Multiple Classification
Classification algorithms include:
Random Forest Algorithm
Decision Tree Algorithm
Logistics Regression Algorithm
Support Vector Machine Algorithm
_____ algorithms handle _____ problems where input and output variables have a linear relationship
Regression
Regression algorithms include:
Simple Linear Regression Algorithm, Multivariate Regression Algorithm, Decision Tree Algorithm, and Lasso Regression
Same with any ML processes, the supervised ML has two phases: the usual ____ and _____, followed by _____
training
validation
prediction
the larger variety of data points your data set contains, the more complex a model you can use without ____
overfitting
how well a model performs:
Generalization
Overfitting
Underfitting
If a model is able to make accurate predictions on unseen data, we say it is able to _____ from the training set to the test set
generalize
Occurs when a model learns the training data too well, including its noise and outliers
Overfitting
occurs when you fit a model too closely to the particularities of the training set and obtain a model that works well on the training set but is not able to generalize to new data
Overfitting
performs exceptionally well on training data but poorly on new, unseen data because it has essentially memorized the training data rather than learning the underlying patterns
overfitted model
If your model is too simple then you might not be able to capture all the aspects and variability in the data, and your model will do badly even on the training set. Choosing too simple a model is called underfitting
underfitting
performs poorly on both training and new data because it hasn’t learned enough from the training data
underfitted
The more complex we allow our model to be, the better we will be able to predict on the training data
Model Complexity Curve
error from having wrong / too simple assumptions in the learning algorithm
Bias
error resulting from sensitivity to the noise / fluctuations in the training data
Variance
Low Bias and Low Variance = ?
Good Model
the k-NN algorithm is arguably the simplest machine learning algorithm.
k-Nearest Neighbors
Building the model consist only of storing the training dataset.
k-Nearest Neighbors
To make a prediction for a new data point, the algorithm finds the closest data points in the training dataset - its _______
“nearest neighbors”
in its simplest version, the k-NN algorithm only considers exactly one nearest neighbor, which is the closest training data point to the point we want to make a prediction for
k-Neighbors classification
Instead of considering only the closest neighbor, we can also consider an _______. This is where the name of the k-nearest neighbors algorithm comes from
arbitrary number, k, of neighbors
There is also a regression variant of the _____
k-nearest neighbors algorithm.
The k-nearest neighbors algorithm for regression is implemented in the KNeighbors Regressor class in scikit-learn. It’s used similarly to KNeighborsClassifier:
k-NN Estimator
_______, also known as the coefficient of determination, is a measure of goodness of a prediction for a regression model, and yields a score between 0 and 1.
The Square Score (R^2)
A value of 1 corresponds to the perfect prediction, and a value of 0 corresponds to a constant model that just predicts the mean of the training set responses, y_train:
The Square Score (R^2)
The regression model’s score() function returns the coefficient of determination R.
Estimation of the Regression Model
Perfect Prediction: target value == prediction -> numerator == denominator
R^2 = 1
Predicting the average degree of target value: numerator == denominator,
R^2 = 0
Predicting worse than the average can result in
negative numbers
Two important parameters to the KNeighbors classifier:
The number of neighbors
how you measure distance between data points
By default, _____ is used to choose the right distance measure
Euclidean distance
Strengths/Advantages of KNN
Easy to understand
Works well without any special adjustments
Suitable as a first-time models
Weaknesses/Disadvantages of KNN
If the number of features or samples is large, the prediction is slow and data preprocessing is important.
Does not work well with sparse datasets
enerate a formula to create a best-fit line to predict unknown values
Linear models
make a prediction using a linear function of the input features
Linear models
They are called _____ because they assume there is a ___ relationship between the outcome variable and each of its predictors
linear
several real-life scenarios follow linear relations between dependent and independent variables.
Application of Linear Models
Application of Linear Models Example
The relationship between the boiling point of water and change in altitude
The relationship between spending on advertising and the revenue of an organization
The relationship between the amount of fertilizer used and crop yields
Performance of athletes and their training regimen
Types of Linear Models
Linear Regression
Logistics Regression
The algorithm is used for solving regression problems
Linear Regression
Final output of the model is numeric value (numerical predictions).
Linear Regression
The algorithm maps a linear relationship between the input features(X) and the output (y)
Linear Regression
Linear model for classification problems
Logistics Regression
It generates a probability between 0 and 1. This happens by fitting a logistic function, also known as the sigmoid function.
Logistics Regression
Logistic Regression generates a probability between 0 and 1. This happens by fitting a logistic function, also known as the _____. The function first transforms the linear regression output between 0 and 1. After that, a predefined threshold helps to determine the probability of the output values
sigmoid function
is the simplest and most classic linear method for regression
Linear Regression (aka Ordinary Least Squares)
Linear regression finds the parameters w and b that minimize the _____ between predictions and the true regression targets, y, on the training set.
mean square error
The ______ is the sum of the squared differences between the predictions and the true values.
mean square error
The “slope”parameters (w), also called _______, are stored in the coef_attribute,
weights or coefficients
the offset or ______ is stored in the intercept_attribute:
intercept (b)
a model that allows us to control complexity. One of the most commonly used alternatives to standard linear regression is ____
ridge regression
is also a linear model for regression, so the formula is used to make predictions is the same one used for OLS
Ridge Regression
Each feature should have as little effect on the outcome as possible (which translates to having a small slope), while still predicting well. This constraints is an example of what is called ______
regularization.
Regularization means explicitly restricting a model to avoid _____
overfitting.
The particular kind of Regularization used by ridge regression is known as
L2 regularization
Ridge regression is implemented in ___ function.
linear_model
a higher alpha means a more restricted model, so we expect the entries of coef_ to have smaller magnitude for a high value of alpha than for a low value of alpha
Ridge Coef
a higher alpha means ______, so we expect the entries of coef_ to have smaller magnitude for a high value of alpha than for a low value of alpha
a more restricted model
plots that show model performance as a function of dataset size are called _____
learning curves
An alternative to Ridge for regularizing linear regression is _____
Lasso
As with ridge regression, using the lasso also restricts coefficients to be close to zero, but in a slightly different way, called _____
L1 regularization
The consequence of L1 is that when using lasso, some coefficients are exactly zero. This means some features are ______ by the model
entirely ignored
A ____ allowed us to fit a more complex model which worked better on the training and testing.
lower alpha
If only some of the many traits are considered important, ____
Lasso
When you want a model that is easy to analyze and understand, ___
Lasso
The most common linear classification algorithms are:
Logistic Regression
Linear Support Vector Machines
LinearSVC =
support vector classi fier
Despite its name, ____ is a classification algorithm and not a regression algorithm, and it should not be confused with LinearRegression
LogisticsRegression
For LogisticRegression and LinearSVC, the trade-off parameter that determines the strength of the regularization is called ____ and higher values of __ correspond to ______
C, C, less regularization.
In other words, when you use a ____ for the parameter C, LogisticRegression and LinearSVC try to fit the training set as best as possible, while the ____ of the parameter C, the models put more emphasis on finding a _____ that is close to zero
high value , low values, coefficient vector (w)
Using ____ of C will cause the algorithms to try to adjust to the “majority” of data points
low values
using a _____ of C stresses the importance that each individual data point be classified correctly
higher value
are a family of classifiers that are quite similar to the linear models
Naive Bayes classifiers
Training speed is faster than linear classifier
Naive Bayes Classifier
Generalization performance is slightly slower
Naive Bayes Classifier
The reason that ______ are so efficient is that they learn parameters by looking at each feature individually and collect simple per-class statistics from each feature
naive Bayes models
3 Kinds of Naive Bayes Classifier in Scikit-learn:
GaussianNB
BernoulliNB
MultinomialNB
_____ -> continuous data, what NB
GaussianNB
Binary data, text data, what NB
BernoulliNB
integer count data, text data, what NB
MultinomialNB
Control model complexity with alpha parameter
Naive Bayes Classifier
Smooth statistics by adding virtually positive data as much as alph
Naive Bayes Classifier
Large alpha decreases the complexity of the model but does not change the performance
Naive Bayes Classifier
_____is a high-dimensional dataset
GaussianNB
_____ and MultinomialNB are a text-like used to count sparse data.
BernoulliNB
BernoulliNB and _____ are a text-like used to count sparse data.
MultinomialNB
______ and _____ are a text-like used to count sparse data.
BernoulliNB
MultinomialNB
Training and testing are fast and easy to understand and process
Naive Bayes Classifier
Works well with sparse high-dimensional datasets and is not parameter sensitive
Naive Bayes Classifier
Naive Bayes Classifier Strengths, Weaknesses, and Parameters
Control model complexity with alpha parameter
Smooth statistics by adding virtually positive data as much as alpha
Large alpha decreases the complexity of the model but does not change the performance
GaussianNB is a high-dimensional dataset
BernoulliNB and MultinomialNB are a text-like used to count sparse data.
Training and testing are fast and easy to understand and process
Works well with sparse high-dimensional datasets and is not parameter sensitive