lv. 3 - Copy of CS35 Flashcards by John Stanley Altonaga

Series of tasks, activities, or operations to achieve a goal or an outcome

Process

How well did you know this?

Not at all

Perfectly

Combination of hardware and software to facilitate or automate processes

Technology

How well did you know this?

Not at all

Perfectly

Discrete measurement, fact, or observation representing a real-world process

Data

How well did you know this?

Not at all

Perfectly

the mathematical discipline that studies the methods of collecting, analyzing, and interpreting data.

Statistics

How well did you know this?

Not at all

Perfectly

specific collection of items of interest

Population

How well did you know this?

Not at all

Perfectly

subset or subcollection of the population

Sample

How well did you know this?

Not at all

Perfectly

two scopes of data

Sample & Population

How well did you know this?

Not at all

Perfectly

Logic is built based on business rules

Traditional Rule-Based AI

How well did you know this?

Not at all

Perfectly

Logic is built by modelling and training data

Machine Learning

How well did you know this?

Not at all

Perfectly

Input and sometimes output data are provided to a machine which will build a logic based on mathematical rules

Machine Learning

How well did you know this?

Not at all

Perfectly

Machine learning algorithms in which the training data includes both input and output

Supervised Machine Learning

How well did you know this?

Not at all

Perfectly

Inputs are called

feature values

How well did you know this?

Not at all

Perfectly

outputs are called

label values

How well did you know this?

Not at all

Perfectly

the label predicted by the model is a numeric value

Regression

How well did you know this?

Not at all

Perfectly

the model predicts whether a record is an instance of a specific class or category

Binary Classification

How well did you know this?

Not at all

Perfectly

the model predicts whether a record is an instance of one of multiple classes or categories

Multiclass Classification

How well did you know this?

Not at all

Perfectly

Training data consists only of input without any known output

Unsupervised Machine Learning

How well did you know this?

Not at all

Perfectly

the model identifies similarities between observations based on their features and groups them into discrete clusters

Clustering

How well did you know this?

Not at all

Perfectly

A model that groups existing customers into clusters based on age, location, gender, social media usage, and purchasing behavior.

Clustering

How well did you know this?

Not at all

Perfectly

A model that classifies whether a social media post is positive, negative, or neutral.

Multiclass Classification

How well did you know this?

Not at all

Perfectly

A model that predicts whether a customer will cancel their subscription.

Binary Classification

How well did you know this?

Not at all

Perfectly

A model that predicts the price of an apartment based on the size, number of rooms, barangay, and date of building.

Regression

How well did you know this?

Not at all

Perfectly

Used to train the model, data where the algorithm learns patterns from

Training Data

How well did you know this?

Not at all

Perfectly

Used to evaluate the model

Test Data

How well did you know this?

Not at all

Perfectly

Proportion of predictions that the model got right

Accuracy

Proportion of predicted positive cases where the true label is actually positive

Precision

Proportion of positive cases that the model identified correctly

Recall

Overall metric combining Recall and Precision

F1 Score

a lazy learning algorithm, predicts the class of a data point based on the majority class of its k nearest neighbors

k-NN classifier

predicts the probability that a given data point belongs to a particular class, uses the logistic function

Logistic Regression

an S-shaped curve, used to represent logistical regression

logistic function

occurs when one class is significantly more frequent than the other

Class Imbalance

reducing the number of instances in the majority class by removing samples until the classes are balanced.

Undersampling

increasing the number of instances in the minority class by duplicating samples or generating new synthetic examples.

Oversampling

Generates synthetic samples for the minority class by interpolating between existing samples

SMOTE (Synthetic Minority Oversampling Technique)

Cons of Oversampling

Oversampling can cause overfitting, especially with random oversampling.

Cons of Undersampling

Important information from the majority class may be lost, potentially underfitting the model.

a measure of the relationship between two variables. If one variable increases when the other one also increases, the correlation is positive.

Correlation

means that changes in one variable cause another variable to change. It means one variable directly influences the other.

Causation

Measures the average magnitude of the errors in a set of predictions without considering their direction

Mean Absolute Error

Measures the average squared difference between actual and predicted values. Larger errors are penalized more.

Mean Squared Error

A popular metric because it has the same units as the target variable, making it easier to interpret

Root Mean Squared Error

standardizes features by making sure that each feature has a mean of 0 and a standard deviation of 1.

StandardScaler

a model used for regression tasks, where the goal is to predict a continuous target variable based on input features. It works by splitting the data into different regions based on feature values, making predictions by averaging the target values in each region.

DecisionTreeRegressor

an ensemble model that averages the predictions from multiple different regression models to make a final prediction.

VotingRegressor

this modeling technique trains multiple binary classifiers, each focusing on one class versus all others.

One Vs. Rest Classifier

? is used for classification problems while ? is used for regression. The approach of both techniques is similar.

Random Forest Classifier, Decision Tree Regressor

a natural language processing approach used to determine whether the emotional tone of a piece of text is positive, negative, or neutral.

Sentiment Analysis

an automated technological process that converts an image of text into a machine-readable format. It is traditionally known as text recognition.

Optical Character Recognition

Layers of a Convolutional Neural Network

Convolutional Layer Pooling Layer Flatten Layer Fully Connected (Dense) Layer

compares the performance of two versions of actions to see which one performs better to users or viewers.

A/B Testing

the process of creating, sharing, and utilizing knowledge and information within an organization.

Knowledge Management

knowledge that can be easily codified into formats such as text, diagrams, or figures

Explicit knowledge

knowledge that is not formally documented but can be inferred from explicit knowledge and transferred into practical skills

Implicit knowledge

personal and often difficult to articulate, consisting of insights, experiences, and "know-how."

Tacit knowledge

facilitates the knowledge management of an organization by capturing and organizing knowledge.

Knowledge Management Software

In-House or Captive Operations Pros and Cons

- Intellectual Property Protection - Ultimate Control - Long-term Cost Savings - Internal Expertise - High Initial Investment - Operational Complexity - Inflexibility

Outsourcing Pros and Cons

- Flexibility - Access to Varied Expertise - Risk Mitigation - Quality Control - Coordination Effort - Costs can balloon if not managed well

How Cloud Computing and Big Data enable Machine Learning

- Cloud Computing - provides the necessary infrastructure and computational power to process large datasets efficiently - Big Data - supplies the enormous and complex datasets that are crucial for training ML models

delivers resources over the internet, making it possible for organization or user to access systems and services.

Public Cloud

the exact opposite of the public cloud deployment model, where a one-on-one environment is dedicated for a single customer or organization

Private Cloud

combines both private and public cloud models. With a hybrid solution, an organization may host applications in a safe environment while taking advantage of the cost savings of the public cloud.

Hybrid Cloud

a distributed system that is created by integrating the services of different clouds to address the specific needs of a community, industry, or business

Federated Cloud or Community Cloud

delivers on-demand infrastructure resources, such as compute, storage, networking, and virtualization

Infrastructure as a Service / IAAS

delivers and manages hardware and software resources for developing, testing, delivering, and managing cloud applications

Platform as a Service / PAAS

provides a full application stack as a service that customers can access and use.

Software as a Service / SAAS

Big Data characteristics

- Volume - Sheer quantity of the data - Velocity - Speed in which the data is gathered - Variety - Type, nature, and source of data - Veracity - Data quality, pertaining to accuracy and reliability - Value - Data has actionable insights and patterns

means connectivity to devices with an on and off switch to the internet, enabling them to collect and share data.

Internet of Things

How Big Data and IoT revolutionized modern-day machine learning:

- Accuracy - Larger datasets enable machine learning algorithms to identify more intricate patterns and relationships - Reduced Overfitting - With more data, models are less likely to overfit. - Discovering Hidden Patterns - Big data enables the discovery of subtle correlations and trends that might be missed in smaller datasets - Deep Learning - Deep learning models such as neural networks require massive amounts of data to learn complex representations - Natural Language Processing - NLP models, such as those used for language translation and sentiment analysis, benefit from large datasets of text and speech data.

the practice of protecting digital information from unauthorized access, corruption, or theft.

Data Security

A regulation of the European Union that establishes rules for the protection of personal data. It requires organizations to protect the privacy of EU residents and provides them with greater control over their personal data.

General Data Protection Regulation (GDPR)

the process of removing or altering personal information from data so that individuals cannot be easily identified.

De-identification

unsupervised learning task where the model groups similar data points together based on their features or attributes

Clustering

Applications of Clustering

- Customer Segmentation - Image Segmentation - Anomaly Detection

a widely used clustering algorithm that partitions a dataset into K clusters based on the similarity of data points. It is used in data mining and image processing applications.

K-Means Clustering

works by recursively partitioning the data into smaller clusters. It merges the two closest clusters at each iteration until all data points belong to a single cluster.

Hierarchical Clustering

a widely used dimensionality reduction technique in machine learning and feature extraction.

Principal Component Analysis (PCA)

a measure of how well the data points are clustered around the centroids

Inertia

measures how well each data point is assigned to its cluster by comparing its similarity to points in its own cluster (cohesion) versus points in the nearest other cluster (separation)

Silhouette Score