lv. 3 - Copy of CS35 Flashcards
Series of tasks, activities, or operations to achieve a goal or an outcome
Process
Combination of hardware and software to facilitate or automate processes
Technology
Discrete measurement, fact, or observation representing a real-world process
Data
the mathematical discipline that studies the methods of collecting, analyzing, and interpreting data.
Statistics
specific collection of items of interest
Population
subset or subcollection of the population
Sample
two scopes of data
Sample & Population
Logic is built based on business rules
Traditional Rule-Based AI
Logic is built by modelling and training data
Machine Learning
Input and sometimes output data are provided to a machine which will build a logic based on mathematical rules
Machine Learning
Machine learning algorithms in which the training data includes both input and output
Supervised Machine Learning
Inputs are called
feature values
outputs are called
label values
the label predicted by the model is a numeric value
Regression
the model predicts whether a record is an instance of a specific class or category
Binary Classification
the model predicts whether a record is an instance of one of multiple classes or categories
Multiclass Classification
Training data consists only of input without any known output
Unsupervised Machine Learning
the model identifies similarities between observations based on their features and groups them into discrete clusters
Clustering
A model that groups existing customers into clusters based on age, location, gender, social media usage, and purchasing behavior.
Clustering
A model that classifies whether a social media post is positive, negative, or neutral.
Multiclass Classification
A model that predicts whether a customer will cancel their subscription.
Binary Classification
A model that predicts the price of an apartment based on the size, number of rooms, barangay, and date of building.
Regression
Used to train the model, data where the algorithm learns patterns from
Training Data
Used to evaluate the model
Test Data
Proportion of predictions that the model got right
Accuracy
Proportion of predicted positive cases where the true label is actually positive
Precision
Proportion of positive cases that the model identified correctly
Recall
Overall metric combining Recall and Precision
F1 Score
a lazy learning algorithm, predicts the class of a data point based on the majority class of its k nearest neighbors
k-NN classifier
predicts the probability that a given data point belongs to a particular class, uses the logistic function
Logistic Regression
an S-shaped curve, used to represent logistical regression
logistic function
occurs when one class is significantly more frequent than the other
Class Imbalance
reducing the number of instances in the majority class by removing samples until the classes are balanced.
Undersampling
increasing the number of instances in the minority class by duplicating samples or generating new synthetic examples.
Oversampling
Generates synthetic samples for the minority class by interpolating between existing samples
SMOTE (Synthetic Minority Oversampling Technique)
Cons of Oversampling
Oversampling can cause overfitting, especially with random oversampling.
Cons of Undersampling
Important information from the majority class may be lost, potentially underfitting the model.
a measure of the relationship between two variables. If one variable increases when the other one also increases, the correlation is positive.
Correlation
means that changes in one variable cause another variable to change. It means one variable directly influences the other.
Causation
Measures the average magnitude of the errors in a set of predictions without considering their direction
Mean Absolute Error
Measures the average squared difference between actual and predicted values. Larger errors are penalized more.
Mean Squared Error
A popular metric because it has the same units as the target variable, making it easier to interpret
Root Mean Squared Error
standardizes features by making sure that each feature has a mean of 0 and a standard deviation of 1.
StandardScaler
a model used for regression tasks, where the goal is to predict a continuous target variable based on input features. It works by splitting the data into different regions based on feature values, making predictions by averaging the target values in each region.
DecisionTreeRegressor
an ensemble model that averages the predictions from multiple different regression models to make a final prediction.
VotingRegressor
this modeling technique trains multiple binary classifiers, each focusing on one class versus all others.
One Vs. Rest Classifier
? is used for classification problems while ? is used for regression. The approach of both techniques is similar.
Random Forest Classifier, Decision Tree Regressor
a natural language processing approach used to determine whether the emotional tone of a piece of text is positive, negative, or neutral.
Sentiment Analysis
an automated technological process that converts an image of text into a machine-readable format. It is traditionally known as text recognition.
Optical Character Recognition
Layers of a Convolutional Neural Network
Convolutional Layer
Pooling Layer
Flatten Layer
Fully Connected (Dense) Layer
compares the performance of two versions of actions to see which one performs better to users or viewers.
A/B Testing
the process of creating, sharing, and utilizing knowledge and information within an organization.
Knowledge Management
knowledge that can be easily codified into formats such as text, diagrams, or figures
Explicit knowledge
knowledge that is not formally documented but can be inferred from explicit knowledge and transferred into practical skills
Implicit knowledge
personal and often difficult to articulate, consisting of insights, experiences, and “know-how.”
Tacit knowledge
facilitates the knowledge management of an organization by capturing and organizing knowledge.
Knowledge Management Software
In-House or Captive Operations Pros and Cons
- Intellectual Property Protection
- Ultimate Control
- Long-term Cost Savings
- Internal Expertise
- High Initial Investment
- Operational Complexity
- Inflexibility
Outsourcing Pros and Cons
- Flexibility
- Access to Varied Expertise
- Risk Mitigation
- Quality Control
- Coordination Effort
- Costs can balloon if not managed well
How Cloud Computing and Big Data enable Machine Learning
- Cloud Computing
- provides the necessary infrastructure and computational power to process large datasets efficiently
- Big Data
- supplies the enormous and complex datasets that are crucial for training ML models
delivers resources over the internet, making it possible for organization or user to access systems and services.
Public Cloud
the exact opposite of the public cloud deployment model, where a one-on-one environment is dedicated for a single customer or organization
Private Cloud
combines both private and public cloud models. With a hybrid solution, an organization may host applications in a safe environment while taking advantage of the cost savings of the public cloud.
Hybrid Cloud
a distributed system that is created by integrating the services of different clouds to address the specific needs of a community, industry, or business
Federated Cloud or Community Cloud
delivers on-demand infrastructure resources, such as compute, storage, networking, and virtualization
Infrastructure as a Service / IAAS
delivers and manages hardware and software resources for developing, testing, delivering, and managing cloud applications
Platform as a Service / PAAS
provides a full application stack as a service that customers can access and use.
Software as a Service / SAAS
Big Data characteristics
- Volume
- Sheer quantity of the data
- Velocity
- Speed in which the data is gathered
- Variety
- Type, nature, and source of data
- Veracity
- Data quality, pertaining to accuracy and reliability
- Value
- Data has actionable insights and patterns
means connectivity to devices with an on and off switch to the internet, enabling them to collect and share data.
Internet of Things
How Big Data and IoT revolutionized modern-day machine learning:
- Accuracy
- Larger datasets enable machine learning algorithms to identify more intricate patterns and relationships
- Reduced Overfitting
- With more data, models are less likely to overfit.
- Discovering Hidden Patterns
- Big data enables the discovery of subtle correlations and trends that might be missed in smaller datasets
- Deep Learning
- Deep learning models such as neural networks require massive amounts of data to learn complex representations
- Natural Language Processing
- NLP models, such as those used for language translation and sentiment analysis, benefit from large datasets of text and speech data.
the practice of protecting digital information from unauthorized access, corruption, or theft.
Data Security
A regulation of the European Union that establishes rules for the protection of personal data. It requires organizations to protect the privacy of EU residents and provides them with greater control over their personal data.
General Data Protection Regulation (GDPR)
the process of removing or altering personal information from data so that individuals cannot be easily identified.
De-identification
unsupervised learning task where the model groups similar data points together based on their features or attributes
Clustering
Applications of Clustering
- Customer Segmentation
- Image Segmentation
- Anomaly Detection
a widely used clustering algorithm that partitions a dataset into K clusters based on the similarity of data points. It is used in data mining and image processing applications.
K-Means Clustering
works by recursively partitioning the data into smaller clusters. It merges the two closest clusters at each iteration until all data points belong to a single cluster.
Hierarchical Clustering
a widely used dimensionality reduction technique in machine learning and feature extraction.
Principal Component Analysis (PCA)
a measure of how well the data points are clustered around the centroids
Inertia
measures how well each data point is assigned to its cluster by comparing its similarity to points in its own cluster (cohesion) versus points in the nearest other cluster (separation)
Silhouette Score