Data Science Terminology - assorted topics Flashcards
Machine Learning Operations (MLOps)
A practice for collaboration and communication between data scientists and operations professionals to help manage production machine learning (ML) lifecycles. It seeks to provide a disciplined approach to manage and scale ML models, drawing on principles and practices from DevOps.
Machine Learning Model
A Machine Learning (ML) model is a mathematical or computational representation of real-world processes or patterns based on data. It’s built by training an algorithm on a set of data. There are various types of machine learning models, each suited to different tasks. Once a machine learning model is trained, it can be used to make predictions or decisions without being explicitly programmed to do so. For example, an ML model trained on email data might be able to predict whether a new email is spam or not based on its content. ML models are not perfect and their accuracy heavily depends on the quality and quantity of the data they are trained on, as well as the suitability of the algorithm used for the task at hand.
Data Science
Data science involves a blend of various tools, algorithms, and machine learning principles to extract patterns from raw data. It operates on the idea of using scientific methods, processes, and systems to gain insights from both structured and unstructured data.
Goal of ML Ops
The goal of MLOps is to create a streamlined process for managing and deploying ML models at scale, improving the efficiency, reproducibility, and reliability of ML systems. It provides a conceptual framework to bridge the gap between development and operations in the ML lifecycle.
Types of ML Models
supervised, unsupervised and reinforcement learning models
Supervised Learning Models
These are trained on labeled data, i.e., data that includes both the input and the desired output. They are used for tasks like regression (predicting a continuous output) and classification (predicting a categorical output). Examples include linear regression, decision trees, support vector machines, and neural networks.
Unsupervised Learning Models
These models learn from unlabeled data, finding structure and relationships within the data itself. They are used for tasks like clustering (grouping similar inputs) and dimensionality reduction (simplifying input by removing redundant features). Examples include k-means clustering and principal component analysis (PCA).
Reinforcement Learning Models
These models learn by interacting with their environment, receiving rewards or penalties based on the actions they take. They are used for tasks where the model needs to make a series of decisions that lead to a final goal, like game playing or robot navigation.
Unsupervised Learning
Unsupervised learning is a type of machine learning where an algorithm learns from unlabeled data. This means the algorithm is not given the correct output during training. Instead, it must discover patterns, relationships, or structure in the input data on its own.
Clustering
Clustering is used to group similar data points together based on their characteristics. The algorithm determines the similarities between data points and clusters them accordingly. K-means and hierarchical clustering are popular examples of clustering algorithms.
Dimensionality Reduction
Dimensionality reduction is used to reduce the number of input features while retaining the essential information. This is often used to make the data more manageable, to remove redundant or irrelevant features, or for visualization purposes. Principal Component Analysis (PCA) is a popular dimensionality reduction technique.
Supervised Learning
Supervised learning is a type of machine learning where an algorithm learns a model from labeled training data. This means the algorithm is given input data along with the corresponding correct output. It uses this information to learn the relationship between the input and the output, which can then be used to predict the output for new, unseen input data.
Linear Regression
Linear regression is used to predict a continuous target variable based on one or more input features. The model assumes a linear relationship between the input and the output.
Decision Trees
Decision trees are used for both classification (predicting a categorical output) and regression (predicting a continuous output). They split the data into different branches based on feature values, allowing for more complex relationships between the input and the output.
Neural Networks
Neural networks are complex models inspired by the human brain, capable of learning nonlinear relationships between the input and the output. They consist of layers of interconnected nodes or “neurons”, each of which applies a simple computation to the data. Deep learning, a subfield of machine learning, involves neural networks with many layers.
Descriptive Statistics
These are basic metrics that summarize and describe the main features of a dataset. They include measures such as mean, median, mode, range, variance, standard deviation, and percentiles.
Data Visualization
This involves using graphical representations of data to understand trends, patterns, and outliers in the data. Common tools include bar graphs, histograms, scatter plots, box plots, and heat maps.
Data Cleaning
This involves dealing with missing values, removing duplicates, correcting errors, and handling outliers in the data. It’s a crucial step to ensure reliable results.
Data Transformation
This involves converting data from one format or structure into another, such as normalizing numerical data, binning continuous variables, or encoding categorical variables.
Feature Engineering
This involves creating new features from existing ones to improve the performance of machine learning models. Techniques might include polynomial features, interaction terms, or creating domain-specific features.
Hypothesis Testing
This is a statistical method that is used in making statistical decisions using experimental data. It involves formulating a null hypothesis and an alternative hypothesis, then using test statistics to accept or reject the null hypothesis.
Regression Analysis
This technique is used for predicting a continuous outcome variable based on one or more input variables.
Classification
This technique is used to predict a categorical outcome variable based on one or more input variables.
Clustering
This unsupervised learning method groups data points together based on their similarities in features.