Final Terms & Concepts List Flashcards
Cover the bulk of terms and concepts taught in the class based on the readings
Data Science
An interdisciplinary field using scientific methods, processes, algorithms, and systems to extract knowledge or insights from data in various forms.
Big Data
Extracting and storing large volumes of data, which data science then refines to produce insights.
Qualitative Data
Descriptive piece of information. Example: ‘What a nice day it is’.
Quantitative Data
Numerical information. Example: ‘1’, ‘3.65’. Can be further divided into discrete and continuous data.
Discrete Data
Quantitative data that can be expressed as a specific value. Example: ‘Number of months in a year’.
Continuous Data
Quantitative data that can be any value in an interval. Example: ‘The amount of oxygen in the atmosphere’.
Artificial Intelligence (AI)
Creating machines with capabilities that would require intelligence if they were performed by humans.
Machine Learning (ML)
A field of AI where systems learn from data without explicit programming.
Supervised Learning
A type of machine learning where an algorithm learns from a labeled dataset.
Unsupervised Learning
A type of machine learning where an algorithm learns from an unlabeled dataset.
Regression Algorithms
Algorithms that predict numerical values of a variable based on historic data.
Classification Algorithms
Algorithms that predict which class data belongs to.
Anomaly Detection Algorithms
Algorithms used to find outliers or anomalies in data.
Cloud Computing
The on-demand delivery of computing resources via the internet with pay-as-you-go pricing.
Public Cloud
Services provided over a public network and available to anyone.
Private Cloud
Infrastructure dedicated to a single organization, located on-premises or off-premises.
Hybrid Cloud
Combines public and private cloud environments, allowing data sharing between them.
Community Cloud
Infrastructure and services shared among a specific community or group of organizations.
Infrastructure as a Service (IaaS)
Provides virtualized computing resources over the internet.
Platform as a Service (PaaS)
Offers a platform for developing, testing, and deploying applications.
Software as a Service (SaaS)
Delivers software applications over the internet on a subscription basis.
Total Cost of Ownership (TCO)
The financial estimate to help identify direct and indirect costs of a system.
AWS Cloud Adoption Framework (CAF)
Provides guidance and best practices to build a comprehensive approach to cloud computing.
Industry 4.0
Increasing automation and data exchange in manufacturing technologies.
Cognitive Biases
Systematic patterns of deviation from norm or rationality in judgment.
Data Visualization
The graphical representation of information and data.
Agile Process
Focuses on delivering the highest business value in the shortest time through iterative development.
Scrum
An agile framework for managing and controlling complex projects.
Sprint
A short, time-boxed period (typically 2-4 weeks) during which a Scrum Team works to complete a set amount of work.
Product Owner
Responsible for maximizing the value of the product and managing the Product Backlog.
ScrumMaster
A facilitator for the Scrum Team, ensuring they follow Scrum practices and removing impediments.
Scrum Team
A self-organizing, cross-functional group responsible for delivering the product increment.
Product Backlog
An ordered list of everything that might be needed in the product.
Sprint Backlog
The set of Product Backlog items selected for a Sprint, plus a plan for delivering the Sprint Goal.
Burndown Chart
A visual representation of the progress of work remaining in a Sprint or Project.
Term
Definition
Data Science
An interdisciplinary field using scientific methods, processes, algorithms, and systems to extract knowledge or insights from data in various forms.
Big Data
Extracting and storing large volumes of data, which data science then refines to produce insights.
Qualitative Data
Descriptive piece of information. Example: ‘What a nice day it is’.
Quantitative Data
Numerical information. Example: ‘1’, ‘3.65’. Can be further divided into discrete and continuous data.
Discrete Data
Quantitative data that can be expressed as a specific value. Example: ‘Number of months in a year’.
Continuous Data
Quantitative data that can be any value in an interval. Example: ‘The amount of oxygen in the atmosphere’.
Artificial Intelligence (AI)
Creating machines with capabilities that would require intelligence if they were performed by humans.
Machine Learning (ML)
A field of AI where systems learn from data without explicit programming.
Supervised Learning
A type of machine learning where an algorithm learns from a labeled dataset.
Unsupervised Learning
A type of machine learning where an algorithm learns from an unlabeled dataset.
Regression Algorithms
Algorithms that predict numerical values of a variable based on historic data.
Classification Algorithms
Algorithms that predict which class data belongs to.
Anomaly Detection Algorithms
Algorithms used to find outliers or anomalies in data.
Cloud Computing
The on-demand delivery of computing resources via the internet with pay-as-you-go pricing.
Public Cloud
Services provided over a public network and available to anyone.
Private Cloud
Infrastructure dedicated to a single organization, located on-premises or off-premises.
Hybrid Cloud
Combines public and private cloud environments, allowing data sharing between them.
Community Cloud
Infrastructure and services shared among a specific community or group of organizations.
Infrastructure as a Service (IaaS)
Provides virtualized computing resources over the internet.
Platform as a Service (PaaS)
Offers a platform for developing, testing, and deploying applications.
Software as a Service (SaaS)
Delivers software applications over the internet on a subscription basis.
Total Cost of Ownership (TCO)
The financial estimate to help identify direct and indirect costs of a system.
AWS Cloud Adoption Framework (CAF)
Provides guidance and best practices to build a comprehensive approach to cloud computing.
Industry 4.0
Increasing automation and data exchange in manufacturing technologies.
Cognitive Biases
Systematic patterns of deviation from norm or rationality in judgment.
Data Visualization
The graphical representation of information and data.
Agile Process
Focuses on delivering the highest business value in the shortest time through iterative development.
Scrum
An agile framework for managing and controlling complex projects.
Sprint
A short, time-boxed period (typically 2-4 weeks) during which a Scrum Team works to complete a set amount of work.
Product Owner
Responsible for maximizing the value of the product and managing the Product Backlog.
ScrumMaster
A facilitator for the Scrum Team, ensuring they follow Scrum practices and removing impediments.
Scrum Team
A self-organizing, cross-functional group responsible for delivering the product increment.
Product Backlog
An ordered list of everything that might be needed in the product.
Sprint Backlog
The set of Product Backlog items selected for a Sprint, plus a plan for delivering the Sprint Goal.
Burndown Chart
A visual representation of the progress of work remaining in a Sprint or Project.
AI Governance
Policies and frameworks ensuring AI development aligns with ethical and legal guidelines.
AWS Billing & Cost Management
AWS tools for tracking and optimizing service costs.
AWS Organizations
A management service for grouping AWS accounts and applying governance policies.
AWS Pricing Calculator
A tool to estimate AWS costs before deployment.
Agile Manifesto
A set of values prioritizing individuals, working software, customer collaboration, and responsiveness to change.
Artificial Neural Networks (ANNs)
Computing systems inspired by biological neural networks, used in deep learning.
Association Rule Learning
Identifying relationships or patterns in large datasets, commonly used in market basket analysis.
Bias-Variance Tradeoff
The balance between model complexity (overfitting) and generalizability (underfitting).
Blockchain in AI
The integration of blockchain for secure, verifiable AI applications.
Chart Types
Various data visualization charts like bar charts, scatter plots, line graphs, pie charts, etc.
Clustering
A technique that groups data points based on similarity, commonly used in unsupervised learning.
Confirmation Bias
The tendency to search for, interpret, or recall information that confirms pre-existing beliefs.
Contrast Principle in Visualization
Using contrast (before/after, with/without) to improve clarity in data visualizations.
Cyber-Physical Systems (CPS)
Systems integrating computation with physical processes, essential in Industry 4.0.
Daily Scrum (Standup Meeting)
A short daily meeting where a Scrum team discusses progress and obstacles.
Decision Trees
A tree-like structure used for classification and regression tasks in ML.
Declarative Visualization
Presenting data insights clearly and effectively to inform decision-making.
Deep Learning
A subset of ML using neural networks with multiple layers to learn representations of data.
Edge Computing
Processing data near the source rather than relying on centralized cloud servers.
Ethical AI
Ensuring AI systems are fair, accountable, and do not reinforce biases.
Explainable AI (XAI)
AI models designed to provide transparency and interpretability in decision-making.
Exploratory Visualization
Used to discover insights in data rather than just presenting information.
Fog Computing
A decentralized computing model that extends cloud services closer to IoT devices.
Frequency Illusion
The illusion that something appears more frequently after you first notice it.
Generative AI
AI models that create new content, such as images, text, and audio (e.g., ChatGPT, DALL·E).
Groupthink
The practice of making decisions as a group in a way that discourages independent thinking.
Halo Effect
The tendency to let an overall impression of someone influence how we judge their specific traits.
Hyperparameter Tuning
The process of optimizing algorithm parameters to improve performance.
Instance-Based Learning
Algorithms that make predictions based on stored instances rather than general models.
K-Means Clustering
A clustering algorithm that partitions data into K clusters based on centroid similarity.
Mist Computing
A decentralized computing model where data processing occurs directly on devices (e.g., sensors).
Model Drift
The degradation of an AI model’s performance over time due to changing data patterns.
Neuromarketing & Visualization
How the brain processes visual information for decision-making in marketing and design.
Optimism Bias
The belief that we are less likely to experience negative events compared to others.
Pattern Recognition Bias
The tendency to see patterns where none exist.
Predictive Maintenance
AI-driven maintenance strategies that anticipate and prevent equipment failures.
Reinforcement Learning
A type of ML where an agent learns by interacting with an environment and receiving rewards.
Reptilian Brain & Visualization
The role of fast, instinctive visual processing in human cognition.
Robotic Process Automation (RPA)
Software that automates repetitive tasks previously done by humans.
SCM & AI
The use of AI to optimize supply chains, logistics, and demand forecasting.
Scaling Scrum (Scrum of Scrums)
A method for coordinating multiple Scrum teams in larger organizations.
Serverless Computing
A cloud model where developers deploy code without managing infrastructure.
Smart Factories
Manufacturing environments that leverage AI, IoT, and automation for efficiency.
Sprint Planning
A Scrum meeting where the team selects work to complete in an upcoming sprint.
Sprint Retrospective
A Scrum meeting to reflect on a completed sprint and improve future work.
Sprint Review
A Scrum meeting where the team presents what they accomplished during the sprint.
What is RapidMiner and what is it used for?
RapidMiner is a data science platform used for machine learning, data preprocessing, and predictive analytics. It allows users to design workflows using a visual, drag-and-drop interface without extensive programming knowledge.
Explanation:
* Used for data cleaning, model training, validation, and deployment.
* Supports both supervised (classification, regression) and unsupervised learning (clustering, anomaly detection).
What are the main components of a RapidMiner process?
Operators, Connectors, Repositories, Parameters Panel, Results View
* Operators → The building blocks that perform tasks (e.g., “Read CSV,” “Normalize Data,” “Apply Model”). * Connections → The arrows linking operators that define workflow execution order. * Repositories → Where datasets, models, and results are stored. * Parameters Panel → Where you set options for each operator. * Results View → Displays output after execution (tables, charts, performance metrics).
What is the difference between “Operators” and “Processes” in RapidMiner?
- Operators are individual tasks (e.g., “Normalize”, “Decision Tree”).
- Processes are full workflows consisting of multiple operators linked together.
How do you import and load data into RapidMiner?
- Using “Read CSV” or “Read Excel” operators for files.
- Connecting to databases using the “Retrieve” operator.
- Manually entering data via the “Create ExampleSet” operator.
What are common data preprocessing steps in RapidMiner?
- Handling Missing Values → Use “Replace Missing Values” to fill with mean, median, or mode.
- Feature Selection → Remove irrelevant attributes using “Select Attributes.”
- Normalization & Standardization → Use “Normalize” to scale numerical data.
- Handling Categorical Variables → Use “Nominal to Numerical” for model compatibility.
- Splitting Data → Use “Split Data” to create training and test sets.
What is the difference between Normalization and Standardization in RapidMiner?
- Normalization (Min-Max Scaling) → Rescales data between 0 and 1.
- Standardization (Z-Score Scaling) → Centers data around mean 0 with unit variance.
Explanation:
* Normalization is best for bounded data (e.g., images with pixel values 0-255).
* Standardization is useful for normally distributed data.
How do you handle missing values in RapidMiner?
- Remove missing values if they are few and not critical.
- Replace missing values using mean, median, mode, or predictive models.
- Interpolate missing values using nearest neighbor or regression.
Why is this important: Missing data can bias models, so it must be handled appropriately
What does the “Generate Attributes” operator do?
It creates new features based on existing data using mathematical or logical expressions.
Example: if([Age] > 30, “Senior”, “Junior”)
What is the purpose of splitting data into training and testing sets?
- Training Set → Used to train the model.
- Testing Set → Used to evaluate performance on unseen data.
Explanation: Prevents overfitting by ensuring the model generalizes well to new data.
What does the “Cross-Validation” operator do?
It divides data into K subsets, trains on K-1 parts, and tests on the remaining part, repeating K times.
Explanation:
* Prevents overfitting.
* More robust evaluation than a single train/test split.
What is the difference between Supervised and Unsupervised Learning?
- Supervised Learning → Uses labeled data (e.g., classification, regression).
- Unsupervised Learning → Finds patterns in unlabeled data (e.g., clustering, anomaly detection).
What is K-Means Clustering and when should you use it?
K-Means is an unsupervised algorithm that groups data into K clusters based on similarity.
Explanation: Used for customer segmentation, anomaly detection, and exploratory data analysis.
What is a Decision Tree and how do you interpret its output?
A Decision Tree splits data into branches based on feature conditions to make predictions.
* Root Node: Initial split based on the most informative feature.
* Leaf Node: Final classification/prediction.
Explanation: Simple, interpretable models useful for classification and regression.
What does a Confusion Matrix show in classification?
It shows the number of True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).
* Accuracy = (TP + TN) / Total Predictions
* Precision = TP / (TP + FP)
* Recall = TP / (TP + FN)
Explanation: Helps evaluate classification performance beyond just accuracy.
⸻
BIG IDEA: What is Overfitting and how do you prevent it?
Overfitting → Model learns noise instead of patterns, leading to poor performance on new data.
Prevention:
* Pruning Decision Trees (limit depth).
* Regularization (L1/L2 penalties).
* Using simpler models or ensemble learning.
Explanation: Overfit models perform well on training data but fail on unseen data.
What is the ROC Curve and how do you interpret it?
- ROC Curve plots True Positive Rate vs. False Positive Rate.
- AUC (Area Under Curve):
- AUC = 1.0 → Perfect model.
- AUC = 0.5 → Random guessing.
Explanation: Higher AUC means better classification performance.
What is the difference between Bagging and Boosting?
- Bagging (e.g., Random Forest) → Reduces variance by training multiple models independently and averaging results.
- Boosting (e.g., XGBoost) → Reduces bias by training models sequentially, improving weak learners.