Final Terms & Concepts List Flashcards

Cover the bulk of terms and concepts taught in the class based on the readings

1
Q

Data Science

A

An interdisciplinary field using scientific methods, processes, algorithms, and systems to extract knowledge or insights from data in various forms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Big Data

A

Extracting and storing large volumes of data, which data science then refines to produce insights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Qualitative Data

A

Descriptive piece of information. Example: ‘What a nice day it is’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Quantitative Data

A

Numerical information. Example: ‘1’, ‘3.65’. Can be further divided into discrete and continuous data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Discrete Data

A

Quantitative data that can be expressed as a specific value. Example: ‘Number of months in a year’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Continuous Data

A

Quantitative data that can be any value in an interval. Example: ‘The amount of oxygen in the atmosphere’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Artificial Intelligence (AI)

A

Creating machines with capabilities that would require intelligence if they were performed by humans.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Machine Learning (ML)

A

A field of AI where systems learn from data without explicit programming.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Supervised Learning

A

A type of machine learning where an algorithm learns from a labeled dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Unsupervised Learning

A

A type of machine learning where an algorithm learns from an unlabeled dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Regression Algorithms

A

Algorithms that predict numerical values of a variable based on historic data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Classification Algorithms

A

Algorithms that predict which class data belongs to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Anomaly Detection Algorithms

A

Algorithms used to find outliers or anomalies in data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cloud Computing

A

The on-demand delivery of computing resources via the internet with pay-as-you-go pricing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Public Cloud

A

Services provided over a public network and available to anyone.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Private Cloud

A

Infrastructure dedicated to a single organization, located on-premises or off-premises.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Hybrid Cloud

A

Combines public and private cloud environments, allowing data sharing between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Community Cloud

A

Infrastructure and services shared among a specific community or group of organizations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Infrastructure as a Service (IaaS)

A

Provides virtualized computing resources over the internet.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Platform as a Service (PaaS)

A

Offers a platform for developing, testing, and deploying applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Software as a Service (SaaS)

A

Delivers software applications over the internet on a subscription basis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Total Cost of Ownership (TCO)

A

The financial estimate to help identify direct and indirect costs of a system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

AWS Cloud Adoption Framework (CAF)

A

Provides guidance and best practices to build a comprehensive approach to cloud computing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Industry 4.0

A

Increasing automation and data exchange in manufacturing technologies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Cognitive Biases

A

Systematic patterns of deviation from norm or rationality in judgment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Data Visualization

A

The graphical representation of information and data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Agile Process

A

Focuses on delivering the highest business value in the shortest time through iterative development.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Scrum

A

An agile framework for managing and controlling complex projects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Sprint

A

A short, time-boxed period (typically 2-4 weeks) during which a Scrum Team works to complete a set amount of work.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Product Owner

A

Responsible for maximizing the value of the product and managing the Product Backlog.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

ScrumMaster

A

A facilitator for the Scrum Team, ensuring they follow Scrum practices and removing impediments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Scrum Team

A

A self-organizing, cross-functional group responsible for delivering the product increment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Product Backlog

A

An ordered list of everything that might be needed in the product.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Sprint Backlog

A

The set of Product Backlog items selected for a Sprint, plus a plan for delivering the Sprint Goal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Burndown Chart

A

A visual representation of the progress of work remaining in a Sprint or Project.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Term

A

Definition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Data Science

A

An interdisciplinary field using scientific methods, processes, algorithms, and systems to extract knowledge or insights from data in various forms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Big Data

A

Extracting and storing large volumes of data, which data science then refines to produce insights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Qualitative Data

A

Descriptive piece of information. Example: ‘What a nice day it is’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Quantitative Data

A

Numerical information. Example: ‘1’, ‘3.65’. Can be further divided into discrete and continuous data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Discrete Data

A

Quantitative data that can be expressed as a specific value. Example: ‘Number of months in a year’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Continuous Data

A

Quantitative data that can be any value in an interval. Example: ‘The amount of oxygen in the atmosphere’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Artificial Intelligence (AI)

A

Creating machines with capabilities that would require intelligence if they were performed by humans.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Machine Learning (ML)

A

A field of AI where systems learn from data without explicit programming.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Supervised Learning

A

A type of machine learning where an algorithm learns from a labeled dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Unsupervised Learning

A

A type of machine learning where an algorithm learns from an unlabeled dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Regression Algorithms

A

Algorithms that predict numerical values of a variable based on historic data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Classification Algorithms

A

Algorithms that predict which class data belongs to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Anomaly Detection Algorithms

A

Algorithms used to find outliers or anomalies in data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Cloud Computing

A

The on-demand delivery of computing resources via the internet with pay-as-you-go pricing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Public Cloud

A

Services provided over a public network and available to anyone.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Private Cloud

A

Infrastructure dedicated to a single organization, located on-premises or off-premises.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Hybrid Cloud

A

Combines public and private cloud environments, allowing data sharing between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Community Cloud

A

Infrastructure and services shared among a specific community or group of organizations.

55
Q

Infrastructure as a Service (IaaS)

A

Provides virtualized computing resources over the internet.

56
Q

Platform as a Service (PaaS)

A

Offers a platform for developing, testing, and deploying applications.

57
Q

Software as a Service (SaaS)

A

Delivers software applications over the internet on a subscription basis.

58
Q

Total Cost of Ownership (TCO)

A

The financial estimate to help identify direct and indirect costs of a system.

59
Q

AWS Cloud Adoption Framework (CAF)

A

Provides guidance and best practices to build a comprehensive approach to cloud computing.

60
Q

Industry 4.0

A

Increasing automation and data exchange in manufacturing technologies.

61
Q

Cognitive Biases

A

Systematic patterns of deviation from norm or rationality in judgment.

62
Q

Data Visualization

A

The graphical representation of information and data.

63
Q

Agile Process

A

Focuses on delivering the highest business value in the shortest time through iterative development.

64
Q

Scrum

A

An agile framework for managing and controlling complex projects.

65
Q

Sprint

A

A short, time-boxed period (typically 2-4 weeks) during which a Scrum Team works to complete a set amount of work.

66
Q

Product Owner

A

Responsible for maximizing the value of the product and managing the Product Backlog.

67
Q

ScrumMaster

A

A facilitator for the Scrum Team, ensuring they follow Scrum practices and removing impediments.

68
Q

Scrum Team

A

A self-organizing, cross-functional group responsible for delivering the product increment.

69
Q

Product Backlog

A

An ordered list of everything that might be needed in the product.

70
Q

Sprint Backlog

A

The set of Product Backlog items selected for a Sprint, plus a plan for delivering the Sprint Goal.

71
Q

Burndown Chart

A

A visual representation of the progress of work remaining in a Sprint or Project.

72
Q

AI Governance

A

Policies and frameworks ensuring AI development aligns with ethical and legal guidelines.

73
Q

AWS Billing & Cost Management

A

AWS tools for tracking and optimizing service costs.

74
Q

AWS Organizations

A

A management service for grouping AWS accounts and applying governance policies.

75
Q

AWS Pricing Calculator

A

A tool to estimate AWS costs before deployment.

76
Q

Agile Manifesto

A

A set of values prioritizing individuals, working software, customer collaboration, and responsiveness to change.

77
Q

Artificial Neural Networks (ANNs)

A

Computing systems inspired by biological neural networks, used in deep learning.

78
Q

Association Rule Learning

A

Identifying relationships or patterns in large datasets, commonly used in market basket analysis.

79
Q

Bias-Variance Tradeoff

A

The balance between model complexity (overfitting) and generalizability (underfitting).

80
Q

Blockchain in AI

A

The integration of blockchain for secure, verifiable AI applications.

81
Q

Chart Types

A

Various data visualization charts like bar charts, scatter plots, line graphs, pie charts, etc.

82
Q

Clustering

A

A technique that groups data points based on similarity, commonly used in unsupervised learning.

83
Q

Confirmation Bias

A

The tendency to search for, interpret, or recall information that confirms pre-existing beliefs.

84
Q

Contrast Principle in Visualization

A

Using contrast (before/after, with/without) to improve clarity in data visualizations.

85
Q

Cyber-Physical Systems (CPS)

A

Systems integrating computation with physical processes, essential in Industry 4.0.

86
Q

Daily Scrum (Standup Meeting)

A

A short daily meeting where a Scrum team discusses progress and obstacles.

87
Q

Decision Trees

A

A tree-like structure used for classification and regression tasks in ML.

88
Q

Declarative Visualization

A

Presenting data insights clearly and effectively to inform decision-making.

89
Q

Deep Learning

A

A subset of ML using neural networks with multiple layers to learn representations of data.

90
Q

Edge Computing

A

Processing data near the source rather than relying on centralized cloud servers.

91
Q

Ethical AI

A

Ensuring AI systems are fair, accountable, and do not reinforce biases.

92
Q

Explainable AI (XAI)

A

AI models designed to provide transparency and interpretability in decision-making.

93
Q

Exploratory Visualization

A

Used to discover insights in data rather than just presenting information.

94
Q

Fog Computing

A

A decentralized computing model that extends cloud services closer to IoT devices.

95
Q

Frequency Illusion

A

The illusion that something appears more frequently after you first notice it.

96
Q

Generative AI

A

AI models that create new content, such as images, text, and audio (e.g., ChatGPT, DALL·E).

97
Q

Groupthink

A

The practice of making decisions as a group in a way that discourages independent thinking.

98
Q

Halo Effect

A

The tendency to let an overall impression of someone influence how we judge their specific traits.

99
Q

Hyperparameter Tuning

A

The process of optimizing algorithm parameters to improve performance.

100
Q

Instance-Based Learning

A

Algorithms that make predictions based on stored instances rather than general models.

101
Q

K-Means Clustering

A

A clustering algorithm that partitions data into K clusters based on centroid similarity.

102
Q

Mist Computing

A

A decentralized computing model where data processing occurs directly on devices (e.g., sensors).

103
Q

Model Drift

A

The degradation of an AI model’s performance over time due to changing data patterns.

104
Q

Neuromarketing & Visualization

A

How the brain processes visual information for decision-making in marketing and design.

105
Q

Optimism Bias

A

The belief that we are less likely to experience negative events compared to others.

106
Q

Pattern Recognition Bias

A

The tendency to see patterns where none exist.

107
Q

Predictive Maintenance

A

AI-driven maintenance strategies that anticipate and prevent equipment failures.

108
Q

Reinforcement Learning

A

A type of ML where an agent learns by interacting with an environment and receiving rewards.

109
Q

Reptilian Brain & Visualization

A

The role of fast, instinctive visual processing in human cognition.

110
Q

Robotic Process Automation (RPA)

A

Software that automates repetitive tasks previously done by humans.

111
Q

SCM & AI

A

The use of AI to optimize supply chains, logistics, and demand forecasting.

112
Q

Scaling Scrum (Scrum of Scrums)

A

A method for coordinating multiple Scrum teams in larger organizations.

113
Q

Serverless Computing

A

A cloud model where developers deploy code without managing infrastructure.

114
Q

Smart Factories

A

Manufacturing environments that leverage AI, IoT, and automation for efficiency.

115
Q

Sprint Planning

A

A Scrum meeting where the team selects work to complete in an upcoming sprint.

116
Q

Sprint Retrospective

A

A Scrum meeting to reflect on a completed sprint and improve future work.

117
Q

Sprint Review

A

A Scrum meeting where the team presents what they accomplished during the sprint.

118
Q

What is RapidMiner and what is it used for?

A

RapidMiner is a data science platform used for machine learning, data preprocessing, and predictive analytics. It allows users to design workflows using a visual, drag-and-drop interface without extensive programming knowledge.

Explanation:
* Used for data cleaning, model training, validation, and deployment.
* Supports both supervised (classification, regression) and unsupervised learning (clustering, anomaly detection).

119
Q

What are the main components of a RapidMiner process?

A

Operators, Connectors, Repositories, Parameters Panel, Results View

*	Operators → The building blocks that perform tasks (e.g., “Read CSV,” “Normalize Data,” “Apply Model”).
*	Connections → The arrows linking operators that define workflow execution order.
*	Repositories → Where datasets, models, and results are stored.
*	Parameters Panel → Where you set options for each operator.
*	Results View → Displays output after execution (tables, charts, performance metrics).
120
Q

What is the difference between “Operators” and “Processes” in RapidMiner?

A
  • Operators are individual tasks (e.g., “Normalize”, “Decision Tree”).
    • Processes are full workflows consisting of multiple operators linked together.
121
Q

How do you import and load data into RapidMiner?

A
  • Using “Read CSV” or “Read Excel” operators for files.
    • Connecting to databases using the “Retrieve” operator.
    • Manually entering data via the “Create ExampleSet” operator.
122
Q

What are common data preprocessing steps in RapidMiner?

A
  1. Handling Missing Values → Use “Replace Missing Values” to fill with mean, median, or mode.
    1. Feature Selection → Remove irrelevant attributes using “Select Attributes.”
    2. Normalization & Standardization → Use “Normalize” to scale numerical data.
    3. Handling Categorical Variables → Use “Nominal to Numerical” for model compatibility.
    4. Splitting Data → Use “Split Data” to create training and test sets.
123
Q

What is the difference between Normalization and Standardization in RapidMiner?

A
  • Normalization (Min-Max Scaling) → Rescales data between 0 and 1.
  • Standardization (Z-Score Scaling) → Centers data around mean 0 with unit variance.

Explanation:
* Normalization is best for bounded data (e.g., images with pixel values 0-255).
* Standardization is useful for normally distributed data.

124
Q

How do you handle missing values in RapidMiner?

A
  • Remove missing values if they are few and not critical.
    • Replace missing values using mean, median, mode, or predictive models.
    • Interpolate missing values using nearest neighbor or regression.

Why is this important: Missing data can bias models, so it must be handled appropriately

125
Q

What does the “Generate Attributes” operator do?

A

It creates new features based on existing data using mathematical or logical expressions.
Example: if([Age] > 30, “Senior”, “Junior”)

126
Q

What is the purpose of splitting data into training and testing sets?

A
  • Training Set → Used to train the model.
    • Testing Set → Used to evaluate performance on unseen data.

Explanation: Prevents overfitting by ensuring the model generalizes well to new data.

127
Q

What does the “Cross-Validation” operator do?

A

It divides data into K subsets, trains on K-1 parts, and tests on the remaining part, repeating K times.

Explanation:
* Prevents overfitting.
* More robust evaluation than a single train/test split.

128
Q

What is the difference between Supervised and Unsupervised Learning?

A
  • Supervised Learning → Uses labeled data (e.g., classification, regression).
    • Unsupervised Learning → Finds patterns in unlabeled data (e.g., clustering, anomaly detection).
129
Q

What is K-Means Clustering and when should you use it?

A

K-Means is an unsupervised algorithm that groups data into K clusters based on similarity.

Explanation: Used for customer segmentation, anomaly detection, and exploratory data analysis.

130
Q

What is a Decision Tree and how do you interpret its output?

A

A Decision Tree splits data into branches based on feature conditions to make predictions.
* Root Node: Initial split based on the most informative feature.
* Leaf Node: Final classification/prediction.

Explanation: Simple, interpretable models useful for classification and regression.

131
Q

What does a Confusion Matrix show in classification?

A

It shows the number of True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).
* Accuracy = (TP + TN) / Total Predictions
* Precision = TP / (TP + FP)
* Recall = TP / (TP + FN)

Explanation: Helps evaluate classification performance beyond just accuracy.

132
Q

BIG IDEA: What is Overfitting and how do you prevent it?

A

Overfitting → Model learns noise instead of patterns, leading to poor performance on new data.
Prevention:
* Pruning Decision Trees (limit depth).
* Regularization (L1/L2 penalties).
* Using simpler models or ensemble learning.

Explanation: Overfit models perform well on training data but fail on unseen data.

133
Q

What is the ROC Curve and how do you interpret it?

A
  • ROC Curve plots True Positive Rate vs. False Positive Rate.
    • AUC (Area Under Curve):
    • AUC = 1.0 → Perfect model.
    • AUC = 0.5 → Random guessing.

Explanation: Higher AUC means better classification performance.

134
Q

What is the difference between Bagging and Boosting?

A
  • Bagging (e.g., Random Forest) → Reduces variance by training multiple models independently and averaging results.
    • Boosting (e.g., XGBoost) → Reduces bias by training models sequentially, improving weak learners.