CSCI316 Exam card revision - csv Flashcards
Implement from scratch a Python function to compute the Gini index of a list. This function takes a list of categorical values as input and returns the Gini index as output. Write down the Python code. (4 marks)
def gini_index(values):
# Calculate the total number of values
total_values = len(values)
# Calculate the count of each unique value in the list
value_counts = {}
for value in values:
if value in value_counts:
value_counts[value] += 1
else:
value_counts[value] = 1
# Calculate the Gini index
gini = 1.0
for count in value_counts.values():
probability = count / total_values
gini -= probability ** 2
return gini
Given a list named X which contains words (as strings), implement a Python function to compute the word(s) with the highest frequency in X. Write down the Python code. (5 marks)
def most_frequent_words(word_list): # Count the frequency of each word in the list word_counts = Counter(word_list)
# Find the maximum frequency in the counts max_frequency = max(word_counts.values())
# Find the word(s) with the maximum frequency most_frequent = [word for word, count in word_counts.items() if count == max_frequency]
return most_frequent
(2.1) Explain why pre-processing is important in big data. (3 marks)
pre-processing is essential in big data to clean, transform, and prepare data for meaningful analysis and modeling. It helps enhance data quality, reduce noise, improve model performance, and ensure that insights drawn from big data are accurate and actionable.
Explain the advantages and disadvantages of data aggregation. (3 marks)
Advantages of data aggregation is that it simplifies the data, reduces storage requirements, faster query performance, enhanced visualization and it is also private and secure
Disadvantages of data aggregation is that there will be loss of detail, sampling bias, information loss, limited analysis, difficulty in drill-down, complex aggregation rules.
Explain undersampling and oversampling, and when you will apply them. (3 marks)
Undersampling:
Explanation: Undersampling involves reducing the number of samples in the majority class(es) to create a more balanced dataset. This is typically done by randomly selecting a subset of samples from the majority class to match the number of samples in the minority class.
Oversampling:
Explanation: Oversampling involves increasing the number of samples in the minority class by generating synthetic samples or duplicating existing ones. This is done to balance the class distribution in the dataset.
When to Apply Undersampling and Oversampling:
Undersampling is typically applied in scenarios where you have a large dataset, and reducing the size of the majority class does not result in a significant loss of information. It can be suitable for situations where computation resources are limited, and a smaller dataset is more manageable.
Oversampling is applied when you have a limited amount of data in the minority class, and simply discarding samples from the majority class would lead to a substantial loss of information. It helps balance the class distribution and allows the model to learn from the minority class more effectively. Oversampling is particularly
(2.1) Explain why we cannot reuse the training data for testing in data mining. (2 marks)
We cannot reuse the training data for testing in data mining because it would lead to biased and unreliable model evaluation. The fundamental reason for this is that using the same data for both training and testing introduces a form of “data leakage” or “information leakage” that can artificially inflate the performance metrics of the model
It will cause overfitting, lack of generalization and biased evaluations.
Explain the concept of feature selection and feature generation, and in what situation to use each method. (3 marks)
Feature Selection:
•
Definition: Feature selection is the process of choosing a subset of the most relevant and informative features (variables or attributes) from the original set of features. It involves eliminating redundant, irrelevant, or noisy features while retaining those that contribute the most to the model’s performance.
•
When to Use:
When there are high-dimensional data, redundant features, irrelevant features and curse of dimensionality.
Explain why we need to convert strings to numerical values in data mining. Describe a concrete example to demonstrate the advantage(s) of one-hot encoding compared with the direct conversion of strings to numerical values. (4 marks)
Converting strings to numerical values is crucial in data mining for several reasons:
1. Algorithm Compatibility: Many machine learning algorithms require numerical input data. Algorithms like linear regression, decision trees, and support vector machines operate on mathematical equations, necessitating numerical input.
2. Distance Metrics: Clustering and similarity-based algorithms rely on distance metrics, which require numerical data to calculate distances accurately.
3. Statistical Analysis: Numerical data is vital for statistical analyses and hypothesis testing in data mining.
4. Efficiency: Numerical data often leads to faster model training and inference compared to string or categorical data.
A concrete example highlighting the advantage of one-hot encoding over direct conversion: Suppose we have a dataset with a categorical feature “Fruit” containing values: “Apple,” “Banana,” and “Orange.” Converting these to numerical values (e.g., 1, 2, 3) implies an ordinal relationship, which might be misleading. One-hot encoding creates binary columns (“Is_Apple,” “Is_Banana,” “Is_Orange”) with 0s and 1s, preserving independence among categories. This ensures that the model treats each fruit equally, preventing unintended hierarchy or bias.
(2.1) Explain the advantages of stratified sampling over standard random sampling. (3 marks)
Stratified sampling is advantageous over standard random sampling for three key reasons:
1.
Representative Samples: Stratified sampling ensures that every subgroup, or stratum, within the population is adequately represented in the sample. This guarantees a more accurate reflection of the entire population’s characteristics.
2.
Bias Reduction: It reduces the risk of sampling bias. In standard random sampling, there’s a chance of disproportionately selecting samples from one subgroup. Stratified sampling systematically selects from each stratum, minimizing such bias.
3.
Precision: When there’s significant variability between subgroups, stratified sampling provides more precise estimates. It minimizes variation within the sample, resulting in more accurate insights into each stratum’s characteristics.
In summary, stratified sampling enhances representativeness, reduces bias, and increases precision compared to standard random sampling, making it a valuable sampling method.
(2.2) Describe three common ways of handling missing values. (4 marks)
Three common ways to handle missing values are:
1.
Deletion or Removal: In this method, rows or columns with missing data are entirely removed from the dataset. It’s simple but can lead to a loss of valuable information and reduced sample size.
2.
Imputation: Imputation involves filling in missing values with estimated or calculated values. Common techniques include mean, median, or mode imputation, where the missing values are replaced with the mean, median, or mode of the observed data in the same column. Another approach is regression imputation, where a regression model is used to predict missing values based on other variables.
3.
Advanced Imputation: More advanced techniques include using machine learning algorithms to predict missing values based on the relationships between variables. Methods like K-nearest neighbors (KNN) imputation, decision tree imputation, or matrix factorization imputation can be employed to handle missing data more effectively.
Each of these methods has its advantages and limitations, and the choice depends on the nature of the data and the specific problem at hand.
Assume that you are given a set of records as shown in the following table, where the last column contains the target variable. Present the procedure of using Gain Ratio to identify which attribute should be split. You need to show all steps of your calculation in detail.
(6 marks)
Case
Lecturer experience
Programming Subject?
Student satisfaction
1
Strong
No
Low
2
Weak
No
Low
3
Weak
Yes
Low
4
Weak
Yes
Low
5
Strong
No
High
6
Strong
No
High
7
Strong
Yes
High
8
Weak
Yes
High
To calculate the Gain Ratio for each attribute, follow these steps:
Step 1: Calculate the Entropy of the Target Variable (Student Satisfaction):
•
Calculate the proportion of each class (Low and High satisfaction) in the target variable.
•
Calculate the entropy using the formula:
•
Entropy(S) = -p(Low) * log2(p(Low)) - p(High) * log2(p(High))
Step 2: Calculate the Information Gain for Each Attribute:
•
Calculate the entropy of the target variable for each unique value of an attribute (Lecturer Experience, Programming Subject).
•
Calculate the weighted average entropy using the formula:
•
Information Gain(Attribute) = Entropy(S) - Σ [(|Sv| / |S|) * Entropy(Sv)]
Step 3: Calculate the Split Information for Each Attribute:
•
Calculate the proportion of each unique value of the attribute.
•
Calculate the split information using the formula:
•
Split Information(Attribute) = - Σ (|Sv| / |S|) * log2(|Sv| / |S|)
Step 4: Calculate the Gain Ratio for Each Attribute:
•
Calculate the Gain Ratio using the formula:
•
Gain Ratio(Attribute) = Information Gain(Attribute) / Split Information(Attribute)
Step 5: Compare Gain Ratios:
•
Compare the Gain Ratios for each attribute.
•
The attribute with the highest Gain Ratio is the best attribute to split on, as it provides the most information gain while considering its potential for overfitting (split information).
In summary, calculate the Gain Ratio for each attribute, and the one with the highest Gain Ratio is chosen as the best attribute to split on. This attribute will be the root of the decision tree in a decision tree classifier.
Why an ensemble classifier (such as a Random Forest) can enhance the performance of individual classifiers?
Ensemble classifiers, like Random Forests, enhance performance by combining multiple individual classifiers:
1.
Reducing Variance: Ensemble methods reduce the risk of overfitting by averaging or combining the predictions of multiple weak learners. This helps to smooth out noisy data and improves generalization.
2.
Improved Robustness: By combining diverse models, ensembles become more robust to outliers and errors in individual models. They are less likely to be misled by a single incorrect prediction.
3.
Better Generalization: Ensembles capture different patterns in the data. Combining these patterns results in a more accurate and stable prediction, especially when the data is complex or contains hidden relationships.
4.
Reduced Bias: Ensembles can reduce bias by incorporating a variety of modeling techniques. This means that they are more likely to capture the true underlying patterns in the data.
In summary, ensemble classifiers combine the strength of multiple models, reducing variance, improving robustness, enhancing generalization, and reducing bias, which collectively enhance their performance compared to individual classifiers.
(4.1) Use an example to illustrate the conditional independence assumption, and explain why it is important to the Naïve Bayes classifier. (3 marks)
Example: Let’s consider a spam email classification scenario where we have two features: “contains the word ‘free’” (F) and “contains the word ‘discount’” (D), and we want to classify an email as either spam (S) or not spam (NS).
Explanation:
⦁ According to the conditional independence assumption, the probability of an email containing both “free” and “discount” given that it is spam is equal to the product of the probabilities of it containing “free” given that it is spam (P(F | S)) and containing “discount” given that it is spam (P(D | S)).
Importance to Naïve Bayes Classifier:
Simplifies computation, handling high dimensional data, decoupling features
(4.1) In Naïve Bayesian classifiers, the numerical underflow and the zero count are two important issues. Explain these two issues and describe at least one common technique to overcome each issue. (4 marks)
1. Numerical Underflow: This issue occurs when multiplying many probabilities together, which can result in extremely small values that may lead to numerical precision errors. To overcome this, we can work in the log space by taking the logarithm of probabilities and summing them instead of multiplying.
2. Zero Count: When a feature in the test data has a value that was not seen in the training data, it results in a zero probability estimate. Laplace smoothing (add-one smoothing) is a common technique to address this. It involves adding a small constant to all counts to ensure that no probability becomes zero, allowing the model to make reasonable predictions even for unseen data.
Need further reading, answer might not be complete
Explain in which situations sensitivity and specificity are more important than accuracy as performance metrics of a classifier.
Class Imbalance: One class significantly outnumbers the other. Sensitivity is crucial for detecting the minority class correctly.
2.
Costly Errors: Different types of classification errors have varying consequences. Sensitivity and specificity help balance the trade-off based on error costs.
3.
Security and Anomaly Detection: Detecting attacks or anomalies accurately is critical. False negatives can lead to security breaches.
4.
Legal or Ethical Implications: Minimizing false negatives is a top priority in situations where the consequences of missing a positive case are severe.
In these scenarios, sensitivity and specificity provide a more nuanced evaluation of classifier performance than accuracy.
(5.1) Use an example to explain how the MapReduce model can process the outer join operation. (3 marks)
In MapReduce, an outer join operation can be achieved by emitting key-value pairs from two datasets, one for employees and another for departments, during the mapping phase. Keys are assigned based on a common attribute like department ID. In the reducing phase, records with the same key are grouped together, allowing reducers to combine employee and department information when there’s a match and handle cases with missing matches. This process enables efficient outer join operations on large datasets in a distributed manner.
Why Apache Spark is suitable for large-scale machine learning? Use an example to support your answer. (3 marks)
Apache Spark is suitable for large-scale machine learning due to its in-memory processing capabilities and distributed computing framework. It efficiently handles large datasets and iterative algorithms commonly used in machine learning.
For example, consider a large-scale recommendation system. Spark’s ability to cache data in memory allows it to store user-product interactions, facilitating rapid model updates. Its distributed nature handles parallel processing of recommendations for multiple users, making it ideal for large-scale scenarios.
(5.2) Why is Apache Spark more suitable for data-parallel computation than for model-parallel computation? Also use an example to support your answer. (4 marks)
Apache Spark is more suitable for data-parallel computation than for model-parallel computation because its strengths lie in processing large volumes of data in parallel across a cluster of machines. It excels at parallelizing data transformations and actions.
For example, in a large-scale data analysis task like log processing, Apache Spark can efficiently distribute and process log files across nodes. It can perform operations like filtering, mapping, and aggregations on this data in parallel, making it well-suited for data-parallel tasks.
In contrast, model-parallel computation involves distributing and training parts of a machine learning model across different nodes. Spark is not as inherently designed for this type of computation, and it may require additional custom implementations and coordination. This makes it less suitable for model-parallel tasks where the focus is on distributing and training a complex model across multiple machines.
Why a classical Perceptron (i.e., a single layer of linear threshold units) is not preferable to use? (2 marks)
A classical Perceptron (single layer of linear threshold units) is not preferable because it:
1.
Handles Only Linearly Separable Data: Can only solve problems where data is linearly separable, limiting its use in complex, real-world tasks.
2.
Lacks Capacity for Deep Learning: Lacks hidden layers for hierarchical feature learning, unlike modern neural networks capable of more sophisticated tasks.
As a result, it’s not suitable for most real-world machine learning problems.
Why the softmax function is suitable for multiple classification but not regression? Also write some sample code to illustrative the difference between implementing an ANN classifier and regressor in Keras. (6 marks)
Import necessary libraries
The softmax function is suitable for multiple classification but not regression because:
• Output Distribution: Softmax transforms the output into a probability distribution, ensuring that the sum of outputs is 1. This is suitable for classifying data into multiple classes, where each class represents a mutually exclusive category.
• Categorical Output: In classification, we want to determine the class or category to which an input belongs. Softmax assigns a probability to each class, and the class with the highest probability is the predicted class.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense
X_classification = np.random.rand(100, 5) # Example input features
y_classification = np.random.randint(0, 3, size=(100,)) # Example classification labels
X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(X_classification, y_classification, test_size=0.2, random_state=42)
y_train_cls_one_hot = np.eye(3)[y_train_cls] # Assuming 3 classes
y_test_cls_one_hot = np.eye(3)[y_test_cls]
model_cls = Sequential()
model_cls.add(Dense(10, input_dim=5, activation=’relu’))
model_cls.add(Dense(3, activation=’softmax’)) # Output layer with softmax for classification
model_cls.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
model_cls.fit(X_train_cls, y_train_cls_one_hot, epochs=10, batch_size=16, validation_data=(X_test_cls, y_test_cls_one_hot))
X_regression = np.random.rand(100, 5) # Example input features
y_regression = np.random.rand(100, 1) # Example regression targets
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_regression, y_regression, test_size=0.2, random_state=42)
model_reg = Sequential()
model_reg.add(Dense(10, input_dim=5, activation=’relu’))
model_reg.add(Dense(1, activation=’linear’)) # Output layer with linear activation for regression
model_reg.compile(loss=’mean_squared_error’, optimizer=’adam’, metrics=[‘mse’])
model_reg.fit(X_train_reg, y_train_reg, epochs=10, batch_size=16, validation_data=(X_test_reg, y_test_reg))
In this example, we create two separate models for classification and regression using Keras. The key differences lie in the output layers: softmax activation for classification and linear activation for regression. Additionally, the loss functions used are different: ‘categorical_crossentropy’ for classification and ‘mean_squared_error’ for regression.
Summarise the difference between pro-processing and data mining. (4 marks)
Pre-processing:
Cleans and organizes raw data for analysis.
Involves tasks like normalization and handling missing values.
Data Mining:
Discovers patterns and insights in large datasets.
Uses techniques like clustering and classification.