CSCI316 Exam card revision - csv Flashcards

1
Q

Implement from scratch a Python function to compute the Gini index of a list. This function takes a list of categorical values as input and returns the Gini index as output. Write down the Python code. (4 marks)

A

def gini_index(values):
# Calculate the total number of values
total_values = len(values)
# Calculate the count of each unique value in the list
value_counts = {}
for value in values:
if value in value_counts:
value_counts[value] += 1
else:
value_counts[value] = 1
# Calculate the Gini index
gini = 1.0
for count in value_counts.values():
probability = count / total_values
gini -= probability ** 2
return gini

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Given a list named X which contains words (as strings), implement a Python function to compute the word(s) with the highest frequency in X. Write down the Python code. (5 marks)

A

def most_frequent_words(word_list): # Count the frequency of each word in the list word_counts = Counter(word_list)
# Find the maximum frequency in the counts max_frequency = max(word_counts.values())
# Find the word(s) with the maximum frequency most_frequent = [word for word, count in word_counts.items() if count == max_frequency]
return most_frequent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

(2.1) Explain why pre-processing is important in big data. (3 marks)

A

pre-processing is essential in big data to clean, transform, and prepare data for meaningful analysis and modeling. It helps enhance data quality, reduce noise, improve model performance, and ensure that insights drawn from big data are accurate and actionable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain the advantages and disadvantages of data aggregation. (3 marks)

A

Advantages of data aggregation is that it simplifies the data, reduces storage requirements, faster query performance, enhanced visualization and it is also private and secure

Disadvantages of data aggregation is that there will be loss of detail, sampling bias, information loss, limited analysis, difficulty in drill-down, complex aggregation rules.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain undersampling and oversampling, and when you will apply them. (3 marks)

A

Undersampling:
Explanation: Undersampling involves reducing the number of samples in the majority class(es) to create a more balanced dataset. This is typically done by randomly selecting a subset of samples from the majority class to match the number of samples in the minority class.

Oversampling:
Explanation: Oversampling involves increasing the number of samples in the minority class by generating synthetic samples or duplicating existing ones. This is done to balance the class distribution in the dataset.

When to Apply Undersampling and Oversampling:
Undersampling is typically applied in scenarios where you have a large dataset, and reducing the size of the majority class does not result in a significant loss of information. It can be suitable for situations where computation resources are limited, and a smaller dataset is more manageable.
Oversampling is applied when you have a limited amount of data in the minority class, and simply discarding samples from the majority class would lead to a substantial loss of information. It helps balance the class distribution and allows the model to learn from the minority class more effectively. Oversampling is particularly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

(2.1) Explain why we cannot reuse the training data for testing in data mining. (2 marks)

A

We cannot reuse the training data for testing in data mining because it would lead to biased and unreliable model evaluation. The fundamental reason for this is that using the same data for both training and testing introduces a form of “data leakage” or “information leakage” that can artificially inflate the performance metrics of the model

It will cause overfitting, lack of generalization and biased evaluations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the concept of feature selection and feature generation, and in what situation to use each method. (3 marks)

A

Feature Selection:

Definition: Feature selection is the process of choosing a subset of the most relevant and informative features (variables or attributes) from the original set of features. It involves eliminating redundant, irrelevant, or noisy features while retaining those that contribute the most to the model’s performance.

When to Use:
When there are high-dimensional data, redundant features, irrelevant features and curse of dimensionality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain why we need to convert strings to numerical values in data mining. Describe a concrete example to demonstrate the advantage(s) of one-hot encoding compared with the direct conversion of strings to numerical values. (4 marks)

A

Converting strings to numerical values is crucial in data mining for several reasons:
1. Algorithm Compatibility: Many machine learning algorithms require numerical input data. Algorithms like linear regression, decision trees, and support vector machines operate on mathematical equations, necessitating numerical input.
2. Distance Metrics: Clustering and similarity-based algorithms rely on distance metrics, which require numerical data to calculate distances accurately.
3. Statistical Analysis: Numerical data is vital for statistical analyses and hypothesis testing in data mining.
4. Efficiency: Numerical data often leads to faster model training and inference compared to string or categorical data.

A concrete example highlighting the advantage of one-hot encoding over direct conversion: Suppose we have a dataset with a categorical feature “Fruit” containing values: “Apple,” “Banana,” and “Orange.” Converting these to numerical values (e.g., 1, 2, 3) implies an ordinal relationship, which might be misleading. One-hot encoding creates binary columns (“Is_Apple,” “Is_Banana,” “Is_Orange”) with 0s and 1s, preserving independence among categories. This ensures that the model treats each fruit equally, preventing unintended hierarchy or bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

(2.1) Explain the advantages of stratified sampling over standard random sampling. (3 marks)

A

Stratified sampling is advantageous over standard random sampling for three key reasons:
1.
Representative Samples: Stratified sampling ensures that every subgroup, or stratum, within the population is adequately represented in the sample. This guarantees a more accurate reflection of the entire population’s characteristics.
2.
Bias Reduction: It reduces the risk of sampling bias. In standard random sampling, there’s a chance of disproportionately selecting samples from one subgroup. Stratified sampling systematically selects from each stratum, minimizing such bias.
3.
Precision: When there’s significant variability between subgroups, stratified sampling provides more precise estimates. It minimizes variation within the sample, resulting in more accurate insights into each stratum’s characteristics.
In summary, stratified sampling enhances representativeness, reduces bias, and increases precision compared to standard random sampling, making it a valuable sampling method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

(2.2) Describe three common ways of handling missing values. (4 marks)

A

Three common ways to handle missing values are:
1.
Deletion or Removal: In this method, rows or columns with missing data are entirely removed from the dataset. It’s simple but can lead to a loss of valuable information and reduced sample size.
2.
Imputation: Imputation involves filling in missing values with estimated or calculated values. Common techniques include mean, median, or mode imputation, where the missing values are replaced with the mean, median, or mode of the observed data in the same column. Another approach is regression imputation, where a regression model is used to predict missing values based on other variables.
3.
Advanced Imputation: More advanced techniques include using machine learning algorithms to predict missing values based on the relationships between variables. Methods like K-nearest neighbors (KNN) imputation, decision tree imputation, or matrix factorization imputation can be employed to handle missing data more effectively.
Each of these methods has its advantages and limitations, and the choice depends on the nature of the data and the specific problem at hand.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Assume that you are given a set of records as shown in the following table, where the last column contains the target variable. Present the procedure of using Gain Ratio to identify which attribute should be split. You need to show all steps of your calculation in detail.
(6 marks)
Case
Lecturer experience
Programming Subject?
Student satisfaction
1
Strong
No
Low
2
Weak
No
Low
3
Weak
Yes
Low
4
Weak
Yes
Low
5
Strong
No
High
6
Strong
No
High
7
Strong
Yes
High
8
Weak
Yes
High

A

To calculate the Gain Ratio for each attribute, follow these steps:
Step 1: Calculate the Entropy of the Target Variable (Student Satisfaction):

Calculate the proportion of each class (Low and High satisfaction) in the target variable.

Calculate the entropy using the formula:

Entropy(S) = -p(Low) * log2(p(Low)) - p(High) * log2(p(High))
Step 2: Calculate the Information Gain for Each Attribute:

Calculate the entropy of the target variable for each unique value of an attribute (Lecturer Experience, Programming Subject).

Calculate the weighted average entropy using the formula:

Information Gain(Attribute) = Entropy(S) - Σ [(|Sv| / |S|) * Entropy(Sv)]
Step 3: Calculate the Split Information for Each Attribute:

Calculate the proportion of each unique value of the attribute.

Calculate the split information using the formula:

Split Information(Attribute) = - Σ (|Sv| / |S|) * log2(|Sv| / |S|)
Step 4: Calculate the Gain Ratio for Each Attribute:

Calculate the Gain Ratio using the formula:

Gain Ratio(Attribute) = Information Gain(Attribute) / Split Information(Attribute)
Step 5: Compare Gain Ratios:

Compare the Gain Ratios for each attribute.

The attribute with the highest Gain Ratio is the best attribute to split on, as it provides the most information gain while considering its potential for overfitting (split information).
In summary, calculate the Gain Ratio for each attribute, and the one with the highest Gain Ratio is chosen as the best attribute to split on. This attribute will be the root of the decision tree in a decision tree classifier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why an ensemble classifier (such as a Random Forest) can enhance the performance of individual classifiers?

A

Ensemble classifiers, like Random Forests, enhance performance by combining multiple individual classifiers:
1.
Reducing Variance: Ensemble methods reduce the risk of overfitting by averaging or combining the predictions of multiple weak learners. This helps to smooth out noisy data and improves generalization.
2.
Improved Robustness: By combining diverse models, ensembles become more robust to outliers and errors in individual models. They are less likely to be misled by a single incorrect prediction.
3.
Better Generalization: Ensembles capture different patterns in the data. Combining these patterns results in a more accurate and stable prediction, especially when the data is complex or contains hidden relationships.
4.
Reduced Bias: Ensembles can reduce bias by incorporating a variety of modeling techniques. This means that they are more likely to capture the true underlying patterns in the data.
In summary, ensemble classifiers combine the strength of multiple models, reducing variance, improving robustness, enhancing generalization, and reducing bias, which collectively enhance their performance compared to individual classifiers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

(4.1) Use an example to illustrate the conditional independence assumption, and explain why it is important to the Naïve Bayes classifier. (3 marks)

A

Example: Let’s consider a spam email classification scenario where we have two features: “contains the word ‘free’” (F) and “contains the word ‘discount’” (D), and we want to classify an email as either spam (S) or not spam (NS).
Explanation:
⦁ According to the conditional independence assumption, the probability of an email containing both “free” and “discount” given that it is spam is equal to the product of the probabilities of it containing “free” given that it is spam (P(F | S)) and containing “discount” given that it is spam (P(D | S)).

Importance to Naïve Bayes Classifier:
Simplifies computation, handling high dimensional data, decoupling features
(4.1) In Naïve Bayesian classifiers, the numerical underflow and the zero count are two important issues. Explain these two issues and describe at least one common technique to overcome each issue. (4 marks)
1. Numerical Underflow: This issue occurs when multiplying many probabilities together, which can result in extremely small values that may lead to numerical precision errors. To overcome this, we can work in the log space by taking the logarithm of probabilities and summing them instead of multiplying.
2. Zero Count: When a feature in the test data has a value that was not seen in the training data, it results in a zero probability estimate. Laplace smoothing (add-one smoothing) is a common technique to address this. It involves adding a small constant to all counts to ensure that no probability becomes zero, allowing the model to make reasonable predictions even for unseen data.

Need further reading, answer might not be complete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain in which situations sensitivity and specificity are more important than accuracy as performance metrics of a classifier.

A

Class Imbalance: One class significantly outnumbers the other. Sensitivity is crucial for detecting the minority class correctly.
2.
Costly Errors: Different types of classification errors have varying consequences. Sensitivity and specificity help balance the trade-off based on error costs.
3.
Security and Anomaly Detection: Detecting attacks or anomalies accurately is critical. False negatives can lead to security breaches.
4.
Legal or Ethical Implications: Minimizing false negatives is a top priority in situations where the consequences of missing a positive case are severe.
In these scenarios, sensitivity and specificity provide a more nuanced evaluation of classifier performance than accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

(5.1) Use an example to explain how the MapReduce model can process the outer join operation. (3 marks)

A

In MapReduce, an outer join operation can be achieved by emitting key-value pairs from two datasets, one for employees and another for departments, during the mapping phase. Keys are assigned based on a common attribute like department ID. In the reducing phase, records with the same key are grouped together, allowing reducers to combine employee and department information when there’s a match and handle cases with missing matches. This process enables efficient outer join operations on large datasets in a distributed manner.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why Apache Spark is suitable for large-scale machine learning? Use an example to support your answer. (3 marks)

A

Apache Spark is suitable for large-scale machine learning due to its in-memory processing capabilities and distributed computing framework. It efficiently handles large datasets and iterative algorithms commonly used in machine learning.
For example, consider a large-scale recommendation system. Spark’s ability to cache data in memory allows it to store user-product interactions, facilitating rapid model updates. Its distributed nature handles parallel processing of recommendations for multiple users, making it ideal for large-scale scenarios.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

(5.2) Why is Apache Spark more suitable for data-parallel computation than for model-parallel computation? Also use an example to support your answer. (4 marks)

A

Apache Spark is more suitable for data-parallel computation than for model-parallel computation because its strengths lie in processing large volumes of data in parallel across a cluster of machines. It excels at parallelizing data transformations and actions.
For example, in a large-scale data analysis task like log processing, Apache Spark can efficiently distribute and process log files across nodes. It can perform operations like filtering, mapping, and aggregations on this data in parallel, making it well-suited for data-parallel tasks.
In contrast, model-parallel computation involves distributing and training parts of a machine learning model across different nodes. Spark is not as inherently designed for this type of computation, and it may require additional custom implementations and coordination. This makes it less suitable for model-parallel tasks where the focus is on distributing and training a complex model across multiple machines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why a classical Perceptron (i.e., a single layer of linear threshold units) is not preferable to use? (2 marks)

A

A classical Perceptron (single layer of linear threshold units) is not preferable because it:
1.
Handles Only Linearly Separable Data: Can only solve problems where data is linearly separable, limiting its use in complex, real-world tasks.
2.
Lacks Capacity for Deep Learning: Lacks hidden layers for hierarchical feature learning, unlike modern neural networks capable of more sophisticated tasks.
As a result, it’s not suitable for most real-world machine learning problems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why the softmax function is suitable for multiple classification but not regression? Also write some sample code to illustrative the difference between implementing an ANN classifier and regressor in Keras. (6 marks)

A

Import necessary libraries

The softmax function is suitable for multiple classification but not regression because:
• Output Distribution: Softmax transforms the output into a probability distribution, ensuring that the sum of outputs is 1. This is suitable for classifying data into multiple classes, where each class represents a mutually exclusive category.
• Categorical Output: In classification, we want to determine the class or category to which an input belongs. Softmax assigns a probability to each class, and the class with the highest probability is the predicted class.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense

X_classification = np.random.rand(100, 5) # Example input features
y_classification = np.random.randint(0, 3, size=(100,)) # Example classification labels

X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(X_classification, y_classification, test_size=0.2, random_state=42)

y_train_cls_one_hot = np.eye(3)[y_train_cls] # Assuming 3 classes
y_test_cls_one_hot = np.eye(3)[y_test_cls]

model_cls = Sequential()
model_cls.add(Dense(10, input_dim=5, activation=’relu’))
model_cls.add(Dense(3, activation=’softmax’)) # Output layer with softmax for classification

model_cls.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

model_cls.fit(X_train_cls, y_train_cls_one_hot, epochs=10, batch_size=16, validation_data=(X_test_cls, y_test_cls_one_hot))

X_regression = np.random.rand(100, 5) # Example input features
y_regression = np.random.rand(100, 1) # Example regression targets

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_regression, y_regression, test_size=0.2, random_state=42)

model_reg = Sequential()
model_reg.add(Dense(10, input_dim=5, activation=’relu’))
model_reg.add(Dense(1, activation=’linear’)) # Output layer with linear activation for regression

model_reg.compile(loss=’mean_squared_error’, optimizer=’adam’, metrics=[‘mse’])

model_reg.fit(X_train_reg, y_train_reg, epochs=10, batch_size=16, validation_data=(X_test_reg, y_test_reg))

In this example, we create two separate models for classification and regression using Keras. The key differences lie in the output layers: softmax activation for classification and linear activation for regression. Additionally, the loss functions used are different: ‘categorical_crossentropy’ for classification and ‘mean_squared_error’ for regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Summarise the difference between pro-processing and data mining. (4 marks)

A

Pre-processing:

Cleans and organizes raw data for analysis.
Involves tasks like normalization and handling missing values.
Data Mining:

Discovers patterns and insights in large datasets.
Uses techniques like clustering and classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the difference between noise and outliers? (4 marks)

A

Noise:

Random or irrelevant information in data.
Present in small amounts, considered unwanted variability.
Outliers:

Data points significantly deviate from the majority.
Can skew results and require careful handling.

22
Q

Implement from scratch a Python function that takes a list of string values and returns
Assume that a Python list is defined as follows:
X = [1, 2, 3, 4]
Write the Python code to filter the numbers bigger than 2 in X.

A

Example usage with the provided list X

def filter_numbers_greater_than_two(input_list):
# Use a list comprehension to filter numbers greater than 2
filtered_list = [num for num in input_list if num > 2]
return filtered_list

X = [1, 2, 3, 4]
result = filter_numbers_greater_than_two(X)
print(result)

23
Q

Implement from scratch a Python function for simple numerical encoding. This function takes a list of string values as input and returns a vector of integers as output. Write down the Python code. (4 marks)

A

Example usage

def numerical_encoding(input_list):
# Create a dictionary to map unique strings to integers
string_to_int_mapping = {string: index for index, string in enumerate(set(input_list))}

# Use list comprehension to encode each string in the input list
encoded_vector = [string_to_int_mapping[string] for string in input_list]

return encoded_vector

input_strings = [“apple”, “banana”, “orange”, “apple”, “orange”]
result = numerical_encoding(input_strings)
print(result)

24
Q

Why the Naïve Bayes classifier is efficient (e.g., compared with Decision Tree)?

A

Simplicity and Speed:

Naïve Bayes: The algorithm is relatively simple and easy to implement. It is computationally inexpensive, requiring less training time.
Decision Tree: Decision trees can become complex, especially with large datasets or deep trees, leading to longer training times and increased computational costs.
Low Parameter Sensitivity:

Naïve Bayes: Typically has fewer hyperparameters to tune, making it less sensitive to parameter optimization.
Decision Tree: May require tuning of parameters, and sensitivity to hyperparameters can impact performance and require more extensive optimization.
Handling of Irrelevant Features:

Naïve Bayes: The “naïve” assumption that features are independent simplifies the model and makes it robust to irrelevant features.
Decision Tree: Can be sensitive to irrelevant features, and pruning is often needed to reduce overfitting.
Efficient with High-Dimensional Data:

Naïve Bayes: Performs well in high-dimensional spaces, making it suitable for datasets with a large number of features.
Decision Tree: Can struggle with high-dimensional data and may require more data preprocessing and feature selection.
Effective for Text Classification:

Naïve Bayes: Particularly effective in natural language processing tasks, such as spam filtering and text classification.
Decision Tree: May not perform as well in text-based tasks due to the high dimensionality and sparsity of text data.
Less Prone to Overfitting:

Naïve Bayes: The simplicity of the model makes it less prone to overfitting, especially with smaller datasets.
Decision Tree: Can be prone to overfitting, and proper pruning is essential to balance model complexity.

25
Q

Explain how the MapReduce model can process the vector-by-vector multiplication (i.e., dot product of two vectors).

A

The MapReduce model is a programming paradigm designed for processing and generating large datasets in a parallel and distributed computing environment. The vector-by-vector multiplication, or dot product, of two vectors can be efficiently computed using the MapReduce model. Here’s a high-level explanation of how this process works:

Map Phase:

Input: The input to the MapReduce job consists of key-value pairs, where the key is an identifier for a specific vector element, and the value is the actual value of that element.
Map Function: In the Map phase, each mapper takes a portion of the vectors and emits intermediate key-value pairs. The key remains the identifier, and the value is the product of the corresponding elements in the two vectors.
python
Copy code
# Example Map function pseudocode
function map(key, value):
for i in range(length of vectors):
emit(i, vector1[i] * vector2[i])
Shuffle and Sort Phase:

The MapReduce framework groups intermediate key-value pairs by key and sorts them. This ensures that all values corresponding to the same identifier are grouped together.
Reduce Phase:

Input: The input to the Reduce phase is a set of key-value pairs where each key corresponds to a vector element, and the values are the products from different vectors that share the same identifier.
Reduce Function: In the Reduce phase, each reducer receives a key along with its associated values. The reducer then sums up these values to compute the dot product for that specific vector element.
python
Copy code
# Example Reduce function pseudocode
function reduce(key, values):
dot_product = sum(values)
emit(key, dot_product)
Output:

The final output of the MapReduce job consists of key-value pairs, where the key is the identifier of a vector element, and the value is the corresponding dot product.
This MapReduce approach allows for parallelization and distributed computation, making it efficient for large-scale vector-by-vector multiplication. Each mapper works independently on a portion of the data, and the reducer combines the results to produce the final dot product. This parallelization is especially beneficial for handling massive datasets in distributed computing environments.

26
Q

Define a new DataFrame that includes on the “StockCode” and “Quantity” columns of df_RD. Count the unique values in the “InvoceNo” column, and compute the total order price (UnitPrice * Quantity) per CustomerID. (5 mark)

A

Import necessary functions and classes from PySpark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, countDistinct, sum

spark = SparkSession.builder.appName(“example”).getOrCreate()

# Select the “StockCode” and “Quantity” columns
selected_df = df_RD.select(“StockCode”, “Quantity”)

unique_invoice_count = df_RD.select(countDistinct(“InvoiceNo”).alias(“UniqueInvoices”))

total_order_price_per_customer = df_RD.withColumn(“TotalOrderPrice”, col(“UnitPrice”) * col(“Quantity”)) \
.groupBy(“CustomerID”) \
.agg(sum(“TotalOrderPrice”).alias(“TotalOrderPricePerCustomer”))

selected_df.show()
unique_invoice_count.show()
total_order_price_per_customer.show()

27
Q

Explain the major difference between Spark Data Frame and Pandas Data Frame as data structures? (2 mark)

A

Major Difference:

Spark DataFrame: Distributed data structure for big data processing across a cluster of machines, leveraging lazy evaluation and immutability in a distributed environment.
Pandas DataFrame: Single-node, in-memory data structure designed for operations on smaller datasets with eager evaluation and mutability on a single machine.

28
Q

Why pre processing can improve the quality of data
pipeline?

A

Data pre-processing is a required first step before any machine learning machinery can be applied. It consists of data cleaning, data transformation and feature selection.
•Through data cleaning, inconsistent data, missing information, errors, and outliers can be corrected before data are submitted to machine learning model.

Through data transformation, the training/testing data are scaled and transformed. For example, categorical data can be transformed to numerical data without losing their original characteristics.
•Through feature selection, relevant features from big data can be selected to serve as input to machine learning model.

29
Q

What is the difference between noise and outliers? (4 marks)

A

Outliers are observed data that are distant (different) from the remaining observed data. They are also referred to as abnormalities, discordant, deviants and anomalies.
•Noise are corrupted or distorted data containing false information. In other words, noise are mislabelled examples or errors in the value of attributes.
Hence, we can say outliers are data that do not “fit in” with the other data that we are analysing. Outlier can be a valid data point, or it can be noise. As such, outliers often contain interesting and useful information about the underlying system. E.g., in an intrusion detection system, an outlier can be an instance of intrusion.

30
Q

Why large scale machine learning is challenging?

A

In machine learning, scale adds complexity. The larger the data set, the longer it takes to process them. At certain points, however, scale makes trivial operations costly and hence forcing us to re-evaluate algorithms in light of the complexity of those operations.
As a data set gets larger, data set may not be able to be processed in the main memory, and hence, an alternative approach such as distributed processing or parallalismand/or multiprocessing may involve. This in turn affects the models used in machine learning.

31
Q

takes a list of string values and returns…
Assure the python list is defined as “X = list(range(10))”
another list named Y which contains all the even numbers in X. Present the Python code of your implementation. Also present the output of the command “print(Y)”.

A

def evenList(X):
# X = list(range(10))
evenX= []
for n in X:
if (n%2 == 0):
evenX.append(n)
# print(evenX)
return evenX

x = list(range(10))
print(evenList(x))

32
Q

Assume that a Python dictionary is defined as follows: Word_Counts= {“a”: 1, “b”: 3, “c”: 2}

Sort the elements in Word Counts from the highest to lowest counts of the keys. Present the Python code of your implementation.
A

Given a list named X which contains words (as strings),

# implement a Python function to compute the word(s)
# with the highest frequency in X.
# Write down the Python code.
def
highestFreq (
freqCount
= defaultdict (
# For each substring in X
for sub in X:
# Split the substring into words
for word in
sub.split
# For each word, accumulate the count of frequency
freqCount
[word] += 1
# Determine the highest count of frequency
result = max(
freqCount , key = freqCount.get
highestCount
= freqCount.values
# return the result
return {‘word’: result, ‘
highestCount ‘: highestCount

33
Q

Suppose you are designing a DT classifier to processing the
following data set:
Present the process of using InfoGain to split the data set
according to the “student” feature. Detail the steps in your
computation process.

A

Refer to notes and understand

34
Q

Explain a method to handle zero counts and a method to handle
numerical underflow in the Naïve Bayes classifier. Use example o
support your answers.
Lecture5

BayesClassifier , Classification, Evaluation and Model
Enhancement, slide 25, 26.

A

Refer to notes and understand

35
Q

Why the Naïve Bayes classifier is efficient (e.g., compared with Decision
Tree)?

A

Bayes can perform quite well, and it doesn’t over fit nearly as much
so there is no need to prune or process the network.

Naive
bayes does quite well when the training data doesn’t contain
all possibilities so it can be very good with low amounts of data.

36
Q

Considering the following confusion matrix, what are the TP, FP, TN, FN,
precision, recall and F
measure?

A

Refer to notes and understand

37
Q

What is a MapReduce?

A

oSoftware framework and programming model used for processing a very large amounts of data.
oIt works in two phases:
Mapping phase
In this phase, data is split and mapped into multiple groups.

Reduce phase

In this phase, the mapped data are shuffle and reduced individual groups.


oMapReduce programs are parallel in nature, hence, they are useful for performing large-scale data analysis using multiple machines in the cluster.

Input Splits:
Data set is divided into fixed-size chunk (block) that is consumed by a single map.

Mapping:
Data in each chunk is passed to a mapping function to produce counts of occurrences of each word, and prepare a list of key-value pair where key is the word, and value is the frequency of occurrences.

Shuffling
Shuffling process will consolidate the relevant records from Mapping phases by clubbing together the same words and accumulate their frequency.

Reducing

In this phase, the output values from the Shuffling phase are aggregated, by combining all the words into a single output, that is, producing a complete dataset.
38
Q

Use an example to explain how the MapReduce model can
process a “word count” problem.

A

Map function.
Dataset is split into blocks/chunks of fixed size.
Map function is then assigned to process each block, that is, each Map function operates only one block.
Each Map function outputs (produces) sets of key-value pair records.
The sets of key-value pair records are passed to Partitioner, which ensures each key-value pair record is passed to one and only one Reducer.

Refer to notes and understand

39
Q

Explain how the MapReduce model can process the
relational
algebra operation “selection”. Use a concrete
example to support your answer.

A

Lecture 6
Handling Massive Datasets, slide 7 9.

40
Q

What is ROC

A

A Receiver Operator Characteristic (ROC) curve is a graphical plot used to show the diagnostic ability of binary classifiers. A ROC curve is constructed by plotting the true positive rate (TPR) against the false positive rate (FPR). The true positive rate is the proportion of observations that were correctly predicted to be positive out of all positive observations (TP/(TP + FN)). Similarly, the false positive rate is the proportion of observations that are incorrectly predicted to be positive out of all negative observations (FP/(TN + FP)). For example, in medical testing, the true positive rate is the rate in which people are correctly identified to test positive for the disease in question.

The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1-FPR). Classifiers that gives curves closer to the top-left corner indicate a better performance. As a baseline, a random classifier is expected to give points lying along the diagonal (FPR=TPR). The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. Please refer to Lecture Week 5 – Classification, slide 85 for examples.

41
Q

Assume that a Bayesian classifier returns the following outcomes for a
binary classification problem, which are sorted by decreasing
probability values. P (resp., N) refers to a record belonging to a positive
(resp., negative) class.

A

refer to notes and understand

42
Q

Precision (Exactness) – What percentage of
tuples that the classifier labelled as
positive are actually positive.

A

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
=
𝑇𝑃
𝑇𝑃 + 𝐹�

43
Q

Recall (completeness) – What percentage
of positive tuples did the classifier labelled
as positive?

A

𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
=
𝑇𝑃
𝑇𝑃 + 𝐹�

44
Q

True Positive Rate (TPR) – The ratio of the
true positive over the sum of true positive
and false negative.

A

TPR=
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
=
𝑇𝑃
𝑇𝑃+𝐹�

45
Q

False Positive Rate (FPR) – The ratio of the
false positive over the sum of false positive
and true negative.

A

FPR=
𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
=
𝐹𝑃
𝐹𝑃 + 𝑇�

46
Q

Naïve Bayes classifier and
model evaluation

A

Use classifier that produces posterior probability for each test
instance 𝑃(+|𝐴)
• Sort the instances according to 𝑃(+|𝐴) in decreasing order
• Apply threshold at each unique value of 𝑃(+|𝐴)
• Count the number of TP, FP, TN, FN at each threshold
• TP rate, 𝑇𝑃𝑅 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁)
• FP rate, 𝐹𝑃𝑅 = 𝐹𝑃/(𝐹𝑃 + 𝑇𝑁)

undetand how to do roc with this info

47
Q

Count the unique values in the ‘Invoice’ column.
ii. Define a new DataFrame that includes on the ‘StockCode’ and
‘Quantity’ columns of df_RD.

A

refer to notes and understand

48
Q

Explain the main differences between Spark Data Frame and Pandas
Data Frame as data structure.

A

Pandas and Spark DataFrame are designed for structural and semistructral data
processing. Both share some similar properties. The few differences between Pandas and
PySpark DataFrame are:
• Operation on Pyspark DataFrame run parallel on different nodes in cluster but, in case of
pandas it is not possible.
• Operations in PySpark DataFrame are lazy in nature but, in case of pandas we get the
result as soon as we apply any operation.
• In PySpark DataFrame, we can’t change the DataFrame due to it’s immutable property,
we need to transform it. But in pandas it is not the case.
• Pandas API support more operations than PySpark DataFrame. Still pandas API is more
powerful than Spark.
• Complex operations in pandas are easier to perform than Pyspark DataFrame
• In addition to above points, Pandas and Pyspark DataFrame have some basic differences
like columns selection, filtering, adding the columns, etc.

49
Q

Spark Mllib is similar to Scikit-Learn in terms of APIs. But what is the
most important feature of Spark Mllib?

A

Lecture6 – Handling Massive Datasets, slide 42 - 58

50
Q

Why the softmax function is suitable for multiple classification but not
regression?

A

The softmax function is a function that turns a vector of K real values
into a vector of K real values that sum to 1. The input values can be
positive, negative, zero, or greater than one, but the softmax
transforms them into values between 0 and 1, so that they can be
interpreted as probabilities. If one of the inputs is small or negative,
the softmax turns it into a small probability, and if an input is large,
then it turns it into a large probability, but it will always remain
between 0 and 1.
• The softmax function can be used in a classifier only when the classes
are mutually exclusive (under the assumption that the classes are
mutually exclusive).

Regression involves processes that estimate the relationships between
a dependent variables, also called ‘outcome variable’, and one or more
independent variables, also called ‘predictors’. Regression analyse
dependencies amongst variables by estimating the effect that changing
one independent variable has on the dependent variable while holding
all the other independent variables constant.

51
Q

Implement a feedforward neural network by using the Keras API in TensorFlow for a regression problem. Assume that the data set has four numerical features and one numerical target variable. The network has one hidden layer with the sigmoid activation function. The number of neurons in the hidden layer is considered as a hyperparameter which you need to fine-tune. Write down the Python code of the implementation.

A

Import necessary libraries

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

np.random.seed(42)
X = np.random.rand(1000, 4) # Example: 1000 samples, 4 features
y = 3 * X[:, 0] + 2 * X[:, 1] - 1.5 * X[:, 2] + 5 * X[:, 3] + 2 * np.random.randn(1000) # Example: Regression formula with noise

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

def create_regression_model(hidden_neurons):
model = Sequential()
model.add(Dense(hidden_neurons, input_dim=4, activation=’sigmoid’)) # Hidden layer with sigmoid activation
model.add(Dense(1, activation=’linear’)) # Output layer for regression with linear activation
model.compile(optimizer=’adam’, loss=’mean_squared_error’)
return model

hidden_neurons = 16 # Adjust as needed

regression_model = create_regression_model(hidden_neurons)
regression_model.fit(X_train_scaled, y_train, epochs=50, batch_size=32, validation_data=(X_test_scaled, y_test))

loss = regression_model.evaluate(X_test_scaled, y_test)
print(f”Mean Squared Error on Test Set: {loss}”)

52
Q

In Naïve Bayesian classifiers, the numerical underflow and the zero count are two important issues. Explain these two issues and describe at least one common technique to overcome each issue.

A

Numerical Underflow:

Issue: In Naïve Bayesian classifiers, the product of probabilities of individual features is calculated to determine the likelihood of a class given the features. When dealing with a large number of small probabilities (especially in continuous data), the product can become extremely small, leading to numerical underflow. As computers have finite precision, extremely small values might be rounded to zero, causing loss of information and accuracy.
Technique to Overcome: To mitigate numerical underflow, one common technique is to work in log-space. Instead of computing the product of probabilities, the log of the probabilities is summed. This helps prevent extremely small values and allows for more stable computations.
Zero Count:

Issue: The Naïve Bayes assumption implies that the occurrence of a feature is independent of other features given the class. If a certain feature has never been observed with a specific class in the training data, the probability becomes zero when using the basic probability formula. This zero probability can dominate the entire product, making predictions unreliable.
Technique to Overcome: Additive smoothing (Laplace smoothing) is a common technique to address zero counts. It involves adding a small constant (usually 1) to the count of each possible feature value for each class. This way, even if a feature has never been observed with a particular class, it still contributes to the probability calculation, avoiding zero probabilities