CSCI316 Exam card revision - csv Flashcards

Question

Explain how the MapReduce model can process the vector-by-vector multiplication (i.e., dot product of two vectors).

Answer 1

The MapReduce model is a programming paradigm designed for processing and generating large datasets in a parallel and distributed computing environment. The vector-by-vector multiplication, or dot product, of two vectors can be efficiently computed using the MapReduce model. Here's a high-level explanation of how this process works: Map Phase: Input: The input to the MapReduce job consists of key-value pairs, where the key is an identifier for a specific vector element, and the value is the actual value of that element. Map Function: In the Map phase, each mapper takes a portion of the vectors and emits intermediate key-value pairs. The key remains the identifier, and the value is the product of the corresponding elements in the two vectors. python Copy code # Example Map function pseudocode function map(key, value): for i in range(length of vectors): emit(i, vector1[i] * vector2[i]) Shuffle and Sort Phase: The MapReduce framework groups intermediate key-value pairs by key and sorts them. This ensures that all values corresponding to the same identifier are grouped together. Reduce Phase: Input: The input to the Reduce phase is a set of key-value pairs where each key corresponds to a vector element, and the values are the products from different vectors that share the same identifier. Reduce Function: In the Reduce phase, each reducer receives a key along with its associated values. The reducer then sums up these values to compute the dot product for that specific vector element. python Copy code # Example Reduce function pseudocode function reduce(key, values): dot_product = sum(values) emit(key, dot_product) Output: The final output of the MapReduce job consists of key-value pairs, where the key is the identifier of a vector element, and the value is the corresponding dot product. This MapReduce approach allows for parallelization and distributed computation, making it efficient for large-scale vector-by-vector multiplication. Each mapper works independently on a portion of the data, and the reducer combines the results to produce the final dot product. This parallelization is especially beneficial for handling massive datasets in distributed computing environments.

Answer 2

# Import necessary functions and classes from PySpark from pyspark.sql import SparkSession from pyspark.sql.functions import col, countDistinct, sum # Create a Spark session spark = SparkSession.builder.appName("example").getOrCreate() # Assuming df_RD is your DataFrame # Select the "StockCode" and "Quantity" columns selected_df = df_RD.select("StockCode", "Quantity") # Count unique values in the "InvoiceNo" column unique_invoice_count = df_RD.select(countDistinct("InvoiceNo").alias("UniqueInvoices")) # Compute total order price (UnitPrice * Quantity) per CustomerID total_order_price_per_customer = df_RD.withColumn("TotalOrderPrice", col("UnitPrice") * col("Quantity")) \ .groupBy("CustomerID") \ .agg(sum("TotalOrderPrice").alias("TotalOrderPricePerCustomer")) # Show the results selected_df.show() unique_invoice_count.show() total_order_price_per_customer.show()

Answer 3

Major Difference: Spark DataFrame: Distributed data structure for big data processing across a cluster of machines, leveraging lazy evaluation and immutability in a distributed environment. Pandas DataFrame: Single-node, in-memory data structure designed for operations on smaller datasets with eager evaluation and mutability on a single machine.

Answer 4

Data pre-processing is a required first step before any machine learning machinery can be applied. It consists of data cleaning, data transformation and feature selection. •Through data cleaning, inconsistent data, missing information, errors, and outliers can be corrected before data are submitted to machine learning model. Through data transformation, the training/testing data are scaled and transformed. For example, categorical data can be transformed to numerical data without losing their original characteristics. •Through feature selection, relevant features from big data can be selected to serve as input to machine learning model.

Answer 5

Outliers are observed data that are distant (different) from the remaining observed data. They are also referred to as abnormalities, discordant, deviants and anomalies. •Noise are corrupted or distorted data containing false information. In other words, noise are mislabelled examples or errors in the value of attributes. Hence, we can say outliers are data that do not “fit in” with the other data that we are analysing. Outlier can be a valid data point, or it can be noise. As such, outliers often contain interesting and useful information about the underlying system. E.g., in an intrusion detection system, an outlier can be an instance of intrusion.

Answer 6

In machine learning, scale adds complexity. The larger the data set, the longer it takes to process them. At certain points, however, scale makes trivial operations costly and hence forcing us to re-evaluate algorithms in light of the complexity of those operations. As a data set gets larger, data set may not be able to be processed in the main memory, and hence, an alternative approach such as distributed processing or parallalismand/or multiprocessing may involve. This in turn affects the models used in machine learning.

Answer 7

def evenList(X): # X = list(range(10)) evenX= [] for n in X: if (n%2 == 0): evenX.append(n) # print(evenX) return evenX x = list(range(10)) print(evenList(x))

Answer 8

# Given a list named X which contains words (as strings), # implement a Python function to compute the word(s) # with the highest frequency in X. # Write down the Python code. def highestFreq ( freqCount = defaultdict ( # For each substring in X for sub in X: # Split the substring into words for word in sub.split # For each word, accumulate the count of frequency freqCount [word] += 1 # Determine the highest count of frequency result = max( freqCount , key = freqCount.get highestCount = freqCount.values # return the result return {'word': result, ' highestCount ': highestCount

Answer 9

Refer to notes and understand

Answer 10

Refer to notes and understand

Answer 11

Bayes can perform quite well, and it doesn't over fit nearly as much so there is no need to prune or process the network. • Naive bayes does quite well when the training data doesn't contain all possibilities so it can be very good with low amounts of data.

Answer 12

Refer to notes and understand

Answer 13

oSoftware framework and programming model used for processing a very large amounts of data. oIt works in two phases: Mapping phase In this phase, data is split and mapped into multiple groups. Reduce phase In this phase, the mapped data are shuffle and reduced individual groups. oMapReduce programs are parallel in nature, hence, they are useful for performing large-scale data analysis using multiple machines in the cluster. Input Splits: Data set is divided into fixed-size chunk (block) that is consumed by a single map. Mapping: Data in each chunk is passed to a mapping function to produce counts of occurrences of each word, and prepare a list of key-value pair where key is the word, and value is the frequency of occurrences. Shuffling Shuffling process will consolidate the relevant records from Mapping phases by clubbing together the same words and accumulate their frequency. Reducing In this phase, the output values from the Shuffling phase are aggregated, by combining all the words into a single output, that is, producing a complete dataset.

Answer 14

Map function. Dataset is split into blocks/chunks of fixed size. Map function is then assigned to process each block, that is, each Map function operates only one block. Each Map function outputs (produces) sets of key-value pair records. The sets of key-value pair records are passed to Partitioner, which ensures each key-value pair record is passed to one and only one Reducer. Refer to notes and understand

Answer 15

Lecture 6 Handling Massive Datasets, slide 7 9.

Answer 16

A Receiver Operator Characteristic (ROC) curve is a graphical plot used to show the diagnostic ability of binary classifiers. A ROC curve is constructed by plotting the true positive rate (TPR) against the false positive rate (FPR). The true positive rate is the proportion of observations that were correctly predicted to be positive out of all positive observations (TP/(TP + FN)). Similarly, the false positive rate is the proportion of observations that are incorrectly predicted to be positive out of all negative observations (FP/(TN + FP)). For example, in medical testing, the true positive rate is the rate in which people are correctly identified to test positive for the disease in question. The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1-FPR). Classifiers that gives curves closer to the top-left corner indicate a better performance. As a baseline, a random classifier is expected to give points lying along the diagonal (FPR=TPR). The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. Please refer to Lecture Week 5 – Classification, slide 85 for examples.

Answer 17

refer to notes and understand

Answer 18

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = 𝑇𝑃 𝑇𝑃 + 𝐹�

Answer 19

𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 = 𝑇𝑃 𝑇𝑃 + 𝐹�

Answer 20

TPR= 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 = 𝑇𝑃 𝑇𝑃+𝐹�

Answer 21

FPR= 𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 = 𝐹𝑃 𝐹𝑃 + 𝑇�

Answer 22

Use classifier that produces posterior probability for each test instance 𝑃(+|𝐴) • Sort the instances according to 𝑃(+|𝐴) in decreasing order • Apply threshold at each unique value of 𝑃(+|𝐴) • Count the number of TP, FP, TN, FN at each threshold • TP rate, 𝑇𝑃𝑅 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁) • FP rate, 𝐹𝑃𝑅 = 𝐹𝑃/(𝐹𝑃 + 𝑇𝑁) undetand how to do roc with this info

Answer 23

refer to notes and understand

Answer 24

Pandas and Spark DataFrame are designed for structural and semistructral data processing. Both share some similar properties. The few differences between Pandas and PySpark DataFrame are: • Operation on Pyspark DataFrame run parallel on different nodes in cluster but, in case of pandas it is not possible. • Operations in PySpark DataFrame are lazy in nature but, in case of pandas we get the result as soon as we apply any operation. • In PySpark DataFrame, we can’t change the DataFrame due to it’s immutable property, we need to transform it. But in pandas it is not the case. • Pandas API support more operations than PySpark DataFrame. Still pandas API is more powerful than Spark. • Complex operations in pandas are easier to perform than Pyspark DataFrame • In addition to above points, Pandas and Pyspark DataFrame have some basic differences like columns selection, filtering, adding the columns, etc.

Answer 25

Lecture6 – Handling Massive Datasets, slide 42 - 58

Answer 26

The softmax function is a function that turns a vector of K real values into a vector of K real values that sum to 1. The input values can be positive, negative, zero, or greater than one, but the softmax transforms them into values between 0 and 1, so that they can be interpreted as probabilities. If one of the inputs is small or negative, the softmax turns it into a small probability, and if an input is large, then it turns it into a large probability, but it will always remain between 0 and 1. • The softmax function can be used in a classifier only when the classes are mutually exclusive (under the assumption that the classes are mutually exclusive). Regression involves processes that estimate the relationships between a dependent variables, also called ‘outcome variable’, and one or more independent variables, also called ‘predictors’. Regression analyse dependencies amongst variables by estimating the effect that changing one independent variable has on the dependent variable while holding all the other independent variables constant.

Answer 27

# Import necessary libraries from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import numpy as np # Generate a sample regression dataset (replace this with your own dataset) np.random.seed(42) X = np.random.rand(1000, 4) # Example: 1000 samples, 4 features y = 3 * X[:, 0] + 2 * X[:, 1] - 1.5 * X[:, 2] + 5 * X[:, 3] + 2 * np.random.randn(1000) # Example: Regression formula with noise # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Standardize the features (optional but recommended) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Define the model def create_regression_model(hidden_neurons): model = Sequential() model.add(Dense(hidden_neurons, input_dim=4, activation='sigmoid')) # Hidden layer with sigmoid activation model.add(Dense(1, activation='linear')) # Output layer for regression with linear activation model.compile(optimizer='adam', loss='mean_squared_error') return model # Set hyperparameters hidden_neurons = 16 # Adjust as needed # Create and train the model regression_model = create_regression_model(hidden_neurons) regression_model.fit(X_train_scaled, y_train, epochs=50, batch_size=32, validation_data=(X_test_scaled, y_test)) # Evaluate the model loss = regression_model.evaluate(X_test_scaled, y_test) print(f"Mean Squared Error on Test Set: {loss}")

Answer 28

Numerical Underflow: Issue: In Naïve Bayesian classifiers, the product of probabilities of individual features is calculated to determine the likelihood of a class given the features. When dealing with a large number of small probabilities (especially in continuous data), the product can become extremely small, leading to numerical underflow. As computers have finite precision, extremely small values might be rounded to zero, causing loss of information and accuracy. Technique to Overcome: To mitigate numerical underflow, one common technique is to work in log-space. Instead of computing the product of probabilities, the log of the probabilities is summed. This helps prevent extremely small values and allows for more stable computations. Zero Count: Issue: The Naïve Bayes assumption implies that the occurrence of a feature is independent of other features given the class. If a certain feature has never been observed with a specific class in the training data, the probability becomes zero when using the basic probability formula. This zero probability can dominate the entire product, making predictions unreliable. Technique to Overcome: Additive smoothing (Laplace smoothing) is a common technique to address zero counts. It involves adding a small constant (usually 1) to the count of each possible feature value for each class. This way, even if a feature has never been observed with a particular class, it still contributes to the probability calculation, avoiding zero probabilities

CSCI316 Exam card revision - csv Flashcards

(52 cards)