Week 2 Flashcards

Question 1

Q

What is “Churn Rate”

Answer

A

refers to the percentage of customers who stop doing business with a company over a specific period. It is a key metric, especially for subscription-based businesses, as it indicates how well the company retains its customers.

Question 2

Q

How to caculate ‘Customer Churn’?

Answer

A

ChurnRate=
TotalCustomersattheBeginningofthePeriod/
NumberofCustomersLostinaPeriod

×100

Question 3

Q

What is “Customer Tenure”?

Answer

A

refers to the length of time (in months) a customer has been with the company. It is a measure of customer loyalty and can indicate how long a customer has remained subscribed to the company’s services.

Tenure is often used alongside metrics like “Churn” to assess customer retention and loyalty trends.

Question 4

Q

What are the common techniques used in handling missing data?

Answer

A

Identifying Missing Data
Removing Missing Data
Imputing Missing Data

Question 5

Q

What is ‘Identifying Missing Data’

Answer

A

detecting where data is absent or incomplete in a dataset.

Question 6

Q

What is ‘Removing Missing Data’

Answer

A

involves deleting rows or columns that contain missing values.

Question 7

Q

What is ‘Imputing Missing Data’

Answer

A

involves filling in the missing values with substituted values without deleting rows or columns.

Question 8

Q

What are the 2 Common Approaches to impute missing data?

Answer

A

Mean/Median/Mode Imputation and Forward/Backward Fill

Question 9

Q

What is ‘Mean/Median/Mode’ imputation?

Answer

A

replaces missing values with a central tendency value to ensure that the dataset can still be used for analysis or modeling. While easy to implement, these methods can distort the original distribution of the dataset and may introduce bias.

Question 10

Q

What is ‘Forward/Backward Fill’

Answer

A

involves filling missing values with the previous or next available values, respectively, based on the order of the data.

Question 11

Q

What is ‘Visual Inspection’ in Identifying Missing Data?

Answer

A

uses graphical representations to identify data gaps, patterns, and abnormalities.

Question 12

Q

What is ‘Data Summary Tables’ in Identifying Missing Data?

Answer

A

shows an overview of the missing values in the dataset.

Question 13

Q

What is ‘Box Plots’ in Identifying Missing Data?

Answer

A

summarizes the distribution of a variable and shows outliers. Still, if the data is missing, there may be visible gaps or unusual behavior in the plot.

Question 14

Q

What is ‘Heatmaps’ in Identifying Missing Data?

Answer

A

a missing data heatmap highlights where missing data exists in a dataset. It uses color to indicate missing vs. non-missing values across the entire dataset.

Question 15

Q

What is ‘Descriptive Statistics’ in Identifying Missing Data?

Answer

A

summarizes and describes the basic characteristics of a dataset.

Question 16

Q

What is ‘Null’ or ‘NaN’ Counts?

Answer

A

counts the number of missing values in each dataset’s column (or feature). In Python, missing data is either NaN (Not a Number) or null.

Question 17

Q

What is ‘Percentage of Missing Data’

Answer

A

provides a proportion of missing data relative to the total number of entries in a column.

Question 18

Q

What are the 3 ‘Patterns of Analysis of Missing Data’

Answer

A

Missing Completely at Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR)

Question 19

Q

What are the 3 types of Missing Data?

Answer

A

Missing Completely at Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR)

Question 20

Q

What is ‘Missing Completely at Random (MCAR)’

Answer

A

where the absence of data is unrelated to any other variables in the dataset. Removing this type of missing data generally won’t introduce bias.

This occurs when the reason for the absence of a value is entirely random and unrelated to any other variables in the datasets. For example, a survey respondent accidentally skips a question, resulting in a missing value in the dataset.

Question 21

Q

What is ‘Missing At Random (MAR)’

Answer

A

when the missingness is related to observed data, meaning that it is not random but can be explained by other variables.

the absence of data isn’t random and can be explained by other observed variables in the dataset. For example, in a health survey, individuals working night shifts may be less likely to respond to a survey conducted during daytime hours. The missingness of their responses is related to their work schedules, an observed variable, but not directly to their health status, which is the variable of interest.

Question 22

Q

What is ‘Missing Not At Random (MNAR)’

Answer

A

when the missing data is related to the unobserved value, making removal potentially biased.

This occurs when the absence of data is directly related to the value itself, even after accounting for other variables. For example, In mental health research, individuals with more severe symptoms are less likely to complete assessments due to the nature of their condition. In this case, the missing data is directly related to the severity of the unobserved symptoms and not the assessment.

Question 23

Q

What is ‘Low Percentage of Missing Data’ factor?

Answer

A

If only a small percentage of the dataset has missing values, dropping rows or columns may be an acceptable solution.

Example:
If less than 5% of the rows contain missing values, the loss of information may be minimal

Question 24

Q

What is ‘Irrelevant Features or Low-Impact Variables’

Answer

A

if the feature or variable with missing data is not crucial for the analysis, it can be removed without significantly impacting the model’s performance.

Question 25

Q

What is ‘Sparsely Populated Columns’

Answer

A

if a column has a significant number of missing values (e.g., >50%), it might be better to remove the column entirely, as imputing values would introduce too much noise or bias.

Question 26

Q

What is ‘Removing Rows with Missing Data’

Answer

A

you can remove rows that contain missing data using dropna() with its default axis=0. This removes any row that contains at least one missing value.

Question 27

Q

What is ‘Removing Columns with Missing Data’

Answer

A

to remove columns with missing data, you can use dropna() with the axis=1 argument. This will remove any column that contains at least one missing value.

Question 28

Q

What is ‘Single Imputation Method’

Answer

A

In single imputation, each missing value in a dataset is replaced with a single estimated value. These methods are generally easier to implement than multiple imputation methods. However, they treat the imputed values as if they were true values, ignoring the uncertainty associated with the imputation process.

Question 29

Q

What is ‘Regression Imputation’

Answer

A

utilizes a regression model built on observed data to predict the missing values based on relationships with other variables.

Question 30

Q

What is ‘Hot Deck Imputation’

Answer

A

estimates missing values by randomly selecting similar values from “donor” records within the dataset. This method retains the original pattern of associations in the dataset but may introduce randomness due to the selection process.

Question 31

Q

What ‘Multiple Imputation Method’

Answer

A

create several imputed datasets and analyze them together. These techniques consider the imputation process’s uncertainty and provide more accurate results than single imputation. However, these methods are generally computationally expensive and require larger sample sizes to provide accurate predictions.

Question 32

Q

What is ‘Outlier Detection and Removal’

Answer

A

outliers are data points that differ significantly from the rest of the dataset. Outlier detection and removal ensures data quality.

Question 33

Q

What are the 3 different types of Outliers?

Answer

A

Global or point outliers
Collective outliers
Contextual or conditional outliers

Question 34

Q

Whar is ‘Global Outlier’

Answer

A

Global outliers are also called point outliers. Global outliers are taken as the simplest form of outliers. When data points deviate from all the rest of the data points in a given data set, it is known as the global outlier. In most cases, all the outlier detection procedures are targeted to determine the global outliers. The green data point is the global outlier.

Question 35

Q

What is ‘Collective Outlier’

Answer

A

In a given set of data, when a group of data points deviates from the rest of the data set is called collective outliers. Here, the particular set of data objects may not be outliers, but when you consider the data objects as a whole, they may behave as outliers. To identify the types of different outliers, you need to go through background information about the relationship between the behavior of outliers shown by different data objects. For example, in an Intrusion Detection System, the DOS package from one system to another is taken as normal behavior. Therefore, if this happens with the various computer simultaneously, it is considered abnormal behavior, and as a whole, they are called collective outliers. The green data points as a whole represent the collective outlier.

Question 36

Q

What is ‘Contextual Outlier’

Answer

A

As the name suggests, “Contextual” means this outlier introduced within a context. For example, in the speech recognition technique, the single background noise. Contextual outliers are also known as Conditional outliers. These types of outliers happen if a data object deviates from the other data points because of any specific condition in a given data set. As we know, there are two types of attributes of objects of data: contextual attributes and behavioral attributes. Contextual outlier analysis enables the users to examine outliers in different contexts and conditions, which can be useful in various applications. For example, A temperature reading of 45 degrees Celsius may behave as an outlier in a rainy season. Still, it will behave like a normal data point in the context of a summer season. In the given diagram, a green dot representing the low-temperature value in June is a contextual outlier since the same value in December is not an outlier.

Question 37

Q

What is ‘Outlier Detection’

Answer

A

identifies data points that significantly differ from the rest of the dataset.

Question 38

Q

What are the techniques used in Outlier Detection?

Answer

A

Statistical Outlier Detection and Visual Outlier Detection

Question 39

Q

What is ‘Statistical Outlier Detection’

Answer

A

identifies outliers using statistical methods based on the distribution and characteristics of the dataset.

Question 40

Q

What is ‘Visual Outlier Detection’

Answer

A

uses graphical methods to identify outliers in data.

Question 41

Q

What is ‘Outlier Handling’

Answer

A

is the process and techniques used to identify, analyze, and manage outliers in datasets. When removing outliers, it is important to be cautious, as it can lead to loss of valuable info. If you need to remove outliers, you must investigate before taking action.

Question 42

Q

What is ‘Noise’ in Data?

Answer

A

refers to random errors or variations in measured data that do not reflect the actual values of the underlying phenomena.

Question 43

Q

What is ‘Data Smoothing’

Answer

A

eliminates or reduces the impact of random noise, outliers, and irregularities. Done by using an algorithm to removenoisefrom a data set. This allows important patterns to more clearly stand out.

Question 44

Q

What is ‘Moving Average’

Answer

A

a method used in data preprocessing to smooth out fluctuations in data. It calculates the average of a specified number of data points (called the “window size”) and slides this window across the dataset. This helps reduce noise and reveals underlying trends.

Question 45

Q

What is ‘Generating New Features’ in Feature Engineering means?

Answer

A

involves identifying aspects of the data that are not directly observable and making them explicit.

Question 46

Q

What is ‘Aggregated Features’ in Feature Engineering

Answer

A

summarize data across multiple rows or over a group, often through operations like mean, sum, max, or count.

Question 47

Q

What is ‘Interaction Features’ in Feature Engineering

Answer

A

Interaction features are new features created by combining two or more existing features in a dataset. These combinations capture the relationship or interaction between the original features, which may improve the performance of machine learning models.

Think of interaction features as a way to understand how two factors work together, rather than individually. For example, the effect of one feature on the target variable might depend on the value of another feature. By explicitly creating features that represent these relationships, we give models a chance to learn more complex patterns.

Question 48

Q

What is ‘Temporal Feature’ in Feature Engineering?

Answer

A

derived from date or time-related data. They capture patterns based on the passage of time (e.g., age, seasonality).

Example:
You can calculate the age of the house from the Year Built column

Question 49

Q

What is ‘Text-Based Features’ in Feature Engineering?

Answer

A

are numerical or categorical representations derived from textual data. Since machine learning models cannot directly understand raw text, we transform text into features that models can process.

Question 50

Q

What is ‘Bag of Words (BoW)’ in Feature Engineering?

Answer

A

one of the simplest and most widely used methods for creating text-based features. It represents a piece of text (like a sentence or document) as a collection of words, disregarding grammar, word order, and context but keeping the frequency of words.

Question 51

Q

What is ‘Domain Specific Features’ in Feature Engineering

Answer

A

are tailored to the problem domain, incorporating expert knowledge or industry-specific metrics.

Question 52

Q

What is ‘One-Hot Encoding’ in Feature Engineering?

Answer

A

converts categorical values into binary columns.

Question 53

Q

What is ‘Label Encoding’ in Feature Engineering?

Answer

A

assigns a unique integer to each category.

Question 54

Q

Brainscape's Knowledge GenomeTM

Week 2 Flashcards

Brainscape's Knowledge Genome^TM