Week 2 Flashcards

1
Q

What is “Churn Rate”

A

refers to the percentage of customers who stop doing business with a company over a specific period. It is a key metric, especially for subscription-based businesses, as it indicates how well the company retains its customers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How to caculate ‘Customer Churn’?

A

ChurnRate=
TotalCustomersattheBeginningofthePeriod/
NumberofCustomersLostinaPeriod

×100

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is “Customer Tenure”?

A

refers to the length of time (in months) a customer has been with the company. It is a measure of customer loyalty and can indicate how long a customer has remained subscribed to the company’s services.

Tenure is often used alongside metrics like “Churn” to assess customer retention and loyalty trends.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the common techniques used in handling missing data?

A

Identifying Missing Data
Removing Missing Data
Imputing Missing Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is ‘Identifying Missing Data’

A

detecting where data is absent or incomplete in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is ‘Removing Missing Data’

A

involves deleting rows or columns that contain missing values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is ‘Imputing Missing Data’

A

involves filling in the missing values with substituted values without deleting rows or columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 2 Common Approaches to impute missing data?

A

Mean/Median/Mode Imputation and Forward/Backward Fill

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is ‘Mean/Median/Mode’ imputation?

A

replaces missing values with a central tendency value to ensure that the dataset can still be used for analysis or modeling. While easy to implement, these methods can distort the original distribution of the dataset and may introduce bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is ‘Forward/Backward Fill’

A

involves filling missing values with the previous or next available values, respectively, based on the order of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is ‘Visual Inspection’ in Identifying Missing Data?

A

uses graphical representations to identify data gaps, patterns, and abnormalities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is ‘Data Summary Tables’ in Identifying Missing Data?

A

shows an overview of the missing values in the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is ‘Box Plots’ in Identifying Missing Data?

A

summarizes the distribution of a variable and shows outliers. Still, if the data is missing, there may be visible gaps or unusual behavior in the plot.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is ‘Heatmaps’ in Identifying Missing Data?

A

a missing data heatmap highlights where missing data exists in a dataset. It uses color to indicate missing vs. non-missing values across the entire dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is ‘Descriptive Statistics’ in Identifying Missing Data?

A

summarizes and describes the basic characteristics of a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is ‘Null’ or ‘NaN’ Counts?

A

counts the number of missing values in each dataset’s column (or feature). In Python, missing data is either NaN (Not a Number) or null.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is ‘Percentage of Missing Data’

A

provides a proportion of missing data relative to the total number of entries in a column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the 3 ‘Patterns of Analysis of Missing Data’

A

Missing Completely at Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the 3 types of Missing Data?

A

Missing Completely at Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is ‘Missing Completely at Random (MCAR)’

A

where the absence of data is unrelated to any other variables in the dataset. Removing this type of missing data generally won’t introduce bias.

This occurs when the reason for the absence of a value is entirely random and unrelated to any other variables in the datasets. For example, a survey respondent accidentally skips a question, resulting in a missing value in the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is ‘Missing At Random (MAR)’

A

when the missingness is related to observed data, meaning that it is not random but can be explained by other variables.

the absence of data isn’t random and can be explained by other observed variables in the dataset. For example, in a health survey, individuals working night shifts may be less likely to respond to a survey conducted during daytime hours. The missingness of their responses is related to their work schedules, an observed variable, but not directly to their health status, which is the variable of interest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is ‘Missing Not At Random (MNAR)’

A

when the missing data is related to the unobserved value, making removal potentially biased.

This occurs when the absence of data is directly related to the value itself, even after accounting for other variables. For example, In mental health research, individuals with more severe symptoms are less likely to complete assessments due to the nature of their condition. In this case, the missing data is directly related to the severity of the unobserved symptoms and not the assessment.

23
Q

What is ‘Low Percentage of Missing Data’ factor?

A

If only a small percentage of the dataset has missing values, dropping rows or columns may be an acceptable solution.

Example:
If less than 5% of the rows contain missing values, the loss of information may be minimal

24
Q

What is ‘Irrelevant Features or Low-Impact Variables’

A

if the feature or variable with missing data is not crucial for the analysis, it can be removed without significantly impacting the model’s performance.

25
Q

What is ‘Sparsely Populated Columns’

A

if a column has a significant number of missing values (e.g., >50%), it might be better to remove the column entirely, as imputing values would introduce too much noise or bias.

26
Q

What is ‘Removing Rows with Missing Data’

A

you can remove rows that contain missing data using dropna() with its default axis=0. This removes any row that contains at least one missing value.

27
Q

What is ‘Removing Columns with Missing Data’

A

to remove columns with missing data, you can use dropna() with the axis=1 argument. This will remove any column that contains at least one missing value.

28
Q

What is ‘Single Imputation Method’

A

In single imputation, each missing value in a dataset is replaced with a single estimated value. These methods are generally easier to implement than multiple imputation methods. However, they treat the imputed values as if they were true values, ignoring the uncertainty associated with the imputation process.

29
Q

What is ‘Regression Imputation’

A

utilizes a regression model built on observed data to predict the missing values based on relationships with other variables.

30
Q

What is ‘Hot Deck Imputation’

A

estimates missing values by randomly selecting similar values from “donor” records within the dataset. This method retains the original pattern of associations in the dataset but may introduce randomness due to the selection process.

31
Q

What ‘Multiple Imputation Method’

A

create several imputed datasets and analyze them together. These techniques consider the imputation process’s uncertainty and provide more accurate results than single imputation. However, these methods are generally computationally expensive and require larger sample sizes to provide accurate predictions.

32
Q

What is ‘Outlier Detection and Removal’

A

outliers are data points that differ significantly from the rest of the dataset. Outlier detection and removal ensures data quality.

33
Q

What are the 3 different types of Outliers?

A

Global or point outliers
Collective outliers
Contextual or conditional outliers

34
Q

Whar is ‘Global Outlier’

A

Global outliers are also called point outliers. Global outliers are taken as the simplest form of outliers. When data points deviate from all the rest of the data points in a given data set, it is known as the global outlier. In most cases, all the outlier detection procedures are targeted to determine the global outliers. The green data point is the global outlier.

35
Q

What is ‘Collective Outlier’

A

In a given set of data, when a group of data points deviates from the rest of the data set is called collective outliers. Here, the particular set of data objects may not be outliers, but when you consider the data objects as a whole, they may behave as outliers. To identify the types of different outliers, you need to go through background information about the relationship between the behavior of outliers shown by different data objects. For example, in an Intrusion Detection System, the DOS package from one system to another is taken as normal behavior. Therefore, if this happens with the various computer simultaneously, it is considered abnormal behavior, and as a whole, they are called collective outliers. The green data points as a whole represent the collective outlier.

36
Q

What is ‘Contextual Outlier’

A

As the name suggests, “Contextual” means this outlier introduced within a context. For example, in the speech recognition technique, the single background noise. Contextual outliers are also known as Conditional outliers. These types of outliers happen if a data object deviates from the other data points because of any specific condition in a given data set. As we know, there are two types of attributes of objects of data: contextual attributes and behavioral attributes. Contextual outlier analysis enables the users to examine outliers in different contexts and conditions, which can be useful in various applications. For example, A temperature reading of 45 degrees Celsius may behave as an outlier in a rainy season. Still, it will behave like a normal data point in the context of a summer season. In the given diagram, a green dot representing the low-temperature value in June is a contextual outlier since the same value in December is not an outlier.

37
Q

What is ‘Outlier Detection’

A

identifies data points that significantly differ from the rest of the dataset.

38
Q

What are the techniques used in Outlier Detection?

A

Statistical Outlier Detection and Visual Outlier Detection

39
Q

What is ‘Statistical Outlier Detection’

A

identifies outliers using statistical methods based on the distribution and characteristics of the dataset.

40
Q

What is ‘Visual Outlier Detection’

A

uses graphical methods to identify outliers in data.

41
Q

What is ‘Outlier Handling’

A

is the process and techniques used to identify, analyze, and manage outliers in datasets. When removing outliers, it is important to be cautious, as it can lead to loss of valuable info. If you need to remove outliers, you must investigate before taking action.

42
Q

What is ‘Noise’ in Data?

A

refers to random errors or variations in measured data that do not reflect the actual values of the underlying phenomena.

43
Q

What is ‘Data Smoothing’

A

eliminates or reduces the impact of random noise, outliers, and irregularities. Done by using an algorithm to removenoisefrom a data set. This allows important patterns to more clearly stand out.

44
Q

What is ‘Moving Average’

A

a method used in data preprocessing to smooth out fluctuations in data. It calculates the average of a specified number of data points (called the “window size”) and slides this window across the dataset. This helps reduce noise and reveals underlying trends.

45
Q

What is ‘Generating New Features’ in Feature Engineering means?

A

involves identifying aspects of the data that are not directly observable and making them explicit.

46
Q

What is ‘Aggregated Features’ in Feature Engineering

A

summarize data across multiple rows or over a group, often through operations like mean, sum, max, or count.

47
Q

What is ‘Interaction Features’ in Feature Engineering

A

Interaction features are new features created by combining two or more existing features in a dataset. These combinations capture the relationship or interaction between the original features, which may improve the performance of machine learning models.

Think of interaction features as a way to understand how two factors work together, rather than individually. For example, the effect of one feature on the target variable might depend on the value of another feature. By explicitly creating features that represent these relationships, we give models a chance to learn more complex patterns.

48
Q

What is ‘Temporal Feature’ in Feature Engineering?

A

derived from date or time-related data. They capture patterns based on the passage of time (e.g., age, seasonality).

Example:
You can calculate the age of the house from the Year Built column

49
Q

What is ‘Text-Based Features’ in Feature Engineering?

A

are numerical or categorical representations derived from textual data. Since machine learning models cannot directly understand raw text, we transform text into features that models can process.

50
Q

What is ‘Bag of Words (BoW)’ in Feature Engineering?

A

one of the simplest and most widely used methods for creating text-based features. It represents a piece of text (like a sentence or document) as a collection of words, disregarding grammar, word order, and context but keeping the frequency of words.

51
Q

What is ‘Domain Specific Features’ in Feature Engineering

A

are tailored to the problem domain, incorporating expert knowledge or industry-specific metrics.

52
Q

What is ‘One-Hot Encoding’ in Feature Engineering?

A

converts categorical values into binary columns.

53
Q

What is ‘Label Encoding’ in Feature Engineering?

A

assigns a unique integer to each category.

54
Q
A