Chapter 1 Flashcards

Question

What is the role of data mining in detecting deviations in sales data?

Answer 1

Data mining plays a role in detecting deviations in sales data by identifying items with sales that significantly differ from what is expected compared to the previous year. These deviations can then be further investigated.

Answer 2

Because they are one of the most commonly available and richest information repositories. They provide a significant source of data for data mining analysis.

Answer 3

The primary function of a data warehouse is to serve as a repository for information collected from multiple sources, which is stored under a unified schema and typically resides at a single site.

Answer 4

In a transactional database, each record captures a transaction, such as a customer's purchase or a flight booking, and includes a unique transaction identity number (trans ID) and a list of the items involved in the transaction. In contrast, a data warehouse may store data from various sources related to transactions, including additional information like item descriptions, details about salespeople, or branch information.

Answer 5

Data in a data warehouse is typically structured under a unified schema, bringing together data from various sources into a single, organized repository.

Answer 6

Transactional databases typically store information related to individual transactions, including transaction details such as the items purchased, customer information, and transaction identity numbers.

Answer 7

The primary purpose of a data warehouse is to consolidate and store data from multiple sources for analysis and reporting, whereas a transactional database primarily focuses on capturing and managing individual transactions in real-time.

Answer 8

What types of data fall into the category of spatial data?

Answer 9

Hypertext and multimedia data encompass data types such as images, videos, and audio, in addition to textual data.

Answer 10

Data mining tasks can be categorized into two main categories: descriptive and predictive.

Answer 11

Descriptive mining tasks aim to characterize properties of the data in a target dataset.

Answer 12

Predictive mining tasks involve performing induction on current data in order to make predictions about future or unknown data.

Answer 13

Data mining functionalities are used to specify the types of patterns to be discovered in data mining tasks.

Answer 14

Data mining functionalities can be used to mine various kinds of patterns, including outlier detection, association rules, classification, clustering, and regression.

Answer 15

Outliers in a dataset are data objects that do not conform to the general behavior or model of the data. They are data points that are considerably different from the rest of the data.

Answer 16

Outlier mining or anomaly mining is valuable in applications such as credit card fraud detection and network intrusion detection, where identifying unusual patterns is crucial for security and fraud prevention.

Answer 17

The primary goal of classification in data mining is to find a model or function that describes and distinguishes data classes. ## Footnote E.g. An applicant for a loan can repay on time, repay late or declare bankruptcy. ------------------------------------------------------- A credit card transactioncan be normal or fraudulent. ------------------------------------------------------ A mail can be normal or spam

Answer 18

The derived classification model can be represented in various forms, including classification rules (IF-THEN rules), decision trees, mathematical formulae, or neural networks. It is used to predict the class label of objects for which the class label is unknown.

Answer 19

Classification predicts categorical (discrete, unordered) labels, while regression models continuous-valued functions. Regression is used to predict missing or unavailable numerical data values, rather than discrete class labels.

Answer 20

In the context of data mining, the term "prediction" encompasses both numeric prediction (regression) and class label prediction (classification).

Answer 21

The primary objective of Association Rule Mining is to discover hidden relationships or associations between items in a dataset, often represented as "if-then" rules.

Answer 22

Market basket analysis is an application of Association Rule Mining that helps retailers understand customer purchasing patterns and optimize product placement.

Answer 23

1. Machine Learning 2. Pattern recognition 3. Visualization 4. Algorithms 5. High preformance computing 6. Info. Retrieval 7. Data warehouse 8. Database system 9. Statistics 10. Applications

Answer 24

1. Telecommunication Industry 2. Credit Card companies 3. Insurance companies 4. Retail & Marketing 5. Medical companies 6. Pharmaceutical

Answer 25

Business Intelligence is used to maximize the return on marketing campaigns, detect fraudulent transactions, automate the loan application process, and identify and treat the most valued customers.

Answer 26

The first phase in the CRISP-DM process is the Business Understanding Phase.

Answer 27

Data exploration and initial data collection are performed in the Data Understanding Phase of CRISP-DM.

Answer 28

The primary goal of the Data Preparation Phase in CRISP-DM is to clean, transform, and preprocess the data for modeling.

Answer 29

Machine learning algorithms are applied to the prepared data during the Modeling Phase of CRISP-DM.

Answer 30

The model's performance is assessed and validated in the Evaluation Phase of CRISP-DM.

Answer 31

The final phase of CRISP-DM where the results are put into practical use is the Deployment Phase.

Answer 32

Data sets are made up of data objects and their attributes.

Answer 33

The rows of a database correspond to the data objects, and the columns correspond to the attributes.

Answer 34

An attribute is a data field, representing a characteristic or feature of a data object.

Answer 35

Examples of data objects include customers in a sales database, patients in a medical database, and students, professors, and courses in a university database.

Answer 36

An attribute is a data field that represents a characteristic or feature of a data object.

Answer 37

Terms such as attribute, dimension, feature, and variable are often used interchangeably in the literature. In data mining literature, the term "feature" is commonly used, while statisticians prefer the term "variable."

Answer 38

The type of n attribute is determined by the set of possible values that the attribute can have.

Answer 39

A nominal attribute is one where each value represents a category or state. It is referred to as "categorical" because it involves categories rather than numerical values.

Answer 40

A binary attribute is one that has two possible values, typically representing two states or categories. ## Footnote An example given in the text is the "smoker" attribute, where 1 indicates that the patient smokes, and 0 indicates that the patient does not smoke.

Answer 41

An ordinal attribute has possible values with a meaningful order or ranking among them, but the magnitude between successive values is not known. It differs from nominal and binary attributes in that it has an ordered relationship among its values, but it differs from numeric attributes in that it lacks precise numerical measurement. ## Footnote An example of an ordinal attribute mentioned in the text is "Drink size" at a fast-food restaurant, which can have values such as small, medium, and large. Its characteristic is that there is a meaningful order (small < medium < large), but we don't know the exact magnitude of the differences between sizes.

Answer 42

Nominal, binary, and ordinal attributes are qualitative and describe features of objects without providing precise size or quantity measurements. In contrast, numeric attributes are quantitative and represent measurable quantities using integer or real values.

Answer 43

Key motivations for data exploration include: * Understanding the characteristics of large and messy data sets. * Selecting the appropriate tool for data preprocessing or analysis. * Utilizing human abilities to recognize patterns in the data, complementing data analysis tools.

Answer 44

Common techniques used in data exploration include: * Summary statistics * Data visualization

Answer 45

Summary statistics are used to summarize a set of observations to provide an understanding of the typical values in the data and how they vary. This helps in gaining insights into the data's characteristics.

Answer 46

The two main types of descriptive statistics encountered in research papers are: 1. Measures of central tendency (e.g., averages). 1. Measures of dispersion (e.g., standard deviation).

Answer 47

The choice between measures of central tendency and measures of dispersion in summary statistics depends on the type of variables being analyzed. ## Footnote Mode can be used for all data types, median can be used for ordinal and numeric data types, and mean can only be used for numeric data type

Answer 48

The primary purpose of data visualization is to convert data into visual or tabular formats to facilitate the analysis of data characteristics and relationships among data items or attributes.

Answer 49

Data visualization is considered powerful because humans have a well-developed ability to analyze large amounts of information presented visually. It allows for the detection of general patterns, trends, outliers, and unusual patterns within the data.

Answer 50

Common graphical techniques used in data exploration include: * Histograms and boxplots for numeric variables to learn about their distribution, detect outliers, and find relevant information. * Bar charts and pie charts for categorical variables to show the frequency of each value. * Scatter plots for pairs of numeric variables to explore possible relationships, the type of relationship, and detect outliers.

Answer 51

A line graph is often used for time series data to display trends or patterns over time

Answer 52

The primary purpose of a bar chart is to display how many occurrences of each value occur in a dataset or, for continuous data, how many values are in each of a series of ranges or "bins." It provides a visual representation of the distribution of the outcome variable.

Answer 53

A heatmap is a graphical representation where individual values of a matrix are displayed as colors. It is useful for visualizing the concentration of values between two dimensions of a matrix. Heatmaps are particularly helpful in finding patterns and providing a perspective of depth. Darker shades in a heatmap correspond to stronger (positive or negative) correlations, making it easy to spot high and low correlations.

Answer 54

The primary purpose of a scatterplot is to visualize the relationship between two numerical variables. It shows the relationship between these variables, which can be positive, negative, or have no clear pattern.

Answer 55

A boxplot is a standardized way of displaying the distribution of data based on a five-number summary, which includes the minimum, first quartile (Q1), median, third quartile (Q3), and maximum values. It provides a visual representation of the data's spread and key statistics.

Answer 56

A boxplot can provide information about: 1. Outliers and their values. 1. Whether the data is symmetrical or skewed. 1. How tightly the data is grouped or dispersed.

Answer 57

The three fundamental issues are: 1. What problems should we expect when working with the dataset? 1. How do we detect those problems within the dataset? 1. How can we solve or address those problems to prepare the data for analysis?

Answer 58

Measures for data quality include: 1. Accuracy: Determining whether the data is correct or incorrect, accurate or inaccurate. 1. Completeness: Ensuring that all necessary records are represented in the dataset. 1. Consistency: Checking for consistency within the dataset, such as modifications that are applied consistently or inconsistencies like "dangling" data. 1. Timeliness: Assessing whether the data is updated in a timely manner. 1. Interpretability: Evaluating how easily the data can be understood and interpreted.

Answer 59

* Faulty data collection instruments. * Human or computer errors during data entry. * Users purposely submitting incorrect data for mandatory fields. * Errors in data transmission. * Technology limitations, such as buffer size limitations for coordinating data transfer. * Inconsistencies in naming conventions or data codes. * Inconsistent input field formats, such as date formats.

Answer 60

Data may be incomplete for various reasons, including: * Attributes of interest may not always be available, such as missing customer information in sales transaction data. * Certain data may not have been included initially because they were not considered important at the time of data entry. * Relevant data may not be recorded due to misunderstanding or equipment malfunctions. * Data that were inconsistent with other recorded data may have been deleted

Answer 61

Data preprocessing is necessary because much of the raw data contained in databases is incomplete, inconsistent, and noisy. Databases may have various issues, including obsolete or redundant fields, missing values, outliers, and data not in a suitable form for data mining models. To make the data useful for data mining purposes, it needs to undergo preprocessing, which includes data cleaning and data transformation.

Answer 62

Data cleaning routines aim to "clean" the data by accomplishing the following objectives: 1. Filling in missing values. 1. Smoothing noisy data. 1. Identifying or removing outliers. 1. Resolving inconsistencies within the data

Answer 63

Cleaning data is important because dirty data, which contains errors, missing values, or inconsistencies, can lead to a lack of trust in the results of data mining applications. Additionally, dirty data can confuse the data mining process and result in unreliable output. Cleaning the data helps ensure the quality and reliability of the data mining results.

Answer 64

The major tasks in data preprocessing include: 1. Data Integration: Integrating multiple databases, data cubes, or files, which may have attributes with different names for the same concept, leading to inconsistencies and redundancies. This task also involves detecting and removing redundancies resulting from data integration. 1. 1. Data Reduction: Obtaining a smaller-volume representation of the dataset that produces the same or nearly the same analytical results. 1. 1. Data Cleaning: Performing routines to **unify data format, fill in missing values, identify and smooth out noisy data, correct inconsistent data, and remove duplicate records**. This ensures the quality and consistency of the data.

Answer 65

Common reasons for missing values in data include: * Information is not collected for certain cases. * People decline to answer specific questions in a survey. * Attributes may not be applicable to all cases, such as income data for children

Answer 66

We should carefully consider how we handle missing data because the absence of information is rarely beneficial, and having more information is usually better for analysis.

Answer 67

One common method of handling missing values is to delete the records or fields with missing values from the analysis.

Answer 68

Simply deleting records with missing values may be dangerous because the pattern of missing values could be systematic, leading to a biased subset of the data. Additionally, it might result in the loss of valuable information in other fields, even if just one field has missing values.

Answer 69

It is considered wasteful to omit information in all the other fields because it's inefficient to discard valuable data in those fields just because one field has missing values.

Answer 70

If only 5% of data values are missing from a data set of 30 variables, and the missing values are spread evenly throughout the data, almost 80% of the records would have at least one missing value.

Answer 71

Data analysts have turned to methods that replace the missing value with a substituted value based on various criteria when they choose not to simply delete the missing data.

Answer 72

Common criteria for choosing replacement values for missing data include: * Replacing the missing value with a constant specified by the analyst. * Replacing the missing value with a measure of central tendency, such as the mean or median for numeric variables, or the mode for categorical variables. * Replace the missing value with a measure of central tendency (mean or median for numeric variables) or the mode (for categorical variables) belonging to the same class. * Replace the missing values with a value generated at random from the observed distribution of the variable. * Replace the missing values with imputed values based on the other characteristics of the record. ## Footnote The choice between mean and median depends on the data distribution, with the mean suitable for normal (symmetric) data distributions and the median for skewed data distributions.

Answer 73

This can be achieved using methods like regression, inference-based tools using Bayesian formalism, or decision tree induction. For instance, by utilizing the other customer attributes in a dataset, one can construct a decision tree to predict the missing values for attributes like income.

Answer 74

if customers are classified according to credit risk, missing values for income may be replaced with the mean income value for customers in the same credit risk category as the given tuple. The choice between mean and median depends on the data distribution within that class, with the median being a better choice for skewed data distributions.

Answer 75

An outlier in data analysis is an observation that is unlike the other observations and represents an extreme value that goes against the trend of the remaining data.

Answer 76

Identifying outliers is important because they may represent errors in data entry. Even if an outlier is a valid data point and not an error, certain statistical methods are sensitive to the presence of outliers and may deliver unreliable results, such as skewed mean values or overly wide ranges of the data.

Answer 77

Outlier detection is related to but distinct from noise detection. Outliers can be considered as interesting and/or unknown patterns hidden in data, which may lead to new insights, the discovery of system faults, or the identification of fraudulent activities. Noise detection, on the other hand, typically focuses on identifying random or erroneous fluctuations in data without the same potential for uncovering valuable patterns or anomalies

Answer 78

Outliers may not always be easily apparent in large data sets due to the complexity and volume of the data. Techniques such as data visualization, statistical tests, and machine learning algorithms can be used to identify outliers in these datasets, helping data analysts and researchers detect and address them effectively

Answer 79

It is essential to employ these methods because outliers are not always easily apparent, especially in large data sets. Failure to address outliers can lead to biased or inaccurate results in data mining analyses.

Answer 80

Outliers are important in data mining because they are data points that deviate significantly from the rest of the dataset

Answer 81

Outlier analysis helps identify data errors, improving data quality and enhancing the reliability of data analysis and modeling.

Answer 82

Outlier analysis reveals relationships and patterns in data that may be absent when only focusing on central tendencies, thereby enhancing the understanding of the data.

Answer 83

Outliers can influence the results of statistical models, and identifying and handling them appropriately through outlier analysis can help improve the accuracy of these models, leading to more reliable data analysis outcomes.

Answer 84

Preventing misleading results is crucial in data mining, and outlier analysis plays a role in avoiding incorrect conclusions by identifying and managing outliers that can impact data analysis and modeling outcomes

Answer 85

Outlier analysis can help detect unusual behavioral patterns and foreign transactions, which can seriously affect various domains such as business, health, and security choices.

Answer 86

1. Better Data Quality 1. Enhances Understanding of Data 2. Improves Accuracy of Statistical Models 1. Prevents Misleading Results 1. Detects Fraud and Anomalies

Answer 87

Common methods for detecting outliers in data are categorized into various approaches, including: 1. Statistical methods 1. Distance-based measures 1. Density-based measures 1. Clustering-based measures

Answer 88

Statistical tests for identifying outliers rely on assumptions about the distribution of the data and use thresholds based on standard deviation, z-scores, or interquartile range to flag data points that deviate significantly from the expected values.

Answer 89

Distance-based measures in outlier detection use the concept of nearest neighbors to determine if a data point is far away from most of the others, suggesting that it may be an outlier.

Answer 90

Density-based measures in outlier detection use the concept of local density to find data points that are in sparse regions, which may indicate their status as outliers.

Answer 91

Clustering-based measures in outlier detection use the concept of clusters to find data points that do not belong to any cluster or are distant from their cluster centers, thus suggesting that they might be outliers

Answer 92

The Z-score method is used to identify outliers, and a data value is considered an outlier if it has a Z-score that is either less than -3 or greater than 3.

Answer 93

The Z-score technique assumes a Gaussian distribution of the data. It defines outliers as data points that are in the tails of the distribution and are far from the mean, typically with Z-scores much less than -3 or greater than 3.

Answer 94

The 68-95-99.7 rule, also known as the empirical rule, states that in a normal distribution, approximately 68.27% of values lie within 1 standard deviation of the mean, 95.45% within 2 standard deviations, and 99.73% within 3 standard deviations.

Answer 95

A box plot diagram helps in identifying outliers by utilizing quartiles. It defines the upper limit and lower limit beyond which any data point will be considered an outlier. The commonly used quartiles in this method are the first quartile and the third quartile.

Answer 96

According to basic standards followed by statisticians, a convenient definition of an outlier is a data point that falls more than 1.5 times the interquartile range (IQR) above the third quartile or below the first quartile. This is used to identify outliers in a dataset.

Answer 97

limitation in using the distribution model to find outliers is that in many cases, the distribution of the data set is not previously known. This can pose a challenge for identifying outliers effectively.

Answer 98

The distance-based approach flags a data point as an outlier if it is mapped beyond a certain threshold away from other data points. This threshold determines whether a data point is considered an outlier.

Answer 99

The density approach groups together data points into clusters, using the distance between each cluster point to set the boundary of the grouping. Outliers are data points that exist outside of the cluster, beyond a user-defined threshold.

Answer 100

Distance-based algorithms identify outliers by measuring the average distance of the nearest k neighbors. Outliers tend to have a higher average distance than other normal data points, and this property is used to detect them

Answer 101

Each data point is ranked based on its distance to its kth nearest neighbor. The top n points in this ranking are declared as outliers. The values of k and n can be specified through parameters, such as the number of neighbors and the number of outliers

Answer 102

Common distance measures used in distance-based outlier detection include Euclidean distance for real values and Jaccard similarity measures for binary and categorical values. The choice of distance measure can impact the algorithm's execution, with high-dimensional datasets becoming expensive to process due to the need to calculate distances with other data points in high-dimensional space

Answer 103

If the value of k is set to 1, two outliers that are located next to each other but far away from other data points may not be identified as outliers. This occurs because the algorithm only considers the nearest neighbor, which could be another outlier.

Answer 104

If the value of k is set to a large number, a group of normal data points that form a cluster might be mislabeled as outliers. This can happen if the number of data points in that cluster is few, and the cluster is far away from other data points.

Answer 105

Normalizing numeric attributes is important to ensure that attributes with a higher absolute scale, such as income, do not dominate attributes with a lower scale, like credit score. This normalization helps prevent certain attributes from disproportionately influencing the outlier detection process.

Answer 106

Outliers occur less frequently compared to normal data points, which means that they are less common in the dataset.

Answer 107

Outliers occupy low-density areas in data space, while normal data points occupy high-density areas. The distinction is based on the frequency of occurrence.

Answer 108

Density, in the context of density-based outlier detection, is a count of data points in a normalized unit space and is inversely proportional to the distances between data points. This means that as data points become closer together, the density in that region increases.

Answer 109

In clustering-based outlier detection, objects are clustered or grouped based on the principle of maximizing intra-class similarity (similarity among objects within the same cluster) and minimizing interclass similarity (similarity between objects in different clusters). Clusters are formed to ensure that objects within a cluster have high similarity to one another but are very dissimilar to objects in other clusters

Answer 110

Outliers may be detected as values that fall outside of the sets of clusters. When outliers are identified, they are typically removed or smoothed, meaning they are either eliminated from the dataset or their impact is reduced

Answer 111

In a 2-D plot of customer data using clustering-based outlier detection, the cluster centroids marked with a "+" represent the average point in space for that cluster

Answer 112

Understanding why outliers occur helps determine the appropriate action to perform after outlier detection. Depending on the application, outliers may need to be isolated and acted upon, such as in credit card transaction fraud monitoring. In other cases, outliers should be filtered out because they can skew the final outcome, as in the case of eliminating ultra-high-income earners to generalize a country's income patterns

Answer 113

The challenge when deciding to remove outliers is that you have to consider how to effectively remove them without losing too much informational value. An outlier in one column may not be an outlier in the rest of the columns, and removing it may result in the loss of valuable information held by the outlier in the other features.

Answer 114

If an outlier is removed from the dataset, especially when it is not an outlier in all columns or features, the consequence can be the loss of information it holds in the rest of the features. This loss of information may impact the overall understanding and analysis of the data.

Answer 115

1. Delete 2. Transformation 3. Replacement methods

Answer 116

Two common transformation methods to handle outliers are log transformation and square root transformation. These transformations can reduce the impact of outliers on statistical models.

Answer 117

Some methods for replacing outliers in data include using the mean, median, mode, or values based on percentiles or ranges to replace extreme values. This is done to mitigate the effects of outliers and create a dataset with less extreme values. The advantage of replacement methods is that they can preserve the size and structure of the data set, but the disadvantage is that they may distort the distribution or variance of the data.

Answer 118

The potential disadvantage of replacing outliers with point statistics is that it can create bias in the data, especially when there are a lot of outliers. This approach can distort the distribution and variance of the data.

Answer 119

The purpose of inferring the values of outliers using a prediction or classification model is to mitigate the effects of outliers and replace them with values that are more representative of the dataset. This process is called imputation

Answer 120

The technique is called "Winsorizing," and it involves replacing outliers with the smallest and largest values of a dataset with observations that are not suspicious or extreme

Answer 121

The term used to describe the process of finding distinct patterns in data is "outlier detection." These distinct patterns are often referred to as "outliers" or "anomalies

Chapter 1 Flashcards

(145 cards)