Chapter 1 Flashcards

1
Q

How much human-made information does Google estimate exists in the world today?

A

Google estimates that there are 300 exabytes (300 followed by 18 zeros) of human-made information in the world today

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does the amount of human-made information today compare to just four years ago?

A

Only four years ago, there were just 30 exabytes of human-made information, which means we’ve seen a tenfold increase in a relatively short span of time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What has led to the explosive growth of available data volume in today’s world?

A

The explosive growth of available data volume is a result of the computerization of society and the rapid development of powerful data collection and storage tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why is data mining important in today’s world?

A

Data mining is important because we live in a world where vast amounts of data are collected daily, and analyzing this data is crucial for gaining insights and making informed decisions in various fields.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Big data consists of

A
  1. Network
  2. Collection
  3. Storage
  4. Research
  5. Analysis
  6. Volume
  7. Visualization
  8. Cloud technology
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the relationship between data, information, and knowledge in the context of data mining?

A

Data is the raw facts, while information involves patterns and relationships within data. Knowledge, on the other hand, is the understanding of a subject gained by synthesizing information to identify historical patterns and future trends

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is data mining essential in today’s business world, and what types of data do businesses typically deal with?

A

Data mining is essential in the business world because companies handle vast data sets, including sales transactions, stock trading records, product descriptions, sales promotions, company profiles, and customer feedback. For instance, large retailers like Wal-Mart manage hundreds of millions of transactions per week across their global branches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Can you explain the concept of the Data Pyramid and its components briefly?

A

The Data Pyramid represents the hierarchy from raw data at the base to wisdom at the top. It includes data, information (patterns and relationships within data), and knowledge (understanding gained from synthesizing information). Wisdom involves using knowledge to make informed decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the significance of understanding historical patterns and future trends in the context of data mining?

A

Understanding historical patterns and future trends is significant in data mining because it allows businesses to make informed decisions, optimize strategies, and respond effectively to changing market conditions based on the insights gained from data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How did data mining contribute to President Obama’s victory in the 2012 presidential election?

A

Data mining helped identify likely voters and predict polling outcomes, enabling efficient allocation of campaign resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the primary purpose of developing data mining tools.

A

The primary purpose of developing data mining tools is to uncover valuable information from large datasets and transform it into organized knowledge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What factors have fueled the remarkable growth of data mining and knowledge discovery?

A

Factors contributing to the growth of data mining include data warehousing, increased data access from web sources, global economic competition, improved computing power, and the availability of commercial data mining software.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is data mining characterized, and what makes it promising?

A

Data mining is characterized as a field that turns data into knowledge and is described as young, dynamic, and promising due to its ability to extract valuable insights from data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the key factors that have driven the remarkable growth in the field of data mining and knowledge discovery

A

The key factors driving the remarkable growth in data mining and knowledge discovery include data warehousing, increased access to data from web sources, competitive pressures in a globalized economy, the tremendous growth in computing power and storage capacity, and the availability of commercial data mining software suites.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why might the term “knowledge mining” be less accurate in describing the process of extracting insights from large datasets?

A

The term “knowledge mining” may be less accurate because it may not fully convey the emphasis on extracting insights from large volumes of data. This is why the term “knowledge discovery from data,” or KDD, is often used instead.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

In the knowledge discovery process, what does the term “Data Warehouses” typically refer to?

A

Data Warehouses” typically refer to a centralized storage system for large amounts of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does data mining involve in the knowledge discovery process?

A

In the knowledge discovery process, data mining involves discovering patterns and valuable insights from extensive datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

After data mining, what is the subsequent step in the process of knowledge discovery?

A

The subsequent step after data mining in the process of knowledge discovery is identifying patterns within the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the ultimate outcome of the knowledge discovery process, particularly in relation to the identified patterns?

A

The ultimate outcome of the knowledge discovery process, especially regarding the identified patterns, is the generation of valuable knowledge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the primary objective of data mining?

A

The primary objective of data mining is to discover meaningful new correlations, patterns, and trends within large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Can you name some of the technologies and techniques commonly used in data mining?

A

Some of the technologies and techniques commonly used in data mining include pattern recognition technologies, statistical methods, and mathematical techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What Kinds of Data Can Be Mined?

A

As a general technology, data mining can be applied
to any kind of data as long as the data are
meaningful for a target application.

The most basic forms of data for mining applications are database data, data warehouse data, and
transactional data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What can data mining systems analyze when mining relational databases?

A

Data mining systems, when mining relational databases, can analyze trends or data patterns.

For example, they can predict credit risk based on customer data or detect deviations in sales.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How can data mining be applied to predict credit risk for new customers?

A

Data mining can predict credit risk for new customers by analyzing factors such as income, age, and previous credit information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the role of data mining in detecting deviations in sales data?

A

Data mining plays a role in detecting deviations in sales data by identifying items with sales that significantly differ from what is expected compared to the previous year. These deviations can then be further investigated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Why are relational databases considered important in the context of data mining?

A

Because they are one of the most commonly available and richest information repositories. They provide a significant source of data for data mining analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the primary function of a data warehouse?

A

The primary function of a data warehouse is to serve as a repository for information collected from multiple sources, which is stored under a unified schema and typically resides at a single site.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How does the content of a record in a transactional database differ from that in a data warehouse?

A

In a transactional database, each record captures a transaction, such as a customer’s purchase or a flight booking, and includes a unique transaction identity number (trans ID) and a list of the items involved in the transaction. In contrast, a data warehouse may store data from various sources related to transactions, including additional information like item descriptions, details about salespeople, or branch information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the typical structure of data in a data warehouse?

A

Data in a data warehouse is typically structured under a unified schema, bringing together data from various sources into a single, organized repository.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What kind of information is usually stored in a transactional database?

A

Transactional databases typically store information related to individual transactions, including transaction details such as the items purchased, customer information, and transaction identity numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What distinguishes a data warehouse from a transactional database in terms of its purpose?

A

The primary purpose of a data warehouse is to consolidate and store data from multiple sources for analysis and reporting, whereas a transactional database primarily focuses on capturing and managing individual transactions in real-time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What types of data fall into the category of spatial data?

A

What types of data fall into the category of spatial data?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Besides textual data, what other forms of data are included in hypertext and multimedia data?

A

Hypertext and multimedia data encompass data types such as images, videos, and audio, in addition to textual data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How can data mining tasks be categorized based on their objectives?

A

Data mining tasks can be categorized into two main categories: descriptive and predictive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is the main goal of descriptive mining tasks?

A

Descriptive mining tasks aim to characterize properties of the data in a target dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is the primary objective of predictive mining tasks?

A

Predictive mining tasks involve performing induction on current data in order to make predictions about future or unknown data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is the primary purpose of data mining functionalities?

A

Data mining functionalities are used to specify the types of patterns to be discovered in data mining tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What kinds of patterns can be mined using data mining functionalities?

A

Data mining functionalities can be used to mine various kinds of patterns, including outlier detection, association rules, classification, clustering, and regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What are outliers in a dataset, and how can they be described?

A

Outliers in a dataset are data objects that do not conform to the general behavior or model of the data. They are data points that are considerably different from the rest of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Provide examples of real-world applications where outlier mining or anomaly mining is valuable.

A

Outlier mining or anomaly mining is valuable in applications such as credit card fraud detection and network intrusion detection, where identifying unusual patterns is crucial for security and fraud prevention.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is the primary goal of classification in data mining?

A

The primary goal of classification in data mining is to find a model or function that describes and distinguishes data classes.

A credit card transactioncan be normal or fraudulent.
——————————————————
A mail can be normal or spam

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

How is the derived classification model represented, and what is its practical use?

A

The derived classification model can be represented in various forms, including classification rules (IF-THEN rules), decision trees, mathematical formulae, or neural networks. It is used to predict the class label of objects for which the class label is unknown.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is the primary distinction between classification and regression in data mining?

A

Classification predicts categorical (discrete, unordered) labels, while regression models continuous-valued functions. Regression is used to predict missing or unavailable numerical data values, rather than discrete class labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What does the term “prediction” encompass in the context of data mining?

A

In the context of data mining, the term “prediction” encompasses both numeric prediction (regression) and class label prediction (classification).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is the primary objective of Association Rule Mining?

A

The primary objective of Association Rule Mining is to discover hidden relationships or associations between items in a dataset, often represented as “if-then” rules.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What is market basket analysis, and how does Association Rule Mining play a role in it?

A

Market basket analysis is an application of Association Rule Mining that helps retailers understand customer purchasing patterns and optimize product placement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Which Technologies Are Used in Data Mining?

A
  1. Machine Learning
  2. Pattern recognition
  3. Visualization
  4. Algorithms
  5. High preformance computing
  6. Info. Retrieval
  7. Data warehouse
  8. Database system
  9. Statistics
  10. Applications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Which Kinds of Applications Are Targeted?

A
  1. Telecommunication Industry
  2. Credit Card companies
  3. Insurance companies
  4. Retail & Marketing
  5. Medical companies
  6. Pharmaceutical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What are some of the key objectives or applications of Business Intelligence in various industries?

A

Business Intelligence is used to maximize the return on marketing campaigns, detect fraudulent transactions, automate the loan application process, and identify and treat the most valued customers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What is the first phase in the CRISP-DM process?

A

The first phase in the CRISP-DM process is the Business Understanding Phase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

In which phase of CRISP-DM is data exploration and initial data collection performed?

A

Data exploration and initial data collection are performed in the Data Understanding Phase of CRISP-DM.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is the primary goal of the Data Preparation Phase in CRISP-DM?

A

The primary goal of the Data Preparation Phase in CRISP-DM is to clean, transform, and preprocess the data for modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

During which phase of CRISP-DM are machine learning algorithms applied to the prepared data?

A

Machine learning algorithms are applied to the prepared data during the Modeling Phase of CRISP-DM.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

In CRISP-DM, when is the model’s performance assessed and validated?

A

The model’s performance is assessed and validated in the Evaluation Phase of CRISP-DM.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What is the final phase of CRISP-DM where the results of the data mining process are put into practical use?

A

The final phase of CRISP-DM where the results are put into practical use is the Deployment Phase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What are data sets made up of?

A

Data sets are made up of data objects and their attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What is the correspondence between data objects and attributes in a database?

A

The rows of a database correspond to the data objects, and the columns correspond to the attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What is an attribute in data mining?

A

An attribute is a data field, representing a characteristic or feature of a data object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What are some examples of data objects in various types of databases

A

Examples of data objects include customers in a sales database, patients in a medical database, and students, professors, and courses in a university database.

60
Q

How is an attribute defined in the context of data objects?

A

An attribute is a data field that represents a characteristic or feature of a data object.

61
Q

What terms are often used interchangeably in the literature to refer to attributes, and what terms are preferred in data mining and statistics literature, respectively?

A

Terms such as attribute, dimension, feature, and variable are often used interchangeably in the literature. In data mining literature, the term “feature” is commonly used, while statisticians prefer the term “variable.”

62
Q

How are attribute types determined, and what aspect defines the type of an attribute?

A

The type of n attribute is determined by the set of possible values that the attribute can have.

63
Q

What is a “nominal” attribute, and why is it sometimes referred to as “categorical”?

A

A nominal attribute is one where each value represents a category or state. It is referred to as “categorical” because it involves categories rather than numerical values.

64
Q

What is a “binary” attribute?

A

A binary attribute is one that has two possible values, typically representing two states or categories.

An example given in the text is the “smoker” attribute, where 1 indicates that the patient smokes, and 0 indicates that the patient does not smoke.

65
Q

What characterizes an “ordinal” attribute, and how does it differ from nominal, binary, or numeric attributes?

A

An ordinal attribute has possible values with a meaningful order or ranking among them, but the magnitude between successive values is not known. It differs from nominal and binary attributes in that it has an ordered relationship among its values, but it differs from numeric attributes in that it lacks precise numerical measurement.

An example of an ordinal attribute mentioned in the text is “Drink size” at a fast-food restaurant, which can have values such as small, medium, and large. Its characteristic is that there is a meaningful order (small < medium < large), but we don’t know the exact magnitude of the differences between sizes.

66
Q

How do nominal, binary, and ordinal attributes differ from numeric attributes in terms of their nature and representation?

A

Nominal, binary, and ordinal attributes are qualitative and describe features of objects without providing precise size or quantity measurements. In contrast, numeric attributes are quantitative and represent measurable quantities using integer or real values.

67
Q

What are some key motivations for conducting data exploration?

A

Key motivations for data exploration include:

  • Understanding the characteristics of large and messy data sets.
  • Selecting the appropriate tool for data preprocessing or analysis.
  • Utilizing human abilities to recognize patterns in the data, complementing data analysis tools.
68
Q

What techniques are commonly used in data exploration

A

Common techniques used in data exploration include:

  • Summary statistics
  • Data visualization
69
Q

What is the primary purpose of summary statistics in data exploration?

A

Summary statistics are used to summarize a set of observations to provide an understanding of the typical values in the data and how they vary. This helps in gaining insights into the data’s characteristics.

70
Q

What are the two main types of descriptive statistics mentioned in the text, often encountered in research papers?

A

The two main types of descriptive statistics encountered in research papers are:

  1. Measures of central tendency (e.g., averages).
  2. Measures of dispersion (e.g., standard deviation).
71
Q

What determines the choice of measures of central tendency and measures of dispersion in summary statistics?

A

The choice between measures of central tendency and measures of dispersion in summary statistics depends on the type of variables being analyzed.

Mode can be used for all data types, median can be used for ordinal and numeric data types, and mean can only be used for numeric data type

72
Q

What is the primary purpose of data visualization in data exploration?

A

The primary purpose of data visualization is to convert data into visual or tabular formats to facilitate the analysis of data characteristics and relationships among data items or attributes.

73
Q

Why is data visualization considered one of the most powerful techniques for data exploration?

A

Data visualization is considered powerful because humans have a well-developed ability to analyze large amounts of information presented visually. It allows for the detection of general patterns, trends, outliers, and unusual patterns within the data.

74
Q

What are some graphical techniques commonly used in graphical analysis to explore data, and how are they used to gain insights into the data?

A

Common graphical techniques used in data exploration include:

  • Histograms and boxplots for numeric variables to learn about their distribution, detect outliers, and find relevant information.
  • Bar charts and pie charts for categorical variables to show the frequency of each value.
  • Scatter plots for pairs of numeric variables to explore possible relationships, the type of relationship, and detect outliers.
75
Q

What is the purpose of using a line graph in data visualization, and what type of data is it commonly used for?

A

A line graph is often used for time series data to display trends or patterns over time

76
Q

What is the primary purpose of a bar chart in data visualization?

A

The primary purpose of a bar chart is to display how many occurrences of each value occur in a dataset or, for continuous data, how many values are in each of a series of ranges or “bins.” It provides a visual representation of the distribution of the outcome variable.

77
Q

What is a heatmap in the context of data visualization, and what kind of data does it help visualize?

A

A heatmap is a graphical representation where individual values of a matrix are displayed as colors. It is useful for visualizing the concentration of values between two dimensions of a matrix. Heatmaps are particularly helpful in finding patterns and providing a perspective of depth. Darker shades in a heatmap correspond to stronger (positive or negative) correlations, making it easy to spot high and low correlations.

78
Q

What is the primary purpose of a scatterplot in data visualization, and what type of relationship does it show?

A

The primary purpose of a scatterplot is to visualize the relationship between two numerical variables. It shows the relationship between these variables, which can be positive, negative, or have no clear pattern.

79
Q

What is a boxplot, and how does it display the distribution of data?

A

A boxplot is a standardized way of displaying the distribution of data based on a five-number summary, which includes the minimum, first quartile (Q1), median, third quartile (Q3), and maximum values. It provides a visual representation of the data’s spread and key statistics.

80
Q

What information can a boxplot provide about the data it represents?

A

A boxplot can provide information about:

  1. Outliers and their values.
  2. Whether the data is symmetrical or skewed.
  3. How tightly the data is grouped or dispersed.
81
Q

What are the three fundamental issues that need to be addressed when approaching a new dataset?

A

The three fundamental issues are:

  1. What problems should we expect when working with the dataset?
  2. How do we detect those problems within the dataset?
  3. How can we solve or address those problems to prepare the data for analysis?
82
Q

What are some measures used to assess data quality?

A

Measures for data quality include:

  1. Accuracy: Determining whether the data is correct or incorrect, accurate or inaccurate.
  2. Completeness: Ensuring that all necessary records are represented in the dataset.
  3. Consistency: Checking for consistency within the dataset, such as modifications that are applied consistently or inconsistencies like “dangling” data.
  4. Timeliness: Assessing whether the data is updated in a timely manner.
  5. Interpretability: Evaluating how easily the data can be understood and interpreted.
83
Q

What are some possible reasons for inaccurate data?

A
  • Faulty data collection instruments.
  • Human or computer errors during data entry.
  • Users purposely submitting incorrect data for mandatory fields.
  • Errors in data transmission.
  • Technology limitations, such as buffer size limitations for coordinating data transfer.
  • Inconsistencies in naming conventions or data codes.
  • Inconsistent input field formats, such as date formats.
84
Q

why might data be incomplete?

A

Data may be incomplete for various reasons, including:

  • Attributes of interest may not always be available, such as missing customer information in sales transaction data.
  • Certain data may not have been included initially because they were not considered important at the time of data entry.
  • Relevant data may not be recorded due to misunderstanding or equipment malfunctions.
  • Data that were inconsistent with other recorded data may have been deleted
85
Q

Why is data preprocessing necessary for raw data contained in databases?

A

Data preprocessing is necessary because much of the raw data contained in databases is incomplete, inconsistent, and noisy. Databases may have various issues, including obsolete or redundant fields, missing values, outliers, and data not in a suitable form for data mining models. To make the data useful for data mining purposes, it needs to undergo preprocessing, which includes data cleaning and data transformation.

86
Q

What are the objectives of data cleaning routines, and what do they aim to achieve?

A

Data cleaning routines aim to “clean” the data by accomplishing the following objectives:

  1. Filling in missing values.
  2. Smoothing noisy data.
  3. Identifying or removing outliers.
  4. Resolving inconsistencies within the data
87
Q

Why is it important to clean data, and how can dirty data impact data mining results?

A

Cleaning data is important because dirty data, which contains errors, missing values, or inconsistencies, can lead to a lack of trust in the results of data mining applications. Additionally, dirty data can confuse the data mining process and result in unreliable output. Cleaning the data helps ensure the quality and reliability of the data mining results.

88
Q

What are the major tasks involved in data preprocessing?

A

The major tasks in data preprocessing include:

  1. Data Integration: Integrating multiple databases, data cubes, or files, which may have attributes with different names for the same concept, leading to inconsistencies and redundancies. This task also involves detecting and removing redundancies resulting from data integration.
  2. Data Reduction: Obtaining a smaller-volume representation of the dataset that produces the same or nearly the same analytical results.
  3. Data Cleaning: Performing routines to unify data format, fill in missing values, identify and smooth out noisy data, correct inconsistent data, and remove duplicate records. This ensures the quality and consistency of the data.
89
Q

What are some common reasons for missing values in data?

A

Common reasons for missing values in data include:
* Information is not collected for certain cases.
* People decline to answer specific questions in a survey.
* Attributes may not be applicable to all cases, such as income data for children

90
Q

Why should we carefully consider how we handle missing data in data analysis

A

We should carefully consider how we handle missing data because the absence of information is rarely beneficial, and having more information is usually better for analysis.

91
Q

What is one common method of handling missing values

A

One common method of handling missing values is to delete the records or fields with missing values from the analysis.

92
Q

What are the potential drawbacks of simply deleting records with missing values?

A

Simply deleting records with missing values may be dangerous because the pattern of missing values could be systematic, leading to a biased subset of the data. Additionally, it might result in the loss of valuable information in other fields, even if just one field has missing values.

93
Q

Why is it considered wasteful to omit information in all the other fields due to the presence of missing values in one field?

A

It is considered wasteful to omit information in all the other fields because it’s inefficient to discard valuable data in those fields just because one field has missing values.

94
Q

What percentage of missing data values in a data set of 30 variables, when spread evenly throughout the data, would result in almost 80% of the records having at least one missing value?

A

If only 5% of data values are missing from a data set of 30 variables, and the missing values are spread evenly throughout the data, almost 80% of the records would have at least one missing value.

95
Q

What approach have data analysts adopted to handle missing values when they are not simply deleted?

A

Data analysts have turned to methods that replace the missing value with a substituted value based on various criteria when they choose not to simply delete the missing data.

96
Q

What are some common criteria for choosing replacement values for missing data?

A

Common criteria for choosing replacement values for missing data include:

  • Replacing the missing value with a constant specified by the analyst.
  • Replacing the missing value with a measure of central tendency, such as the mean or median for numeric variables, or the mode for categorical variables.
  • Replace the missing value with a measure of central tendency (mean or median for numeric variables) or the mode (for categorical variables) belonging to the same class.
  • Replace the missing values with a value generated at random from the observed distribution of the variable.
  • Replace the missing values with imputed values based on the other characteristics of the record.

The choice between mean and median depends on the data distribution, with the mean suitable for normal (symmetric) data distributions and the median for skewed data distributions.

97
Q

How Does it work?

Replace the missing values with a value generated at
random from the observed distribution of the variable.

A

This can be achieved using methods like regression, inference-based tools using Bayesian formalism, or decision tree induction. For instance, by utilizing the other customer attributes in a dataset, one can construct a decision tree to predict the missing values for attributes like income.

98
Q

Give an Example

Replacing the missing value with a measure of central
tendency (mean or median for numeric variables) or the
mode (for categorical variables) belonging to the same
class

A

if customers are classified according to credit risk, missing values for income may be replaced with the mean income value for customers in the same credit risk category as the given tuple. The choice between mean and median depends on the data distribution within that class, with the median being a better choice for skewed data distributions.

99
Q

What is an outlier in the context of data analysis, and how does it differ from other observations?

A

An outlier in data analysis is an observation that is unlike the other observations and represents an extreme value that goes against the trend of the remaining data.

100
Q

Why is it important to identify outliers in data analysis, and what potential issues can outliers introduce into statistical methods?

A

Identifying outliers is important because they may represent errors in data entry. Even if an outlier is a valid data point and not an error, certain statistical methods are sensitive to the presence of outliers and may deliver unreliable results, such as skewed mean values or overly wide ranges of the data.

101
Q

How does outlier detection differ from noise detection, and what are some potential insights or discoveries that can result from identifying outliers?

A

Outlier detection is related to but distinct from noise detection. Outliers can be considered as interesting and/or unknown patterns hidden in data, which may lead to new insights, the discovery of system faults, or the identification of fraudulent activities. Noise detection, on the other hand, typically focuses on identifying random or erroneous fluctuations in data without the same potential for uncovering valuable patterns or anomalies

102
Q

Why might outliers not always be easily apparent in large data sets, and what techniques can be used to identify them?

A

Outliers may not always be easily apparent in large data sets due to the complexity and volume of the data. Techniques such as data visualization, statistical tests, and machine learning algorithms can be used to identify outliers in these datasets, helping data analysts and researchers detect and address them effectively

103
Q

Why is it essential to detect, remove, replace, or accommodate outliers in data mining?

A

It is essential to employ these methods because outliers are not always easily apparent, especially in large data sets. Failure to address outliers can lead to biased or inaccurate results in data mining analyses.

104
Q

Why are outliers considered important in data mining, and how can they affect the quality and reliability of data mining results?

A

Outliers are important in data mining because they are data points that deviate significantly from the rest of the dataset

105
Q

How does outlier analysis contribute to better data quality in data mining?

A

Outlier analysis helps identify data errors, improving data quality and enhancing the reliability of data analysis and modeling.

106
Q

What does outlier analysis reveal about data that might be missed when only focusing on central tendencies, and how does this enhance the understanding of data?

A

Outlier analysis reveals relationships and patterns in data that may be absent when only focusing on central tendencies, thereby enhancing the understanding of the data.

107
Q

In what way can the accuracy of statistical models be improved through outlier analysis, and why is this improvement important?

A

Outliers can influence the results of statistical models, and identifying and handling them appropriately through outlier analysis can help improve the accuracy of these models, leading to more reliable data analysis outcomes.

108
Q

Why is it crucial to prevent misleading results in data mining, and how can outlier analysis contribute to this prevention?

A

Preventing misleading results is crucial in data mining, and outlier analysis plays a role in avoiding incorrect conclusions by identifying and managing outliers that can impact data analysis and modeling outcomes

109
Q

What are some practical applications of outlier analysis, and how can it contribute to detecting fraud and anomalies?

A

Outlier analysis can help detect unusual behavioral patterns and foreign transactions, which can seriously affect various domains such as business, health, and security choices.

110
Q

Importance of Outlier Analysis in
Data Mining

A
  1. Better Data Quality
  2. Enhances Understanding of Data
  3. Improves Accuracy of Statistical Models
  4. Prevents Misleading Results
  5. Detects Fraud and Anomalies
111
Q

What are some common methods for detecting outliers in data, and how are they categorized?

A

Common methods for detecting outliers in data are categorized into various approaches, including:
1. Statistical methods
1. Distance-based measures
1. Density-based measures
1. Clustering-based measures

112
Q

How do statistical tests identify outliers, and what are some of the key statistical measures used in this process?

A

Statistical tests for identifying outliers rely on assumptions about the distribution of the data and use thresholds based on standard deviation, z-scores, or interquartile range to flag data points that deviate significantly from the expected values.

113
Q

What is the fundamental concept behind distance-based measures in outlier detection, and how do they assess whether a data point is an outlier?

A

Distance-based measures in outlier detection use the concept of nearest neighbors to determine if a data point is far away from most of the others, suggesting that it may be an outlier.

114
Q

How do density-based measures use the concept of local density to detect outliers, and what kind of regions do they focus on?

A

Density-based measures in outlier detection use the concept of local density to find data points that are in sparse regions, which may indicate their status as outliers.

115
Q

What is the primary concept employed by clustering-based measures in identifying outliers, and what role do clusters play in this process?

A

Clustering-based measures in outlier detection use the concept of clusters to find data points that do not belong to any cluster or are distant from their cluster centers, thus suggesting that they might be outliers

116
Q

What is the Z-score method used for in identifying outliers, and what are the criteria for flagging a data value as an outlier using this method?

A

The Z-score method is used to identify outliers, and a data value is considered an outlier if it has a Z-score that is either less than -3 or greater than 3.

117
Q

What assumption does the Z-score technique make about the distribution of data, and how does it define outliers in this context?

A

The Z-score technique assumes a Gaussian distribution of the data. It defines outliers as data points that are in the tails of the distribution and are far from the mean, typically with Z-scores much less than -3 or greater than 3.

118
Q

What does the 68-95-99.7 rule, represented by a bell curve in statistics, signify about data distribution?

A

The 68-95-99.7 rule, also known as the empirical rule, states that in a normal distribution, approximately 68.27% of values lie within 1 standard deviation of the mean, 95.45% within 2 standard deviations, and 99.73% within 3 standard deviations.

119
Q

How does a box plot diagram contribute to the identification of outliers, and what quartiles are typically used in this method?

A

A box plot diagram helps in identifying outliers by utilizing quartiles. It defines the upper limit and lower limit beyond which any data point will be considered an outlier. The commonly used quartiles in this method are the first quartile and the third quartile.

120
Q

What is the convenient definition of an outlier according to basic standards followed by statisticians when using the interquartile range (IQR) method, and how is it determined?

A

According to basic standards followed by statisticians, a convenient definition of an outlier is a data point that falls more than 1.5 times the interquartile range (IQR) above the third quartile or below the first quartile. This is used to identify outliers in a dataset.

121
Q

What is a limitation of using the distribution model to find outliers, and why might it be problematic in many cases?

A

limitation in using the distribution model to find outliers is that in many cases, the distribution of the data set is not previously known. This can pose a challenge for identifying outliers effectively.

122
Q

How does the distance-based approach contribute to outlier detection, and what criteria are used to flag a data point as an outlier using this method?

A

The distance-based approach flags a data point as an outlier if it is mapped beyond a certain threshold away from other data points. This threshold determines whether a data point is considered an outlier.

123
Q

What role does the density approach play in outlier detection, and how does it group data points for this purpose?

A

The density approach groups together data points into clusters, using the distance between each cluster point to set the boundary of the grouping. Outliers are data points that exist outside of the cluster, beyond a user-defined threshold.

124
Q

How do distance-based algorithms utilize the property of average distance to identify outliers in data?

A

Distance-based algorithms identify outliers by measuring the average distance of the nearest k neighbors. Outliers tend to have a higher average distance than other normal data points, and this property is used to detect them

125
Q

How are data points ranked in distance-based outlier detection, and what determines which points are declared as outliers?

A

Each data point is ranked based on its distance to its kth nearest neighbor. The top n points in this ranking are declared as outliers. The values of k and n can be specified through parameters, such as the number of neighbors and the number of outliers

126
Q

What types of distance measures are commonly used in distance-based outlier detection, and how does the choice of distance measure affect the algorithm’s execution?

A

Common distance measures used in distance-based outlier detection include Euclidean distance for real values and Jaccard similarity measures for binary and categorical values. The choice of distance measure can impact the algorithm’s execution, with high-dimensional datasets becoming expensive to process due to the need to calculate distances with other data points in high-dimensional space

127
Q

What happens if the value of k is set to 1 in distance-based outlier detection, and why does this occur?

A

If the value of k is set to 1, two outliers that are located next to each other but far away from other data points may not be identified as outliers. This occurs because the algorithm only considers the nearest neighbor, which could be another outlier.

128
Q

What potential issue arises when the value of k is set to a large number in distance-based outlier detection, and what data points might be mislabeled as outliers in such cases?

A

If the value of k is set to a large number, a group of normal data points that form a cluster might be mislabeled as outliers. This can happen if the number of data points in that cluster is few, and the cluster is far away from other data points.

129
Q

Why is it important to normalize numeric attributes in distance-based outlier detection, and what is the purpose of this normalization?

A

Normalizing numeric attributes is important to ensure that attributes with a higher absolute scale, such as income, do not dominate attributes with a lower scale, like credit score. This normalization helps prevent certain attributes from disproportionately influencing the outlier detection process.

130
Q

What characterizes the occurrence of outliers in comparison to normal data points

A

Outliers occur less frequently compared to normal data points, which means that they are less common in the dataset.

131
Q

How do outliers and normal data points differ in terms of their location in data space?

A

Outliers occupy low-density areas in data space, while normal data points occupy high-density areas. The distinction is based on the frequency of occurrence.

132
Q

What is the relationship between density and the distances between data points in the context of density-based outlier detection?

A

Density, in the context of density-based outlier detection, is a count of data points in a normalized unit space and is inversely proportional to the distances between data points. This means that as data points become closer together, the density in that region increases.

133
Q

How are objects clustered or grouped in clustering-based outlier detection, and what principle guides this clustering process?

A

In clustering-based outlier detection, objects are clustered or grouped based on the principle of maximizing intra-class similarity (similarity among objects within the same cluster) and minimizing interclass similarity (similarity between objects in different clusters). Clusters are formed to ensure that objects within a cluster have high similarity to one another but are very dissimilar to objects in other clusters

134
Q

How are outliers typically detected in clustering-based outlier detection, and what action is taken when outliers are identified?

A

Outliers may be detected as values that fall outside of the sets of clusters. When outliers are identified, they are typically removed or smoothed, meaning they are either eliminated from the dataset or their impact is reduced

135
Q

In a 2-D plot of customer data using clustering-based outlier detection, what do the cluster centroids marked with a “+” represent?

A

In a 2-D plot of customer data using clustering-based outlier detection, the cluster centroids marked with a “+” represent the average point in space for that cluster

136
Q

Why is it important to understand why outliers occur in the context of outlier detection, and how does this understanding influence the action taken after outlier detection?

A

Understanding why outliers occur helps determine the appropriate action to perform after outlier detection. Depending on the application, outliers may need to be isolated and acted upon, such as in credit card transaction fraud monitoring. In other cases, outliers should be filtered out because they can skew the final outcome, as in the case of eliminating ultra-high-income earners to generalize a country’s income patterns

137
Q

What challenge arises when deciding to remove outliers in data mining, and how can it impact the information value of the dataset?

A

The challenge when deciding to remove outliers is that you have to consider how to effectively remove them without losing too much informational value. An outlier in one column may not be an outlier in the rest of the columns, and removing it may result in the loss of valuable information held by the outlier in the other features.

138
Q

What is the potential consequence of removing an outlier from a dataset, especially when the outlier is not an outlier in all columns or features?

A

If an outlier is removed from the dataset, especially when it is not an outlier in all columns or features, the consequence can be the loss of information it holds in the rest of the features. This loss of information may impact the overall understanding and analysis of the data.

139
Q

List different ways of:

Handling Outliers

A
  1. Delete
  2. Transformation
  3. Replacement methods
140
Q

What are some transformation methods that can be used to handle outliers in data?

A

Two common transformation methods to handle outliers are log transformation and square root transformation. These transformations can reduce the impact of outliers on statistical models.

141
Q

What are some of the methods for replacing outliers in data, and why might this be done?

A

Some methods for replacing outliers in data include using the mean, median, mode, or values based on percentiles or ranges to replace extreme values. This is done to mitigate the effects of outliers and create a dataset with less extreme values. The advantage of replacement methods is that they can preserve the size and structure of the data set, but the disadvantage is that they may distort the distribution or variance of the data.

142
Q

What is the potential disadvantage of replacing outliers with point statistics, and how does this approach affect the data?

A

The potential disadvantage of replacing outliers with point statistics is that it can create bias in the data, especially when there are a lot of outliers. This approach can distort the distribution and variance of the data.

143
Q

What is the purpose of inferring the values of outliers using a prediction or classification model, and what is this process called

A

The purpose of inferring the values of outliers using a prediction or classification model is to mitigate the effects of outliers and replace them with values that are more representative of the dataset. This process is called imputation

144
Q

What is the technique called where outliers are replaced with the smallest and largest values of a dataset with observations closest to them that are not suspicious?

A

The technique is called “Winsorizing,” and it involves replacing outliers with the smallest and largest values of a dataset with observations that are not suspicious or extreme

145
Q

What is the term used to describe the process of finding patterns in data that are distinct from the majority of the data, and what are these distinct patterns often referred to as?

A

The term used to describe the process of finding distinct patterns in data is “outlier detection.” These distinct patterns are often referred to as “outliers” or “anomalies