DM mod 2 Flashcards

1
Q

what is Preprocessing in Data Mining:

A

Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data cleaning

A

The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

how to handle missing data while data cleaning ?

A

This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple.

Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the most probable value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is noisy data in data cleaning ?

A

Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data collection, data entry errors etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to handle noisy data in data cleaning ?

A

It is handled by three methods:
1- binning method
2- regression
3- clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is binning method in data cleaning ?

A

This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various methods are performed to complete the task. Each segmented is handled separately. One can replace all data in a segment by its mean or boundary values can be used to complete the task.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How noisy data is handled using regression ?

A

Here data can be made smooth by fitting it to a regression function.The regression used may be linear (having one independent variable) or multiple (having multiple independent variables).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How noisy data is handled with the clustering ?

A

This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is data transformation ?

A

This step is taken in order to transform the data in appropriate forms suitable for mining process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

List four ways data transformation happens ?

A

1- Normalization
2- attribute selection
3- Discretization
4- concept heirarchy generation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How data transformation is done using normalization ?

A

It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How data transformation is done using attribute selection ?

A

In this strategy, new attributes are constructed from the given set of attributes to help the mining process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How data transformation is done using discretization ?

A

Data discretization refers to a method of converting a huge number of data values into smaller ones so that the evaluation and management of data become easy. In other words, data discretization is a method of converting attributes values of continuous data into a finite set of intervals with minimum data loss. There are two forms of data discretization first is supervised discretization, and the second is unsupervised discretization. Supervised discretization refers to a method in which the class data is used. Unsupervised discretization refers to a method depending upon the way which operation proceeds. It means it works on the top-down splitting strategy and bottom-up merging strategy.

Now, we can understand this concept with the help of an example

Suppose we have an attribute of Age with the given values
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

we can categorise these ages as - child(1, 5, 4..), young, mature, old etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How data transformation is done using concept hierarchy generation ?

A

Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute “city” can be converted to “country”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is data reduction ?

A

Since data mining is a technique that is used to handle huge amount of data. While working with huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis costs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the four ways used in data reduction ?

A

1- data cube aggregation
2- Attribute subset selection
3- Numerosity reduction
4- Dimensionality reduction

17
Q

What is data cube aggregation ?

A

This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a multidimensional aggregation that uses aggregation at various levels of a data cube to represent the original data set, thus achieving data reduction.

For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to the year 2022. If you want to get the annual sale per year, you just have to aggregate the sales per quarter for each year. In this way, aggregation provides you with the required data, which is much smaller in size, and thereby we achieve data reduction even without losing any data.

The data cube aggregation is a multidimensional aggregation that eases multidimensional analysis. The data cube present precomputed and summarized data which eases the data mining into fast access.

18
Q

What is Dimensionality Reduction ?

A

Whenever we encounter weakly important data, we use the attribute required for our analysis. Dimensionality reduction eliminates the attributes from the data set under consideration, thereby reducing the volume of original data. It reduces data size as it eliminates outdated or redundant features. Here are three methods of dimensionality reduction.
Wavelet Transform:
Principal Component Analysis:
Attribute Subset Selection:

19
Q

What is Numerosity Reduction ?

A

The numerosity reduction reduces the original data volume and represents it in a much smaller form. This technique includes two types parametric and non-parametric numerosity reduction.

20
Q

What is Parametric numerosity reduction ?

A

Parametric: Parametric numerosity reduction incorporates storing only data parameters instead of the original data. One method of parametric numerosity reduction is the regression and log-linear method.

21
Q

What is non-parametric numerosity reduction ?

A

Non-Parametric: A non-parametric numerosity reduction technique does not assume any model. The non-Parametric technique results in a more uniform reduction, irrespective of data size, but it may not achieve a high volume of data reduction like the parametric. There are at least four types of Non-Parametric data reduction techniques, Histogram, Clustering, Sampling, Data Cube Aggregation, and Data Compression.

22
Q

What is Data Compression

A

Data compression employs modification, encoding, or converting the structure of data in a way that consumes less space. Data compression involves building a compact representation of information by removing redundancy and representing data in binary form. Data that can be restored successfully from its compressed form is called Lossless compression. In contrast, the opposite where it is not possible to restore the original form from the compressed form is Lossy compression. Dimensionality and numerosity reduction method are also used for data compression.

23
Q

What is Lossless Compression:

A

Encoding techniques (Run Length Encoding) allow a simple and minimal data size reduction. Lossless data compression uses algorithms to restore the precise original data from the compressed data.

24
Q

What is Lossy Compression:

A

In lossy-data compression, the decompressed data may differ from the original data but are useful enough to retrieve information from them. For example, the JPEG image format is a lossy compression, but we can find the meaning equivalent to the original image. Methods such as the Discrete Wavelet transform technique PCA (principal component analysis) are examples of this compression.

25
Q

What is Discretization Operation ?

A

The data discretization technique is used to divide the attributes of the continuous nature into data with intervals. We replace many constant values of the attributes with labels of small intervals. This means that mining results are shown in a concise and easily understandable way.

26
Q

What is Top-down discretization:

A

If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole set of attributes and repeat this method up to the end, then the process is known as top-down discretization, also known as splitting.

27
Q

What is Bottom-up discretization: ?

A

If you first consider all the constant values as split-points, some are discarded through a combination of the neighborhood values in the interval. That process is called bottom-up discretization.

28
Q

What is Benefits of Data Reduction ?

A

The main benefit of data reduction is simple: the more data you can fit into a terabyte of disk space, the less capacity you will need to purchase. Here are some benefits of data reduction, such as:

Data reduction can save energy.
Data reduction can reduce your physical storage costs.
And data reduction can decrease your data center track.
Data reduction greatly increases the efficiency of a storage system and directly impacts your total spending on capacity.

29
Q

What is attribute subset selection ? in your own words

A

It is a process in which we select the attributes which are most relevent and remove the irrelevant attributes.
there are four ways in which we can do this-
1- forward selection
2- backward selection
3- both forward and backward
4- decision tree

30
Q

Attribure subset selection

What is forward attribute selection ?

A

Ir : {A1, A2, A3, A4, A5} —-removed–> Fr : {A1, A4, A5}

In this we first have an empty subset and then we fill it with the relevent attributes.

31
Q

Attribure subset selection

What is backward attribute selection ?

A

Ir : {A1, A2, A3, A4, A5} —-removed–> Fr : {A1, A4, A5}

In this we remove the attributes from the original subset and we don’t use any auxillary subset.