DM mod 2 Flashcards
what is Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format.
What is data cleaning
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data etc.
how to handle missing data while data cleaning ?
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple.
Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the most probable value.
What is noisy data in data cleaning ?
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data collection, data entry errors etc.
How to handle noisy data in data cleaning ?
It is handled by three methods:
1- binning method
2- regression
3- clustering
What is binning method in data cleaning ?
This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various methods are performed to complete the task. Each segmented is handled separately. One can replace all data in a segment by its mean or boundary values can be used to complete the task.
How noisy data is handled using regression ?
Here data can be made smooth by fitting it to a regression function.The regression used may be linear (having one independent variable) or multiple (having multiple independent variables).
How noisy data is handled with the clustering ?
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.
What is data transformation ?
This step is taken in order to transform the data in appropriate forms suitable for mining process.
List four ways data transformation happens ?
1- Normalization
2- attribute selection
3- Discretization
4- concept heirarchy generation
How data transformation is done using normalization ?
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
How data transformation is done using attribute selection ?
In this strategy, new attributes are constructed from the given set of attributes to help the mining process.
How data transformation is done using discretization ?
Data discretization refers to a method of converting a huge number of data values into smaller ones so that the evaluation and management of data become easy. In other words, data discretization is a method of converting attributes values of continuous data into a finite set of intervals with minimum data loss. There are two forms of data discretization first is supervised discretization, and the second is unsupervised discretization. Supervised discretization refers to a method in which the class data is used. Unsupervised discretization refers to a method depending upon the way which operation proceeds. It means it works on the top-down splitting strategy and bottom-up merging strategy.
Now, we can understand this concept with the help of an example
Suppose we have an attribute of Age with the given values
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77
we can categorise these ages as - child(1, 5, 4..), young, mature, old etc
How data transformation is done using concept hierarchy generation ?
Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute “city” can be converted to “country”.
What is data reduction ?
Since data mining is a technique that is used to handle huge amount of data. While working with huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis costs.