Lesson 1 Flashcards
What is data mining
Is the process of discovering patterns , relationships and useful information from large data sets
Importance of data mining
.extracting valuable insights
.improving decision making
.enhance customer experience
.fraud detection and risk management
Steps in knowledge discovery
1.data preparation(data cleaning , integration, transformation and data selection)
2.data mining (intelligence methods are applied to extract patterns)
3.pattern evaluation
4.knowledge presentation
Pyramid method of data mining
Data
Information
Knowledge
Wisdom
What is data ware house
Is a centralised repository designed to store large amounts of structured data from multiple sources for analysis and reporting
Purpose of data warehouse
Data integration (combines different data In one location)
Data consistency and quality
Historical data storage
Supports business intelligence BI
Difference btn operational database and data warehouse
1Support day to day business op while warehouse is for analysis and reporting
2.current real time transaction data while historical and aggregate data
3. Highly normalised while denomarlized for faster query
4.CRUD operations while Olap
5.optimized for quick inserts and updates while optimised for complex queries and summaries
Components of data warehouse architecture
1.data sources
2.ETL (extract, transform and load )layer
3.dat storage layer
4.metadata and management layer
5.data access layer
Types of data
1.structured data. Organised in predefined format in rows and columns.
2. Semistructure data does not follow a strict schema
3. Unstructured data that is not in predefined format
Examples of types of data
- Structured. Phone no, address
2.semistructured. xml webpages
3.unstructured. images , text doc, social videos
Data types and attributes
1.nominal.(categories without meaningful order.names of things or symbols)
2.numeric.(quantitative integers or real values)
3.binary.(nominal attribute with only two categories. 0 or 1)
4.ordinal (categories with meaningful ordered or ranking)
What is a cluster
Is a collection of data objects such that the objects within the cluster are similar to one another and dismilar to the objects in other cluster
Methods for handling missing values
1.ignore the tuple
2.fill in the missing value manually
3.use a goal constant to fill the missing value such as unknown
4.use central tendency such mean
5.
What is a noise
Is a random error or variance in a measured variable
What is a data cleaning
Is the process of removing the noise data, filling the missing values and identifying the outliers in the data
How to fill the missing dat?
Ignore the data
Fill them manually
Using central tendency
Use a constant value
Use mean or median
What is pca
Is the linear method that transforms original data onto the smaller space