5. dataware house and data mining Flashcards
Data warehouse
- Data warehouse is a relational database that is designed for query and analysis, rather than transaction processing It contains historical data derived from transaction data from single and multiple Sources.
- A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support for decision-makers for data modeling and analysis.
- It is a database that stores information oriented to satisfy decision making reqs
- It is a group of decisions support technologies target To enable the knowledge worker to make superior and higher decision
- Data warehouse is group of data Specific to entire organisation, not only to a particular group of users
- It is not used for daily operations and transaction processing, but for making decisions
- It supports small number of clients with relatively long interactions
- It includes current and historical data to provide a historical perspective of the information
characteristics of dataware houses
- Subject oriented
- Integrated
- Time variant
- Non volatile
1- Subject oriented
A data warehouse targets on querying and analysing the data for decision makers. Hence, it typically provides concise and straight forward views around particular subject rather than whole global organisations Ongoing operations and excludes data that are not useful concerning the subject
2- integrated
It integrates various heterogeneous data sources like Rdbms, Flat files and online transaction records
3- time varient
Historical information is kept in the data warehouse
goals of data warehouse
- help in reporting as well as analysis
- maintain organisations historical information
- be Foundation for decision making.
Components of data warehouse
- Source data component
- Data staggling component
- Data storage component
- information delivery component
- metadata component
- management and control component
Source data component:
Data source coming into the data warehouse may be grouped into four broad categories:
1. Internal data(Private data)
2. Production data (Different operating systems)
3. external data
4. achieved data(Old data is periodically taken and stored in achieved files)
Data stagging component
After extracting the data from various sources we have to prepare the files for storing the data in the warehouse, which have to be changed, converted and made ready in a format that is relevant to be saved for querying and analysis
3 tyoes:
1. Data extraction
2. data loading
3. data transformation
Data storage components
Data storage for data warehousing is split repository It includes current data mostly
Information delivery component
Enables the processor subscribing for data warehouse files and having it transferred to 1 or more destinations, according to the customer specified scheduling algorithm
Metadata component
It is like data dictionary where we keep the data about logical data structures, the data about the records and addresses, the information about the indexes, so on
Datamarts
Subset of directional information store generally oriented to specific purpose or primary data subject, which may be distributed to provide business needs
- Datamarts are analytic record stores designed to focus on particular business functions for specific community within an organisation
- data marts are derived from the subsets of data in the data warehouse
types :
1. Independent data marts (bottom up approach)
Independent data markets are created and then data warehouses designed using integration of independent multiple data marks
-
dependent data marts(Top down approach)
Treated as subsets of data warehous that is firstly a data warehouse is created from which further various data marts can be created
- These data marts are dependent on data warehouse and extract essential records from it
Management and control component
It manages and controls the elements to coordinate the services and functions within the data warehouse It also control the data transformation and the data transfer into Gita warehouse storage and delivery to the clients
Data mining
- Process of extracting information to identify patterns, trends and useful data that would allow the business to take the data driven decision from huge sets of data is called as Data mining
- Data Mining is a process used by organizations to extract specific data from huge databases to solve business problems. It primarily turns raw data into useful information.
Datasets on which data mining is done - Data warehouses
- relational databases
- data repositories
- object relational databases
- transactional databases
adv:
1. Enables organisations to obtain knowledge based data
2. Helps organisation to make profitable modifications in operational production
3. Compare to other statistical data applications data mining is cost efficient.
4. helps in decision making process
5. It can be introduced in new systems as well as existing platforms
disadv:
1. Probability that organisations may sell useful data of customers to other organisations for money
- As per the report, American Express has sold credit card purchases of their customers to the other organisations
2. Data mining analytic softwares are difficult to use and operate and needs advanced training to work on
3. Over reliance on data
4. Misinterpretation of results
5. Performance issues
Online Transaction processing OLTP
- Meant to be used to do small transactions, and usually serve single source of storage.
- They are created for transactional priority instead of data analysis
2.Example is online movie tickets booking websiteIf two persons at the same time wants to book the same seat for the same movie, for same movie timings, then this case however, will complete the transaction first. Will get the ticket
adv:
1. Responds very quickly
2. Allows us to perform actions such as read write and delete data quickly
3. Ensure the consistency of data real time
4. Ensure highly availability by providing real time access to the data
5. Highly scalable and can handle an increasing number of users and transactions
6. Security
7. Helps in making the decisions better
disadv:
1. It is not fail safe If there is hardware failure, then whole transaction gets affected
2. Allows the user to access Unchanged data at the same time, which may
3. Ltd analysis capabilities
4. Costs high maintenance
Online analytic processing
OLAP
Usually more suited for analytics, data mining
- Data warehouse system is olap system
- many companies compare their sales of current month with previous months, and keep the trace of the business which is stored in another location, which is completely separate database So in this situation, company uses Olap databases
benefits:
1. Keeps trace of consistency and calculation
2. Stores single platform where we can store planning analysis and budgeting for business analytics
3. We can easily apply security restrictions to protected data
4. Enables complex queries and data analysis by providing multi dimensional view of data
5. It also supports data mining and predictive analytics by providing access to historical data and trends
6. Helps in handling large volumes of data
Drawbacks:
1. We need always, IT professionals to handle them
2. As many departmental data is stored in single olap It is sometimes need cooperation between people of various departments, which leads to dependency problem
3. Expensive to implement and maintain especially for large data sets
4. Olab services are optimised for read heavy workloads. So write operations may be slower, less efficient
collapse services may not be suitable for real time analysis or decision making As data is typically updated on periodic basis
Data mining techniques
- Classification
- clustering
- regression analysis
- association rules
a. lift (Measure accuracy of confidence over how often item B is purchased)
confidence/ Item B / Entire data set
b. support. (measures how often multiple items are purchased and compared to overall data set)
Item a + Item B / Entire data set
c. confidence (measures how often item B is purchased when item a is purchased as well)
Item a + Item B / Entire data set - Outer detection
- sequential patterns
rediction