DM Mod 1 Flashcards

1
Q

What is Data Mining?

A

The process of extracting information to identify patterns, trends, and useful data that would allow the business to take the data-driven decision from huge sets of data is called Data Mining.In other words, we can say that Data Mining is the process of investigating hidden patterns of information to various perspectives for categorization into useful data, which is collected and assembled in particular areas such as data warehouses, efficient analysis, data mining algorithm, helping decision making and other data requirement to eventually cost-cutting and generating revenue.Data Mining is also called Knowledge Discovery of Data (KDD).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Supervised learning

A

Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Basically supervised learning is when we teach or train the machine using data that is well labeled. Which means some data is already tagged with the correct answer. After that, the machine is provided with a new set of examples(data) so that the supervised learning algorithm analyses the training data(set of training examples) and produces a correct outcome from labeled data. Supervised learning is classified into two categories of algorithms:

1-Classification: A classification problem is when the output variable is a category, such as “Red” or “blue” or “disease” and “no disease”.

2-Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.

Advantages:-Supervised learning allows collecting data and produces data output from previous experiences.
Helps to optimize performance criteria with the help of experience.
Supervised machine learning helps to solve various types of real-world computation problems.

Disadvantages:-
Classifying big data can be challenging.
Training for supervised learning needs a lot of computation time. So, it requires a lot of time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Unsupervised learning

A

Unsupervised learning is the training of a machine using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Here the task of the machine is to group unsorted information according to similarities, patterns, and differences without any prior training of data. Unlike supervised learning, no teacher is provided that means no training will be given to the machine. Therefore the machine is restricted to find the hidden structure in unlabeled data by itself. For instance, suppose it is given an image having both dogs and cats which it has never seen. Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as ‘dogs and cats ‘. But it can categorize them according to their similarities, patterns, and differences, i.e., we can easily categorize the above picture into two parts. The first may contain all pics having dogs in them and the second part may contain all pics having cats in them. Here you didn’t learn anything before, which means no training data or examples. It allows the model to work on its own to discover patterns and information that was previously undetected. It mainly deals with unlabelled data.Unsupervised learning is classified into two categories of algorithms:
1- Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
2-Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is OLAP

A

OLAP stands for On-Line Analytical Processing. OLAP is a classification of software technology which authorizes analysts, managers, and executives to gain insight into information through fast, consistent, interactive access in a wide variety of possible views of data that has been transformed from raw information to reflect the real dimensionality of the enterprise as understood by the clients.

OLAP implement the multidimensional analysis of business information and support the capability for complex estimations, trend analysis, and sophisticated data modeling.

It is rapidly enhancing the essential foundation for Intelligent Solutions containing Business Performance Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis, Simulation-Models, Knowledge Discovery, and Data Warehouses Reporting.

OLAP enables end-clients to perform ad hoc analysis of record in multiple dimensions, providing the insight and understanding they require for better decision making.

How OLAP Works?
Fundamentally, OLAP has a very simple concept. It pre-calculates most of the queries that are typically very hard to execute over tabular databases, namely aggregation, joining, and grouping. These queries are calculated during a process that is usually called ‘building’ or ‘processing’ of the OLAP cube. This process happens overnight, and by the time end users get to work - data will have been updated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Difference between OLTP and OLAP

A

OLTP (On-Line Transaction Processing) is featured by a large number of short on-line transactions (INSERT, UPDATE, and DELETE). The primary significance of OLTP operations is put on very rapid query processing, maintaining record integrity in multi-access environments, and effectiveness consistent by the number of transactions per second. In the OLTP database, there is an accurate and current record, and schema used to save transactional database is the entity model (usually 3NF).OLAP (On-line Analytical Processing) is represented by a relatively low volume of transactions. Queries are very difficult and involve aggregations. For OLAP operations, response time is an effectiveness measure. OLAP applications are generally used by Data Mining techniques. In OLAP database there is aggregated, historical information, stored in multi-dimensional schemas (generally star schema).Following are the difference between OLAP and OLTP system.1) Users: OLTP systems are designed for office worker while the OLAP systems are designed for decision-makers. Therefore while an OLTP method may be accessed by hundreds or even thousands of clients in a huge enterprise, an OLAP system is suitable to be accessed only by a select class of manager and may be used only by dozens of users.2) Functions: OLTP systems are mission-critical. They provide day-to-day operations of an enterprise and are largely performance and availability driven. These operations carry out simple repetitive operations. OLAP systems are management-critical to support the decision of enterprise support tasks using detailed investigation.3) Nature: Although SQL queries return a set of data, OLTP methods are designed to step one record at the time, for example, a data related to the user who may be on the phone or in the store. OLAP system is not designed to deal with individual customer records. Instead, they include queries that deal with many data at a time and provide summary or aggregate information to a manager. OLAP applications include data stored in a data warehouses that have been extracted from many tables and possibly from more than one enterprise database.4) Design: OLTP database operations are designed to be application-oriented while OLAP operations are designed to be subject-oriented. OLTP systems view the enterprise record as a collection of tables (possibly based on an entity-relationship model). OLAP operations view enterprise information as multidimensional).5) Data: OLTP systems usually deal only with the current status of data. For example, a record about an employee who left three years ago may not be feasible on the Human Resources System. The old data may have been achieved on some type of stable storage media and may not be accessible online. On the other hand, OLAP systems needed historical data over several years since trends are often essential in decision making.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a Data Warehouse?

A

A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than transaction processing.

It includes historical data derived from transaction data from single and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support for decision-makers for data modeling and analysis.

A Data Warehouse is a group of data specific to the entire organization, not only to a particular group of users.

It is not used for daily operations and transaction processing but used for making decisions.

A Data Warehouse can be viewed as a data system with the following attributes:
It is a database designed for investigative tasks, using data from various applications.
It supports a relatively small number of clients with relatively long interactions.
It includes current and historical data to provide a historical perspective of information.
Its usage is read-intensive.
It contains a few large tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Warehouse Architecture

A

A data-warehouse is a heterogeneous collection of different data sources organised under a unified schema. There are 2 approaches for constructing data-warehouse: Top-down approach and Bottom-up approach are explained as below.

  1. Top-down approach:
    The essential components are discussed below:
    A-External Sources – External source is a source from where data is collected irrespective of the type of data. Data can be structured, semi structured and unstructured as well.
    B- Stage Area – Since the data, extracted from the external sources does not follow a particular format, so there is a need to validate this data to load into data warehouse. For this purpose, it is recommended to use ETL tool.
    C- E(Extracted): Data is extracted from External data source.
    T(Transform): Data is transformed into the standard format.
    L(Load): Data is loaded into data warehouse after transforming it into the standard format.
    D- Data-warehouse – After cleansing of data, it is stored in the data warehouse as central repository. It actually stores the meta data and the actual data gets stored in the data marts.
    Note that data warehouse stores the data in its purest form in this top-down approach.

Data Marts – Data mart is also a part of storage component. It stores the information of a particular function of an organization which is handled by single authority. There can be as many number of data marts in an organization depending upon the functions. We can also say that data mart contains subset of the data stored in data warehouse.

Advantages of Top-Down Approach – Since the data marts are created from the data warehouse, provides consistent dimensional view of data marts. Also, this model is considered as the strongest model for business changes. That’s why, big organizations prefer to follow this approach. Creating data mart from data warehouse is easy.

Disadvantages of Top-Down Approach – The cost, time taken in designing and its maintenance is very high.

  1. Top-down approach:
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Data Cube or OLAP approach in Data Mining

A

Grouping of data in a multidimensional matrix is called data cubes. In Data warehousing, we generally deal with various multidimensional data models as the data will be represented by multiple dimensions and multiple attributes. This multidimensional data is represented in the data cube as the cube represents a high-dimensional space. The Data cube pictorially shows how different attributes of data are arranged in the data model. Below is the diagram of a general data cube. The example above is a 3D cube having attributes like branch(A,B,C,D),item type(home,entertainment,computer,phone,security), year(1997,1998,1999)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Data cube classification:

A

The data cube can be classified into two categories:
1- Multidimensional data cube: It basically helps in storing large amounts of data by making use of a multi-dimensional array. It increases its efficiency by keeping an index of each dimension. Thus, dimensional is able to retrieve data fast.
2-Relational data cube: It basically helps in storing large amounts of data by making use of relational tables. Each relational table displays the dimensions of the data cube. It is slower compared to a Multidimensional Data Cube.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe various data cube operations like Roll up, drill down, slicing, dicing, pivot

A

Data cube operations are used to manipulate data to meet the needs of users. These operations help to select particular data for the analysis purpose. There are mainly 5 operations listed below-
1- Roll-up: operation and aggregate certain similar data attributes having the same dimension together.
For example, if the data cube displays the daily income of a customer, we can use a roll-up operation to find the monthly income of his salary.
2- Drill-down: this operation is the reverse of the roll-up operation. It allows us to take particular information and then subdivide it further for coarser granularity analysis. It zooms into more detail.
For example- if India is an attribute of a country column and we wish to see villages in India, then the drill-down operation splits India into states, districts, towns, cities, villages and then displays the required information.
3- Slicing: this operation filters the unnecessary portions. Suppose in a particular dimension, the user doesn’t need everything for analysis, rather a particular attribute.
For example, country=”jamaica”, this will display only about jamaica and only display other countries present on the country list.
4-Dicing: this operation does a multidimensional cutting, that not only cuts only one dimension but also can go to another dimension and cut a certain range of it. As a result, it looks more like a subcube out of the whole cube(as depicted in the figure). For example- the user wants to see the annual salary of Jharkhand state employees.
5-Pivot: this operation is very important from a viewing point of view. It basically transforms the data cube in terms of view. It doesn’t change the data present in the data cube.
For example, if the user is comparing year versus branch, using the pivot operation, the user can change the viewpoint and now compare branch versus item type.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Advantages of data cubes:

A

Helps in giving a summarised view of data.Data cubes store large data in a simple way.Data cube operation provides quick and better analysis,Improve performance of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is ROLAP

A

Relational Online Analytical Processing (ROLAP):ROLAP is used for large data volumes and in this data is stored in relation tables. In ROLAP, Static multidimensional view of data is created.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is MOLAP

A

Multidimensional Online Analytical Processing (MOLAP):MOLAP is used for limited data volumes and in this data is stored in multidimensional array. In MOLAP, Dynamic multidimensional view of data is created.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

difference between ROLAP and MOLAP:

A

The main difference between ROLAP and MOLAP is that, In ROLAP, Data is fetched from data-warehouse. On the other hand, in MOLAP, Data is fetched from MDDBs database. The common term between these two is OLAP.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is schema in data warehousing

A

Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates. Much like a database, a data warehouse also requires to maintain a schema. A database uses relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is star schema

A

Each dimension in a star schema is represented with only one-dimension table.This dimension table contains the set of attributes.The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location.There is a fact table at the center. It contains the keys to each of four dimensions.The fact table also contains the attributes, namely dollars sold and units sold.Note − Each dimension has only one dimension table and each table holds a set of attributes. For example, the location dimension table contains the attribute set {location_key, street, city, province_or_state,country}. This constraint may cause data redundancy. For example, “Vancouver” and “Victoria” both the cities are in the Canadian province of British Columbia. The entries for such cities may cause data redundancy along the attributes province_or_state and country.

17
Q

What is snowflake schema

A

Some dimension tables in the Snowflake schema are normalized.The normalization splits up the data into additional tables.Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example, the item dimension table in star schema is normalized and split into two dimension tables, namely item and supplier table.Now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-key.The supplier key is linked to the supplier dimension table. The supplier dimension table contains the attributes supplier_key and supplier_type.Note − Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it becomes easy to maintain and the save storage space.

18
Q

What is Fact Constellation Schema

A

A fact constellation has multiple fact tables. It is also known as galaxy schema.The following diagram shows two fact tables, namely sales and shipping.The sales fact table is same as that in the star schema.The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key, from_location, to_location.The shipping fact table also contains two measures, namely dollars sold and units sold.It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables are shared between the sales and shipping fact table.

19
Q

Difference between database and data warehouse

Which is designed for analysis ?

A

data warehouse

20
Q

Difference between database and data warehouse

Which uses OLAP

A

data warehouse

21
Q

Difference between database and data warehouse

In which data is upto date ?

A

database