Part 2: Chapter 1, 2, 3 Flashcards

Question 1

Q

Data

Answer

A

Collection of facts usually obtained as the result of experiences, web page visits, observations, or experiments. Data may consists of numbers, words, images, …
- Data is the lowest level of abstractions (from which information and knowledge are derived).

Question 2

Q

CRISP-DM

Answer

A

Cross industry standard to do certain projects. 6 steps:

Business understanding: what is it all about and what do we want to achieve.
Data understanding
Data preparation: collect data from all sources and clean them.
Model building: select the best model that derives the results you want to know.
Testing and evaluation
Deployment: how are we going to use the results in real-time.

First 3 steps = +/- 85% of the project time
Last 3 steps are supported by 'SEMMA':
- Sample
- Explore
- Modify
- Model
- Assess

Question 3

Q

Datamining

Answer

A

The process of discovering new valuable knowledge in databases. Datamining = machine learning (Ai) + statistics + databases.
2 types of datamining:
1. Hypothesis-driven data mining. This is the classical statistical hypothesis testing. Beforehand you have a hypothesis that you want to test.
2. Discovery-driven data mining. This is modern explorative research. You are exploring the data and the hypothesis develops while looking at the data.

Question 4

Q

Similarities between datamining and statistics.

Answer

A

Scientific fields for analyzing data

- Complex processes for learning from data that require profound understanding and mastering.

Question 5

Q

Differences between datamining and statistics

Answer

A

Purpose of data: statistics data are primarily collected for checking a hypothesis formulated beforehand. Datamining deals mostly with data gathered from operational processes.
Amount of data: Datamining usually deal with vast amounts of data.
Data analysis: the results from datamining process should be easy to understand by and explain to the human decision makers and should comply with the business objectives.

Question 6

Q

5 v’s for big data

Answer

A

Volume: the incredible amounts of data generated from different sources.
Variety: all the different types of data we can use.
Velocity: the speed at which vast amounts of data are being generated, collected, and analyzed.
Value: the worth of the data being extracted (often forgotten).
Veracity: the quality or trustworthiness of the data.

Question 7

Q

Principles from logic

Answer

A

Deduction: all birds can fly, Koko is a bird, Koko can fly. -> it is sounds
Abduction: all birds can fly, Koko can fly. Therefor, Koko is a bird. -> not sounds, airplanes can fly as well.
Induction: Koko is a bird, Koko can fly, Tweety is a bird, Tweety can fly. All birds can fly. -> it is not sounds, because we only have 2 cases. However, this is used in data mining.

Question 8

Q

Selection bias

Answer

A

Make sure that your dataset (sample) is a good representation of the real world.

Question 9

Q

Social and legal aspects

Answer

A

Social impacts:
+ privacy: access to detailed personal information.
+ ethics: handling misinformation and discrimination.
Legal aspects:
+ National and EU
+ General Data Protection Rule (GDPR)

Part 2: Chapter 1, 2, 3 Flashcards

(9 cards)