Part 2: Chapter 1, 2, 3 Flashcards
Data
Collection of facts usually obtained as the result of experiences, web page visits, observations, or experiments. Data may consists of numbers, words, images, …
- Data is the lowest level of abstractions (from which information and knowledge are derived).
CRISP-DM
Cross industry standard to do certain projects. 6 steps:
- Business understanding: what is it all about and what do we want to achieve.
- Data understanding
- Data preparation: collect data from all sources and clean them.
- Model building: select the best model that derives the results you want to know.
- Testing and evaluation
- Deployment: how are we going to use the results in real-time.
First 3 steps = +/- 85% of the project time Last 3 steps are supported by 'SEMMA': - Sample - Explore - Modify - Model - Assess
Datamining
The process of discovering new valuable knowledge in databases. Datamining = machine learning (Ai) + statistics + databases.
2 types of datamining:
1. Hypothesis-driven data mining. This is the classical statistical hypothesis testing. Beforehand you have a hypothesis that you want to test.
2. Discovery-driven data mining. This is modern explorative research. You are exploring the data and the hypothesis develops while looking at the data.
Similarities between datamining and statistics.
- Scientific fields for analyzing data
- Complex processes for learning from data that require profound understanding and mastering.
Differences between datamining and statistics
- Purpose of data: statistics data are primarily collected for checking a hypothesis formulated beforehand. Datamining deals mostly with data gathered from operational processes.
- Amount of data: Datamining usually deal with vast amounts of data.
- Data analysis: the results from datamining process should be easy to understand by and explain to the human decision makers and should comply with the business objectives.
5 v’s for big data
- Volume: the incredible amounts of data generated from different sources.
- Variety: all the different types of data we can use.
- Velocity: the speed at which vast amounts of data are being generated, collected, and analyzed.
- Value: the worth of the data being extracted (often forgotten).
- Veracity: the quality or trustworthiness of the data.
Principles from logic
- Deduction: all birds can fly, Koko is a bird, Koko can fly. -> it is sounds
- Abduction: all birds can fly, Koko can fly. Therefor, Koko is a bird. -> not sounds, airplanes can fly as well.
- Induction: Koko is a bird, Koko can fly, Tweety is a bird, Tweety can fly. All birds can fly. -> it is not sounds, because we only have 2 cases. However, this is used in data mining.
Selection bias
Make sure that your dataset (sample) is a good representation of the real world.
Social and legal aspects
- Social impacts:
+ privacy: access to detailed personal information.
+ ethics: handling misinformation and discrimination. - Legal aspects:
+ National and EU
+ General Data Protection Rule (GDPR)