Preprocessing Flashcards

1
Q

Welche Dimensionen gibt es, um die Qualität von Daten zu messen?

A
  • Completeness: is the data fully available? What to do if not?
  • Consistency: differences in data units or name conventions?
  • Timeliness: measurements from different epochs?
    Old measure devices?
  • Believability: is the data source reliable?
  • Interpretability: how easily can the data be understood?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Für das Data Cleaning, über welche Typen an Fehlern sollte man Bescheid wissen?

A
  • Incomplete: lacking attribute values, lacking certain attributes of
    interest, or only aggregate data available
  • Noisy: containing noise, errors, or outliers
  • Inconsistent: containing discrepancies in codes or names
  • Intentionally imprecise
    – Jan. 1 as everyone’s birthday
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Wie kann man mit fehlenden Daten umgehen?

A
  • Ignorieren: kein großer Effekt bei großen Daten
  • manuell die Einträge überarbeiten
  • automatisch die Einträge überarbeiten (global constant, mean, most probable value using inference such as Bayesian
    formula or decision tree based on other attributess)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Was ist Data Integration?

A

Data integration combines data from
multiple sources into a coherent store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Mit welcher Methode kann man redundante Attribute erkennen?

A

chi-square test (nominal)
correlation analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Was sind die Vorteile von Data Integration?

A
  • reduce/avoid redundancies and inconsistencies and
  • improve mining speed and quality
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Beschreib den Chi-Square Test mathematisch

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Was bedeutet ein hohes Chi-Quadrat?

A

→ data distributions are statistically different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Was bedeutet ein niedriges Chi-Quadrat?

A

distributions are similar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Wie funktioniert ChiMerge?

A

Man hat Intervalle und checkt rekursiv, ob die Verteilung der Label in den beiden ähnlich ist anhand des Chi-Quadrat tests und mergt diese, falls dies stimmt

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Beschreib den Pearson’s product
moment coefficient mathematisch

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Pearson’s product moment coefficient

Was bedeutet es, wenn r > 0?

A

A and B are positively correlated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Pearson’s product moment coefficient

Was bedeutet es, wenn r = 0?

A

uncorrelated, not necessarily independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Pearson’s product moment coefficient

Was bedeutet es, wenn r < 0?

A

negatively correlated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Wie berechent man die Kovarianz?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Was bedeutet eine Kovarianz von größer als null?

A

A and B tend to be together
larger or together smaller than their expected values

17
Q

Was bedeutet eine Kovarianz von kleiner als null?

A

if A is larger than its expected
value, B is likely to be smaller than its expected value.

18
Q

Wie kann die Kovarianz vereinfacht werden?

A
19
Q
A
20
Q

Wie berechnet man element ij einer Kovarianzmatrix?

A

it computes the covariance between feature i and feature j

21
Q

Welche Strategien für das Binning existieren?

A
  • equal-width
  • equal-first (same number of samples)
22
Q

Welche Smoothing Strategien im Anschluss des Binnings existieren?

A
23
Q

Welche zwei Wege zur Dimensionsreduzierung existieren?

A
  • Feature selection: A process that chooses an optimal
    subset of features according to an objective function
  • Feature extraction: refers to the mapping of the original
    high-dimensional data onto a lower-dimensional space
24
Q

Was minimiert deskriptive Dimensionsreuzierung?

A

den Informationsverlust

25
Q

Was maximiert prädiktive Dimensionsreduzierung?

A

die Klassendiskrimination

26
Q

Wie bestimmt man die separation quality?

A
27
Q

Was ist entity identification, or entity resolution?

A

Entity identification, or entity resolution, is a common challenge in data integration. It involves identifying and linking records that refer to the same entities (such as individuals, companies, or products) across different data sources. This process can be complex due to variations in how entities are represented (different names, addresses, etc.) in various sources.

28
Q

Which of the following statements on boxplots are correct?

  • The whiskers extend to Q1 and Q3.
  • It is possible to create a boxplot for ordinal data.
  • The box of a boxplot contains half of the data.
  • Boxplots show the min, max, Q1, Q3, and mean of a distribution.
A
  • The box of a boxplot contains half of the data.,
  • It is possible to create a boxplot for ordinal data.
29
Q

Which of the following statements are correct?
1. Boxplots contain more information than histograms.
2. The area of the bars in a histogram matters.
3. Buckets of histograms can have varying size.
4. In histograms, the buckets can be reordered.

A
  1. The area of the bars in a histogram matters.
  2. Buckets of histograms can have varying size.
30
Q

Which of the following statements are correct?
1. Equal width binning results in bins with the same number of elements.
2. Binning is a simple way of data discretization.
3. Equal width binning is not influenced by outliers.
4. Bins can be used for data smoothing.

A

2 und 4

31
Q

Wie können Bins für data smoothing sorgen?

A

Data smoothing is a technique used to reduce the noise within a dataset to highlight the underlying trend, and binning is one method to achieve this. By grouping data points into bins, and then replacing each data point by a central value of the bin (like the mean or median), the minor fluctuations or the noise in the data can be smoothed out. This process simplifies the data, making it easier to observe long-term trends and patterns.

32
Q

Which of the following statements about outliers is correct?
1. Graphical outlier detection is suitable for high-dimensional data.
2. Outlier detection can be used to prevent credit card fraud.
3. It is always useful to automatically remove all outliers to reduce noise.

A
  1. Outlier detection can be used to prevent credit card fraud.
33
Q

Which of the statements are not correct?
1. It is never possible to automatically detect redundant attributes.
2. Data integration may lead to redundancy in the dataset.
3. Data integration combines data from multiple sources.
4. Entity identification is a problem in data integration.

A
  1. It is never possible to automatically detect redundant attributes.
34
Q

Wie berechnet man die Expected werte in der confusion tabelle bei chi quadrat

A