Theory - Data Pipeline Flashcards

1
Q

What is Data?

A

Data refers to a collection of raw facts, observations, or statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Capta?

A

Capta is “taken” actively while data is assumed to be a “given” able to
be recorded and observed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a mutlivariate data type?

A

Multiple variables within a single record representing a composite data item

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Can you list various data types?

A
  • 1D (e.g., sets and sequences)
  • 2D (e.g., geo-spatial maps)
  • 3D (e.g., shapes)
  • nD (e.g., high-dimensional)
  • Networks (e.g., relational and graphs)
  • Temporal (e.g., timelines)
  • Trees (e.g., hierarchies)
  • Text (e.g., document corpus)
  • Multimedia (e.g., images and videos)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a scalar data type?

A

An individual number in a data record

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a vector data type?

A

Representation forms of multivariate data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are Data Models?

A

Conceptual models that consist of a formal description of the data w.r.t. the task semantics to support reasoning and problem-solving.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Can you list 4 data classes?

A
  • Nominal (labels or categories)
  • Ordinal (ordered)
  • Numeric - Interval (location of zero arbitrary)
  • Numeric - Ratio (zero fixed)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Can you list 2 different types of data representation?

A
  • Symbolic (Explicit, used by deductive methods)
  • Sub-symbolic (Implicit, used by ANN)

Criteria: Logical relationshipsbetween dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can the Mental Model of the Users about the Data be expressed?

A

Intelligible vs. Non-Intelligible Data

Criteria: Correlation with the mental model of a person

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can we collect data?

A
  • Observations
    • Field Studies
    • User Behavior Analysis
  • Surveys
    • Questionnaires
    • Crowdsourcing
  • Sensors
    • Automatic Collection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Can you list 3 data formats?

A
  • Structured
  • Unstructured
  • Semi-Structured Data

Criteria: Arrengement and organisation of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Can you list 3 Common Data Exchange Formats?

A
  • XML: eXtensible Markup Language
  • JSON: JavaScript
    Object Notation
  • CSV: Comma-Separated Value
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

List as many possible data issues as you can think of

A
  • Typos
  • Missing Data or Fields
  • Different Units
  • Non-Uniform Data Types
  • Abbreviations
  • Variations of the Same Thing
  • Duplicates
  • Encoding Issues
  • Dashes & Parentheses
  • Delimiters
  • White spaces
  • Noise
  • Outliers
  • Measurement Errors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can we generate data from ML models?

A
  • Inputs and Outputs
    • Distributions
    • Relations
    • Perturbations (what-if analysis)
  • Parameters
    • Logging of Weights, Biases, Activations, etc.
    • Model Architectures and Parameters
  • Scores and Performance Measures
  • Temporal Evolution
    • Training-Testing Comparisons
    • Data Distribution Shifts
    • Model Development Evolution Cycles
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Can you list the 7 data cleaning pipeline steps?

A

Missing Value Treatment
Noise Treatment
Outlier Detection
Normalization
Data Reduction
Data Smoothing
Data Augmentation

14
Q

Can you least 4 solutions to missing values with their advantages and disadvantages?

A
  • Ignore the tuple
    + Can be easily done
    + No computational effort
    • Loss of information
    • Unnecessary if the attribute is not needed
  • Enter value manually
    + For small datasets effective
    • Need to know the value
    • Time consuming
    • Not feasible with large datasets
  • Use attribute mean
    + Simple to implement
    • Not the most accurate approximation of the value
  • Use most probable value
    + Most accurate approximation of the value
    • Most computational effort
15
Q

Can you mention 4 sources of noises?

A
  • Random noise (white noise)
  • Noise introduced by measurement tools
  • Human error
  • Fraud
16
Q

Can you mention 4 methods of removing noise?

A
  • Equal-width binning
  • Equal-depth binning
  • Outlier Detection via Clustering
  • Data Smoothing via Regression
17
Q

List 2 ways of performing Data Reduction

A
  • Reducing the data points (Sampling)
  • Reducing the dimensions of every data point (Dimensionality Reduction) (PCA)
17
Q

Can you provide 4 different normalization techniques?

A
  • Linear Normalization
  • Square Root Normalization
  • Logarithmic Normalization
18
Q

List 3 reasons to perform Data Reduction

A
  • Sometimes too much data can be harmful for the model because of data redundancies
  • Furthermore, interpreting a model with input data of high dimensionity can be impossible.
  • Finally, it can be hard to train a model with too much data because of computational complexity!
19
Q

List 4 methods for Sampling

A
  • Random Sampling
    + Can be easily done
    + Least biased sampling method
    • Chance that some characteristics are not taken into account
  • Systematic Random Sampling
    + Can be easily done
    + Good sample for the ordered attribute
    • Possible bias towards the other attributes
    • Problem with periodicities
  • Stratified Random Sampling
    + Good sample of the data
    + Different sample strategies within a strata can be used
    • Computationally more expensive
  • Cluster Random Sampling
    + Cheap method when it is geographically convenient
    + Easy to increase sample size
    • Least representative of the population
20
Q

What is Data Augmentation?

A

Modifying dimensions of one datapoint without altering its labels to produce a new one

21
Q

List 2 methods that can be used for Data Augmentation

A
  • Adding datapoints
  • Adding new dimensions
22
Q

How do we integrate Data?

A
  • Join Databases
    + easy approach: fast and low effort
    • does not always work or yield desired results
  • Reconcile and Match Data (e.g., http://openrefine.org)
    + can consider tasks and semantics
    • more time consuming, requires entity resolution/matching
  • Crowd-Sourcing (e.g., Annotated Knowledge Graphs)
    + high quality data
    • costly and task-specific
23
Q

How to match Data Items?

A
  • Exact Matches
  • Based on IDs
  • Based on Similarity
    • Euclidean Norm
    • Cosine Similarity
    • Edit Distance
  • Based on Co-Occurrence
24
Q

What are the three types of Modeling?

A
  • Supervised learning
    • Classification: decide what category does a datapoint belong
    • Regression: estimate relarionship among variables
  • Unsupervised learning
    • Clustering: group similar data together
    • Dimension reduction: find a subgroup of dimensions that captures as much information as possible
  • Reinforcement learning: takes decisions based on a set of actions in an environment
25
Q

What is Anscombe’s Quartet

A

4 datasets with same summary statistics that are clearly different and visually distinct.

26
Q

What does Guidance refer to?

A

An AI assisted process that aims to actively resolve a knowledge gap encountered by users during the solving of an analytical task

27
Q

What are the dimensions of Feedback?

A
  • Intent
  • Expression Form
  • Observation Actuality
  • Observation Granularity