Theory - Data Pipeline Flashcards by Dimitrios Dekas

What is Data?

Data refers to a collection of raw facts, observations, or statistics.

How well did you know this?

Not at all

Perfectly

What is Capta?

Capta is “taken” actively while data is assumed to be a “given” able to
be recorded and observed

How well did you know this?

Not at all

Perfectly

What is a mutlivariate data type?

Multiple variables within a single record representing a composite data item

How well did you know this?

Not at all

Perfectly

Can you list various data types?

1D (e.g., sets and sequences)
2D (e.g., geo-spatial maps)
3D (e.g., shapes)
nD (e.g., high-dimensional)
Networks (e.g., relational and graphs)
Temporal (e.g., timelines)
Trees (e.g., hierarchies)
Text (e.g., document corpus)
Multimedia (e.g., images and videos)

How well did you know this?

Not at all

Perfectly

What is a scalar data type?

An individual number in a data record

How well did you know this?

Not at all

Perfectly

What is a vector data type?

Representation forms of multivariate data

How well did you know this?

Not at all

Perfectly

What are Data Models?

Conceptual models that consist of a formal description of the data w.r.t. the task semantics to support reasoning and problem-solving.

How well did you know this?

Not at all

Perfectly

Can you list 4 data classes?

Nominal (labels or categories)
Ordinal (ordered)
Numeric - Interval (location of zero arbitrary)
Numeric - Ratio (zero fixed)

How well did you know this?

Not at all

Perfectly

Can you list 2 different types of data representation?

Symbolic (Explicit, used by deductive methods)
Sub-symbolic (Implicit, used by ANN)

Criteria: Logical relationshipsbetween dimensions

How well did you know this?

Not at all

Perfectly

How can the Mental Model of the Users about the Data be expressed?

Intelligible vs. Non-Intelligible Data

Criteria: Correlation with the mental model of a person

How well did you know this?

Not at all

Perfectly

How can we collect data?

Observations
- Field Studies
- User Behavior Analysis
Surveys
- Questionnaires
- Crowdsourcing
Sensors
- Automatic Collection

How well did you know this?

Not at all

Perfectly

Can you list 3 data formats?

Structured
Unstructured
Semi-Structured Data

Criteria: Arrengement and organisation of the data

How well did you know this?

Not at all

Perfectly

Can you list 3 Common Data Exchange Formats?

XML: eXtensible Markup Language
JSON: JavaScript
Object Notation
CSV: Comma-Separated Value

How well did you know this?

Not at all

Perfectly

List as many possible data issues as you can think of

Typos
Missing Data or Fields
Different Units
Non-Uniform Data Types
Abbreviations
Variations of the Same Thing
Duplicates
Encoding Issues
Dashes & Parentheses
Delimiters
White spaces
Noise
Outliers
Measurement Errors

How well did you know this?

Not at all

Perfectly

How can we generate data from ML models?

Inputs and Outputs
- Distributions
- Relations
- Perturbations (what-if analysis)
Parameters
- Logging of Weights, Biases, Activations, etc.
- Model Architectures and Parameters
Scores and Performance Measures
Temporal Evolution
- Training-Testing Comparisons
- Data Distribution Shifts
- Model Development Evolution Cycles

How well did you know this?

Not at all

Perfectly

Can you list the 7 data cleaning pipeline steps?

Study These Flashcards

Missing Value Treatment
Noise Treatment
Outlier Detection
Normalization
Data Reduction
Data Smoothing
Data Augmentation

Can you least 4 solutions to missing values with their advantages and disadvantages?

Study These Flashcards

Ignore the tuple
+ Can be easily done
+ No computational effort
- Loss of information
- Unnecessary if the attribute is not needed
Enter value manually
+ For small datasets effective
- Need to know the value
- Time consuming
- Not feasible with large datasets
Use attribute mean
+ Simple to implement
- Not the most accurate approximation of the value
Use most probable value
+ Most accurate approximation of the value
- Most computational effort

Can you mention 4 sources of noises?

Study These Flashcards

Random noise (white noise)
Noise introduced by measurement tools
Human error
Fraud

Can you mention 4 methods of removing noise?

Study These Flashcards

Equal-width binning
Equal-depth binning
Outlier Detection via Clustering
Data Smoothing via Regression

List 2 ways of performing Data Reduction

Study These Flashcards

Reducing the data points (Sampling)
Reducing the dimensions of every data point (Dimensionality Reduction) (PCA)

Can you provide 4 different normalization techniques?

Study These Flashcards

Linear Normalization
Square Root Normalization
Logarithmic Normalization

List 3 reasons to perform Data Reduction

Study These Flashcards

Sometimes too much data can be harmful for the model because of data redundancies
Furthermore, interpreting a model with input data of high dimensionity can be impossible.
Finally, it can be hard to train a model with too much data because of computational complexity!

List 4 methods for Sampling

Study These Flashcards

Random Sampling
+ Can be easily done
+ Least biased sampling method
- Chance that some characteristics are not taken into account
Systematic Random Sampling
+ Can be easily done
+ Good sample for the ordered attribute
- Possible bias towards the other attributes
- Problem with periodicities
Stratified Random Sampling
+ Good sample of the data
+ Different sample strategies within a strata can be used
- Computationally more expensive
Cluster Random Sampling
+ Cheap method when it is geographically convenient
+ Easy to increase sample size
- Least representative of the population

What is Data Augmentation?

Study These Flashcards

Modifying dimensions of one datapoint without altering its labels to produce a new one

List 2 methods that can be used for Data Augmentation

* Adding datapoints * Adding new dimensions

How do we integrate Data?

* Join Databases + easy approach: fast and low effort - does not always work or yield desired results * Reconcile and Match Data (e.g., http://openrefine.org) + can consider tasks and semantics - more time consuming, requires entity resolution/matching * Crowd-Sourcing (e.g., Annotated Knowledge Graphs) + high quality data - costly and task-specific

How to match Data Items?

* Exact Matches * Based on IDs * Based on Similarity * Euclidean Norm * Cosine Similarity * Edit Distance * Based on Co-Occurrence

What are the three types of Modeling?

* Supervised learning * Classification: decide what category does a datapoint belong * Regression: estimate relarionship among variables * Unsupervised learning * Clustering: group similar data together * Dimension reduction: find a subgroup of dimensions that captures as much information as possible * Reinforcement learning: takes decisions based on a set of actions in an environment

What is Anscombe's Quartet

4 datasets with same summary statistics that are clearly different and visually distinct.

What does Guidance refer to?

An AI assisted process that aims to actively resolve a knowledge gap encountered by users during the solving of an analytical task

What are the dimensions of Feedback?

* Intent * Expression Form * Observation Actuality * Observation Granularity

Theory - Data Pipeline Flashcards

(31 cards)