Theory - Data Pipeline Flashcards
What is Data?
Data refers to a collection of raw facts, observations, or statistics.
What is Capta?
Capta is “taken” actively while data is assumed to be a “given” able to
be recorded and observed
What is a mutlivariate data type?
Multiple variables within a single record representing a composite data item
Can you list various data types?
- 1D (e.g., sets and sequences)
- 2D (e.g., geo-spatial maps)
- 3D (e.g., shapes)
- nD (e.g., high-dimensional)
- Networks (e.g., relational and graphs)
- Temporal (e.g., timelines)
- Trees (e.g., hierarchies)
- Text (e.g., document corpus)
- Multimedia (e.g., images and videos)
What is a scalar data type?
An individual number in a data record
What is a vector data type?
Representation forms of multivariate data
What are Data Models?
Conceptual models that consist of a formal description of the data w.r.t. the task semantics to support reasoning and problem-solving.
Can you list 4 data classes?
- Nominal (labels or categories)
- Ordinal (ordered)
- Numeric - Interval (location of zero arbitrary)
- Numeric - Ratio (zero fixed)
Can you list 2 different types of data representation?
- Symbolic (Explicit, used by deductive methods)
- Sub-symbolic (Implicit, used by ANN)
Criteria: Logical relationshipsbetween dimensions
How can the Mental Model of the Users about the Data be expressed?
Intelligible vs. Non-Intelligible Data
Criteria: Correlation with the mental model of a person
How can we collect data?
- Observations
- Field Studies
- User Behavior Analysis
- Surveys
- Questionnaires
- Crowdsourcing
- Sensors
- Automatic Collection
Can you list 3 data formats?
- Structured
- Unstructured
- Semi-Structured Data
Criteria: Arrengement and organisation of the data
Can you list 3 Common Data Exchange Formats?
- XML: eXtensible Markup Language
- JSON: JavaScript
Object Notation - CSV: Comma-Separated Value
List as many possible data issues as you can think of
- Typos
- Missing Data or Fields
- Different Units
- Non-Uniform Data Types
- Abbreviations
- Variations of the Same Thing
- Duplicates
- Encoding Issues
- Dashes & Parentheses
- Delimiters
- White spaces
- Noise
- Outliers
- Measurement Errors
How can we generate data from ML models?
- Inputs and Outputs
- Distributions
- Relations
- Perturbations (what-if analysis)
- Parameters
- Logging of Weights, Biases, Activations, etc.
- Model Architectures and Parameters
- Scores and Performance Measures
- Temporal Evolution
- Training-Testing Comparisons
- Data Distribution Shifts
- Model Development Evolution Cycles
Can you list the 7 data cleaning pipeline steps?
Missing Value Treatment
Noise Treatment
Outlier Detection
Normalization
Data Reduction
Data Smoothing
Data Augmentation
Can you least 4 solutions to missing values with their advantages and disadvantages?
- Ignore the tuple
+ Can be easily done
+ No computational effort- Loss of information
- Unnecessary if the attribute is not needed
- Enter value manually
+ For small datasets effective- Need to know the value
- Time consuming
- Not feasible with large datasets
- Use attribute mean
+ Simple to implement- Not the most accurate approximation of the value
- Use most probable value
+ Most accurate approximation of the value- Most computational effort
Can you mention 4 sources of noises?
- Random noise (white noise)
- Noise introduced by measurement tools
- Human error
- Fraud
Can you mention 4 methods of removing noise?
- Equal-width binning
- Equal-depth binning
- Outlier Detection via Clustering
- Data Smoothing via Regression
List 2 ways of performing Data Reduction
- Reducing the data points (Sampling)
- Reducing the dimensions of every data point (Dimensionality Reduction) (PCA)
Can you provide 4 different normalization techniques?
- Linear Normalization
- Square Root Normalization
- Logarithmic Normalization
List 3 reasons to perform Data Reduction
- Sometimes too much data can be harmful for the model because of data redundancies
- Furthermore, interpreting a model with input data of high dimensionity can be impossible.
- Finally, it can be hard to train a model with too much data because of computational complexity!
List 4 methods for Sampling
- Random Sampling
+ Can be easily done
+ Least biased sampling method- Chance that some characteristics are not taken into account
- Systematic Random Sampling
+ Can be easily done
+ Good sample for the ordered attribute- Possible bias towards the other attributes
- Problem with periodicities
- Stratified Random Sampling
+ Good sample of the data
+ Different sample strategies within a strata can be used- Computationally more expensive
- Cluster Random Sampling
+ Cheap method when it is geographically convenient
+ Easy to increase sample size- Least representative of the population
What is Data Augmentation?
Modifying dimensions of one datapoint without altering its labels to produce a new one
List 2 methods that can be used for Data Augmentation
- Adding datapoints
- Adding new dimensions
How do we integrate Data?
- Join Databases
+ easy approach: fast and low effort- does not always work or yield desired results
- Reconcile and Match Data (e.g., http://openrefine.org)
+ can consider tasks and semantics- more time consuming, requires entity resolution/matching
- Crowd-Sourcing (e.g., Annotated Knowledge Graphs)
+ high quality data- costly and task-specific
How to match Data Items?
- Exact Matches
- Based on IDs
- Based on Similarity
- Euclidean Norm
- Cosine Similarity
- Edit Distance
- Based on Co-Occurrence
What are the three types of Modeling?
- Supervised learning
- Classification: decide what category does a datapoint belong
- Regression: estimate relarionship among variables
- Unsupervised learning
- Clustering: group similar data together
- Dimension reduction: find a subgroup of dimensions that captures as much information as possible
- Reinforcement learning: takes decisions based on a set of actions in an environment
What is Anscombe’s Quartet
4 datasets with same summary statistics that are clearly different and visually distinct.
What does Guidance refer to?
An AI assisted process that aims to actively resolve a knowledge gap encountered by users during the solving of an analytical task
What are the dimensions of Feedback?
- Intent
- Expression Form
- Observation Actuality
- Observation Granularity