Theory - Data Pipeline Flashcards
What is Data?
Data refers to a collection of raw facts, observations, or statistics.
What is Capta?
Capta is “taken” actively while data is assumed to be a “given” able to
be recorded and observed
What is a mutlivariate data type?
Multiple variables within a single record representing a composite data item
Can you list various data types?
- 1D (e.g., sets and sequences)
- 2D (e.g., geo-spatial maps)
- 3D (e.g., shapes)
- nD (e.g., high-dimensional)
- Networks (e.g., relational and graphs)
- Temporal (e.g., timelines)
- Trees (e.g., hierarchies)
- Text (e.g., document corpus)
- Multimedia (e.g., images and videos)
What is a scalar data type?
An individual number in a data record
What is a vector data type?
Representation forms of multivariate data
What are Data Models?
Conceptual models that consist of a formal description of the data w.r.t. the task semantics to support reasoning and problem-solving.
Can you list 4 data classes?
- Nominal (labels or categories)
- Ordinal (ordered)
- Numeric - Interval (location of zero arbitrary)
- Numeric - Ratio (zero fixed)
Can you list 2 different types of data representation?
- Symbolic (Explicit, used by deductive methods)
- Sub-symbolic (Implicit, used by ANN)
Criteria: Logical relationshipsbetween dimensions
How can the Mental Model of the Users about the Data be expressed?
Intelligible vs. Non-Intelligible Data
Criteria: Correlation with the mental model of a person
How can we collect data?
- Observations
- Field Studies
- User Behavior Analysis
- Surveys
- Questionnaires
- Crowdsourcing
- Sensors
- Automatic Collection
Can you list 3 data formats?
- Structured
- Unstructured
- Semi-Structured Data
Criteria: Arrengement and organisation of the data
Can you list 3 Common Data Exchange Formats?
- XML: eXtensible Markup Language
- JSON: JavaScript
Object Notation - CSV: Comma-Separated Value
List as many possible data issues as you can think of
- Typos
- Missing Data or Fields
- Different Units
- Non-Uniform Data Types
- Abbreviations
- Variations of the Same Thing
- Duplicates
- Encoding Issues
- Dashes & Parentheses
- Delimiters
- White spaces
- Noise
- Outliers
- Measurement Errors
How can we generate data from ML models?
- Inputs and Outputs
- Distributions
- Relations
- Perturbations (what-if analysis)
- Parameters
- Logging of Weights, Biases, Activations, etc.
- Model Architectures and Parameters
- Scores and Performance Measures
- Temporal Evolution
- Training-Testing Comparisons
- Data Distribution Shifts
- Model Development Evolution Cycles
Can you list the 7 data cleaning pipeline steps?
Missing Value Treatment
Noise Treatment
Outlier Detection
Normalization
Data Reduction
Data Smoothing
Data Augmentation
Can you least 4 solutions to missing values with their advantages and disadvantages?
- Ignore the tuple
+ Can be easily done
+ No computational effort- Loss of information
- Unnecessary if the attribute is not needed
- Enter value manually
+ For small datasets effective- Need to know the value
- Time consuming
- Not feasible with large datasets
- Use attribute mean
+ Simple to implement- Not the most accurate approximation of the value
- Use most probable value
+ Most accurate approximation of the value- Most computational effort
Can you mention 4 sources of noises?
- Random noise (white noise)
- Noise introduced by measurement tools
- Human error
- Fraud
Can you mention 4 methods of removing noise?
- Equal-width binning
- Equal-depth binning
- Outlier Detection via Clustering
- Data Smoothing via Regression
List 2 ways of performing Data Reduction
- Reducing the data points (Sampling)
- Reducing the dimensions of every data point (Dimensionality Reduction) (PCA)
Can you provide 4 different normalization techniques?
- Linear Normalization
- Square Root Normalization
- Logarithmic Normalization
List 3 reasons to perform Data Reduction
- Sometimes too much data can be harmful for the model because of data redundancies
- Furthermore, interpreting a model with input data of high dimensionity can be impossible.
- Finally, it can be hard to train a model with too much data because of computational complexity!
List 4 methods for Sampling
- Random Sampling
+ Can be easily done
+ Least biased sampling method- Chance that some characteristics are not taken into account
- Systematic Random Sampling
+ Can be easily done
+ Good sample for the ordered attribute- Possible bias towards the other attributes
- Problem with periodicities
- Stratified Random Sampling
+ Good sample of the data
+ Different sample strategies within a strata can be used- Computationally more expensive
- Cluster Random Sampling
+ Cheap method when it is geographically convenient
+ Easy to increase sample size- Least representative of the population
What is Data Augmentation?
Modifying dimensions of one datapoint without altering its labels to produce a new one