Theory - Data Pipeline Flashcards
What is Data?
Data refers to a collection of raw facts, observations, or statistics.
What is Capta?
Capta is “taken” actively while data is assumed to be a “given” able to
be recorded and observed
What is a mutlivariate data type?
Multiple variables within a single record representing a composite data item
Can you list various data types?
- 1D (e.g., sets and sequences)
- 2D (e.g., geo-spatial maps)
- 3D (e.g., shapes)
- nD (e.g., high-dimensional)
- Networks (e.g., relational and graphs)
- Temporal (e.g., timelines)
- Trees (e.g., hierarchies)
- Text (e.g., document corpus)
- Multimedia (e.g., images and videos)
What is a scalar data type?
An individual number in a data record
What is a vector data type?
Representation forms of multivariate data
What are Data Models?
Conceptual models that consist of a formal description of the data w.r.t. the task semantics to support reasoning and problem-solving.
Can you list 4 data classes?
- Nominal (labels or categories)
- Ordinal (ordered)
- Numeric - Interval (location of zero arbitrary)
- Numeric - Ratio (zero fixed)
Can you list 2 different types of data representation?
- Symbolic (Explicit, used by deductive methods)
- Sub-symbolic (Implicit, used by ANN)
Criteria: Logical relationshipsbetween dimensions
How can the Mental Model of the Users about the Data be expressed?
Intelligible vs. Non-Intelligible Data
Criteria: Correlation with the mental model of a person
How can we collect data?
- Observations
- Field Studies
- User Behavior Analysis
- Surveys
- Questionnaires
- Crowdsourcing
- Sensors
- Automatic Collection
Can you list 3 data formats?
- Structured
- Unstructured
- Semi-Structured Data
Criteria: Arrengement and organisation of the data
Can you list 3 Common Data Exchange Formats?
- XML: eXtensible Markup Language
- JSON: JavaScript
Object Notation - CSV: Comma-Separated Value
List as many possible data issues as you can think of
- Typos
- Missing Data or Fields
- Different Units
- Non-Uniform Data Types
- Abbreviations
- Variations of the Same Thing
- Duplicates
- Encoding Issues
- Dashes & Parentheses
- Delimiters
- White spaces
- Noise
- Outliers
- Measurement Errors
How can we generate data from ML models?
- Inputs and Outputs
- Distributions
- Relations
- Perturbations (what-if analysis)
- Parameters
- Logging of Weights, Biases, Activations, etc.
- Model Architectures and Parameters
- Scores and Performance Measures
- Temporal Evolution
- Training-Testing Comparisons
- Data Distribution Shifts
- Model Development Evolution Cycles