Lecture 2: Data Foundations and Tasks Flashcards
When should you not visualize?
When it comes to well-defined questions on a well-defined dataset
->use statistics/machine learning
Where do we get insight generation?
Humans
Pros of having a computer in the loop?
Scale
‣ Drawing by hand infeasible
‣ Interaction allows to ‘drill down’ into data
‣ Integration with algorithms
Efficiency
‣ Re-use charts for different datasets
Quality
‣ Precise data-driven rendering
Storytelling
Why use Interactivity?
Limitations of people and displays
Single static view can only show one aspect.
Name methods of Data Aquisition (Raw Data)
Measurements,
Modeling/Simulation,
Artificial
Measurements
‣ Real world data
‣ e.g.: computer tomography (CT) / magnetic resonance (MR), lab results, production sensor data
Modeling / Simulation
‣ e.g.: flow visualization, biological processes (pathways), climate change model, engine model
Artificial
‣ Human generated data
‣ e.g.: social networks, text, painting, movie, workflows
How can we handle missing data values?
Discard bad records,
Assign sentinel value
Assign average value
Assign value based on nearest neighbor
Communicate in visualization
Discard bad records
‣ Commonly applied
‣ Con: loss of data
Assign sentinel value
‣ e.g., -1, NaN
‣ Needs to be handled when statistics is applied
Assign average value
‣ Pro: effects statistics minimally
‣ Con: non existing data values are introduced
Assign value based on nearest neighbor
Communicate in visualization
What is done during Data Processing & Cleaning(in order)?
Handling Missing Values, Normalization, Sanity Check, Data Reduction (Filtering, Aggregation), Data Transformation/Mapping
What is Normalization?
Allows to compare seemingly unrelated data.
Transform data set so that results satisfy a particular statistical property .
Why is a Sanity Check important?
Impossible data values, Attention with (wrong) assumptions
Data Reduction
What is Filtering?
Eliminating some items or attributes
Data Reduction
What are some Approaches of Data Filtering?
‣ User-defined attributes / criteria
* Clipping (min, max) * Threshold value (cut-off value)
* Interactive filtering/zooming
‣ Sampling
* e.g., take every xth element, random
Data Reduction
What is Data Aggregation?
Representing a group of items/attributes by a new item/attribute
Data Reduction
What are some Types of Data Aggregation?
Item aggregation
‣ Using statistics
e.g., average, min/max, count, sum
‣ Clustering
Attribute aggregation
‣ Dimensionality reduction aka embeddings / projections
e.g., t-SNE, PCA, UMAP
How does Data Transformation / Mapping work?
In data space
‣ Convert from source data system to target data system
e.g., temperature conversion
In visual space
‣ Mapping of data to geometric primitives (points, lines, etc.) and their attributes (color, position, size, etc.)
Name data types
Structural interpretation of data
Items
Attributes
Links
Positions
Grids
Different from data types in programming
Items
‣ Discrete individual entity
‣ e.g., machine, worker, city
Attributes
‣ Measured, observed, or logged properties of items
‣ Aka variable, dimension, feature e.g., age, price, temperature
Links
‣ Relationship between items
‣ e.g., Facebook friendship, connections between circuit elements
Positions
‣ Spatial data providing location in 2D or 3D space
‣ e.g., long/lat pair of city, pixel in photo, voxels in MRI scan
Grids
‣ Sampling strategy for continuous data 29
‣ e.g., grid of weather stations in a region
Name dataset types
Tabels
Networks & Trees
Fields
Geometry
Clusters, Sets, Lists
Collection of information that is target of analysis
Name Attribute Types
Categorical (nominal)
Ordered Ordinal
Ordered Quantitative
Which classes of values & measurements are there?
Categorical (nominal)
‣ Compare equality, no implicit order
‣ e.g., fruit, gender, product category, file types
Ordered
‣Ordinal
* Great/less than defined
* e.g., shirt size, rankings
‣Quantitative
* Arithmetic possible
* e.g., length, weight, count
What are the types of ordering Directions?
Sequential
Diverging
Cyclic
Sequential
‣ Homogeneous from min to max
‣ e.g., # people in countries
Diverging
‣ Two or multiple sequences that meet at common zero point
‣ e.g., elevation dataset (above sea level & below sea level)
Cyclic
‣ Time (hours, week, month, year)
‣ e.g., seasons of the year
What is Task Abstraction?
The formulation of domain-independent tasks.