Lecture 3 Flashcards
Semantics
Thesemanticsof the data is its real-world meaning. For instance, does a word represent a human first name, or is it the shortened version of a company name where the full name can be looked up in an external list, or is it a city, or is it a fruit? Does a number represent a day of the month, or an age, or a measurement of height, or a unique code for a specific person, or a postal code for a neighbourhood, or a position in space?
Data Sources
Since every visualization starts with the data that is to be displayed, a first step in addressing the design of visualizations is to examine the characteristics of the data.
Data comes from many sources; it can be gathered from sensors or surveys, or it can be generated by simulations and computations.
Data can be raw (untreated), or it can be derived from raw data via some
process, such as smoothing, noise removal, scaling, or interpolation. It also can have a wide range of characteristics and structures.
A typical data set used in visualization consists of a list of n records. Each record consists of m (one or more) observations or variables. An observation may be a single number/symbol/string or a more complex structure. A variable may be classified as either independent or dependent. An independent variable is one whose value is not controlled or affected by another variable, such as the time variable in a time-series data set. A dependent variable is one whose value is affected by a variation in one or more associated independent variables. Temperature for a region would be considered a dependent variable, as its value is affected by variables such as date, time, or location.
Types of Data
In its simplest form, each observation or variable of a data record represents a single piece of information. We can categorize this information as being
ordinal (numeric) or nominal (nonnumeric). Subcategories of each can be readily defined.
Ordinal. The data take on numeric values:
• binary—assuming only values of 0 and 1;
• discrete—taking on only integer values or from a specific subset (e.g., (2, 4, 6));
• continuous—representing real values (e.g., in the interval [0, 5]).
Nominal. The data take on nonnumeric values:
• categorical—a value selected from a finite (often short) list of possibilities (e.g., red, blue, green);
• ranked—a categorical variable that has an implied ordering (e.g., small, medium, large);
• arbitrary—a variable with a potentially infinite range of values with no implied ordering (e.g., addresses).
Structure Within Records
Ascalar fieldis univariate, with a single value attribute at each point in space. One example of a 3D scalar field is the time-varying medical scan above; another is the temperature in a room at each point in 3D space. The geometric intuition is that each point in a scalar field has a single value. A point in space can have several different numbers associated with it; if there is no underlying connection between them then they are simply multiple separate scalar fields.
Avector fieldis multivariate, with a list of multiple attribute values at each point. The geometric intuition is that each point in a vector field has a direction and magnitude, like an arrow that can point in any direction and that can be any length. The length might mean the speed of a motion or the strength of a force. A concrete example of a 3D vector field is the velocity of air in the room at a specific time point, where there is a direction and speed for each item. The dimensionality of the field determines the number of components in the direction vector; its length can be computed directly from these components, using the standard Euclidean distance formula. The standard cases are two, three, or four components, as above.
Atensor fieldhas an array of attributes at each point, representing a more complex multivariate mathematical structure than the list of numbers in a vector. A physical example is stress, which in the case of a 3D field can be defined by nine numbers that represent forces acting in three orthogonal directions. The geometric intution is that the full information at each point in a tensor field cannot be represented by just an arrow and would require a more complex shape such as an ellipsoid.
Data Preprocessing – Metadata and Statistics
Data preprocessing(data cleansing) is the first step in data visualization.
Metadata – Data(information) about data
Metadata helps in understanding the context of the data and provides guidance for the preprocessing.
It provides information like format of individual fields, base reference points, unit of measurement and symbols or numbers used.
Getting the statistical analysis on data provides us with mean, median etc., and helps in outlier detection, clustering and finding correlation
Data Preprocessing – Discarding the Bad Records
Sometimes we delete the records with missing values.
Deleting data has its own pros and cons.
Pro
Easy to implement
Con
Data Loss
Sometimes the missing data is of more interest than the actual data as in malfunctioning of sensors
Never delete records if the missing value records are more than 2% of the whole dataset.
Data Preprocessing – Assigning a Sentinel Value
Sometimes, we assign a sentinel value to the missing data records.
For example, in a variable having range 0-100, one can use -5 to represent missing value.
Pro
Easy to visualize the erroneous data
Con
Care must be taken not to perform statistical analysis on these sentinel values
Data Preprocessing – Assigning Average Value
This is one of the simple strategy for dealing with bad and missing data.
Calculate and replace the missing value with the average value of the variable or dimensions
Pro
It minimally affects the overall statistics for the variable.
Con
May not be a good guess.
May mask or obscure outliers
Data Preprocessing – Assigning Values Based on Nearest Neighbors
In this case, we find the record that has the highest similarity with record in question, based on analyzing the differences in all other variable and then assign the missing value
Pro
Better approximation
Con
Variable in question may be most dependent on only a subset of the other dimensions, rather than on all dimensions
Data Preprocessing – Compute a Substitute Value
This method is based on scientific researches and is known as imputation
In this method we seek to find values which have high statistical confidence.
In case of normal distribution, we impute the missing values by mean.
In case of skewed distribution(right skewed or positive skewed and left skewed or negative skewed), we use median as an imputation value.
Pro
Mostly accurate
Con
Significant amount of money and energy has been devoted for the research and experiments.
Normalization
Normalization is the process of transforming a data set so that the results satisfy a particular statistical property. Typically between 0.0 and 1.0
Segmentation
Sometimes, for the sake of analysis and visualization we need to separate data into contiguous regions, where each region corresponds to a particular classification of data.
This is called segmentation.
Segmentation can be achieved by the following approaches:
Segmentation Approaches
- Top-down approach - Creating a cluster with all the data and then move down by increasing the number of cluster.
- Bottom-up approach – Creating clusters with all the data in the dataset where each record represents one cluster and then go on merging the clusters.
Mapping Nominal Dimensions
Sometimes we need to map nominal data to either number or color.
This helps us to have a better visualization as well as helps in clustering of data.
For e.g. mapping of model and make of a car to an integer.
Color coding is also used extensively to represent data.
Mostly used in maps to show different geographical regions like green for plains and brown for mountains and plateaus