All Glossary Terms Flashcards
(30 cards)
The data wrangling step in which errors in the raw data are corrected.
Cleaning
Documentation of characteristics of the wrangled data such as names and definitions of the fields, units of measure used in the fields, the source(s) of the raw data, relationship(s) of the wrangled data with other data, and other attributes.
Data dictionary
The process of cleaning, transforming, and managing data so it is more reliable and can be more easily accessed and used for analysis.
Data wrangling
A tag or marker that separates structured data into various fields
Delimiters
The data wrangling step in which the analyst becomes familiar with the data in order to conceptualize how it might be used and potentially discovers issues that will need to be addressed later in the data wrangling process.
Discovery
A field that that takes a value of 0 or 1 to indicate the absence or presence of some categorical effect.
Dummy variable
The data wrangling step in which the raw data are augmented by incorporating values from other data sets and/or applying transformations to portions of the existing data to ensure that all data that will be required for the ensuing analyses will be included in the resulting data set.
Enriching
A characteristic of the observations in a data set.
Field
A data file in which structured data are arrayed as a rectangle, with each row representing an observation or record, and each column representing a unique variable or field.
Flat file
Instances for which there is an appropriate reason for the value of a field to be missing.
Illegitimately missing data
Instances for which there is an appropriate reason for the value of a field to be missing.
Legitimately missing data
Systematic replacement of missing values with values that seem reasonable.
Imputation
Instances for which the tendency for a record to be missing a value of some field is related to the value of some other fields(s) in the record.
Missing at random
Instances for which the tendency for a record to be missing a value of some field is entirely random.
Missing completely at random
Instances for which the tendency for a record to be missing a value of some field is related to the missing value.
Missing not at random
are data that are stored in a manner that allows mathematical operations to be performed on them. Data of this type generally represent a count or measurement.
Numeric data
Combining multiple data sets that each have different data for individual records, when each record occurs no more than once in each data set.
One-to-one merger
Combining multiple data sets that each have different data for individual records, when at least one record occurs more than once in at least one of the data sets
One-to-many merger
The data wrangling step in which a file containing the wrangled data and documentation of the file’s contents are made available to its intended users in a format they can use.
Publishing
Data that has not been processed or prepared for analysis.
Raw data
A grouping of characteristics for a particular observation in a data set.
Record
Data that does not have the same level of organization as structured data, but that allow for isolation of some elements of the raw data when they are imported.
Semi-structured data
Data organized so that the values for each variable are stored in a single field.
Stacked data
Data sets that are arrayed in a predetermined pattern that make them easy to manage and search.
Structured data