Process Data from Dirty to Clean (Terms) Flashcards
A range of values that conveys how likely a statistical estimate reflects the population
Confidence interval
A character that indicates the beginning or end of a data item
Delimiter
A data value that cannot be left blank or empty
Mandatory
A file containing a chronologically ordered list of modifications made to a project
Changelog
A function that removes leading, trailing, and repeated spaces in data
TRIM
A function that returns a segment from the middle of a text string
MID
A function that returns a set number of characters from the left side of a text string
LEFT
A function that returns a set number of characters from the right side of a text string
RIGHT
A function that returns the length of a text string by counting the number of characters it contains
LEN
A group of characters within a cell, most often composed of letters
Text string
A keyword that is added to a SQL SELECT statement to retrieve only non-duplicate entries
DISTINCT
A number that contains a decimal
Float
A process that ensures certain conditions for multiple data fields are satisfied
Cross-field validation
A process to confirm that a data-cleaning effort was well executed and the resulting data is accurate and reliable
Verification
A process to determine if a survey or experiment has meaningful results
Hypothesis testing
A professional who develops processes and procedures to effectively store and organize data
Data warehousing specialist
A professional who transforms data into a useful format for analysis and gives it a reliable infrastructure
Data engineer
A rule that says the values in a table must match a prescribed pattern
Regular expression (RegEx)
A spreadsheet function that calculates the number of days, months, or years between two dates
DATEDIF
A spreadsheet function that counts the total number of values within a specified range
COUNTA
A spreadsheet function that divides text around a specified character and puts each fragment into a new, separate cell
Split
A spreadsheet function that joins together two or more text strings
CONCATENATE
A spreadsheet function that returns the number of cells in a range that match a specified value
COUNTIF
A spreadsheet function that vertically searches for a certain value in a column to return a corresponding piece of information
VLOOKUP
A spreadsheet tool that automatically searches for and eliminates duplicate entries from a spreadsheet
Remove duplicates
A spreadsheet tool that changes how cells appear when values meet specific conditions
Conditional formatting
A SQL function that adds strings together to create new text strings that can be used as unique keys
CONCAT
A SQL function that converts data from one datatype to another
CAST
A SQL function that extracts a substring from a string variable
SUBSTR
A SQL function that returns non-null values in a list
COALESCE
A SQL statement that returns records that meet conditions by including an if/then statement in a query
CASE
A subset of a text string
Substring
A tool for checking the accuracy and quality of data
Data validation
A tool for determining how many characters can be keyed into a spreadsheet field
Field length
A tool that finds a specified search term and replaces it with something else
Find and replace
A value that can’t have a duplicate
Unique
A way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen
Random sampling
An agreement that unites two organizations into a single new one
Merger
An indication that a value does not exist in a dataset
Null
Any data that has been superseded by newer and more accurate information
Outdated data
Any record that inadvertently shares data with another record
Duplicate data
Converting data from one type to another
Typecasting
Data that is complete but inaccurate
Incorrect/inaccurate data
Data that uses different formats to represent the same thing
Inconsistent data
Data that is complete, correct, and relevant to the problem being solved
Clean data
Data that is incomplete, incorrect, or irrelevant to the problem to be solved
Dirty data
Data that is missing important fields
Incomplete data
How well two or more datasets are able to work together
Compatibility
Nontechnical traits and behaviors that relate to how people work
Soft skills
Numerical values that fall between predefined maximum and minimum values
Data range
Skills and qualities that can transfer from one job or industry to another
Transferable skills
The accuracy, completeness, consistency, and trustworthiness of data throughout its life cycle
Data integrity
The average number of people who typically complete a survey
Estimated response rate
The criteria that determine whether a piece of a data is clean and valid
Data constraints
The degree to which data conforms to constraints when it is input, collected, or created
Validity
The degree to which data conforms to the actual entity being measured or described
Accuracy
The degree to which data contains all desired components or measures
Completeness
The degree to which data is repeatable from different points of entry or collection
Consistency
The maximum amount that sample results are expected to differ from those of the actual population
Margin of error
The number of characters in a text string
Length
The predetermined structure of a language that includes all required words, symbols, and punctuation, as well as their proper placement
Syntax
The probability that a sample size accurately reflects the greater population
Confidence level
The probability that a test of significance will recognize an effect that is present
Statistical power
The probability that sample results are not due to random chance
Statistical significance
The process of changing data to make it more organized and easier to read
Data manipulation
The process of combining two or more datasets into a single dataset
Data merging
The process of copying data from a storage device to computer memory or from one computer to another
Data transfer
The process of matching fields from one data source to another
Data mapping
The process of storing data in multiple locations
Data replication
The process of testing two variations of the same web page to determine which page is more successful at attracting user traffic and generating revenue
A/B testing