COURSERA - Getting and cleaning data Flashcards
THE COURSE GOAL
Raw data -> Processing script -> tidy data -> data analysis -> data communication
DEFINING DATA
Start with a SET OF ITEMS ; population
Determine VARIABLES that need to be measured
Determine what type of values of the VARIABLES are relevant => QUALITATIVE or QUANTITATIVE
QUALITATIVE: sex , country of origin, etc.
QUANTITATIVE: height, weight, blood pressure, etc.
RAW vs PROCESSED DATA
Data is deemed RAW or processed depending on the analysis required.
RAW data is characterized by the fact that it is in its original format and it needs processing for the purpose of the planned analysis.
Processing data involves operations such as : merging, subsetting, transforming, etc.
Processing steps need to be recorded and transmitted to the analysis stage.
PROCESSED data is ready to be subjected to the planned analysis constraints.
DATA PROCESSING PIPELINE
a pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements