Getting and cleaning data - the basics Flashcards
THE COURSE GOAL
Raw data -> Processing script -> tidy data -> data analysis -> data communication
DEFINING DATA
- Start with a SET OF ITEMS ; populationDetermine VARIABLES that need to be measured
- Determine what type of values of the VARIABLES are relevant => QUALITATIVE or QUANTITATIVE
- QUALITATIVE: sex , country of origin, etc.QUANTITATIVE: height, weight, blood pressure, etc.
RAW vs PROCESSED DATA
Data is deemed RAW or processed depending on the analysis required.RAW data is characterized by the fact that it is in its original format and it needs processing for the purpose of the planned analysis.Processing data involves operations such as : merging, subsetting, transforming, etc.Processing steps need to be recorded and transmitted to the analysis stage.PROCESSED data is ready to be subjected to the planned analysis constraints.
DATA PROCESSING PIPELINE
A pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements
COMPONENTS FROM RAW TO TIDY
The documented path for the PROCESSING PIPELINE
- Raw data
- Tidy data
- Code book describing variables/ values in the tiny data set (referred to as metadata)
- The storyline of actions leading from RAW to TIDY (R scripts etc.)
CHARACTERISTICS of RAW DATA
What’s RAW to you might be someone else’s processed data.
- RAW is therefore data that you have run no software on.
- Data that you have not manipulated the numbers of.
- Data from data sets that you have not removed elements from.
- Data that you have not summarised in any way.
CHARACTERISTICS of TIDY DATA
The resulting tidy data you should organise so that:
- Each variable should be in a distinct column. Columns should be clearly labelled with variable name (human understandable)
- Each separate observation of that variables should be in a distinct row.
- Each type of variable should have its own table. “Type” is defined on a case by case basis criterias can be different sources of RAW DATA etc. Distinct tables should be in distinct files (one file / table)
- Multiple tables should contain a column linking one table to the ohers. Useful for merging.
CODE BOOK CONTENTS
The code book should:
- Be a text file (markdown, WORD, etc.)
- Contain a section entitled “study design” explaining the choice of raw data and how it was retreived.
- Contain a section entitled “code book” describing each variable and its units.
THE STORYLINE / INSTRUCTIONS LIST
The aim is to make sure the recipient can re-run the elements of the PROCESSING PIPELINE to go from RAW to TIDY data. The isntructions list should consist of:
- A computer script or a set of them. Make sure you specify what version of the script language was used.
- The input for the script is the RAW DATA already described.
- The output is the TIDY DATA also described.
- If necessary, detailed instructions of how the script(s) are run.
MANIPULATING LOCAL DIRECTORIES
- getwd()
- setwd()
- file.exists(“directory”) - checks existence
- dir.create(“directory”) - creates directory
Example: if (!file.exists(“data”)) {
dir.create(“data”) }
DOWNLOADING FILES FROM THE INTERNET
This is useful in STORY LINE to document reproducible process
#copy URL from browser and assign it to “fileURL” in R script
fileUrl <- “https://data.baltimorecity.gov/api/views/dz54- …”
#use “download.file” command, specify local “destfile” ans use “curl” for https
download.file(fileUrl, destfile = “./data/cameras.csv”, method = “curl”)
#check files downloaded
list.files(“./data”)
#display date downloaded
dateDownloaded <- date()
dateDownloaded
READING DOWNLOADED FILES
- read.table(“file path”,sep=” “,header=T or F): always state these parameters as a minimum for read.table()
- read.csv(“file path”): with a csv file the separator is by default a comma and a header existence is always considered as TRUE.
- Resolve typical issues:
quote=”” ignore quotation marks when they show up in files
na.strings= sets the characters that represent missing data
nrows= sets number of rows to read
skip= number of lines to skip before starting to read
READING EXCEL FILES
can read specified subsets of EXCEL file,
fileUrl <- “https://data.baltimorecity.gov/api/views/dz54-2aru…” download.file(fileUrl,destfile=”./data/cameras.xlsx”,method=”curl”)
dateDownloaded <- date()
library(xlsx)
cameraData <- read.xlsx(“./data/cameras.xlsx”,sheetIndex=1,header=TRUE)
colIndex <- 2:3
rowIndex <- 1:4
cameraDataSubset <- read.xlsx(”./data/cameras.xlsx”,sheetIndex=1,
colIndex=colIndex,rowIndex=rowIndex)
HANDLING XML FILES - basics
library(XML)
Assign a variable to the file URL
fileUrl <- “http://www.w3schools.com/xml/simple.xml”
Access the content by parsing it
doc <- xmlTreeParse(fileUrl,useInternal=TRUE)
DRILLING THROUGH XML FILES - level 1
Assign a varaible to the root Node
rootNode <- xmlRoot(doc)
Determine the first level node name
> xmlName(rootNode)
[1] “breakfast_menu”