Getting and cleaning data - the basics Flashcards

Question 1

Q

THE COURSE GOAL

Answer

A

Raw data -> Processing script -> tidy data -> data analysis -> data communication

Question 2

Q

DEFINING DATA

Answer

A

Start with a SET OF ITEMS ; populationDetermine VARIABLES that need to be measured
Determine what type of values of the VARIABLES are relevant => QUALITATIVE or QUANTITATIVE
QUALITATIVE: sex , country of origin, etc.QUANTITATIVE: height, weight, blood pressure, etc.

Question 3

Q

RAW vs PROCESSED DATA

Answer

A

Data is deemed RAW or processed depending on the analysis required.RAW data is characterized by the fact that it is in its original format and it needs processing for the purpose of the planned analysis.Processing data involves operations such as : merging, subsetting, transforming, etc.Processing steps need to be recorded and transmitted to the analysis stage.PROCESSED data is ready to be subjected to the planned analysis constraints.

Question 4

Q

DATA PROCESSING PIPELINE

Answer

A

A pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements

Question 5

Q

COMPONENTS FROM RAW TO TIDY

Answer

A

The documented path for the PROCESSING PIPELINE

Raw data
Tidy data
Code book describing variables/ values in the tiny data set (referred to as metadata)
The storyline of actions leading from RAW to TIDY (R scripts etc.)

Question 6

Q

CHARACTERISTICS of RAW DATA

Answer

A

What’s RAW to you might be someone else’s processed data.

RAW is therefore data that you have run no software on.
Data that you have not manipulated the numbers of.
Data from data sets that you have not removed elements from.
Data that you have not summarised in any way.

Question 7

Q

CHARACTERISTICS of TIDY DATA

Answer

A

The resulting tidy data you should organise so that:

Each variable should be in a distinct column. Columns should be clearly labelled with variable name (human understandable)
Each separate observation of that variables should be in a distinct row.
Each type of variable should have its own table. “Type” is defined on a case by case basis criterias can be different sources of RAW DATA etc. Distinct tables should be in distinct files (one file / table)
Multiple tables should contain a column linking one table to the ohers. Useful for merging.

Question 8

Q

CODE BOOK CONTENTS

Answer

A

The code book should:

Be a text file (markdown, WORD, etc.)
Contain a section entitled “study design” explaining the choice of raw data and how it was retreived.
Contain a section entitled “code book” describing each variable and its units.

Question 9

Q

THE STORYLINE / INSTRUCTIONS LIST

Answer

A

The aim is to make sure the recipient can re-run the elements of the PROCESSING PIPELINE to go from RAW to TIDY data. The isntructions list should consist of:

A computer script or a set of them. Make sure you specify what version of the script language was used.
The input for the script is the RAW DATA already described.
The output is the TIDY DATA also described.
If necessary, detailed instructions of how the script(s) are run.

Question 10

Q

MANIPULATING LOCAL DIRECTORIES

Answer

A

getwd()
setwd()
file.exists(“directory”) - checks existence
dir.create(“directory”) - creates directory

Example: if (!file.exists(“data”)) {

dir.create(“data”) }

Question 11

Q

DOWNLOADING FILES FROM THE INTERNET

Answer

A

This is useful in STORY LINE to document reproducible process

#copy URL from browser and assign it to “fileURL” in R script

fileUrl <- “https://data.baltimorecity.gov/api/views/dz54- …”

#use “download.file” command, specify local “destfile” ans use “curl” for https

download.file(fileUrl, destfile = “./data/cameras.csv”, method = “curl”)

#check files downloaded

list.files(“./data”)

#display date downloaded

dateDownloaded <- date()

dateDownloaded

Question 12

Q

READING DOWNLOADED FILES

Answer

A

read.table(“file path”,sep=” “,header=T or F): always state these parameters as a minimum for read.table()
read.csv(“file path”): with a csv file the separator is by default a comma and a header existence is always considered as TRUE.
Resolve typical issues:
quote=”” ignore quotation marks when they show up in files
na.strings= sets the characters that represent missing data
nrows= sets number of rows to read
skip= number of lines to skip before starting to read

Question 13

Q

READING EXCEL FILES

Answer

A

can read specified subsets of EXCEL file,

fileUrl <- “https://data.baltimorecity.gov/api/views/dz54-2aru…” download.file(fileUrl,destfile=”./data/cameras.xlsx”,method=”curl”)

dateDownloaded <- date()
library(xlsx)

cameraData <- read.xlsx(“./data/cameras.xlsx”,sheetIndex=1,header=TRUE)

colIndex <- 2:3

rowIndex <- 1:4

cameraDataSubset <- read.xlsx(”./data/cameras.xlsx”,sheetIndex=1,

colIndex=colIndex,rowIndex=rowIndex)

Question 14

Q

HANDLING XML FILES - basics

Answer

A

library(XML)

Assign a variable to the file URL

fileUrl <- “http://www.w3schools.com/xml/simple.xml”

Access the content by parsing it

doc <- xmlTreeParse(fileUrl,useInternal=TRUE)

Question 15

Q

DRILLING THROUGH XML FILES - level 1

Answer

A

Assign a varaible to the root Node

rootNode <- xmlRoot(doc)

Determine the first level node name

> xmlName(rootNode)

[1] “breakfast_menu”

Question 16

Q

DRILLING THROUGH XML FILES - level +2

Answer

Study These Flashcards

A

displays the XML content of the 1st level

We’ve determined the “root” node

> xmlName(rootNode)

[1] “breakfast_menu”

& the subsection node names

> names(rootNode)

food food food food food *
“food” “food” “food” “food” “food”*

Drill down 1 by using “rootNode[[1]]” and get names

>rootNode[[1]]

>names(rootNode[[1]])

name price description calories*
“name” “price” “description” “calories”*

Question 17

Q

DRILLING THROUGH XML FILES - level +2

Answer

Study These Flashcards

A

Drill down 1 by using “rootNode[[1]] [[1]]” and get names

>rootNode[[1]] [[1]]
>names(rootNode[[1]] [[1]])

name*
“name” “*

Question 18

Q

Answer

Study These Flashcards

A

Getting and cleaning data - the basics Flashcards

(18 cards)