Getting and cleaning data - the basics Flashcards

1
Q

THE COURSE GOAL

A

Raw data -> Processing script -> tidy data -> data analysis -> data communication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

DEFINING DATA

A
  • Start with a SET OF ITEMS ; populationDetermine VARIABLES that need to be measured
  • Determine what type of values of the VARIABLES are relevant => QUALITATIVE or QUANTITATIVE
  • QUALITATIVE: sex , country of origin, etc.QUANTITATIVE: height, weight, blood pressure, etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

RAW vs PROCESSED DATA

A

Data is deemed RAW or processed depending on the analysis required.RAW data is characterized by the fact that it is in its original format and it needs processing for the purpose of the planned analysis.Processing data involves operations such as : merging, subsetting, transforming, etc.Processing steps need to be recorded and transmitted to the analysis stage.PROCESSED data is ready to be subjected to the planned analysis constraints.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

DATA PROCESSING PIPELINE

A

A pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

COMPONENTS FROM RAW TO TIDY

A

The documented path for the PROCESSING PIPELINE

  • Raw data
  • Tidy data
  • Code book describing variables/ values in the tiny data set (referred to as metadata)
  • The storyline of actions leading from RAW to TIDY (R scripts etc.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

CHARACTERISTICS of RAW DATA

A

What’s RAW to you might be someone else’s processed data.

  • RAW is therefore data that you have run no software on.
  • Data that you have not manipulated the numbers of.
  • Data from data sets that you have not removed elements from.
  • Data that you have not summarised in any way.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

CHARACTERISTICS of TIDY DATA

A

The resulting tidy data you should organise so that:

  • Each variable should be in a distinct column. Columns should be clearly labelled with variable name (human understandable)
  • Each separate observation of that variables should be in a distinct row.
  • Each type of variable should have its own table. “Type” is defined on a case by case basis criterias can be different sources of RAW DATA etc. Distinct tables should be in distinct files (one file / table)
  • Multiple tables should contain a column linking one table to the ohers. Useful for merging.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

CODE BOOK CONTENTS

A

The code book should:

  • Be a text file (markdown, WORD, etc.)
  • Contain a section entitled “study design” explaining the choice of raw data and how it was retreived.
  • Contain a section entitled “code book” describing each variable and its units.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

THE STORYLINE / INSTRUCTIONS LIST

A

The aim is to make sure the recipient can re-run the elements of the PROCESSING PIPELINE to go from RAW to TIDY data. The isntructions list should consist of:

  • A computer script or a set of them. Make sure you specify what version of the script language was used.
  • The input for the script is the RAW DATA already described.
  • The output is the TIDY DATA also described.
  • If necessary, detailed instructions of how the script(s) are run.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

MANIPULATING LOCAL DIRECTORIES

A
  • getwd()
  • setwd()
  • file.exists(“directory”) - checks existence
  • dir.create(“directory”) - creates directory

Example: if (!file.exists(“data”)) {

dir.create(“data”) }

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

DOWNLOADING FILES FROM THE INTERNET

A

This is useful in STORY LINE to document reproducible process

#copy URL from browser and assign it to “fileURL” in R script

fileUrl <- “https://data.baltimorecity.gov/api/views/dz54- …”

#use “download.file” command, specify local “destfile” ans use “curl” for https

download.file(fileUrl, destfile = “./data/cameras.csv”, method = “curl”)

#check files downloaded

list.files(“./data”)

#display date downloaded

dateDownloaded <- date()

dateDownloaded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

READING DOWNLOADED FILES

A
  • read.table(“file path”,sep=” “,header=T or F): always state these parameters as a minimum for read.table()
  • read.csv(“file path”): with a csv file the separator is by default a comma and a header existence is always considered as TRUE.
  • Resolve typical issues:
    quote=”” ignore quotation marks when they show up in files
    na.strings= sets the characters that represent missing data
    nrows= sets number of rows to read
    skip= number of lines to skip before starting to read
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

READING EXCEL FILES

A

can read specified subsets of EXCEL file,

fileUrl <- “https://data.baltimorecity.gov/api/views/dz54-2aru…” download.file(fileUrl,destfile=”./data/cameras.xlsx”,method=”curl”)

dateDownloaded <- date()
library(xlsx)

cameraData <- read.xlsx(“./data/cameras.xlsx”,sheetIndex=1,header=TRUE)

colIndex <- 2:3

rowIndex <- 1:4

cameraDataSubset <- read.xlsx(”./data/cameras.xlsx”,sheetIndex=1,

colIndex=colIndex,rowIndex=rowIndex)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

HANDLING XML FILES - basics

A

library(XML)

Assign a variable to the file URL

fileUrl <- “http://www.w3schools.com/xml/simple.xml”

Access the content by parsing it

doc <- xmlTreeParse(fileUrl,useInternal=TRUE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DRILLING THROUGH XML FILES - level 1

A

Assign a varaible to the root Node

rootNode <- xmlRoot(doc)

Determine the first level node name

> xmlName(rootNode)

[1] “breakfast_menu”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

DRILLING THROUGH XML FILES - level +2

A

displays the XML content of the 1st level

We’ve determined the “root” node

> xmlName(rootNode)

[1] “breakfast_menu”

& the subsection node names

> names(rootNode)

  • food food food food food *
  • “food” “food” “food” “food” “food”*

Drill down 1 by using “rootNode[[1]]” and get names

>rootNode[[1]]

>names(rootNode[[1]])

  • name price description calories*
  • “name” “price” “description” “calories”*
17
Q

DRILLING THROUGH XML FILES - level +2

A

Drill down 1 by using “rootNode[[1]] [[1]]” and get names

>rootNode[[1]] [[1]]
>names(rootNode[[1]] [[1]])

  • name*
  • “name” “*
18
Q
A