Handling Data Flashcards

1
Q

data()

A
  • Function that reveals R’s built-in data sets.

- Most packages have their own built in data sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

save()

A
  • Function that allows the selective saving of objects.
  • save(junk, junk2, file=”junky.RData”) = will sepcifically save the object junk and junk2 to an external file present in the working directory named junky.RData.
  • There does not need to be a relationship between the external name and its contents.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

load()

A
  • Function that loads a saved R object.

- load(“junky.RData”) = reloads the objects present in junky.RData

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

load(url(“website_url”))

A
  • Functions (nested) that allow you to remote load an R Data set.
  • Always check the results of your remote load by reviewing the environment tab
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Reading Excel Files

A
  • Useful packages: XLconnect, xlsx, gdata, readXL, etc.
  • Function to read data into excel: read_excel(“file_name”, sheet = number, col_names = TRUE, col_types = NULL, na = “”, skip = something or nothing)
  • Can always learn more about this function using help(read_excel).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Reading Text Files

A
  • Extensions: .txt, .csv, .dat, .tab.
  • In the Import Dataset tab in the Environment tab, if you have a Local Text File, R will automatically load it (?).
  • If the file is a Web URL:
    1) Enter the URL.
    2) Choose heading “Yes” if variable names are present.
    3) Strings as Factors unchecked.
    4) Set encoding to automatic.
    5) R will correctly identify the dataset as tab separated.
  • You can check your results using the View() function.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to Read Text Files (functions)

A

1) read.csv()
2) read.delim()
3) read.table()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Reading Non-conforming Yet Formatted Data

A

1) readLines()

2) scan()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

read.csv(“url_name”)

A
  • Function used for reading comma separated (typically have .csv extension) files.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

read.delim(“url_name”)

A
  • Function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

read.table(“url_name”)

A
  • Function used for most any type of text file as long as a separator exists + more general than either read.csv or read.delim.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

readLines(“url_name”)

A
  • Function that will read all or part of a text file.
  • Useful for data files that are irregular, have no delimiter (commas or a separator) or do not conform to a standard format.
  • Will read virtually any file.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

scan(“url_name”)

A
  • Function similar to readLines but will keep a record of the structure or patterning in the data if your need to keep that information.
  • More restrictive than readLines().
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

NA / Missing Values

A
  • A place within a vector may be reserved for the missing element by assigning the special value NA.
  • Usually any operation involving NA results in an NA.
  • All types of vectors (character, logical, numeric) can use NA to represent missing values.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Numeric Vectors and NA Entries

A
  • Includes the symbols -Inf and Inf (positive and negative “infinity”) and NaN (not a number).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

na.omit() / complete.cases()

A
  • Functions that will remove observations with missing values from a dataset.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Package foreign

A
  • Package that allows us to read data files by competitors of R (DBF, Stata, Epi Info, Minitab, Octave SPSS, SAS and Systat).
  • SAS files must be in transport format (.xport file) for package foreign.
  • SPSS files require a use of an option to be converted into data frames.
  • Stata files can be read up to Stata 12.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Package haven

A
  • Package that will only read data files from Stata, SPSS and SAS.
  • Can also be used to write Stata and SPSS files.
  • SAS files can be read without conversion (.sas7bdat files).
  • SPSS files are converted into data frames automatically.
  • Stata files can be read up to Stata 13.
19
Q

Stata

A
  • A competing statistical programming software.

- R can both read and write files to Stata.

20
Q

read.dta()

A
  • Function in package foreign that can read local Stata files or remotely stored Stata files using a web address.
  • Result in a data frame.
21
Q

read_dta() + read_stata()

A

Function in package haven that can read local Stata files or remotely stored Stata files using a web address.
- Result in a data frame.

22
Q

write.dta()

A
  • Function in package foreign that will export R data to Stata.
23
Q

write_dta()

A
  • Function in package haven that will export R data to Stata.
24
Q

SPSS

A
  • A competing statistical programming software.
  • R can both read and write files to SPSS.
  • Package haven does a much better job of reading these files than package foreign.
25
read.spss()
- Function in package foreign that is used to convert SPSS data files into R objects. - Can read local or remotely stored files. - Needs option to.data.frame = to make the resulting R object a data frame, otherwise read.spss() returns a list.
26
read_spss() + read_sav
- Functions in package haven that read SPSS data files and convert them into data frames.
27
write_sav()
- Function in package haven that can write SPSS files. | - PAckage foreign cannot write SPSS files.
28
Differences Between Packages Foreign and Haven (SPSS)
- Foreign preserves labels but haven converts variables into numbers and stored the labels. - Can convert these numbers back into variables using the function as_factor().
29
SAS
- A competing statistical programming software. - Package foreign can read SAS data files but those files must exist in a portable format created by SAS software or user must have a copy of SAS software on local computer.
30
read_sas()
- Function in package haven that imports standard SAS files directly to an R data frame. - Does not require SAS.
31
write.table(where_to_store, file = "name_of_file, sep = "some separator, like tab (\t)", row.names = FALSE (will not add a row of numbering))
- Function that will coerce an R object to be a data frame (if it isn't already one) and then save it as an external text file = can be imported into different non-R applications. - Data frame written will have columns space delimited (default) but can be change to tab, comma or virtually anything. - Missing values default to NA which can also be changed.
32
write.csv(where_to_store, file = "name_of_file", row.names = FALSE)
- Function that is a special case of write.table(). - When used, the result will be a comma separated values file. - .csv files are the common format for data that is to be exchanged between software.
33
Results of write.table() or write.csv()
- Tab separated (delimited) files will usually have character quoted values + "jagged" appearance with what appear to be spaces between variables (fields). - Comma separated (delimited) files will also have character values quoted but commas separating fields (also appear to be "jagged") but easier to see the separation between the fields. - Both will typically have a first row which has the names of the columns.
34
subset(x, subset, select, drop = FALSE, ...)
- Function that generates subsets of a data frame or matrix based upon certain conditions. - x: the data frame / object. - subset: a condition you want to impose. - select =: which columns or rows to take. - drop =: if you want to remove any rows or columns.
35
Substitutions
- Can just change the entries of the data frame or matrix to the desired entries.
36
is.na()
- Function which will force all NA's in an object to be replaced with something of your choice.
37
$
- Operator which can be used to grow data frame columns or take data from columns of the data frame with the desired column name. - data_frame_name$column_name
38
Adding new columns + Combining vectors with cbind() + Adding new rows + Combining data frames with rbind()
- Can use cbind() and rowbind() to add new columns or rows to a data frame. - junk$Name
39
Merging Data Frames
- Can merge data frames but must have same matching information.
40
List Extraction Methods
- Can use $, [] or [[]]. - [] and [[]] will extract component entries. - $ will extract a vector.
41
Webscrapping
- Needs the XML library. - Try to use the readHTMLTable() function first = parses an HTML page and retrieves the table elements. - Utilize readHTMLTable("url_name", stringsAsFactors = FALSE).
42
unlist()
- Function that converts a list into a vector.
43
Package rvest
- Package that also contains many tools to scrape webpages.