Data Scientist's Toolbox Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Command Line Interface

A

A window in a computer where you can enter command lines and navigate the computer file structure by text and commands; git bash on Windows, and Terminal on Mac

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Working Directory

A

Whatever directory you are currently in when using the Command Line Interface.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Path

A

All of the directories you need to navigate through to get back to your root directory from the directory you;re in now

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Home Directory

A

The directory the command line interface opens into on your machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Root Directory

A

The highest up the directory structure you can go. Nothing larger than this in your computer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

pwd

A

A command that shows the path to the working directory you’re currently in. Stands for “print working directory”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The CLI Command recipe

A

command flags arguments

“command” is the CLI command that does a certain task

“flags” are options given to the command to trigger certain behaviours, proceeded by a “-“

“arguments” can be what the command is going to mod, or other options - not every command has these

Eg. pwd is a command with no flags or arguments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

CLI Command: Clear

A

Clears up everything on the screen back to the home directory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

CLI Command: ls

ls -a?

ls - al?

A

“ls” lists files and folders in the current directory

“ls -a” lists hidden and unhidden files and folders

“ls -l” lists details for files and folders in the current directory

“ls - al” lists details for hidden AND unhidden files and folders

-a and -l are flags

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to Id a hidden file or folder in the CLI

A

It will start with a “.” and only be visible with certain commands

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

CLI Command: cd

A

“cd” stands for “change directory”

“cd” accepts as an argument the directory you want to change to

“cd” with no argument goes back to the home directory

“cd ..” goes up one level in the directory path

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does “/” represent in the CLI?

A

The Root Directory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does “ ~ “ represent in the CLI?

A

The Home directory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

CLI Command: mkdir

A

“mkdir” stands for “make directory”

same as making a new folder in the GUI

mkdir accepts as an argument the name of the directory being created

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

CLI Command: touch

A

“touch” creates an empty file

“touch” accepts as an argument the name of the file being created

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

CLI Command: cp

A

“cp” stands for “copy”

“cp” takes as a first argument a file, and as a second argument (separated by a space) the name of the path to where the file should be copied

the “-r” (recursive) flag can also be used to indicate the contents of a directory should be copied to the new directory

17
Q

CLI Command: rm

A

“rm” stands for “remove”

“rm” takes as an argument the name of a file you want to remove

“rm” can be modded with the flag “-r” and a directory as an argument to remove an entire directory (Note: this cannot be undone”

18
Q

CLI Command: mv

A

“mv” stands for “move” and can be used to move a file from one directory to another

it accepts as an argument the name of the file to be moved

“mv” can also be used to rename files by using the desired filename as the second argument

19
Q

CLI Command: echo

A

“echo” will print whatever arguments are provided after

Useful for printing out the contents of variables that have been stored

20
Q

CLI command: date

A

“date” will print out the date

21
Q

Adding to an index

A

“git add .” adds all new files

“git add -u” updates tracking for files that were renamed or deleted

“git add -a” does both of the previous

  • should be done before committing
22
Q

Committing to a local repository

A

“git commit -m “message” “ where “message” is a useful description of the work that was done

Note: this only updates the local repo, not the remote one

23
Q

Pushing to Github

A

“git push” takes all changes made since last push and send them to github

24
Q

Branch Commands

A

”- git checkout -b branchname” create a new branch named what is put in branch name

”- git branch” see what branch you are on

”- git checkout master” switch to the master branch

25
Q

Pull Request

A
  • a unique feature of github
  • if you fork someone’s repo or have separate branches, you may need to merge; this is what a pull request does, and it’s done from github
26
Q

How install R packages? Multiple packages?

A

Type “install.packages(“packagename”)”
for multiple “install.packages(c(“packagename”, “packagename2”))”

Or use the “tools” menu in R Tools

then use “library(package name)” to load the package

Note: packages won’t load if dependent packages are not already loaded

then “search()” to see functions available using that package

27
Q

Types of Data Science Questions in order of difficulty

A

In order of difficulty:

Descriptive
Exploratory
Inferential
Causal
Mechanistic
28
Q

Descriptive Analysis: Goal? Example?

A

Goal: Describe a set of Data

  • commonly applied to census data
  • description and interpretation are different steps
  • descriptions usually can’t be generalized without additional modelling

Eg. www.census.gov/2010census
books.google.com/ngrams

29
Q

Exploratory Analysis: Goal? Example(s)?

A

Goal - find relationships you didn’t know about

  • good for discovering new conniptions
  • useful for defining future studies
  • exploratory analysis usually not the final say
  • should not be used for generalizing or predicting without further analysis
  • correlation does not imply causation

Eg. sdss.org - sloan digital sky survey

30
Q

Inferential Analysis: Goal? Example(s)?

A

Goal: use a relatively small sample of data to say something about a bigger population

  • commonly the goal of statistical models
  • involves estimating both the quantity cared about and the uncertainty of the estimate
  • depends on population and sampling scheme

Eg. Effect of Air Pollution Control on Life Expectancy in the United States: An analysis of 545 US Counties for the period from 2000 to 2007

31
Q

Predictive Analysis: Goal? Example(s)?

A

Goal: To use the data on some objects to predict values for another object

  • If X predicts Y, it does not mean that X causes Y
  • Accurate prediction depends heavily on measuring the right variables
  • more data and simple model tends to work really well
  • prediction is very hard

Example: 538 blog - Nate Silver, Predicting the presidential election; target seeing purchases someone has made and determining they’re pregnant

32
Q

Causal Analysis: Goal? Example(s)?

A

Goal: Find out what happens to one variable when another variable changes

  • Usually randomized studies are required to ID causation
  • some approaches to inferring causation in non-randomized, but they’re finicky
  • causal relationships are usually IDed as “average effects” but may not apply to every individual in the population
  • causal models are the “gold standard” for data analysis

Eg. Medical studies for new processes or drugs

33
Q

Mechanistic Analysis: Goal? Example(s)?

A

Goal: Understand the exact changes in variables that lead to change in other variables for individual objects

  • incredibly hard to infer, except in simple situations
  • usually modelled by a deterministic set of equations (physical/engineering science) where all variables can be carefully controlled
  • generally the random component of the data is measurement error
  • if the equations are known but the parameters are not, they may be inferred by data analysis

Example: Pavement design - what changes lead to different outcomes in function? www.fhwa.dot.gov/resourcecenter/teams/pavement/pave_3pdg.pdf

34
Q

Definition: Data

A

“Data are values of qualitative or quantitative variables, belonging to a set of items”

Set of items: Also known as the population; the set of objects you’re interested in

Qualitative: Country of origin, sex, treatment

Quantitative: Height, weight

35
Q

Variability

A

Also called dispersion, scatter, or spread, is the distribution of measured variables in a plot. Commonly measured by variance, standard deviation, or interquartile range

36
Q

Confounding

A

When an extraneous variable correlates with the measured variables, creating the false appearance of correlation between those two variable. Eg. As shoe size goes up, so does literacy, but this is actually because as people become older (to a certain point) both literacy and shoe size increase

37
Q

Spurious Correlation

A

When tow variables are correlated but the correlation is almost entirely due to some other, far more important, variable

38
Q

Values of good experiments

A
  • Can be replicated
  • Measure variability
  • Generalize to the problem
  • Are transparent
39
Q

Data dredging

A
  • data dredging is when many different hypotheses are applied to a large data set trying to find some correlation; eventually by playing with the data enough some correlation will be found but there is a good chance it can be false