INTRO+DATASETS Flashcards

1
Q

What is Data Science

A

process of building, cleaning, structuring datasets to analyse and extract meaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Process of Data Science

A
  1. Ask interesting qn
  2. Get data
  3. Explore data
  4. model data
  5. visualize and communicate results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

key principles in DS

A
  • get many data sources
  • understand how data collected
  • use statistical models
  • understand correlations
  • good comm skills
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the discussion of probability include

A

-random experiments that produce a series of possible outcomes (can be infinity outcomes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

elements of probability model (uncertainty of experiment)

A
  • sample space(ohm symbol)(set that contains all possible outcomes. outcomes are mutually exclusive and collective exhaustive)(an event is a collection of one or more outcomes–subset of sample space)
  • probability fraction p(A) assigns event A a no. between 0 and 1. Complement of event A= A^c– p(A^c)=1-p(A)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

conditional probability

A

probability of outcome A given that event B (DENOMINATOR)has occurred.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

independent

A

A and B are independent if the occurrence of B provides no information about A. intersect of events A and B =P(A)*P(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Variable?

A

variable is any characteristic observed in a study. summary of ALL outcomes in a random process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

quantitative variable

A

there is meaningful distance between any 2 points of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

types of categorical variable

A
  • ordinal

- nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

types of quantitative variable

A
  • discrete (separate numbers)

- continuous (possible values form an interval)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

distribution of a variable (probability distribution)

A

list of possible outcomes+associated probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Cumulative probability distribution

A

probability that the discrete variable is less than or equal to a particular value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

probability density function (used for continuous variable as impossible to list down all values and prob for each value

A

Probability density function (PDF) is the probability that the value of a continuous variable falls within an interval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

cumulative density function

A

Cumulative distribution function (CDF) is the probability that the variable is less than or equal to a particular value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

modal category?

A

category with the highest frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Bar plot (common way to display categorical variable)

A

One vertical bar for each possible category that could occur,
with the height proportional to the frequency of that category.

18
Q

Histogram(quantitative variable)

A
  • Divide the range of data into intervals of equal width.
  • Count the number of observations that fall within each interval.
  • Label the intervals on the x-axis.
  • Draw a bar over each interval
19
Q

Weakness of range?

A

sensitive to extreme observations

20
Q

variance definition

A

average squared deviations from the mean

21
Q

empirical rules of SD

A
  • 68% of observations fall within +-1SD
  • 95% fall within +-2SD
  • almost all fall within +-3SD(check for outliers)
22
Q

interquartile range

A

range between upper and lower quartiles (robust to outliers)

23
Q

5 number summary

A

min , lower quartile, median (X0.5), upper quartile, max (min N max NOT considering outliers)

24
Q

when does an association exist?

A

if a particular value for a certain variable(response/dependent) is more likely to occur with certain values of another variable(explanatory/independent)

25
covariance
measures the extent to which two variables move in the same direction
26
correlation
covariance between two variables divided by the product of their standard deviations
27
To check your working directory
getwd()
28
get data types
class(a) (if a assigned to smth)
29
true or false class?
logical
30
Creating a vector of numbers and name it x
x=c(1,2,3,4) x= 1 2 3 4 class:numeric
31
length of vector
length(x)
32
alternative ways to write | x = matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2)
x=matrix(c(1,2,3,4),2,2) | x=matrix(1:4,2,2)
33
vector by row first?
y=matrix(1:4,2,2,byrow=TRUE)
34
class of matrix?
'matrix' 'array'
35
dimension of matrix
dim(x)=2 2 (row then column)
36
extract component from row 2, column 3 of matrix A
A[2,3]
37
attain subset of first row of A
A[1,]
38
delete first row of A
A[-1,]
39
list out all objects?
ls()
40
remove one or all object?
- rm(x) | - rm(list=ls()) (must contain name or character strings)