- get many data sources - understand how data collected - use statistical models - understand correlations - good comm skills

INTRO+DATASETS Flashcards by Eliza Ong

What is Data Science

process of building, cleaning, structuring datasets to analyse and extract meaning

How well did you know this?

Not at all

Perfectly

Process of Data Science

Ask interesting qn
Get data
Explore data
model data
visualize and communicate results

How well did you know this?

Not at all

Perfectly

key principles in DS

get many data sources
understand how data collected
use statistical models
understand correlations
good comm skills

How well did you know this?

Not at all

Perfectly

What does the discussion of probability include

-random experiments that produce a series of possible outcomes (can be infinity outcomes)

How well did you know this?

Not at all

Perfectly

elements of probability model (uncertainty of experiment)

sample space(ohm symbol)(set that contains all possible outcomes. outcomes are mutually exclusive and collective exhaustive)(an event is a collection of one or more outcomes–subset of sample space)
probability fraction p(A) assigns event A a no. between 0 and 1. Complement of event A= A^c– p(A^c)=1-p(A)

How well did you know this?

Not at all

Perfectly

conditional probability

probability of outcome A given that event B (DENOMINATOR)has occurred.

How well did you know this?

Not at all

Perfectly

independent

A and B are independent if the occurrence of B provides no information about A. intersect of events A and B =P(A)*P(B)

How well did you know this?

Not at all

Perfectly

Variable?

variable is any characteristic observed in a study. summary of ALL outcomes in a random process

How well did you know this?

Not at all

Perfectly

quantitative variable

there is meaningful distance between any 2 points of data

How well did you know this?

Not at all

Perfectly

types of categorical variable

ordinal

- nominal

How well did you know this?

Not at all

Perfectly

types of quantitative variable

discrete (separate numbers)

- continuous (possible values form an interval)

How well did you know this?

Not at all

Perfectly

distribution of a variable (probability distribution)

list of possible outcomes+associated probability

How well did you know this?

Not at all

Perfectly

Cumulative probability distribution

probability that the discrete variable is less than or equal to a particular value.

How well did you know this?

Not at all

Perfectly

probability density function (used for continuous variable as impossible to list down all values and prob for each value

Probability density function (PDF) is the probability that the value of a continuous variable falls within an interval.

How well did you know this?

Not at all

Perfectly

cumulative density function

Cumulative distribution function (CDF) is the probability that the variable is less than or equal to a particular value.

How well did you know this?

Not at all

Perfectly

modal category?

category with the highest frequency

How well did you know this?

Not at all

Perfectly

Bar plot (common way to display categorical variable)

Study These Flashcards

One vertical bar for each possible category that could occur,
with the height proportional to the frequency of that category.

Histogram(quantitative variable)

Study These Flashcards

Divide the range of data into intervals of equal width.
Count the number of observations that fall within each interval.
Label the intervals on the x-axis.
Draw a bar over each interval

Weakness of range?

Study These Flashcards

sensitive to extreme observations

variance definition

Study These Flashcards

average squared deviations from the mean

empirical rules of SD

Study These Flashcards

68% of observations fall within +-1SD
95% fall within +-2SD
almost all fall within +-3SD(check for outliers)

interquartile range

Study These Flashcards

range between upper and lower quartiles (robust to outliers)

5 number summary

Study These Flashcards

min , lower quartile, median (X0.5), upper quartile, max (min N max NOT considering outliers)

when does an association exist?

Study These Flashcards

if a particular value for a certain variable(response/dependent) is more likely to occur with certain values of another variable(explanatory/independent)

covariance

measures the extent to which two variables move in the same direction

correlation

covariance between two variables divided by the product of their standard deviations

To check your working directory

getwd()

get data types

class(a) (if a assigned to smth)

true or false class?

logical

Creating a vector of numbers and name it x

x=c(1,2,3,4) x= 1 2 3 4 class:numeric

length of vector

length(x)

alternative ways to write | x = matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2)

x=matrix(c(1,2,3,4),2,2) | x=matrix(1:4,2,2)

vector by row first?

y=matrix(1:4,2,2,byrow=TRUE)

class of matrix?

'matrix' 'array'

dimension of matrix

dim(x)=2 2 (row then column)

extract component from row 2, column 3 of matrix A

A[2,3]

attain subset of first row of A

A[1,]

delete first row of A

A[-1,]

list out all objects?

ls()

remove one or all object?

- rm(x) | - rm(list=ls()) (must contain name or character strings)

INTRO+DATASETS Flashcards

(40 cards)