S1 Study Flashcards
What are the five steps of data science?
Ask Get Explore Model Communicate
What type of variable is gender?
Categorical
What type of variable is height?
Numerical
What is the name of the bar’s width on a histogram?
Bin size
What are the three V’s of data science?
Volume
Velocity
Variety
What does EDA stand for?
Exploratory Data Analysis
What is the goal of EDA?
To understand the data better and search for patterns
What is data wrangling?
Transforming raw data to be usable later in the process
Are boxplots used when
- there are lots of outliers?
- the data is skewed?
- the data has high dimensionality?
- we want to see some specific features?
When the data is skewed
What is GDPR?
General Data Protection Regulation
According to GDPR, can anyone ask a search engine to remove irrelevant data from their search results?
Yes
This stands regardless of where their servers are stored since their services are offered in the EU
Name 3 rights guaranteed by the personal protection rights of GDPR
The right to…
- transfer my data to another server
- delete false data about me
- challenge an outcome based on my data
NOT to withdraw a paper based on my data
True or False
To properly data scrape, you should provide a user agent string
True
True or False
To properly data scrape, you should request data at a reasonable rate
True
True or False
To properly data scrape, you should not use an adblocker
False
True or False
To properly data scrape, you should launch the project when the site isn’t busy
False
True or False
To properly data scrape, you should use an API rather than your own code
True
How do we check if a distribution is skewed and in what direction?
Answer in terms of a physical diagram
Skewed right ‘tilts’ to the left and is negative
Skewed left ‘tilts’ to the right and is positive
Not skewed is symmetrical and is 0
How do we check if a distribution is skewed and in what direction?
Answer in terms of mathematics
Mean > median -> positive skew
Mean = median -> symmetric
Mean < median -> negative skew
Think of the skew’s sign as the relative size of the mean (a high mean compared to median means a ‘high’ skew)
Is the distribution shown positive, negative, or not skewed?
2 3 5 6 7 7
(Hint: their sum is 30)
Distribution given was already in order
Median is 5.5 (between 5 and 6)
Mean is 30/6 = 5
Since the mean < median, the distribution is negatively skewed
What is the variable for Sample Variance?
Sx ^ 2
x is a subscript
What is the variable for Standard Deviation?
Sx
X is a subscript