Statistics Flashcards
How do we use statistics in data science?
Data scientists use statistics to gather, review, analyze, and draw conclusions from data, as well as apply quantified mathematical models to appropriate variables.
What is the Central Tendency?
A descriptive summary of a dataset through a single value that reflects the centre of the data distribution
What is the mean?
The average value
What is the median?
The middle value when data is ordered from highest to lowest. If the number of elements is even, it is the mean of the two values in the middle. The median is not affected by outliers, unlike the mean.
What is the mode?
The number that occurs the most frequently. If there isn’t a single mode, the set is multimodal.
What is variability?
The extent to which data points diverge from the average value, and from each other
What is the range?
The difference between the highest and lowest values
What is the interquartile range?
The spread of the middle half of a distribution
What is standard deviation?
The average distance from the mean
What is variance?
The average of square distances from the mean.
What is correlation?
The strength and direction of the relationship between two or more variables in a dataset. Positive correlation is when larger values of x correspond to larger values of y. Negative correlation is the opposite. Weak or no correlation is if there is no relationship.
What is covariance?
A measure of the joint variability of two variables
What is the correlation coefficient?
A statistical measure of the strength of the linear relationship between two variables
np.corrcoef
What is the population?
A set of all the elements you’re interested in.
What is a sample?
A representative subset of the population - it should preserve the essential statistical features of the overall population
What is an outlier?
An observation that lies an abnormal distance from other values in a dataset.
They are caused by:
- errors in data entry or measurement
- sampling problems and unusual conditions
- natural variation
What are percentiles?
Percentiles indicate the percentage of scores that fall below a particular value
What are quartiles?
Percentiles that divide the dataset into four parts. Each dataset has three quartiles.
The first quartile divides 25% of the smallest items from the rest of the dataset.
The second quartile is the median. Approximately 25% of the items lie between the first and second quartile and another 25% between the second and third quartiles.
The third quartile divides 25% of the largest items from the rest of the dataset
statistics.quantiles(x,n=4)
np.percentile(y,95)
What is a joint plot?
Jointplot comprises three plots. A bivariate graph which shows the relationship between x and y. A histogram at the top showing the distribution of x, a histogram on the right showing the distribution of y.
sns.joinplot