Statistics Flashcards
How do we use statistics in data science?
Data scientists use statistics to gather, review, analyze, and draw conclusions from data, as well as apply quantified mathematical models to appropriate variables.
What is the Central Tendency?
A descriptive summary of a dataset through a single value that reflects the centre of the data distribution
What is the mean?
The average value
What is the median?
The middle value when data is ordered from highest to lowest. If the number of elements is even, it is the mean of the two values in the middle. The median is not affected by outliers, unlike the mean.
What is the mode?
The number that occurs the most frequently. If there isn’t a single mode, the set is multimodal.
What is variability?
The extent to which data points diverge from the average value, and from each other
What is the range?
The difference between the highest and lowest values
What is the interquartile range?
The spread of the middle half of a distribution
What is standard deviation?
The average distance from the mean
What is variance?
The average of square distances from the mean.
What is correlation?
The strength and direction of the relationship between two or more variables in a dataset. Positive correlation is when larger values of x correspond to larger values of y. Negative correlation is the opposite. Weak or no correlation is if there is no relationship.
What is covariance?
A measure of the joint variability of two variables
What is the correlation coefficient?
A statistical measure of the strength of the linear relationship between two variables
np.corrcoef
What is the population?
A set of all the elements you’re interested in.
What is a sample?
A representative subset of the population - it should preserve the essential statistical features of the overall population