Analysis Basics Flashcards
What do you use to visualize the distribution or spread of a variable?
- Histogram
- Box plot
What do you do to understand the distribution?
Examine the “measured of central tendency”. This refers to describing the “middle” of the data by getting the mean, median, and mode.
A simple average based on adding together all of the values in the sample set and then dividing the total by the number of samples.
Mean
The value in the middle of the range of all of the sample values.
Median
The most commonly occurring value in the sample set
Mode
This refers to a tie for the most common value.
Bimodal or Multimodal
It refers to the data that we have on hand.
Samples
It refers to all the data that we can collect.
Population
Which function is used to estimate the distribution of a variable for the full population?
Probability Density Function (PDF)
What type of distribution has the mean and mode at the center and symmetric tail?
Normal Distribution
What type of distribution has the “bell shape” characteristic?
Normal Distribution
It refers to a tendency to select certain types of values more frequently than others, in a way that misrepresents the underlying population, or ‘real world’.
Bias
What are the things to remember when examining real world data?
- Check for missing values and badly recorded data
- Consider removal of obvious outliers
- Consider what real-world factors might affect your analysis and consider if your dataset size is large enough to handle this
- Check for biased raw data and consider your options to fix this, if found
It is a value that lies significantly outside the range of the rest of the distribution.
Outlier
Which type of distribution has the mass of the data on the left side of the distribution, creating a long tail to the right because of the values at the extreme high end, which pull the mean to the right.
Right skewed
How do you measure variability (variance) in the data?
- Range
- Variance
- Standard Deviation
This refers to the difference between the maximum and minimum. There’s no built-in function for this, but it’s easy to calculate using the min and max functions.
Range
This refers to the average of the squared difference from the mean. You can use the built-in var function to find this.
Variance
This refers to the square root of the variance. You can use the built-in std function to find this.
Standard Deviation
It is a built-in method of the DataFrame object that returns the main descriptive statistics for all numeric columns.
df.describe()
When comparing numeric variables, how do you deal with numeric data in different scales?
Normalize the data
It is a technique that distributes the values proportionally on a scale of 0 to 1.
MinMax scaling
This indicates the strength of the relationship between variables.
Correlation
Values above 0 indicate a positive correlation (high values of one variable tend to coincide with high values of the other), while values below 0 indicate a negative correlation (high values of one variable tend to coincide with low values of the other).
What do you use to visualize the correlation between two numeric variables?
- Scatter plot
2.
It is added to a scatter plot that shows the general trend in the data.
Regression line (line of best fit)
What is the slope-intercept form of a linear equation?
y = mx + b
Where:
- y and x are the coordinate variables
- m is the slope of the line
- b is the y-intercept (where the line goes through the axis)
It is the line that gives us the lowest value for the sum of the squared errors
Least Squares Regression
This returns (among other things) the coefficients you need for the slope equation: slope (m) and intercept (b) based on a given pair of variable samples you want to compare.
linregress method