Exploratory Data Analysis Flashcards
What is a normal distribution
A normal distribution is a probability distribution that is symmetric around its mean, examples are heights and weights of people, IQ scores. In a normal distribution, the mean, median, and mode are all equal
What is a skewed distribution
A skewed distribution is a probability distribution where the data is not symmetric around the mean, and one tail of the distribution has more extreme values than the other. There are two types of skewed distributions: left-skewed (negative skew) and right-skewed (positive skew)
What is an example of left-skewed distributions
Prices of used cars, where there are more cards with a high price than with a low price
What is an example of right-skewed distributions
Distributions of age at first marriage, where there are more people who get married at a younger age than at an older age
What is a uniform distribution
A uniform distribution is a probability distribution where all values have an equal chance of occurring. This means that the probability of any value within a given range is the same.
What is an example of uniform distribution
Rolling a fair die, where each number has an equal chance of being rolled
What is a bi-modal distribution
A bi-modal distribution is a probability distribution where there are two distinct peaks, or modes, in the data. This indicates that there are two underlying subpopulations within the data that are distinct from each other.
What is an example of bi-modal distribution
An example of a bi-modal distribution is the distribution of heights for a population that includes both adults and children
What are some key features of a normal distribution
Some key features of a normal distribution include the fact that it is symmetric, the mean, median and mode are all equal, and the frequency falls off in both directions away from the centre
What is the area under the curve of a normal distribution
The area under the curve of a normal distribution is equal to 1, meaning that the probabilities of all possible outcome sum up to 1
What are the two parameters that determine the shape of a normal distribution
The two parameters that determine the shape of a normal distribution are the mean and the standard deviation
What is the empirical rule
The empirical rule is a statistical rule of thumb that states that for a normal distribution, approximately 68% of the data falls within one standard deviation of the mean, 95% of the data falls within two standard deviations of the mean, and 99.7 of the data falls within three standard deviations of the mean
Can a distribution be both normal and skewed
No, a distribution cannot be both normal and skewed. A normal distribution is always symmetric, while a skewed distribution is not symmetric
What is the difference between normal distribution and a uniform distribution
A normal distribution is bell-shaped and symmetric around the mean, while a uniform distribution is flat and all values are equally likely
What is the difference between skewed left and skewed right
Skewed left and skewed right refer to the direction of the tail of the distribution. In a skewed left distribution, the tail is on the left side and the mean is smaller than the median. In a skewed right distribution, the tail is on the right side and the mean is larger than the media
How do skewed distributions impact statistical analysis
Skewed distributions can have a significant impact on statistical analysis because they can influence the interpretation of measures such as the mean and standard deviation
What is the relationship between the mean and median in a skewed distribution
In a skewed distribution, the mean and median can be different from each other. The mean is pulled towards the tail of the distribution, while the median remains in the centre
Why is a perfectly flat uniform distribution rare
A perfectly flat uniform distribution is rare because it would require an infinite sample size, which is not practical in most cases. In reality, even if the distributions are uniform, there will be some small variation due to sampling
What is the relationship between mean and median in uniform distribution
The mean and median are equal. This is because every value in the distribution has the same frequency of occurrence and contributes equally to the calculation of both mean and median
How does a bi-modal distribution differ from a normal distribution
A normal distribution is symmetrical with a single peak, whereas a bi-modal distribution has two peaks and is not symmetrical
What are some examples of phenomena that may exhibit a bi-modal distribution
Income distributions in certain societies, test scores for a bi-modal test, or bi-modal response patterns in psychological studies
What is the Inter Quartile Range (IQR)
The Inter Quartile Range is the range between the first and third quartiles of a dataset
What is the IQR used for
The IQR is used to measure the spread of data by identifying the range between the first quartile (Q1) and the third quartile (Q3)
What are the 6 different data points usually found on a box plot
Minimum, Quartile 1, Median (Q2), Quartile 3, Maximum, Extreme values (outliers)
How to calculate IQR if n is odd
We can define n as being 2k + 1 for k ∈ Ν
The first quartile is defined as the median of the samples {o0 , o1 , o2 , …, ok-1}
The second quartile is the usual median value: ok
Third quartile is defined as median of the samples {ok+1 , ok+2 , ok+3 , …, on}
How to calculate IQR if n is even
We can define n as being 2k for k ∈ Ν
The first quartile is defined as the median of the samples {o0 , o1 , o2 , …, ok-1}
The second quartile is the usual median value: (ok-1 + ok) / 2
Third quartile is defined as median of the samples {ok+1 , ok+2 , ok+3 , …, on}
Formula for IQR
IQR = Q3 -Q1
What are outliers in statistics
Outliers are data points that are significantly different from the other observations in a dataset
Why is it important to identify outliers in data
Identifying outliers is important because they can significantly affect the mean and standard deviation of a dataset, and can also impact the results of statstical analyses
How can outliers be detected using IQR
Outliers can be detected using the IQR and 1.5 time IQR rule. Any data points outside the range of Q1 - (1.5 x IQR) and Q3 + (1.5 x IQR) are considered outliers
Are outliers always bad data points that should be removed
Outliers can be valid data points that represent a real phenomenon in a population. However they should be investigated further to determine if they are valid or if they are result of errors or anomalies
What are some benefits of using visualisation tools in data analysis
Visualisation tools can help identify patterns and trend, communicate insights, present information succinctly, provide evidence and support, and influence and persuade
How do visualisations support data analysis
Visualisations can provide insights beyond just the numbers and statistics, making it easier to identify patterns and communicate results. They can also raise further questions and liens of inquiry
What are some common types of visualisation tools used in data analysis
Some common types of visualisation tools include scatter plots, histograms, bar charts, line charts, heap maps, and pie charts
How can visualisation be used to support decision-making
Visualisation can provide a clear and concise way to present information and insights, allowing decision makers to quickly understand key trends and patterns and make more informed decisions
What are some best practices for creating effective visualisation
Some best practices for creating effective visualisations include choosing the appropriate type of visualisation for the data being presented, using clear and concise labels and legends, avoiding clutter and unnecessary details and ensuring the visualisation is easily understandable by the intended audience
What is Anscombe’s Quartet
Anscombe’s Quartet is a collection of four datasets that have identical mean and standard deviation, but are actually very different when you look at them
What is the purpose of Anscombe’s Quartet
The purpose of Anscombe’s Quartet is to demonstrate how descriptive statistics alone can hide underlying data and the importance of visualising data to fully understand it
How many datasets are included in Anscombe’s Quartet
Anscombe’s Quartet consists of four datasets
What is the significance of the identical mean and standard deviation values in Ashcombe’s Quartet
The identical mean and standard deviation values in Anscombe’s Quartet highlights the limitations of relying solely on summary statistics to understand a dataset, and demonstrate the importance of visualising data to uncover patterns and relationships
What is simulated annealing in the context of generating datasets
Simulated annealing is a technique used to generate datasets with desired statistical properties by starting with a set of data points in roughly the right place then iteratively adjusting their positions until the desired outcome is achieved (i.e. matching mean and standard deviation
What is a scatter chart used for
A scatter chart is used to show the relationship between two variables
How do you plot data on a scatter chart
You choose two variables and plot one on the x-axis and the other on the y-axis
Can you plot more than two variables on a scatter chart
Yes, you can sometimes plot a third variable by using colour or size of the marker
What is the syntax for creating a scatter chart in Python using Matplotlib
The syntax for creating a scatter chart in Python using Matplotlib is plt.scatter(x,y)
What is the purpose of a bar/column chart in data visualisation
Bar/column charts are used to compare the counts or frequencies of different categories or variables. The height of the bars or columns represents the observed count for each category, making it easy to see which categories have more or less counts than others
How do you create bar chart in Python using Matplotlib
To create a bar chart in Python using Matplotlib, you can use the function plt.bar(x,height) where x is an array or list of values for the x-axis and height is an array or list of values for the heights of the bars. You can also add labels, titles, and other formatting options to the chart using additional Matplotlib functions
What is a line chart used for
A line chart is used to show how a trend occurs over a series of observations. It is commonly used to visualise changes over time or across a range