Exploratory Data Analysis Flashcards
What is a normal distribution
A normal distribution is a probability distribution that is symmetric around its mean, examples are heights and weights of people, IQ scores. In a normal distribution, the mean, median, and mode are all equal
What is a skewed distribution
A skewed distribution is a probability distribution where the data is not symmetric around the mean, and one tail of the distribution has more extreme values than the other. There are two types of skewed distributions: left-skewed (negative skew) and right-skewed (positive skew)
What is an example of left-skewed distributions
Prices of used cars, where there are more cards with a high price than with a low price
What is an example of right-skewed distributions
Distributions of age at first marriage, where there are more people who get married at a younger age than at an older age
What is a uniform distribution
A uniform distribution is a probability distribution where all values have an equal chance of occurring. This means that the probability of any value within a given range is the same.
What is an example of uniform distribution
Rolling a fair die, where each number has an equal chance of being rolled
What is a bi-modal distribution
A bi-modal distribution is a probability distribution where there are two distinct peaks, or modes, in the data. This indicates that there are two underlying subpopulations within the data that are distinct from each other.
What is an example of bi-modal distribution
An example of a bi-modal distribution is the distribution of heights for a population that includes both adults and children
What are some key features of a normal distribution
Some key features of a normal distribution include the fact that it is symmetric, the mean, median and mode are all equal, and the frequency falls off in both directions away from the centre
What is the area under the curve of a normal distribution
The area under the curve of a normal distribution is equal to 1, meaning that the probabilities of all possible outcome sum up to 1
What are the two parameters that determine the shape of a normal distribution
The two parameters that determine the shape of a normal distribution are the mean and the standard deviation
What is the empirical rule
The empirical rule is a statistical rule of thumb that states that for a normal distribution, approximately 68% of the data falls within one standard deviation of the mean, 95% of the data falls within two standard deviations of the mean, and 99.7 of the data falls within three standard deviations of the mean
Can a distribution be both normal and skewed
No, a distribution cannot be both normal and skewed. A normal distribution is always symmetric, while a skewed distribution is not symmetric
What is the difference between normal distribution and a uniform distribution
A normal distribution is bell-shaped and symmetric around the mean, while a uniform distribution is flat and all values are equally likely
What is the difference between skewed left and skewed right
Skewed left and skewed right refer to the direction of the tail of the distribution. In a skewed left distribution, the tail is on the left side and the mean is smaller than the median. In a skewed right distribution, the tail is on the right side and the mean is larger than the media
How do skewed distributions impact statistical analysis
Skewed distributions can have a significant impact on statistical analysis because they can influence the interpretation of measures such as the mean and standard deviation
What is the relationship between the mean and median in a skewed distribution
In a skewed distribution, the mean and median can be different from each other. The mean is pulled towards the tail of the distribution, while the median remains in the centre
Why is a perfectly flat uniform distribution rare
A perfectly flat uniform distribution is rare because it would require an infinite sample size, which is not practical in most cases. In reality, even if the distributions are uniform, there will be some small variation due to sampling
What is the relationship between mean and median in uniform distribution
The mean and median are equal. This is because every value in the distribution has the same frequency of occurrence and contributes equally to the calculation of both mean and median
How does a bi-modal distribution differ from a normal distribution
A normal distribution is symmetrical with a single peak, whereas a bi-modal distribution has two peaks and is not symmetrical
What are some examples of phenomena that may exhibit a bi-modal distribution
Income distributions in certain societies, test scores for a bi-modal test, or bi-modal response patterns in psychological studies
What is the Inter Quartile Range (IQR)
The Inter Quartile Range is the range between the first and third quartiles of a dataset
What is the IQR used for
The IQR is used to measure the spread of data by identifying the range between the first quartile (Q1) and the third quartile (Q3)
What are the 6 different data points usually found on a box plot
Minimum, Quartile 1, Median (Q2), Quartile 3, Maximum, Extreme values (outliers)
How to calculate IQR if n is odd
We can define n as being 2k + 1 for k ∈ Ν
The first quartile is defined as the median of the samples {o0 , o1 , o2 , …, ok-1}
The second quartile is the usual median value: ok
Third quartile is defined as median of the samples {ok+1 , ok+2 , ok+3 , …, on}
How to calculate IQR if n is even
We can define n as being 2k for k ∈ Ν
The first quartile is defined as the median of the samples {o0 , o1 , o2 , …, ok-1}
The second quartile is the usual median value: (ok-1 + ok) / 2
Third quartile is defined as median of the samples {ok+1 , ok+2 , ok+3 , …, on}
Formula for IQR
IQR = Q3 -Q1
What are outliers in statistics
Outliers are data points that are significantly different from the other observations in a dataset
Why is it important to identify outliers in data
Identifying outliers is important because they can significantly affect the mean and standard deviation of a dataset, and can also impact the results of statstical analyses
How can outliers be detected using IQR
Outliers can be detected using the IQR and 1.5 time IQR rule. Any data points outside the range of Q1 - (1.5 x IQR) and Q3 + (1.5 x IQR) are considered outliers
Are outliers always bad data points that should be removed
Outliers can be valid data points that represent a real phenomenon in a population. However they should be investigated further to determine if they are valid or if they are result of errors or anomalies
What are some benefits of using visualisation tools in data analysis
Visualisation tools can help identify patterns and trend, communicate insights, present information succinctly, provide evidence and support, and influence and persuade
How do visualisations support data analysis
Visualisations can provide insights beyond just the numbers and statistics, making it easier to identify patterns and communicate results. They can also raise further questions and liens of inquiry
What are some common types of visualisation tools used in data analysis
Some common types of visualisation tools include scatter plots, histograms, bar charts, line charts, heap maps, and pie charts
How can visualisation be used to support decision-making
Visualisation can provide a clear and concise way to present information and insights, allowing decision makers to quickly understand key trends and patterns and make more informed decisions
What are some best practices for creating effective visualisation
Some best practices for creating effective visualisations include choosing the appropriate type of visualisation for the data being presented, using clear and concise labels and legends, avoiding clutter and unnecessary details and ensuring the visualisation is easily understandable by the intended audience
What is Anscombe’s Quartet
Anscombe’s Quartet is a collection of four datasets that have identical mean and standard deviation, but are actually very different when you look at them
What is the purpose of Anscombe’s Quartet
The purpose of Anscombe’s Quartet is to demonstrate how descriptive statistics alone can hide underlying data and the importance of visualising data to fully understand it
How many datasets are included in Anscombe’s Quartet
Anscombe’s Quartet consists of four datasets
What is the significance of the identical mean and standard deviation values in Ashcombe’s Quartet
The identical mean and standard deviation values in Anscombe’s Quartet highlights the limitations of relying solely on summary statistics to understand a dataset, and demonstrate the importance of visualising data to uncover patterns and relationships
What is simulated annealing in the context of generating datasets
Simulated annealing is a technique used to generate datasets with desired statistical properties by starting with a set of data points in roughly the right place then iteratively adjusting their positions until the desired outcome is achieved (i.e. matching mean and standard deviation
What is a scatter chart used for
A scatter chart is used to show the relationship between two variables
How do you plot data on a scatter chart
You choose two variables and plot one on the x-axis and the other on the y-axis
Can you plot more than two variables on a scatter chart
Yes, you can sometimes plot a third variable by using colour or size of the marker
What is the syntax for creating a scatter chart in Python using Matplotlib
The syntax for creating a scatter chart in Python using Matplotlib is plt.scatter(x,y)
What is the purpose of a bar/column chart in data visualisation
Bar/column charts are used to compare the counts or frequencies of different categories or variables. The height of the bars or columns represents the observed count for each category, making it easy to see which categories have more or less counts than others
How do you create bar chart in Python using Matplotlib
To create a bar chart in Python using Matplotlib, you can use the function plt.bar(x,height) where x is an array or list of values for the x-axis and height is an array or list of values for the heights of the bars. You can also add labels, titles, and other formatting options to the chart using additional Matplotlib functions
What is a line chart used for
A line chart is used to show how a trend occurs over a series of observations. It is commonly used to visualise changes over time or across a range
What is the x-axis in a line chart
The x-axis in a line chart represents the dimensions across which you want the trend to be measured. For example, if you are plotting stock prices over time, the x-axis would represent time
What is the y-axis in a line chart
The y-axis in a line chart represents the measured value. For example, if you are plotting stock prices over time, the y-axis would represent the price of the stock
How can you create a line chart in Python
You can create a line chart in Python using the “plt.plot” function. This function takes two arguments: the x-axis values and the y-axis values. For example, plt.plot([1,2,3,4,5],[10,15,20,25,30]) would create a line chart with x-axis values of 1 through 5 and y-axis values of 10,15,20,25,30
What type of data is best visualised using histograms
Histograms are best used to visualise the distribution of continuous data, where the data is divided into intervals or bins along the x-axis and the frequency of observations falling into each bin is represented by the height of the bars on the y-axis
What does the height of each bar in a histogram represent
The height of each bar in a histogram represents the frequency or count of observations that fall into the corresponding bin or interval along the x-axis
How do you create a histogram in Python
plt.hist(data,bins). Pass in our dataset as the first arguement. The ‘bins’ parameter specifies the number of bins to use for grouping data. e.g. plt.hist(data, bins=5)
What does a pie chart show
A pie chart shows how categories share proportions of a whole
What does each ‘wedge’ in a pie chart represent
Each ‘wedge’ in a pie chart represents a category with the size denoting the amount
How is the size of each ‘wedge’ in a pie chart determined
The size of each ‘wedge’ in a pie chart is determined by the proportion of the whole that the corresponding category represents
What is the Python command to create a pie chart
The Python command to create a pie chart is plt.pie(x) where x is a list or array of data to be represented in the chart
What is the purpose of a box and whisker plot in data visualisation
A box and whisker plot is used to show the distribution of a dataset and to visualise the “5 number” summary statistic, including the minimum, Q1, median, Q3 and maximum
How does a box and whisker plot describe the IQR
The box in a box and whisker plot describes the IQR which is the range between the Q1 and Q3. The box covers the middle 50% of the data, with the median line in the centre
How is a box and whisker plot created in Python
A box and whisker plot can be created in Python plt.boxplot(x), x being the data you want visualised, array or list
What are some examples of advance visualisation tools
Parallel co-ordinate plots, glyphs, quivers, violin plots, and dendrograms
How can parallel coordinate plots be useful for visualising data
Parallel coordinate plots are useful for visualising high-dimensional data by mapping each variable onto a separate axis and then connecting the values for each observation with a line. This allows for patterns and relationships to be seen across multiple variables at once
What is a dendrogram used for in data visualisation
A dendrogram is a diagram used to represent a hierarchical clustering of observations or variables. It shows the relationships between the clusters and the individual items being clustered
What is a violin plot and how does it differ from a box and whisker plot
A violin plot is a type of plot that combines a box and whisker plot with a density plot. It shows the same information as a box and whisker plot, but also provides a visual representation of the density of the data at different values. This makes it more informative than a box and whisker plot in situations where data is not normally distributed
What are some further choices to consider when creating visualisations
Some further choices to consider include colours, markers, line types, and layout
What is Feature Scaling
Feature Scaling is a technique used to handle the variance in scale between different features in a dataset
Why is Feature Scaling important
Many methods struggle to handle the variance in scale between different features in a dataset. Thus, scaling the features into a suitable range helps to counteract the issue
What are the two common methods of Feature Scaling
The two common methods of Feature Scaling are Normalisation and Standardisation
What is normalisation in feature scaling
Normalisation is a technique used to scale features into the range of 0-1. It involves shifting or translating the points by subtracting the minimum value and rescaling the values by dividing the difference between the maximum and minimum values
What kind of data works well with normalisation in feature scaling
Normalisation works well for data that has a fixed natural range with no values outside that range and does not follow a Normal distribution, such as images with pixel values between 0 and 255 and a skewed population of coursework grades
What is the formula for normalisation
X’ = (X - Xmin) / (Xmax - Xmin), where X’ is the new normalised dataset, X is the original dataset, Xmin is the minimum observed value, and Xmax is the maximum observed value for the feature. This formula is used to shift and rescale the data to a range of 0 to 1
What is standardisation in feature scaling
Standardisation is a technique used in feature scaling to transform a dataset so that it has a mean of 0 and a standard deviation of 1. It involves subtracting the mean value of each feature from the original data and then dividing by the standard deviation of that feature. This rescales the data to have a more standardised range and allows for easier comparison between different features in the dataset
What is the formula for standardisation
The formula for standardisation is (X - μ) / σ, where X is the original feature, μ is the mean of the feature, σ is the standard deviation of the feature
What does the subtraction in the standardisation formula do
The subtraction in the standardisation formula does mean centring of the feature, i.e. it subtracts the mean of the feature from each observation of the feature
What does the division by sigma in the standardisation formula do
The division by sigma in the standardisation formula does rescaling of the feature, i.e. it scales the feature so that it has a standard deviation of 1
When does standardisation work well for data
Standardisation works well for data that does not want to bound the range of observed values and follows Normal distribution
What is the difference between normalisation and standardisation in feature scaling
Normalisation scales the features into the range of 0-1 by subtracting the minimum value and dividing by the range, while a standardisation scales the features to have a mean of 0 and a standard deviation of 1 by subtracting the mean and dividing by the standard deviation
When is it appropriate to use normalisation in feature scaling
Normalisation is appropriate when the data has a fixed natural range and no values outside that range, and when the data does not follow a Normal distribution
When is it appropriate to use standardisation in feature scaling
Standardisation is appropriate when the data does not need to be bound by an upper or lower limit, and when the data follows a Normal distribution. It is also useful when you want to compare features that have different units or scales, and when you want to reduce the effect of outliers
Do all machine learning methods require feature scaling
No, not all machine learning methods require feature scaling. Some methods, such as decision trees, can handle large disparities in feature ranges and do not require scaling. However, other methods, such as k-nearest neighbour and support vector machines, are sensitive to the scale of the features and may require scaling for optimal performance
What is the curse of dimensionality
The curse of dimensionality is the difficulty in dealing with high dimensionality in our data, which can lead to issues such as difficulty in visualisation, increased complexity in modelling, potential redundant information, and longer computation times
What are some examples of high-dimensional datasets
Mushroom dataset (22D), Lung Cancer dataset (23D), and any dataset with a large number of features
How can high dimensionality affect our ability to model and analyse data
High dimensionality can make it difficult to visualise the data and can lead to redundant information, increased complexity, and longer computation times. It can also make it harder to identify important features and patterns in the data
What are some techniques for visualising high-dimensional data
Some techniques for visualising high-dimensional data include scatter plot matrices (SPloMs), parallel coordinate plots, and glyph plots. These techniques allow for the visualisation of multiple dimensions at once and can help identify patterns and relationships in the data
How can dimensionality reduction help address the curse of dimensionality
Dimensionality reduction techniques can help address the curse of dimensionality by reducing the number of features in the data while retaining important information. This can simplify the data, make it easier to analyse and model, and reduce computation time
What is SPloM
SPloM stands for Scatter Plot Matrix, which is a visualisation technique used to explore the relationship between multiple variables in a dataset
How is SPloM constructed
SPloM is constructed by creating a matrix of scatter plots where each row and column represents a variable in the dataset. Along the diagonal of the matrix, histograms of each variable are plotted to show the distribution of that variable
What information can be obtained from the diagonal of a SPloM
The diagonal of a SPloM shows the distribution of each variable in the dataset. This can give an idea of the central tendency, spread, and skewness of each variable
What are some limitations of SPloM
SPloM can become cluttered and difficult to interpret when there are too many variables in the dataset. Additionally, SPloM is limited to detecting linear relationships between variables and may not be able to capture non-linear relationships
In what ways can the dimensionality of data limit the usefulness of SPloM
As the dimensionality of data increases, the number of scatter plots required in the SPloM also increases. This can make the visualisation difficult to interpret and may result in over plotting, where data points in the scatter plots overlap and become difficult to distinguish. Additionally, with higher dimensionality, there is an increased likelihood of redundant information or irrelevant variables, which may obscure important relationships between variables
What is the Parallel Coordinates method
Parallel Coordinates is a visualisation technique used to plot high-dimensional data. It allows us to plot multiple variables simultaneously and observe the relationships between them
How is the Parallel Coordinates plot constructed
In the Parallel Coordinate plot, each variable is assigned an axis, and the observed measurements of the variable are plotted along the corresponding axis. The plot consists of a series of liens, with each line representing a single observation, and connecting the values of each variable for that observation
What is the advantage of using Parallel Coordinates for visualising high-dimensional data
Parallel Coordinates can handle numerous variables, making it useful for visualising high-dimensional data. It also allows us to observe patterns and relationships between variables that may not be easily seen with other visualisation techniques
What are some limitations of Parallel Coordinates
One limitation of Parallel Coordinates is that it can become difficult to interpret as the numbers of variables increase. Additionally, it may not be suitable for data that contain categorical variables or data with extreme outliers
How does the number of observed samples affect the usefulness of Parallel Coordinates
Parallel Coordinates can become difficult to interpret as the number of observed samples increases, as the plot can become cluttered and difficult to read. Therefore it may not be suitable for visualising very large datasets
What is a Glyph Plot
A Glyph Plot is a visualisation technique that uses symbols or shapes to display features of high-dimensionality
How are features displayed in a Glyph Plot
Features are displayed as components of glyphs (symbols or shapes)
What is Chernoff Face?
A Chernoff Face is a type of Glyph Plot where components of a face (such as eyes, nose, mouth, etc) are controlled by data
How are componenets of a Chernoff Face controlled by data
The components of a Chernoff Face are controlled by data through various visual properties such as size, shape, placement, and orientation
What is the maximum number of dimensions that can be displayed using Chernoff Faces
Up to 18 dimensions can be displayed
What are the advantages of glyph plots
They can display numerous features, up to 18 dimensions. They can be visually appealing and intuitive, as they use familiar shapes and symbols to represent features.They can highlight patterns and correlations between features, such as the relationship between the size of the nose in a Chernoff Face and a particular feature in the data
What are the disadvantages of glyph plots
They may be difficult to interpret for those who are not familiar with the particular glyph used. There may be less precise than other visualisation techniques, as the relationship between glyph components and the underlying data may be unclear. They may not be suitable for all types of data or all research questions, particularly those that require a high degree of precision or accuracy
What is dimensionality reduction
Dimensionality reduction refers to the process of reducing the number of features (or dimensions) of a dataset while retaining as much of the original information as possible
Why might someone want to reduce the dimensions of their data
There are several reasons why someone might want to reduce the dimensions of their data, including:
To simplify the data and make it easier to visualise or interpret
To reduce noise and redundancy in the data
To improve the performance of machine learning algorithms that might struggle with high-dimensional data
What is the disadvantage of arbitrarily removing features to reduce the dimension of data
The disadvantage of arbitrarily removing features to reduce the dimension of data is that important information might be lost. It is important to use a statistical approach that considers the relationships between the features in the dataset
What are some statistical approaches to reduce the dimension of data
There are several statistical approaches to reduce the dimensions of data including:
Principle Component Analysis (PCA)
Correspondence Analysis (CA)
Multi-Correspondence Analysis (MCA)
Factor-Analysis of Mixed Data (FAMD)
Multi-Factor Analysis (MFA)
What is Principle Component Analysis (PCA)
PCA is a linear transformation technique that identifies the most important features (or principal components) of a dataset and projects the data onto a lower-dimensional space while retaining as much of the original data as possible
What is Correspondence Analysis (CA)
CA is a method for visualising the relationships between categorical variables in a high-dimensional dataset by representing the variables as points in a low-dimensional space
What is Multi-Correspondence Analysis (MCA)
MCA is a technique that extends Correspondence Analysis to handle datasets with multiple categorical variables
What is Factor-Analysis of Mixed Data (FAMD)
FAMD is a dimensionality reduction technique that is specifically designed to handle datasets with both categorical and continuous variables
What is Multi-Factor Analysis (MFA)
MFA is a method for reducing the dimensions of a dataset with multiple blocks of variables (e.g., different types of data collected from different sources) by identifying common factors that are shared across the blocks
How does PCA create new axes (Look back on notes)
PCA creates new axes by identifying an axis that is orthogonal (perpendicular) to the existing axes, and has the highest variance across the observed sample
What is the significance of variance in PCA
In PCA, variance is used to measure the amount of information or variability present in a particular direction or axis. The axes with the highest variance are considered the most important, as they capture the most significant patterns in the data
Can PCA be used for feature selection
Yes, PCA can be used for feature selection, as it can identify the most important features that contribute to the variability in the data. By selecting only the most important features, the dimensionality of the data can be reduced, which can improve the accuracy and efficiency of machine learning models
Can PCA be used for feature selection
Yes, PCA can be used for feature selection, as it can identify the most important features that contribute to the variability in the data. By selecting only the most important features, the dimensionality of the data can be reduced, which can improve the accuracy and efficiency of machine learning models
What is Scikit-learn (sklearn)
Scikit-learn is a Python package that provides a range of machine learning tools and algorithms, including functionality to perform PCA on data
What is the purpose of performing PCA using sklearn
The purpose of performing PCA using sklearn is to simplify complex data sets by reducing their dimensionality and finding patterns or trends that may not be easily visible in the original data
What input does sklearn’s PCA function require (Check notes)
Sklearn PCA function requires the input data to be passed in as an array or a matrix
What output does sklearn’s PCA function provide
Sklearn’s PCA function provides the projected samples into the new set of principal component axes
Can we specify the number of principal components to be returned by sklearn’s PCA function using the “n_components” parameter
Yes, we can specify the number of principal components to be returned by sklearn’s PCA function using the “n_components” parameter