Exploratory Data Analysis Flashcards

Question

How to calculate IQR if n is odd

Answer 1

We can define n as being 2k + 1 for k ∈ Ν The first quartile is defined as the median of the samples {o0 , o1 , o2 , …, ok-1} The second quartile is the usual median value: ok Third quartile is defined as median of the samples {ok+1 , ok+2 , ok+3 , …, on}

Answer 2

We can define n as being 2k for k ∈ Ν The first quartile is defined as the median of the samples {o0 , o1 , o2 , …, ok-1} The second quartile is the usual median value: (ok-1 + ok) / 2 Third quartile is defined as median of the samples {ok+1 , ok+2 , ok+3 , …, on}

Answer 3

IQR = Q3 -Q1

Answer 4

Outliers are data points that are significantly different from the other observations in a dataset

Answer 5

Identifying outliers is important because they can significantly affect the mean and standard deviation of a dataset, and can also impact the results of statstical analyses

Answer 6

Outliers can be detected using the IQR and 1.5 time IQR rule. Any data points outside the range of Q1 - (1.5 x IQR) and Q3 + (1.5 x IQR) are considered outliers

Answer 7

Outliers can be valid data points that represent a real phenomenon in a population. However they should be investigated further to determine if they are valid or if they are result of errors or anomalies

Answer 8

Visualisation tools can help identify patterns and trend, communicate insights, present information succinctly, provide evidence and support, and influence and persuade

Answer 9

Visualisations can provide insights beyond just the numbers and statistics, making it easier to identify patterns and communicate results. They can also raise further questions and liens of inquiry

Answer 10

Some common types of visualisation tools include scatter plots, histograms, bar charts, line charts, heap maps, and pie charts

Answer 11

Visualisation can provide a clear and concise way to present information and insights, allowing decision makers to quickly understand key trends and patterns and make more informed decisions

Answer 12

Some best practices for creating effective visualisations include choosing the appropriate type of visualisation for the data being presented, using clear and concise labels and legends, avoiding clutter and unnecessary details and ensuring the visualisation is easily understandable by the intended audience

Answer 13

Anscombe's Quartet is a collection of four datasets that have identical mean and standard deviation, but are actually very different when you look at them

Answer 14

The purpose of Anscombe's Quartet is to demonstrate how descriptive statistics alone can hide underlying data and the importance of visualising data to fully understand it

Answer 15

Anscombe's Quartet consists of four datasets

Answer 16

The identical mean and standard deviation values in Anscombe's Quartet highlights the limitations of relying solely on summary statistics to understand a dataset, and demonstrate the importance of visualising data to uncover patterns and relationships

Answer 17

Simulated annealing is a technique used to generate datasets with desired statistical properties by starting with a set of data points in roughly the right place then iteratively adjusting their positions until the desired outcome is achieved (i.e. matching mean and standard deviation

Answer 18

A scatter chart is used to show the relationship between two variables

Answer 19

You choose two variables and plot one on the x-axis and the other on the y-axis

Answer 20

Yes, you can sometimes plot a third variable by using colour or size of the marker

Answer 21

The syntax for creating a scatter chart in Python using Matplotlib is plt.scatter(x,y)

Answer 22

Bar/column charts are used to compare the counts or frequencies of different categories or variables. The height of the bars or columns represents the observed count for each category, making it easy to see which categories have more or less counts than others

Answer 23

To create a bar chart in Python using Matplotlib, you can use the function plt.bar(x,height) where x is an array or list of values for the x-axis and height is an array or list of values for the heights of the bars. You can also add labels, titles, and other formatting options to the chart using additional Matplotlib functions

Answer 24

A line chart is used to show how a trend occurs over a series of observations. It is commonly used to visualise changes over time or across a range

Answer 25

The x-axis in a line chart represents the dimensions across which you want the trend to be measured. For example, if you are plotting stock prices over time, the x-axis would represent time

Answer 26

The y-axis in a line chart represents the measured value. For example, if you are plotting stock prices over time, the y-axis would represent the price of the stock

Answer 27

You can create a line chart in Python using the "plt.plot" function. This function takes two arguments: the x-axis values and the y-axis values. For example, plt.plot([1,2,3,4,5],[10,15,20,25,30]) would create a line chart with x-axis values of 1 through 5 and y-axis values of 10,15,20,25,30

Answer 28

Histograms are best used to visualise the distribution of continuous data, where the data is divided into intervals or bins along the x-axis and the frequency of observations falling into each bin is represented by the height of the bars on the y-axis

Answer 29

The height of each bar in a histogram represents the frequency or count of observations that fall into the corresponding bin or interval along the x-axis

Answer 30

plt.hist(data,bins). Pass in our dataset as the first arguement. The 'bins' parameter specifies the number of bins to use for grouping data. e.g. plt.hist(data, bins=5)

Answer 31

A pie chart shows how categories share proportions of a whole

Answer 32

Each 'wedge' in a pie chart represents a category with the size denoting the amount

Answer 33

The size of each 'wedge' in a pie chart is determined by the proportion of the whole that the corresponding category represents

Answer 34

The Python command to create a pie chart is plt.pie(x) where x is a list or array of data to be represented in the chart

Answer 35

A box and whisker plot is used to show the distribution of a dataset and to visualise the "5 number" summary statistic, including the minimum, Q1, median, Q3 and maximum

Answer 36

The box in a box and whisker plot describes the IQR which is the range between the Q1 and Q3. The box covers the middle 50% of the data, with the median line in the centre

Answer 37

A box and whisker plot can be created in Python plt.boxplot(x), x being the data you want visualised, array or list

Answer 38

Parallel co-ordinate plots, glyphs, quivers, violin plots, and dendrograms

Answer 39

Parallel coordinate plots are useful for visualising high-dimensional data by mapping each variable onto a separate axis and then connecting the values for each observation with a line. This allows for patterns and relationships to be seen across multiple variables at once

Answer 40

A dendrogram is a diagram used to represent a hierarchical clustering of observations or variables. It shows the relationships between the clusters and the individual items being clustered

Answer 41

A violin plot is a type of plot that combines a box and whisker plot with a density plot. It shows the same information as a box and whisker plot, but also provides a visual representation of the density of the data at different values. This makes it more informative than a box and whisker plot in situations where data is not normally distributed

Answer 42

Some further choices to consider include colours, markers, line types, and layout

Answer 43

Feature Scaling is a technique used to handle the variance in scale between different features in a dataset

Answer 44

Many methods struggle to handle the variance in scale between different features in a dataset. Thus, scaling the features into a suitable range helps to counteract the issue

Answer 45

The two common methods of Feature Scaling are Normalisation and Standardisation

Answer 46

Normalisation is a technique used to scale features into the range of 0-1. It involves shifting or translating the points by subtracting the minimum value and rescaling the values by dividing the difference between the maximum and minimum values

Answer 47

Normalisation works well for data that has a fixed natural range with no values outside that range and does not follow a Normal distribution, such as images with pixel values between 0 and 255 and a skewed population of coursework grades

Answer 48

X' = (X - Xmin) / (Xmax - Xmin), where X' is the new normalised dataset, X is the original dataset, Xmin is the minimum observed value, and Xmax is the maximum observed value for the feature. This formula is used to shift and rescale the data to a range of 0 to 1

Answer 49

Standardisation is a technique used in feature scaling to transform a dataset so that it has a mean of 0 and a standard deviation of 1. It involves subtracting the mean value of each feature from the original data and then dividing by the standard deviation of that feature. This rescales the data to have a more standardised range and allows for easier comparison between different features in the dataset

Answer 50

The formula for standardisation is (X - μ) / σ, where X is the original feature, μ is the mean of the feature, σ is the standard deviation of the feature

Answer 51

The subtraction in the standardisation formula does mean centring of the feature, i.e. it subtracts the mean of the feature from each observation of the feature

Answer 52

The division by sigma in the standardisation formula does rescaling of the feature, i.e. it scales the feature so that it has a standard deviation of 1

Answer 53

Standardisation works well for data that does not want to bound the range of observed values and follows Normal distribution

Answer 54

Normalisation scales the features into the range of 0-1 by subtracting the minimum value and dividing by the range, while a standardisation scales the features to have a mean of 0 and a standard deviation of 1 by subtracting the mean and dividing by the standard deviation

Answer 55

Normalisation is appropriate when the data has a fixed natural range and no values outside that range, and when the data does not follow a Normal distribution

Answer 56

Standardisation is appropriate when the data does not need to be bound by an upper or lower limit, and when the data follows a Normal distribution. It is also useful when you want to compare features that have different units or scales, and when you want to reduce the effect of outliers

Answer 57

No, not all machine learning methods require feature scaling. Some methods, such as decision trees, can handle large disparities in feature ranges and do not require scaling. However, other methods, such as k-nearest neighbour and support vector machines, are sensitive to the scale of the features and may require scaling for optimal performance

Answer 58

The curse of dimensionality is the difficulty in dealing with high dimensionality in our data, which can lead to issues such as difficulty in visualisation, increased complexity in modelling, potential redundant information, and longer computation times

Answer 59

Mushroom dataset (22D), Lung Cancer dataset (23D), and any dataset with a large number of features

Answer 60

High dimensionality can make it difficult to visualise the data and can lead to redundant information, increased complexity, and longer computation times. It can also make it harder to identify important features and patterns in the data

Answer 61

Some techniques for visualising high-dimensional data include scatter plot matrices (SPloMs), parallel coordinate plots, and glyph plots. These techniques allow for the visualisation of multiple dimensions at once and can help identify patterns and relationships in the data

Answer 62

Dimensionality reduction techniques can help address the curse of dimensionality by reducing the number of features in the data while retaining important information. This can simplify the data, make it easier to analyse and model, and reduce computation time

Answer 63

SPloM stands for Scatter Plot Matrix, which is a visualisation technique used to explore the relationship between multiple variables in a dataset

Answer 64

SPloM is constructed by creating a matrix of scatter plots where each row and column represents a variable in the dataset. Along the diagonal of the matrix, histograms of each variable are plotted to show the distribution of that variable

Answer 65

The diagonal of a SPloM shows the distribution of each variable in the dataset. This can give an idea of the central tendency, spread, and skewness of each variable

Answer 66

SPloM can become cluttered and difficult to interpret when there are too many variables in the dataset. Additionally, SPloM is limited to detecting linear relationships between variables and may not be able to capture non-linear relationships

Answer 67

As the dimensionality of data increases, the number of scatter plots required in the SPloM also increases. This can make the visualisation difficult to interpret and may result in over plotting, where data points in the scatter plots overlap and become difficult to distinguish. Additionally, with higher dimensionality, there is an increased likelihood of redundant information or irrelevant variables, which may obscure important relationships between variables

Answer 68

Parallel Coordinates is a visualisation technique used to plot high-dimensional data. It allows us to plot multiple variables simultaneously and observe the relationships between them

Answer 69

In the Parallel Coordinate plot, each variable is assigned an axis, and the observed measurements of the variable are plotted along the corresponding axis. The plot consists of a series of liens, with each line representing a single observation, and connecting the values of each variable for that observation

Answer 70

Parallel Coordinates can handle numerous variables, making it useful for visualising high-dimensional data. It also allows us to observe patterns and relationships between variables that may not be easily seen with other visualisation techniques

Answer 71

One limitation of Parallel Coordinates is that it can become difficult to interpret as the numbers of variables increase. Additionally, it may not be suitable for data that contain categorical variables or data with extreme outliers

Answer 72

Parallel Coordinates can become difficult to interpret as the number of observed samples increases, as the plot can become cluttered and difficult to read. Therefore it may not be suitable for visualising very large datasets

Answer 73

A Glyph Plot is a visualisation technique that uses symbols or shapes to display features of high-dimensionality

Answer 74

Features are displayed as components of glyphs (symbols or shapes)

Answer 75

A Chernoff Face is a type of Glyph Plot where components of a face (such as eyes, nose, mouth, etc) are controlled by data

Answer 76

The components of a Chernoff Face are controlled by data through various visual properties such as size, shape, placement, and orientation

Answer 77

Up to 18 dimensions can be displayed

Answer 78

They can display numerous features, up to 18 dimensions. They can be visually appealing and intuitive, as they use familiar shapes and symbols to represent features.They can highlight patterns and correlations between features, such as the relationship between the size of the nose in a Chernoff Face and a particular feature in the data

Answer 79

They may be difficult to interpret for those who are not familiar with the particular glyph used. There may be less precise than other visualisation techniques, as the relationship between glyph components and the underlying data may be unclear. They may not be suitable for all types of data or all research questions, particularly those that require a high degree of precision or accuracy

Answer 80

Dimensionality reduction refers to the process of reducing the number of features (or dimensions) of a dataset while retaining as much of the original information as possible

Answer 81

There are several reasons why someone might want to reduce the dimensions of their data, including: To simplify the data and make it easier to visualise or interpret To reduce noise and redundancy in the data To improve the performance of machine learning algorithms that might struggle with high-dimensional data

Answer 82

The disadvantage of arbitrarily removing features to reduce the dimension of data is that important information might be lost. It is important to use a statistical approach that considers the relationships between the features in the dataset

Answer 83

There are several statistical approaches to reduce the dimensions of data including: Principle Component Analysis (PCA) Correspondence Analysis (CA) Multi-Correspondence Analysis (MCA) Factor-Analysis of Mixed Data (FAMD) Multi-Factor Analysis (MFA)

Answer 84

PCA is a linear transformation technique that identifies the most important features (or principal components) of a dataset and projects the data onto a lower-dimensional space while retaining as much of the original data as possible

Answer 85

CA is a method for visualising the relationships between categorical variables in a high-dimensional dataset by representing the variables as points in a low-dimensional space

Answer 86

MCA is a technique that extends Correspondence Analysis to handle datasets with multiple categorical variables

Answer 87

FAMD is a dimensionality reduction technique that is specifically designed to handle datasets with both categorical and continuous variables

Answer 88

MFA is a method for reducing the dimensions of a dataset with multiple blocks of variables (e.g., different types of data collected from different sources) by identifying common factors that are shared across the blocks

Answer 89

PCA creates new axes by identifying an axis that is orthogonal (perpendicular) to the existing axes, and has the highest variance across the observed sample

Answer 90

In PCA, variance is used to measure the amount of information or variability present in a particular direction or axis. The axes with the highest variance are considered the most important, as they capture the most significant patterns in the data

Answer 91

Yes, PCA can be used for feature selection, as it can identify the most important features that contribute to the variability in the data. By selecting only the most important features, the dimensionality of the data can be reduced, which can improve the accuracy and efficiency of machine learning models

Answer 92

Yes, PCA can be used for feature selection, as it can identify the most important features that contribute to the variability in the data. By selecting only the most important features, the dimensionality of the data can be reduced, which can improve the accuracy and efficiency of machine learning models

Answer 93

Scikit-learn is a Python package that provides a range of machine learning tools and algorithms, including functionality to perform PCA on data

Answer 94

The purpose of performing PCA using sklearn is to simplify complex data sets by reducing their dimensionality and finding patterns or trends that may not be easily visible in the original data

Answer 95

Sklearn PCA function requires the input data to be passed in as an array or a matrix

Answer 96

Sklearn's PCA function provides the projected samples into the new set of principal component axes

Answer 97

Yes, we can specify the number of principal components to be returned by sklearn's PCA function using the "n_components" parameter

Exploratory Data Analysis Flashcards

(121 cards)