Exploratory Data Analysis Flashcards

1
Q

What is a normal distribution

A

A normal distribution is a probability distribution that is symmetric around its mean, examples are heights and weights of people, IQ scores. In a normal distribution, the mean, median, and mode are all equal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a skewed distribution

A

A skewed distribution is a probability distribution where the data is not symmetric around the mean, and one tail of the distribution has more extreme values than the other. There are two types of skewed distributions: left-skewed (negative skew) and right-skewed (positive skew)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an example of left-skewed distributions

A

Prices of used cars, where there are more cards with a high price than with a low price

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is an example of right-skewed distributions

A

Distributions of age at first marriage, where there are more people who get married at a younger age than at an older age

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a uniform distribution

A

A uniform distribution is a probability distribution where all values have an equal chance of occurring. This means that the probability of any value within a given range is the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is an example of uniform distribution

A

Rolling a fair die, where each number has an equal chance of being rolled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a bi-modal distribution

A

A bi-modal distribution is a probability distribution where there are two distinct peaks, or modes, in the data. This indicates that there are two underlying subpopulations within the data that are distinct from each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is an example of bi-modal distribution

A

An example of a bi-modal distribution is the distribution of heights for a population that includes both adults and children

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some key features of a normal distribution

A

Some key features of a normal distribution include the fact that it is symmetric, the mean, median and mode are all equal, and the frequency falls off in both directions away from the centre

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the area under the curve of a normal distribution

A

The area under the curve of a normal distribution is equal to 1, meaning that the probabilities of all possible outcome sum up to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two parameters that determine the shape of a normal distribution

A

The two parameters that determine the shape of a normal distribution are the mean and the standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the empirical rule

A

The empirical rule is a statistical rule of thumb that states that for a normal distribution, approximately 68% of the data falls within one standard deviation of the mean, 95% of the data falls within two standard deviations of the mean, and 99.7 of the data falls within three standard deviations of the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Can a distribution be both normal and skewed

A

No, a distribution cannot be both normal and skewed. A normal distribution is always symmetric, while a skewed distribution is not symmetric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the difference between normal distribution and a uniform distribution

A

A normal distribution is bell-shaped and symmetric around the mean, while a uniform distribution is flat and all values are equally likely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between skewed left and skewed right

A

Skewed left and skewed right refer to the direction of the tail of the distribution. In a skewed left distribution, the tail is on the left side and the mean is smaller than the median. In a skewed right distribution, the tail is on the right side and the mean is larger than the media

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do skewed distributions impact statistical analysis

A

Skewed distributions can have a significant impact on statistical analysis because they can influence the interpretation of measures such as the mean and standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the relationship between the mean and median in a skewed distribution

A

In a skewed distribution, the mean and median can be different from each other. The mean is pulled towards the tail of the distribution, while the median remains in the centre

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why is a perfectly flat uniform distribution rare

A

A perfectly flat uniform distribution is rare because it would require an infinite sample size, which is not practical in most cases. In reality, even if the distributions are uniform, there will be some small variation due to sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the relationship between mean and median in uniform distribution

A

The mean and median are equal. This is because every value in the distribution has the same frequency of occurrence and contributes equally to the calculation of both mean and median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How does a bi-modal distribution differ from a normal distribution

A

A normal distribution is symmetrical with a single peak, whereas a bi-modal distribution has two peaks and is not symmetrical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are some examples of phenomena that may exhibit a bi-modal distribution

A

Income distributions in certain societies, test scores for a bi-modal test, or bi-modal response patterns in psychological studies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the Inter Quartile Range (IQR)

A

The Inter Quartile Range is the range between the first and third quartiles of a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the IQR used for

A

The IQR is used to measure the spread of data by identifying the range between the first quartile (Q1) and the third quartile (Q3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are the 6 different data points usually found on a box plot

A

Minimum, Quartile 1, Median (Q2), Quartile 3, Maximum, Extreme values (outliers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How to calculate IQR if n is odd

A

We can define n as being 2k + 1 for k ∈ Ν
The first quartile is defined as the median of the samples {o0 , o1 , o2 , …, ok-1}
The second quartile is the usual median value: ok
Third quartile is defined as median of the samples {ok+1 , ok+2 , ok+3 , …, on}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How to calculate IQR if n is even

A

We can define n as being 2k for k ∈ Ν
The first quartile is defined as the median of the samples {o0 , o1 , o2 , …, ok-1}
The second quartile is the usual median value: (ok-1 + ok) / 2
Third quartile is defined as median of the samples {ok+1 , ok+2 , ok+3 , …, on}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Formula for IQR

A

IQR = Q3 -Q1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are outliers in statistics

A

Outliers are data points that are significantly different from the other observations in a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Why is it important to identify outliers in data

A

Identifying outliers is important because they can significantly affect the mean and standard deviation of a dataset, and can also impact the results of statstical analyses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How can outliers be detected using IQR

A

Outliers can be detected using the IQR and 1.5 time IQR rule. Any data points outside the range of Q1 - (1.5 x IQR) and Q3 + (1.5 x IQR) are considered outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Are outliers always bad data points that should be removed

A

Outliers can be valid data points that represent a real phenomenon in a population. However they should be investigated further to determine if they are valid or if they are result of errors or anomalies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are some benefits of using visualisation tools in data analysis

A

Visualisation tools can help identify patterns and trend, communicate insights, present information succinctly, provide evidence and support, and influence and persuade

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How do visualisations support data analysis

A

Visualisations can provide insights beyond just the numbers and statistics, making it easier to identify patterns and communicate results. They can also raise further questions and liens of inquiry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are some common types of visualisation tools used in data analysis

A

Some common types of visualisation tools include scatter plots, histograms, bar charts, line charts, heap maps, and pie charts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How can visualisation be used to support decision-making

A

Visualisation can provide a clear and concise way to present information and insights, allowing decision makers to quickly understand key trends and patterns and make more informed decisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What are some best practices for creating effective visualisation

A

Some best practices for creating effective visualisations include choosing the appropriate type of visualisation for the data being presented, using clear and concise labels and legends, avoiding clutter and unnecessary details and ensuring the visualisation is easily understandable by the intended audience

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is Anscombe’s Quartet

A

Anscombe’s Quartet is a collection of four datasets that have identical mean and standard deviation, but are actually very different when you look at them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is the purpose of Anscombe’s Quartet

A

The purpose of Anscombe’s Quartet is to demonstrate how descriptive statistics alone can hide underlying data and the importance of visualising data to fully understand it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

How many datasets are included in Anscombe’s Quartet

A

Anscombe’s Quartet consists of four datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is the significance of the identical mean and standard deviation values in Ashcombe’s Quartet

A

The identical mean and standard deviation values in Anscombe’s Quartet highlights the limitations of relying solely on summary statistics to understand a dataset, and demonstrate the importance of visualising data to uncover patterns and relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is simulated annealing in the context of generating datasets

A

Simulated annealing is a technique used to generate datasets with desired statistical properties by starting with a set of data points in roughly the right place then iteratively adjusting their positions until the desired outcome is achieved (i.e. matching mean and standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is a scatter chart used for

A

A scatter chart is used to show the relationship between two variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

How do you plot data on a scatter chart

A

You choose two variables and plot one on the x-axis and the other on the y-axis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Can you plot more than two variables on a scatter chart

A

Yes, you can sometimes plot a third variable by using colour or size of the marker

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is the syntax for creating a scatter chart in Python using Matplotlib

A

The syntax for creating a scatter chart in Python using Matplotlib is plt.scatter(x,y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What is the purpose of a bar/column chart in data visualisation

A

Bar/column charts are used to compare the counts or frequencies of different categories or variables. The height of the bars or columns represents the observed count for each category, making it easy to see which categories have more or less counts than others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

How do you create bar chart in Python using Matplotlib

A

To create a bar chart in Python using Matplotlib, you can use the function plt.bar(x,height) where x is an array or list of values for the x-axis and height is an array or list of values for the heights of the bars. You can also add labels, titles, and other formatting options to the chart using additional Matplotlib functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is a line chart used for

A

A line chart is used to show how a trend occurs over a series of observations. It is commonly used to visualise changes over time or across a range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is the x-axis in a line chart

A

The x-axis in a line chart represents the dimensions across which you want the trend to be measured. For example, if you are plotting stock prices over time, the x-axis would represent time

50
Q

What is the y-axis in a line chart

A

The y-axis in a line chart represents the measured value. For example, if you are plotting stock prices over time, the y-axis would represent the price of the stock

51
Q

How can you create a line chart in Python

A

You can create a line chart in Python using the “plt.plot” function. This function takes two arguments: the x-axis values and the y-axis values. For example, plt.plot([1,2,3,4,5],[10,15,20,25,30]) would create a line chart with x-axis values of 1 through 5 and y-axis values of 10,15,20,25,30

52
Q

What type of data is best visualised using histograms

A

Histograms are best used to visualise the distribution of continuous data, where the data is divided into intervals or bins along the x-axis and the frequency of observations falling into each bin is represented by the height of the bars on the y-axis

53
Q

What does the height of each bar in a histogram represent

A

The height of each bar in a histogram represents the frequency or count of observations that fall into the corresponding bin or interval along the x-axis

54
Q

How do you create a histogram in Python

A

plt.hist(data,bins). Pass in our dataset as the first arguement. The ‘bins’ parameter specifies the number of bins to use for grouping data. e.g. plt.hist(data, bins=5)

55
Q

What does a pie chart show

A

A pie chart shows how categories share proportions of a whole

56
Q

What does each ‘wedge’ in a pie chart represent

A

Each ‘wedge’ in a pie chart represents a category with the size denoting the amount

57
Q

How is the size of each ‘wedge’ in a pie chart determined

A

The size of each ‘wedge’ in a pie chart is determined by the proportion of the whole that the corresponding category represents

58
Q

What is the Python command to create a pie chart

A

The Python command to create a pie chart is plt.pie(x) where x is a list or array of data to be represented in the chart

59
Q

What is the purpose of a box and whisker plot in data visualisation

A

A box and whisker plot is used to show the distribution of a dataset and to visualise the “5 number” summary statistic, including the minimum, Q1, median, Q3 and maximum

60
Q

How does a box and whisker plot describe the IQR

A

The box in a box and whisker plot describes the IQR which is the range between the Q1 and Q3. The box covers the middle 50% of the data, with the median line in the centre

61
Q

How is a box and whisker plot created in Python

A

A box and whisker plot can be created in Python plt.boxplot(x), x being the data you want visualised, array or list

62
Q

What are some examples of advance visualisation tools

A

Parallel co-ordinate plots, glyphs, quivers, violin plots, and dendrograms

63
Q

How can parallel coordinate plots be useful for visualising data

A

Parallel coordinate plots are useful for visualising high-dimensional data by mapping each variable onto a separate axis and then connecting the values for each observation with a line. This allows for patterns and relationships to be seen across multiple variables at once

64
Q

What is a dendrogram used for in data visualisation

A

A dendrogram is a diagram used to represent a hierarchical clustering of observations or variables. It shows the relationships between the clusters and the individual items being clustered

65
Q

What is a violin plot and how does it differ from a box and whisker plot

A

A violin plot is a type of plot that combines a box and whisker plot with a density plot. It shows the same information as a box and whisker plot, but also provides a visual representation of the density of the data at different values. This makes it more informative than a box and whisker plot in situations where data is not normally distributed

66
Q

What are some further choices to consider when creating visualisations

A

Some further choices to consider include colours, markers, line types, and layout

67
Q

What is Feature Scaling

A

Feature Scaling is a technique used to handle the variance in scale between different features in a dataset

67
Q

Why is Feature Scaling important

A

Many methods struggle to handle the variance in scale between different features in a dataset. Thus, scaling the features into a suitable range helps to counteract the issue

68
Q

What are the two common methods of Feature Scaling

A

The two common methods of Feature Scaling are Normalisation and Standardisation

69
Q

What is normalisation in feature scaling

A

Normalisation is a technique used to scale features into the range of 0-1. It involves shifting or translating the points by subtracting the minimum value and rescaling the values by dividing the difference between the maximum and minimum values

70
Q

What kind of data works well with normalisation in feature scaling

A

Normalisation works well for data that has a fixed natural range with no values outside that range and does not follow a Normal distribution, such as images with pixel values between 0 and 255 and a skewed population of coursework grades

71
Q

What is the formula for normalisation

A

X’ = (X - Xmin) / (Xmax - Xmin), where X’ is the new normalised dataset, X is the original dataset, Xmin is the minimum observed value, and Xmax is the maximum observed value for the feature. This formula is used to shift and rescale the data to a range of 0 to 1

72
Q

What is standardisation in feature scaling

A

Standardisation is a technique used in feature scaling to transform a dataset so that it has a mean of 0 and a standard deviation of 1. It involves subtracting the mean value of each feature from the original data and then dividing by the standard deviation of that feature. This rescales the data to have a more standardised range and allows for easier comparison between different features in the dataset

73
Q

What is the formula for standardisation

A

The formula for standardisation is (X - μ) / σ, where X is the original feature, μ is the mean of the feature, σ is the standard deviation of the feature

74
Q

What does the subtraction in the standardisation formula do

A

The subtraction in the standardisation formula does mean centring of the feature, i.e. it subtracts the mean of the feature from each observation of the feature

75
Q

What does the division by sigma in the standardisation formula do

A

The division by sigma in the standardisation formula does rescaling of the feature, i.e. it scales the feature so that it has a standard deviation of 1

76
Q

When does standardisation work well for data

A

Standardisation works well for data that does not want to bound the range of observed values and follows Normal distribution

77
Q

What is the difference between normalisation and standardisation in feature scaling

A

Normalisation scales the features into the range of 0-1 by subtracting the minimum value and dividing by the range, while a standardisation scales the features to have a mean of 0 and a standard deviation of 1 by subtracting the mean and dividing by the standard deviation

78
Q

When is it appropriate to use normalisation in feature scaling

A

Normalisation is appropriate when the data has a fixed natural range and no values outside that range, and when the data does not follow a Normal distribution

79
Q

When is it appropriate to use standardisation in feature scaling

A

Standardisation is appropriate when the data does not need to be bound by an upper or lower limit, and when the data follows a Normal distribution. It is also useful when you want to compare features that have different units or scales, and when you want to reduce the effect of outliers

80
Q

Do all machine learning methods require feature scaling

A

No, not all machine learning methods require feature scaling. Some methods, such as decision trees, can handle large disparities in feature ranges and do not require scaling. However, other methods, such as k-nearest neighbour and support vector machines, are sensitive to the scale of the features and may require scaling for optimal performance

81
Q

What is the curse of dimensionality

A

The curse of dimensionality is the difficulty in dealing with high dimensionality in our data, which can lead to issues such as difficulty in visualisation, increased complexity in modelling, potential redundant information, and longer computation times

82
Q

What are some examples of high-dimensional datasets

A

Mushroom dataset (22D), Lung Cancer dataset (23D), and any dataset with a large number of features

83
Q

How can high dimensionality affect our ability to model and analyse data

A

High dimensionality can make it difficult to visualise the data and can lead to redundant information, increased complexity, and longer computation times. It can also make it harder to identify important features and patterns in the data

84
Q

What are some techniques for visualising high-dimensional data

A

Some techniques for visualising high-dimensional data include scatter plot matrices (SPloMs), parallel coordinate plots, and glyph plots. These techniques allow for the visualisation of multiple dimensions at once and can help identify patterns and relationships in the data

85
Q

How can dimensionality reduction help address the curse of dimensionality

A

Dimensionality reduction techniques can help address the curse of dimensionality by reducing the number of features in the data while retaining important information. This can simplify the data, make it easier to analyse and model, and reduce computation time

86
Q

What is SPloM

A

SPloM stands for Scatter Plot Matrix, which is a visualisation technique used to explore the relationship between multiple variables in a dataset

87
Q

How is SPloM constructed

A

SPloM is constructed by creating a matrix of scatter plots where each row and column represents a variable in the dataset. Along the diagonal of the matrix, histograms of each variable are plotted to show the distribution of that variable

88
Q

What information can be obtained from the diagonal of a SPloM

A

The diagonal of a SPloM shows the distribution of each variable in the dataset. This can give an idea of the central tendency, spread, and skewness of each variable

89
Q

What are some limitations of SPloM

A

SPloM can become cluttered and difficult to interpret when there are too many variables in the dataset. Additionally, SPloM is limited to detecting linear relationships between variables and may not be able to capture non-linear relationships

90
Q

In what ways can the dimensionality of data limit the usefulness of SPloM

A

As the dimensionality of data increases, the number of scatter plots required in the SPloM also increases. This can make the visualisation difficult to interpret and may result in over plotting, where data points in the scatter plots overlap and become difficult to distinguish. Additionally, with higher dimensionality, there is an increased likelihood of redundant information or irrelevant variables, which may obscure important relationships between variables

91
Q

What is the Parallel Coordinates method

A

Parallel Coordinates is a visualisation technique used to plot high-dimensional data. It allows us to plot multiple variables simultaneously and observe the relationships between them

92
Q

How is the Parallel Coordinates plot constructed

A

In the Parallel Coordinate plot, each variable is assigned an axis, and the observed measurements of the variable are plotted along the corresponding axis. The plot consists of a series of liens, with each line representing a single observation, and connecting the values of each variable for that observation

93
Q

What is the advantage of using Parallel Coordinates for visualising high-dimensional data

A

Parallel Coordinates can handle numerous variables, making it useful for visualising high-dimensional data. It also allows us to observe patterns and relationships between variables that may not be easily seen with other visualisation techniques

94
Q

What are some limitations of Parallel Coordinates

A

One limitation of Parallel Coordinates is that it can become difficult to interpret as the numbers of variables increase. Additionally, it may not be suitable for data that contain categorical variables or data with extreme outliers

95
Q

How does the number of observed samples affect the usefulness of Parallel Coordinates

A

Parallel Coordinates can become difficult to interpret as the number of observed samples increases, as the plot can become cluttered and difficult to read. Therefore it may not be suitable for visualising very large datasets

96
Q

What is a Glyph Plot

A

A Glyph Plot is a visualisation technique that uses symbols or shapes to display features of high-dimensionality

97
Q

How are features displayed in a Glyph Plot

A

Features are displayed as components of glyphs (symbols or shapes)

98
Q

What is Chernoff Face?

A

A Chernoff Face is a type of Glyph Plot where components of a face (such as eyes, nose, mouth, etc) are controlled by data

99
Q

How are componenets of a Chernoff Face controlled by data

A

The components of a Chernoff Face are controlled by data through various visual properties such as size, shape, placement, and orientation

100
Q

What is the maximum number of dimensions that can be displayed using Chernoff Faces

A

Up to 18 dimensions can be displayed

101
Q

What are the advantages of glyph plots

A

They can display numerous features, up to 18 dimensions. They can be visually appealing and intuitive, as they use familiar shapes and symbols to represent features.They can highlight patterns and correlations between features, such as the relationship between the size of the nose in a Chernoff Face and a particular feature in the data

102
Q

What are the disadvantages of glyph plots

A

They may be difficult to interpret for those who are not familiar with the particular glyph used. There may be less precise than other visualisation techniques, as the relationship between glyph components and the underlying data may be unclear. They may not be suitable for all types of data or all research questions, particularly those that require a high degree of precision or accuracy

103
Q

What is dimensionality reduction

A

Dimensionality reduction refers to the process of reducing the number of features (or dimensions) of a dataset while retaining as much of the original information as possible

104
Q

Why might someone want to reduce the dimensions of their data

A

There are several reasons why someone might want to reduce the dimensions of their data, including:
To simplify the data and make it easier to visualise or interpret
To reduce noise and redundancy in the data
To improve the performance of machine learning algorithms that might struggle with high-dimensional data

105
Q

What is the disadvantage of arbitrarily removing features to reduce the dimension of data

A

The disadvantage of arbitrarily removing features to reduce the dimension of data is that important information might be lost. It is important to use a statistical approach that considers the relationships between the features in the dataset

106
Q

What are some statistical approaches to reduce the dimension of data

A

There are several statistical approaches to reduce the dimensions of data including:
Principle Component Analysis (PCA)
Correspondence Analysis (CA)
Multi-Correspondence Analysis (MCA)
Factor-Analysis of Mixed Data (FAMD)
Multi-Factor Analysis (MFA)

107
Q

What is Principle Component Analysis (PCA)

A

PCA is a linear transformation technique that identifies the most important features (or principal components) of a dataset and projects the data onto a lower-dimensional space while retaining as much of the original data as possible

108
Q

What is Correspondence Analysis (CA)

A

CA is a method for visualising the relationships between categorical variables in a high-dimensional dataset by representing the variables as points in a low-dimensional space

109
Q

What is Multi-Correspondence Analysis (MCA)

A

MCA is a technique that extends Correspondence Analysis to handle datasets with multiple categorical variables

110
Q

What is Factor-Analysis of Mixed Data (FAMD)

A

FAMD is a dimensionality reduction technique that is specifically designed to handle datasets with both categorical and continuous variables

111
Q

What is Multi-Factor Analysis (MFA)

A

MFA is a method for reducing the dimensions of a dataset with multiple blocks of variables (e.g., different types of data collected from different sources) by identifying common factors that are shared across the blocks

112
Q

How does PCA create new axes (Look back on notes)

A

PCA creates new axes by identifying an axis that is orthogonal (perpendicular) to the existing axes, and has the highest variance across the observed sample

113
Q

What is the significance of variance in PCA

A

In PCA, variance is used to measure the amount of information or variability present in a particular direction or axis. The axes with the highest variance are considered the most important, as they capture the most significant patterns in the data

114
Q

Can PCA be used for feature selection

A

Yes, PCA can be used for feature selection, as it can identify the most important features that contribute to the variability in the data. By selecting only the most important features, the dimensionality of the data can be reduced, which can improve the accuracy and efficiency of machine learning models

115
Q

Can PCA be used for feature selection

A

Yes, PCA can be used for feature selection, as it can identify the most important features that contribute to the variability in the data. By selecting only the most important features, the dimensionality of the data can be reduced, which can improve the accuracy and efficiency of machine learning models

116
Q

What is Scikit-learn (sklearn)

A

Scikit-learn is a Python package that provides a range of machine learning tools and algorithms, including functionality to perform PCA on data

117
Q

What is the purpose of performing PCA using sklearn

A

The purpose of performing PCA using sklearn is to simplify complex data sets by reducing their dimensionality and finding patterns or trends that may not be easily visible in the original data

118
Q

What input does sklearn’s PCA function require (Check notes)

A

Sklearn PCA function requires the input data to be passed in as an array or a matrix

119
Q

What output does sklearn’s PCA function provide

A

Sklearn’s PCA function provides the projected samples into the new set of principal component axes

120
Q

Can we specify the number of principal components to be returned by sklearn’s PCA function using the “n_components” parameter

A

Yes, we can specify the number of principal components to be returned by sklearn’s PCA function using the “n_components” parameter