DSE Flashcards

Question 1

Q

What is noise?

Answer

A

Something that may make the data unreliable. These are usually known as confounding variables.

These stop you from getting accurate data; thus, everything you measure is a sample and not the ideal or final result.

However, you can cut as much noise as possible by repeating it in different conditions and using more random data to test.

This is called Random Variations to achieve a particular accuracy.

Question 2

Q

What are the 3 main components of summary statistics?

Answer

A

Central Tendency

Dispersion

Skewness and Kurtosis

Question 3

Q

What is Central Tendency?

Answer

A

Think “what is the centre of the distribution?” and modes, median and mean.

Question 4

Q

What is Dispersion?

Answer

A

Think “how widely spread is my distribution about the centre?” Clustered or spread out?

Question 5

Q

What is the Skewness

Answer

A

Think about where the mode of the data leans into the high or low side by the x-axis.

Question 6

Q

What are the problems with means?

Answer

A

Heavily influenced by large values. A single great number that is an outlier can influence the mean and make it unreliable.

It is not influenced by outliers.

Question 7

Q

What are the problems with the median?

Answer

A

It chooses the middle only, disregarding all other data.

Question 8

Q

What are the problems with the mode?

Answer

A

Some datasets are multimodal, meaning they have more than one mode.

It works best in discrete datasets and not continuous since continuous have more data.

Question 9

Q

Relationship between mean, median, and mode in a symmetrical dataset.

Answer

A

All in the middle. mean, mode, and median.

Question 10

Q

Relation between mean, median, and mode in a dataset skewed to the left (positive).

Answer

A

Mode, Median Mean

Question 11

Q

Relation between mean, median, and mode in a dataset skewed to the right (negative).

Answer

A

Mean, Median, Mode.

Question 12

Q

What do you have to do to find the Standard Error?

Answer

A

std / sqrt(number of samples)

Question 13

Q

What are Quartiles?

Answer

A

Three values, Q1, Q2 and Q3.

Question 14

Q

What is Q1?

Answer

A

First, lower, first 25% percentile of value.

It splits 25% of data from the highest 75%

Question 15

Q

What is Q2?

Answer

A

Median. Cuts in Half

Question 16

Q

What is Q3?

Answer

A

third, the last 25% percentile of value.

It splits off the highest 25% of data from the lowest 75%.

Question 17

Q

What is IQR?

Answer

A

Inter Quantile Range.

They are between Q1 and Q3.

Question 18

Q

What is kurtosis?

Answer

A

How heavy the “tails” are in the distribution. (fix this)

Question 19

Q

What are the 5 summaries of a dataset?

Answer

A

Min
Q1
Median
Q3
Max

Question 20

Q

What is standard deviation?

Answer

A

a measure of spread, how spread out or clustered the data is to the mean. They measure the consistency and uncertainty of the data.

Think of them as the average distance between the data point to the mean.

Question 21

Q

What does a small or large std mean?

Answer

A

A small std means the number is closer from the means
A large std means the number is further from the means.

Question 22

Q

Why should you use n-1 if you are researching a sample dataset?

Answer

A

The sample underestimates the spread or variance of the true population as sample may miss extreme values or outliers. Funny enough, you increase the std to compensate the extreme values.

Question 23

Q

Why do you use n-1 and not n+1?

Answer

A

Because we need the value to have a larger std to compensate the outlier, we need to minus n by 1.

In the std formula, n is a denominator, and the bigger the denominator, the smaller the answer and we need a large std to compensate the outlier.

Remember, outliers are spread away from the means, and a large std means they are spread away from the mean.

Question 24

Q

Why not -2 or xyz?

Answer

A

What if the dataset has 2 data? You would get a math error.

-2 would overestimate and overcorrect the outliers and overcompensate the biases.

It may result in a too large answer.

-1 is not the true value but the best estimate and goldilock solution of of the std.

Question 25

Q

What are the general principle of a data visualisation?

Answer

A

Keep it simple

No unnecessary decor, variations, colours or 3D imaging.

Colour appropriate for audience.

Better not to use 3D and use 2D instead.

Question 26

Q

What is a scatter plot used for?

Answer

A

Use for displaying a relationship between two or multiple variables sampled.

plot() or plot3() for 3d scatter plot.

use LineSpec to show the points clearly.

Question 27

Q

What is an error bar for?

Answer

A

Represent uncertainty or
variance in measured values.

May be standard deviation,
SEM, or a specified
confidence interval.

Can also be used on x-values
if there isa sampling variation.

errorbar()

Question 28

Q

What is a box plot?

Answer

A

The 5 summary of the data, min, q1 median, q3 max.

Show various summary statics for each points.

boxplot()

Question 29

Q

What are line plots?

Answer

A

y = mx + b

When x affects y.

If you have more than one y value per x value, scatter plot is useful.

Otherwise if y value per x value is just 1, use line plots.

X is often time,mass, length.

plot()

Question 30

Q

What is a bar chart?

Answer

A

Good for categorical data.

Can be vertical or horizontal; vertical is easier, but horizontal is useful for longer names.

bar() for vertical

barh() for horizontal

bar3() for 3d visualisation if needed.

can be use for accumulation.

Question 31

Q

What is a histogram?

Answer

A

Distribution of numerical data, divide data into bins.

histogram()

matlab will choose a reasonable size however, you can vary it.

Don’t make it too wide or too narrow as it can skip important details or look like a mess.

to edit BinWidth, do

histogram(‘data’ , <number>)</number>

Question 32

Q

Density plot

Answer

A

Useful for physical and geographical data eg rainfall maps.

Choose sensible colour scale

pcolor() or contour() or surf().

Question 33

Q

How to get all unique data

Answer

A

Variable = unique(array) in matlab

Question 34

Q

Vector plot

Answer

A

plot vector quanitities over area such as wind speed and direction, magnetic fields or fluid flow.

quiver()

Question 35

Q

3D visualisation

Answer

A

mesh() or surf()

Useful for 3 indepdent variables (x).

Question 36

Q

why convert data to mathematical representation?

Answer

A

To aid future visualisation trends, summarise relationships, and inter trends and make predictions, model or confirm physical law that predict behaviour.

Question 37

Q

What are the two common correlations of data?

Answer

A

Linear and Rank order

Question 38

Q

What is a linear correlation?

Answer

A

the name says itself, the data plot makes a certain linear pattern.?

To find it

corr(x,y, ‘Type’ , ‘Pearson’) or corr(x,y)

To skip NaN add “complete”

Question 39

Q

What is a rank order (aka spearmen)?

Answer

A

When y always goes up when x goes up but not on the same amount. You call this monotonic and it will look like a sine wave but always increasing.

It is not monotonic when it goes up then down then up.

-y going down can make x go up.

corr(x,y, ‘Type’, ‘Spearman’)

To skip NaN add ‘complete’.

Question 40

Q

How to label figures?

Answer

A

plot(x,y) <this></this>

title(“<put>")4
xlabel("<put>");
ylabel("<put>");
grid(<put>);</put></put></put></put>

Question 41

Q

How to avoid overfitting?

Answer

A

Use your judgement, test and validate test sets.

Give it noise and retry to see how much it influences the function model.

If any parameter has negative, assume is 0.

Question 42

Q

What are some outlier ethical considerations?

Answer

A

DO NOT REMOVE THEM (unless it is a collection error). It is a serious ethical violation.

Question 43

Q

Question 44

Q