DSE Flashcards

1
Q

What is noise?

A

Something that may make the data unreliable. These are usually known as confounding variables.

These stop you from getting accurate data; thus, everything you measure is a sample and not the ideal or final result.

However, you can cut as much noise as possible by repeating it in different conditions and using more random data to test.

This is called Random Variations to achieve a particular accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 3 main components of summary statistics?

A

Central Tendency

Dispersion

Skewness and Kurtosis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Central Tendency?

A

Think “what is the centre of the distribution?” and modes, median and mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Dispersion?

A

Think “how widely spread is my distribution about the centre?” Clustered or spread out?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Skewness

A

Think about where the mode of the data leans into the high or low side by the x-axis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the problems with means?

A

Heavily influenced by large values. A single great number that is an outlier can influence the mean and make it unreliable.

It is not influenced by outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the problems with the median?

A

It chooses the middle only, disregarding all other data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the problems with the mode?

A

Some datasets are multimodal, meaning they have more than one mode.

It works best in discrete datasets and not continuous since continuous have more data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Relationship between mean, median, and mode in a symmetrical dataset.

A

All in the middle. mean, mode, and median.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Relation between mean, median, and mode in a dataset skewed to the left (positive).

A

Mode, Median Mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Relation between mean, median, and mode in a dataset skewed to the right (negative).

A

Mean, Median, Mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do you have to do to find the Standard Error?

A

std / sqrt(number of samples)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are Quartiles?

A

Three values, Q1, Q2 and Q3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Q1?

A

First, lower, first 25% percentile of value.

It splits 25% of data from the highest 75%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Q2?

A

Median. Cuts in Half

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Q3?

A

third, the last 25% percentile of value.

It splits off the highest 25% of data from the lowest 75%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is IQR?

A

Inter Quantile Range.

They are between Q1 and Q3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is kurtosis?

A

How heavy the “tails” are in the distribution. (fix this)

19
Q

What are the 5 summaries of a dataset?

A

Min
Q1
Median
Q3
Max

20
Q

What is standard deviation?

A

a measure of spread, how spread out or clustered the data is to the mean. They measure the consistency and uncertainty of the data.

Think of them as the average distance between the data point to the mean.

21
Q

What does a small or large std mean?

A

A small std means the number is closer from the means
A large std means the number is further from the means.

22
Q

Why should you use n-1 if you are researching a sample dataset?

A

The sample underestimates the spread or variance of the true population as sample may miss extreme values or outliers. Funny enough, you increase the std to compensate the extreme values.

23
Q

Why do you use n-1 and not n+1?

A

Because we need the value to have a larger std to compensate the outlier, we need to minus n by 1.

In the std formula, n is a denominator, and the bigger the denominator, the smaller the answer and we need a large std to compensate the outlier.

Remember, outliers are spread away from the means, and a large std means they are spread away from the mean.

24
Q

Why not -2 or xyz?

A

What if the dataset has 2 data? You would get a math error.

-2 would overestimate and overcorrect the outliers and overcompensate the biases.

It may result in a too large answer.

-1 is not the true value but the best estimate and goldilock solution of of the std.

25
Q

What are the general principle of a data visualisation?

A

Keep it simple

No unnecessary decor, variations, colours or 3D imaging.

Colour appropriate for audience.

Better not to use 3D and use 2D instead.

26
Q

What is a scatter plot used for?

A

Use for displaying a relationship between two or multiple variables sampled.

plot() or plot3() for 3d scatter plot.

use LineSpec to show the points clearly.

27
Q

What is an error bar for?

A

Represent uncertainty or
variance in measured values.

May be standard deviation,
SEM, or a specified
confidence interval.

Can also be used on x-values
if there isa sampling variation.

errorbar()

28
Q

What is a box plot?

A

The 5 summary of the data, min, q1 median, q3 max.

Show various summary statics for each points.

boxplot()

29
Q

What are line plots?

A

y = mx + b

When x affects y.

If you have more than one y value per x value, scatter plot is useful.

Otherwise if y value per x value is just 1, use line plots.

X is often time,mass, length.

plot()

30
Q

What is a bar chart?

A

Good for categorical data.

Can be vertical or horizontal; vertical is easier, but horizontal is useful for longer names.

bar() for vertical

barh() for horizontal

bar3() for 3d visualisation if needed.

can be use for accumulation.

31
Q

What is a histogram?

A

Distribution of numerical data, divide data into bins.

histogram()

matlab will choose a reasonable size however, you can vary it.

Don’t make it too wide or too narrow as it can skip important details or look like a mess.

to edit BinWidth, do

histogram(‘data’ , <number>)</number>

32
Q

Density plot

A

Useful for physical and geographical data eg rainfall maps.

Choose sensible colour scale

pcolor() or contour() or surf().

33
Q

How to get all unique data

A

Variable = unique(array) in matlab

34
Q

Vector plot

A

plot vector quanitities over area such as wind speed and direction, magnetic fields or fluid flow.

quiver()

35
Q

3D visualisation

A

mesh() or surf()

Useful for 3 indepdent variables (x).

36
Q

why convert data to mathematical representation?

A

To aid future visualisation trends, summarise relationships, and inter trends and make predictions, model or confirm physical law that predict behaviour.

37
Q

What are the two common correlations of data?

A

Linear and Rank order

38
Q

What is a linear correlation?

A

the name says itself, the data plot makes a certain linear pattern.?

To find it

corr(x,y, ‘Type’ , ‘Pearson’) or corr(x,y)

To skip NaN add “complete”

39
Q

What is a rank order (aka spearmen)?

A

When y always goes up when x goes up but not on the same amount. You call this monotonic and it will look like a sine wave but always increasing.

It is not monotonic when it goes up then down then up.

-y going down can make x go up.

corr(x,y, ‘Type’, ‘Spearman’)

To skip NaN add ‘complete’.

40
Q

How to label figures?

A

plot(x,y) <this></this>

title(“<put>")4
xlabel("<put>");
ylabel("<put>");
grid(<put>);</put></put></put></put>

41
Q

How to avoid overfitting?

A

Use your judgement, test and validate test sets.

Give it noise and retry to see how much it influences the function model.

If any parameter has negative, assume is 0.

42
Q

What are some outlier ethical considerations?

A

DO NOT REMOVE THEM (unless it is a collection error). It is a serious ethical violation.