Chapters 3 - Exploring Data Flashcards

1
Q

What is data exploration?

A

A preliminary investigation of the data in order to better understand its specific characteristics. Helps with selecting appropriate preprocessing and data analysis techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are summary statistics?

A

Quantities such as the mean and standard deviation that capture various characteristics of a potentially large set of values with a single number or a small set of numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define frequency of a data set

A

frequency(vi) = number of objects with attribute value vi / number of objects eg. 1 2 3 1 1 frequency of 1 is 3/6 or 0.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define mode of a data set

A

The mode of a categorical attribute is one that appears most frequently (has the highest frequency)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a percentile?

A

The value below which a percentage of data falls eg 80% of people are shorter than you That means you are at the 80th percentile.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What do the mean and median measure?

A

The location of a set of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define mean

A

The mean is the average of a set of numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is the mean of a set of numbers computed?

A

Add the values of the numbers and divide by the total number of numbers in the set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the median?

A

The median is the middle number in a set of values when arranged from hightest to lowest.

if the set has an even number of values, than the median is the average of the middle two numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a trimmed mean?

A

A percentage of the beggining and end of the data is thrown out.

This is because the mean is sensitive to outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

If the distribution of values is skeweked in a data set is the mean a good indivator of the middle of a set of values?

A

No. In this case, the media is a better indicator of the middle.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do range and variance measure in a data set?

A

the spread of a set of values

indicate if values are wideley spread out or relatively concentrated around a single point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is the range of a data set calculated?

A

the largest value subtract the smallest value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is variance a preffered measure of spread over range?

A

The range can be misleading if most of the values are concentrated in a narrow band of values, but there are also a relatively small number of mroe extreme values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What symbol is used for variance?

A

s2x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is variance?

A

The average of the squared differences from the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How is variance calculated?

A

To calculate the variance follow these steps:

Work out the Mean (the simple average of the numbers)

Then for each number: subtract the Mean and square the result (the squared difference).

Then work out the average of those squared differences. (

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the standard deviation?

A

a measure of how spread out the data is

Its symbol is σ (the greek letter sigma)

the formula is the square root of the vaThe formula is easy: it is the square rooriance

19
Q

What can be the issue with using standard devication on data?

A

The mean can be distorted by outliers, and since the variance is computed using the mean, it is also sensitive to outliers.

20
Q

What is multivariate data?

A

Measures of location for data that consists of several attrivutes

21
Q

How can measures of location be obtained for data that consists of several attributtes (multivariate data)

A

Compute the mean or median separately for each attribute.

22
Q

What is the covariance of two attributes?

A

A measure of the degree to which two attributes vary together and depends on the magnitude of the variables.

A value of 0 indicates that two attributes do not have a (linear) relatioinship.

23
Q

Is it possobile to judge the degree of relationship between two variables by looking only at the value of the covariance?

A

No, correlatin is preferred to covariance.

24
Q

What does the skewness of a set of values measure?

A

the degree to which the values are symetrically distributed around the mean.

25
Q

What is usual data mining?

A

The use of visualization techniques in data mining

26
Q

what is selection (visualization)

A

the elimination or de-emphasis of certain objects and attributes

there is not a satisfactory or general approach to represent data with many attributes (graphs has 2 or 3 dimensions)

27
Q

What are stem and leaf plots useful for?

A

Provide insight into the distribution of one-dmensional integer or continous data.

28
Q

What is the following type of visual data representation?

A

Stem and leaf plot

29
Q

What is a histogram?

A

a plot that displays the distribution of values for attributes by dividing the possible valuesi nto bins and showing the number of ojbects that fall into each bin.

30
Q

What is a pareto histogram?

A

A normal histogram, except that categories are sorted by count so that the count is decreasing from left to right.

31
Q

What are box plots?

A

another method for showing the distrubution of the values of a single numerical attribute.

32
Q

What is the following visual representation of data?

A

scatter plot

33
Q

What is the following visual representation of data?

A

Contour plot

34
Q

What is the following visual representation of data?

A

surface plot

35
Q

What type of visual representation is the following?

A

vector field plot

a characteristic may have both a magnitude and direction

36
Q

What type of visual representation is the following?

A

Data matrix

image can be regarded as a rectangular array of pixes

each pixel is characterized byits color and brightness

37
Q

What type of visual representation of data is the following?

A

parallel coordinates plot

have one coordinate axis for each attribute, but different ax4es are parallelto one other instead of perpendicular

the object is represented as a line instead of a point

the value of each object is mapped to a point on the coordinate axies associated with that attribute

38
Q

What type of visual representation is the following?

A

Chernoff face

each attribute is associated with a specific feature of a face

each attribute value is used to determine the way tha the facial feature is expressed

39
Q

Define data cube

A

a multidimensional representation of the data, together with all possible totals (aggregates)

40
Q

Define cross tabulation

A

Cross-tabluation is about taking two variables and tabulating the results of one variable against the other variable.

An example would be the cross-tabluation of course performance against mode of study:

41
Q

Define pivoting

A

refers to aggregating over all dimensions except two

the result is a two-diensional cross tabulation with the two specificed dimensions as the only remaining dimensions

42
Q

Define slicing and dicing

A

slicing: selecting a group of cells from an entire multidimensional array by sepcifying specific values for one or more dimensions.
dicing: selecting a subset of cells by sepcifying a range of attribute value (this is the equivalent of defining a subarray from the complete array).

43
Q

Define “roll up” given a multi-dimensional array of sales data

A

aggregating the sales across all the dates in a month

44
Q

Define “drill down” given a multi-dimensional array of sales data

A

splitting monthly sales totals into daily sales totals