Chapters 3 - Exploring Data Flashcards by Cameron Chambers

What is data exploration?

A preliminary investigation of the data in order to better understand its specific characteristics. Helps with selecting appropriate preprocessing and data analysis techniques.

How well did you know this?

Not at all

Perfectly

What are summary statistics?

Quantities such as the mean and standard deviation that capture various characteristics of a potentially large set of values with a single number or a small set of numbers.

How well did you know this?

Not at all

Perfectly

Define frequency of a data set

frequency(vi) = number of objects with attribute value vi / number of objects eg. 1 2 3 1 1 frequency of 1 is 3/6 or 0.5

How well did you know this?

Not at all

Perfectly

Define mode of a data set

The mode of a categorical attribute is one that appears most frequently (has the highest frequency)

How well did you know this?

Not at all

Perfectly

What is a percentile?

The value below which a percentage of data falls eg 80% of people are shorter than you That means you are at the 80th percentile.

How well did you know this?

Not at all

Perfectly

What do the mean and median measure?

The location of a set of values

How well did you know this?

Not at all

Perfectly

Define mean

The mean is the average of a set of numbers

How well did you know this?

Not at all

Perfectly

How is the mean of a set of numbers computed?

Add the values of the numbers and divide by the total number of numbers in the set

How well did you know this?

Not at all

Perfectly

What is the median?

The median is the middle number in a set of values when arranged from hightest to lowest.

if the set has an even number of values, than the median is the average of the middle two numbers

How well did you know this?

Not at all

Perfectly

What is a trimmed mean?

A percentage of the beggining and end of the data is thrown out.

This is because the mean is sensitive to outliers.

How well did you know this?

Not at all

Perfectly

If the distribution of values is skeweked in a data set is the mean a good indivator of the middle of a set of values?

No. In this case, the media is a better indicator of the middle.

How well did you know this?

Not at all

Perfectly

What do range and variance measure in a data set?

the spread of a set of values

indicate if values are wideley spread out or relatively concentrated around a single point

How well did you know this?

Not at all

Perfectly

How is the range of a data set calculated?

the largest value subtract the smallest value

How well did you know this?

Not at all

Perfectly

Why is variance a preffered measure of spread over range?

The range can be misleading if most of the values are concentrated in a narrow band of values, but there are also a relatively small number of mroe extreme values.

How well did you know this?

Not at all

Perfectly

What symbol is used for variance?

s²_x

How well did you know this?

Not at all

Perfectly

What is variance?

The average of the squared differences from the mean.

How well did you know this?

Not at all

Perfectly

How is variance calculated?

To calculate the variance follow these steps:

Work out the Mean (the simple average of the numbers)

Then for each number: subtract the Mean and square the result (the squared difference).

Then work out the average of those squared differences. (

How well did you know this?

Not at all

Perfectly

What is the standard deviation?

Study These Flashcards

a measure of how spread out the data is

Its symbol is σ (the greek letter sigma)

the formula is the square root of the vaThe formula is easy: it is the square rooriance

What can be the issue with using standard devication on data?

Study These Flashcards

The mean can be distorted by outliers, and since the variance is computed using the mean, it is also sensitive to outliers.

What is multivariate data?

Study These Flashcards

Measures of location for data that consists of several attrivutes

How can measures of location be obtained for data that consists of several attributtes (multivariate data)

Study These Flashcards

Compute the mean or median separately for each attribute.

What is the covariance of two attributes?

Study These Flashcards

A measure of the degree to which two attributes vary together and depends on the magnitude of the variables.

A value of 0 indicates that two attributes do not have a (linear) relatioinship.

Is it possobile to judge the degree of relationship between two variables by looking only at the value of the covariance?

Study These Flashcards

No, correlatin is preferred to covariance.

What does the skewness of a set of values measure?

Study These Flashcards

the degree to which the values are symetrically distributed around the mean.

What is usual data mining?

The use of visualization techniques in data mining

what is selection (visualization)

the elimination or de-emphasis of certain objects and attributes there is not a satisfactory or general approach to represent data with many attributes (graphs has 2 or 3 dimensions)

What are stem and leaf plots useful for?

Provide insight into the distribution of one-dmensional integer or continous data.

What is the following type of visual data representation?

Stem and leaf plot

What is a histogram?

a plot that displays the distribution of values for attributes by dividing the possible valuesi nto bins and showing the number of ojbects that fall into each bin.

What is a pareto histogram?

A normal histogram, except that categories are sorted by count so that the count is decreasing from left to right.

What are box plots?

another method for showing the distrubution of the values of a single numerical attribute.

What is the following visual representation of data?

scatter plot

What is the following visual representation of data?

Contour plot

What is the following visual representation of data?

surface plot

What type of visual representation is the following?

vector field plot a characteristic may have both a magnitude and direction

What type of visual representation is the following?

Data matrix image can be regarded as a rectangular array of pixes each pixel is characterized byits color and brightness

What type of visual representation of data is the following?

parallel coordinates plot have one coordinate axis for each attribute, but different ax4es are parallelto one other instead of perpendicular the object is represented as a line instead of a point the value of each object is mapped to a point on the coordinate axies associated with that attribute

What type of visual representation is the following?

Chernoff face each attribute is associated with a specific feature of a face each attribute value is used to determine the way tha the facial feature is expressed

Define data cube

a multidimensional representation of the data, together with all possible totals (aggregates)

Define cross tabulation

Cross-tabluation is about taking two variables and tabulating the results of one variable against the other variable. An example would be the cross-tabluation of course performance against mode of study:

Define pivoting

refers to aggregating over all dimensions except two the result is a two-diensional cross tabulation with the two specificed dimensions as the only remaining dimensions

Define slicing and dicing

slicing: selecting a group of cells from an entire multidimensional array by sepcifying specific values for one or more dimensions. dicing: selecting a subset of cells by sepcifying a range of attribute value (this is the equivalent of defining a subarray from the complete array).

Define "roll up" given a multi-dimensional array of sales data

aggregating the sales across all the dates in a month

Define "drill down" given a multi-dimensional array of sales data

splitting monthly sales totals into daily sales totals

Chapters 3 - Exploring Data Flashcards

(44 cards)