Chapters 3 - Exploring Data Flashcards
What is data exploration?
A preliminary investigation of the data in order to better understand its specific characteristics. Helps with selecting appropriate preprocessing and data analysis techniques.
What are summary statistics?
Quantities such as the mean and standard deviation that capture various characteristics of a potentially large set of values with a single number or a small set of numbers.
Define frequency of a data set
frequency(vi) = number of objects with attribute value vi / number of objects eg. 1 2 3 1 1 frequency of 1 is 3/6 or 0.5
Define mode of a data set
The mode of a categorical attribute is one that appears most frequently (has the highest frequency)
What is a percentile?
The value below which a percentage of data falls eg 80% of people are shorter than you That means you are at the 80th percentile.

What do the mean and median measure?
The location of a set of values
Define mean
The mean is the average of a set of numbers
How is the mean of a set of numbers computed?
Add the values of the numbers and divide by the total number of numbers in the set
What is the median?
The median is the middle number in a set of values when arranged from hightest to lowest.
if the set has an even number of values, than the median is the average of the middle two numbers
What is a trimmed mean?
A percentage of the beggining and end of the data is thrown out.
This is because the mean is sensitive to outliers.
If the distribution of values is skeweked in a data set is the mean a good indivator of the middle of a set of values?
No. In this case, the media is a better indicator of the middle.
What do range and variance measure in a data set?
the spread of a set of values
indicate if values are wideley spread out or relatively concentrated around a single point
How is the range of a data set calculated?
the largest value subtract the smallest value
Why is variance a preffered measure of spread over range?
The range can be misleading if most of the values are concentrated in a narrow band of values, but there are also a relatively small number of mroe extreme values.
What symbol is used for variance?
s2x
What is variance?
The average of the squared differences from the mean.
How is variance calculated?
To calculate the variance follow these steps:
Work out the Mean (the simple average of the numbers)
Then for each number: subtract the Mean and square the result (the squared difference).
Then work out the average of those squared differences. (
What is the standard deviation?
a measure of how spread out the data is
Its symbol is σ (the greek letter sigma)
the formula is the square root of the vaThe formula is easy: it is the square rooriance
What can be the issue with using standard devication on data?
The mean can be distorted by outliers, and since the variance is computed using the mean, it is also sensitive to outliers.
What is multivariate data?
Measures of location for data that consists of several attrivutes
How can measures of location be obtained for data that consists of several attributtes (multivariate data)
Compute the mean or median separately for each attribute.
What is the covariance of two attributes?
A measure of the degree to which two attributes vary together and depends on the magnitude of the variables.
A value of 0 indicates that two attributes do not have a (linear) relatioinship.
Is it possobile to judge the degree of relationship between two variables by looking only at the value of the covariance?
No, correlatin is preferred to covariance.
What does the skewness of a set of values measure?
the degree to which the values are symetrically distributed around the mean.
What is usual data mining?
The use of visualization techniques in data mining
what is selection (visualization)
the elimination or de-emphasis of certain objects and attributes
there is not a satisfactory or general approach to represent data with many attributes (graphs has 2 or 3 dimensions)
What are stem and leaf plots useful for?
Provide insight into the distribution of one-dmensional integer or continous data.
What is the following type of visual data representation?

Stem and leaf plot
What is a histogram?
a plot that displays the distribution of values for attributes by dividing the possible valuesi nto bins and showing the number of ojbects that fall into each bin.

What is a pareto histogram?
A normal histogram, except that categories are sorted by count so that the count is decreasing from left to right.

What are box plots?
another method for showing the distrubution of the values of a single numerical attribute.

What is the following visual representation of data?

scatter plot
What is the following visual representation of data?

Contour plot
What is the following visual representation of data?

surface plot
What type of visual representation is the following?

vector field plot
a characteristic may have both a magnitude and direction
What type of visual representation is the following?

Data matrix
image can be regarded as a rectangular array of pixes
each pixel is characterized byits color and brightness
What type of visual representation of data is the following?

parallel coordinates plot
have one coordinate axis for each attribute, but different ax4es are parallelto one other instead of perpendicular
the object is represented as a line instead of a point
the value of each object is mapped to a point on the coordinate axies associated with that attribute
What type of visual representation is the following?

Chernoff face
each attribute is associated with a specific feature of a face
each attribute value is used to determine the way tha the facial feature is expressed
Define data cube
a multidimensional representation of the data, together with all possible totals (aggregates)

Define cross tabulation
Cross-tabluation is about taking two variables and tabulating the results of one variable against the other variable.
An example would be the cross-tabluation of course performance against mode of study:
Define pivoting
refers to aggregating over all dimensions except two
the result is a two-diensional cross tabulation with the two specificed dimensions as the only remaining dimensions
Define slicing and dicing
slicing: selecting a group of cells from an entire multidimensional array by sepcifying specific values for one or more dimensions.
dicing: selecting a subset of cells by sepcifying a range of attribute value (this is the equivalent of defining a subarray from the complete array).
Define “roll up” given a multi-dimensional array of sales data
aggregating the sales across all the dates in a month
Define “drill down” given a multi-dimensional array of sales data
splitting monthly sales totals into daily sales totals