Chapters 3 - Exploring Data Flashcards
What is data exploration?
A preliminary investigation of the data in order to better understand its specific characteristics. Helps with selecting appropriate preprocessing and data analysis techniques.
What are summary statistics?
Quantities such as the mean and standard deviation that capture various characteristics of a potentially large set of values with a single number or a small set of numbers.
Define frequency of a data set
frequency(vi) = number of objects with attribute value vi / number of objects eg. 1 2 3 1 1 frequency of 1 is 3/6 or 0.5
Define mode of a data set
The mode of a categorical attribute is one that appears most frequently (has the highest frequency)
What is a percentile?
The value below which a percentage of data falls eg 80% of people are shorter than you That means you are at the 80th percentile.

What do the mean and median measure?
The location of a set of values
Define mean
The mean is the average of a set of numbers
How is the mean of a set of numbers computed?
Add the values of the numbers and divide by the total number of numbers in the set
What is the median?
The median is the middle number in a set of values when arranged from hightest to lowest.
if the set has an even number of values, than the median is the average of the middle two numbers
What is a trimmed mean?
A percentage of the beggining and end of the data is thrown out.
This is because the mean is sensitive to outliers.
If the distribution of values is skeweked in a data set is the mean a good indivator of the middle of a set of values?
No. In this case, the media is a better indicator of the middle.
What do range and variance measure in a data set?
the spread of a set of values
indicate if values are wideley spread out or relatively concentrated around a single point
How is the range of a data set calculated?
the largest value subtract the smallest value
Why is variance a preffered measure of spread over range?
The range can be misleading if most of the values are concentrated in a narrow band of values, but there are also a relatively small number of mroe extreme values.
What symbol is used for variance?
s2x
What is variance?
The average of the squared differences from the mean.
How is variance calculated?
To calculate the variance follow these steps:
Work out the Mean (the simple average of the numbers)
Then for each number: subtract the Mean and square the result (the squared difference).
Then work out the average of those squared differences. (
What is the standard deviation?
a measure of how spread out the data is
Its symbol is σ (the greek letter sigma)
the formula is the square root of the vaThe formula is easy: it is the square rooriance
What can be the issue with using standard devication on data?
The mean can be distorted by outliers, and since the variance is computed using the mean, it is also sensitive to outliers.
What is multivariate data?
Measures of location for data that consists of several attrivutes
How can measures of location be obtained for data that consists of several attributtes (multivariate data)
Compute the mean or median separately for each attribute.
What is the covariance of two attributes?
A measure of the degree to which two attributes vary together and depends on the magnitude of the variables.
A value of 0 indicates that two attributes do not have a (linear) relatioinship.
Is it possobile to judge the degree of relationship between two variables by looking only at the value of the covariance?
No, correlatin is preferred to covariance.
What does the skewness of a set of values measure?
the degree to which the values are symetrically distributed around the mean.











