Chapters 3 - Exploring Data Flashcards
What is data exploration?
A preliminary investigation of the data in order to better understand its specific characteristics. Helps with selecting appropriate preprocessing and data analysis techniques.
What are summary statistics?
Quantities such as the mean and standard deviation that capture various characteristics of a potentially large set of values with a single number or a small set of numbers.
Define frequency of a data set
frequency(vi) = number of objects with attribute value vi / number of objects eg. 1 2 3 1 1 frequency of 1 is 3/6 or 0.5
Define mode of a data set
The mode of a categorical attribute is one that appears most frequently (has the highest frequency)
What is a percentile?
The value below which a percentage of data falls eg 80% of people are shorter than you That means you are at the 80th percentile.
What do the mean and median measure?
The location of a set of values
Define mean
The mean is the average of a set of numbers
How is the mean of a set of numbers computed?
Add the values of the numbers and divide by the total number of numbers in the set
What is the median?
The median is the middle number in a set of values when arranged from hightest to lowest.
if the set has an even number of values, than the median is the average of the middle two numbers
What is a trimmed mean?
A percentage of the beggining and end of the data is thrown out.
This is because the mean is sensitive to outliers.
If the distribution of values is skeweked in a data set is the mean a good indivator of the middle of a set of values?
No. In this case, the media is a better indicator of the middle.
What do range and variance measure in a data set?
the spread of a set of values
indicate if values are wideley spread out or relatively concentrated around a single point
How is the range of a data set calculated?
the largest value subtract the smallest value
Why is variance a preffered measure of spread over range?
The range can be misleading if most of the values are concentrated in a narrow band of values, but there are also a relatively small number of mroe extreme values.
What symbol is used for variance?
s2x
What is variance?
The average of the squared differences from the mean.
How is variance calculated?
To calculate the variance follow these steps:
Work out the Mean (the simple average of the numbers)
Then for each number: subtract the Mean and square the result (the squared difference).
Then work out the average of those squared differences. (