Task 2 the characteristic score Flashcards
Cases
are the objects described by a set of data (customers, companies, subjects ín a study)
Label
is a special variable used in some data sets to distinguish the different cases
Variable
is a characteristic of a case
→different cases can have different values of the variables
Categorical Variable
A categorical doesn’t have a numerically meaning, it describes simply the quality or characteristics of a variable. The numbers in categorical variable designate quality rather than a measurement quality. You could use e.g. gender and use a 1 for males and a 2 for females. You can’t calculate with these numbers
Quantitative Variable
Quantitative variables are measured and expressed numerically, have numeric meaning and are used for calculation. Although e.g. zip codes are written in numbers these numbers are only labels and you can´t calculate with them
Nominal Variable
Nominal variables are categorical. They are equal categories in other words categories which don’t differ in terms of order. You can’t bring them order you cant calculate with them e.g. Gender (Male, Female, Transgender) Eye colour (Blue, Green, Brown, Hazel)
Ordinal variable
Ordinal variables, belong to categorical variables, are those which have a clear ordering e.g. education status middle school, high school, college now you have 1 2 and 3 and they have a clear order. Economic status: low middle high again 1 2 and 3 in a clear order
Interval Variable
Interval belongs to quantitative variables and it means that the interval between the values has the same interval/ are equally spaced (same space in between). In other words the distance between the variables must be the same E.g. Three peoples income 15,000 20,000 and 25,000 the interval is always 5,000
Distribution of variables
It tells us what values a variable takes and how often it takes these values
Frequency table
You can see the peak or Two peaks (bimodal). Used for categorical variables (nominal/ordinal)
Pie chart
We can se the proportions and what is the major variable and what the minor
Nominal variables
Bar chart
Lower form of frequency and mostly used with categorical variables preferred in case of ordinal ones but also applicable with nominal variables
Stem-and-leaf plot
We can see the peak and the outliers it is basically just a frequency table turned around
For both quantitative variables so interval and ratio
Distribution: Shapes
Trends, Peak, Outlier
skewed distribution
A frequency distribution in which most scores fall in categories above or below the middle
Histogram
useful for quantitative variables
Mode (MO)
Simply the score that occurs most often
Tells something about the frequency of score
Median (M)
Lies in the middle if we order the sores from highest to lowest
Take the total amount of sores (N) +1 and divide it by 2 to obtain the middle score rank((N)+1)/2= middle score rank
You count to the Median and take the value of this score
Arithmetic mean ( x ̅)
Add up all the separate scores and divide them by the total of scores (N)
x ̅=(∑x_i)/N
Sum of squares (variation)
Calculate the difference of each score from the mean and square all the differences before adding them up
∑(x_i-x ̅ )^2
Variance
divide the variation by the total amount of scores (N) minus 1
s_x^2=(∑(x_i-x ̅ )^2)/(N-1)=variance
Standard deviation
Take the square root of the variance
s_x=√((∑(χ_i-x ̅ )^2)/(N-1))
Split up in quartiles (Q1, Q2, Q3)
first quartile(Q1): 25% of all score lie below it and 75% above second quartile(Q2): The median third quartile(Q3): 75% lies below 25% above
IQR (interquartile range)
between Q1 and Q3 lies the half of the scores
IQR= Q3-Q1
Five Number Summary
the lowest score (minimum), Q1, the median, Q3, and the highest score the maximum. You can leave out the outliers.
→can be summarised in a boxplot (all the horizontal bars indicate a number from the summary
1,5*IQR criterion
Used to identify outliers: Q1-(1,5IQR)= every score below the outcome is an outlier
Q3+(1,5IQR)= everything above is an outlier
Linear Transformation
Centering
Shift all the scores such that the scale becomes 0
→subtract the mean from all scores →C=X - x ̅
Shape of distribution is not effected at all
Z-scores
Z-scores indicate how many standard deviations a measurement has scored above or below the mean
z=(xi-x ̅)/s_x
Centring
Shift all the scores such that the scale becomes 0
→subtract the mean from all scores →C=X - x ̅
Shape of distribution is not effected at all
Standard deviation does not change
Standardising
Make sure that the scale obtains a mean of 0, and a standard deviation of 1
The results are so-called z-scores
→z=(xi-x ̅)/s_x
Z-scores indicate how many standard deviations a measurement has scored above or below the mean
Multiplying
• We multiply all the scores by a certain number
→e.g. for every sold litre you get 20$ so you multiply your X with 20
order of mean median mode in view of skewing
right skewed Mean > Median > Mode
left skewed Mean < Median < Mode