module 1 introduction to data analysis Flashcards

Question 1

Q

prepare your data for analysis

Answer

A

The basic process of statistical analysis is as follows:
-Design a study: Nature of hypothesis and data
-Choose an analysis
-Determine power
-Collect data
-Screen data
-Do analysis
-Check diagnostics (may need to go back to the second step to find a different -analysis if the data doesn’t fit the model)
-Report

Before doing any statistical analysis, you should take the time to carefully understand the concept that you’re exploring. Consider the following:
-What is the research question?
-What are the concepts and what variables are associated with them?
-What types of variables are they?
-What is the sample size?

Note: The thing to remember is that you need to come up with a methodical way of working with your data. There is no one ‘right way’, but you do need to adopt a method that you consistently use, and document it. This allows you to easily reproduce your own work and easily find problems with your own work. It is also important to have a clean backup of the raw data file that you never touch, just in case.

What is the research question?
Dependence
If there is a dependent variable then, it has a dependence research question.

Consider the following research question: Statistics anxiety (DV) is a result of a weak maths self-concept (IV), little previous experience of maths (IV), gender (IV) and age (IV).

When trying to predict or explain something such as statistics anxiety from one or more independent variables—such as a weak maths self-concept (IV), little previous experience of maths (IV), gender (IV) and age (IV)—the research question is a dependent question.
Interdependence
Where there are no dependent variables in the research question, then the question is an interdependent question—that is, it’s about things that are related to each other. There’s no prediction or expectation of a direction. For example, we might expect depression to be related to a person’s level of wellbeing—one thing is not causing the other, they are just related.
In factor analysis, it might be expected that there is one anger factor which then implies that I have a ten-item questionnaire to measure anger in its different expressions. As can be seen in the following figure, all those ten items are correlated with each other, that is they all overlap. This would be an interdependence question because there are no dependent variables.
In the example of ‘statistics anxiety’, anxiety is the dependent variable, and you need to know how it was measured. It is important to know what the numbers mean because this will influence how you interpret the results. The variables in this study were measured in the following ways:

Statistical anxiety was measured by the statistical anxiety rating scale, which is one total score ranging from 1–100, where 100 = high anxiety.
Maths self-concept may be measured on a maths self-concept scale, which is one score from 1–70, where 70 indicates a strong maths self-concept, as seen in someone who sees themselves as a ‘maths person’.
There may be a measure of the number of maths or statistics courses a person has completed which is a count variable (how many statistics courses, how many maths courses) that provides a numeric result.
Age is measured in years and gender is nominated.
The variables are described as either quantitative data or qualitative data and how the data is represented can be further categorised as nominal, ordinal or interval data.

Statistical anxiety is quantitative (1–100)
Maths self-concept is quantitative (1–70)
Number of courses and age are quantitative (0–∞)
Gender is qualitative (man, woman, non-binary)

Nominal
Nominal data is where you have groups or categories that are mutually exclusive—that is, you can’t be more than one. This data is descriptive and is qualitative data. For example, groups might be:

Old, Young
Man, Woman
Labor, Liberal, Green

go to card 14

Question 2

Q

use SPSS to review your data.

Question 3

Q

Tabachnick & Fidell, 2018, pp. 53–79)

Answer

A

Data screening processes vary depending on if data is grouped or ungrouped, or basically how the data has been collected or being analysed.
Proof reading is great to check data but for large set, screening for accuracy involves using descriptive stats & graphic representations of the variables.
For continuous variables, are all data within range? Are means and SD’s plausible? For discrete variables (eg religion), are any answers out of range? Have codes for missing values been accurately set up?
INFLATED CORRELATION; Composite variables are constructed from multiple parts. If these parts actually contain partly the same items, it is easy to erroneously create greater levels of correlation than actually exist.
DEFLATED CORRELATION;a falsely small correlation between 2 variables is erroneously created when the sample has restricted the range of one or both variables.
If a correlation is too small to be assessed, due to restricted sampling, it may be possible to calculate an adjusted correlation using knowledge of the population distribution. But run the risk of creating internal inconsistencies.
MISSING DATA creates problems for analyses. Occurs when eg rats die, equipment malfns, respondents don’t answer, some error occurs etc. How big of a problem missing data creates, depends on both how much is missing and why it is missing.
The pattern of missing data, is actually more important than how much. Missing values scattered randomly throughout a data matrix, are less serious than even a small quantity of non -randomly missing values, as these affect the generalizability of results. eg in a questionaire with both attitudinal & demographic questions, some respondents refuse to answer q’s re income. It is likely that this refusal is related to attitude, and if this missing data is deleted, the smaple values of the attitudinal variable, will be distorted.
MYCAR=Missing Completely At Random. This means the missing pattern cannot be predicted, and is the most ideal situation if one has to have missing data.
MAR=Missing At Random. Is a misnomer because there is some ability to predict it. Sometimes called Ignorable Nonresponse.
MNAR=Missing Not At Random. Here, the missingness is related to the variable and hence cannot be ignored and may have a profound effect on interpretation.
It is best to test whether data is missing in a meaningful way, by checking for patterns and then significant differences or not etc.
DELETING CASES OR VARIABLES; Sometimes missing data may be managed by simply deleting cases with missing data (sometimes this can be the default setting in spss), and this is ok if the missingness is truly random and of small size and sample size bif enough to cope. Sometimes if the missing data is only across a few variables and these are not essential, entire variables are dropped. But not good to delete data when missingness is not random, where sample size disallows it and when there has been considerable effort in collecting data in the first place.
ESTIMATING MISSING DATA; sometimes missing data is replaced with estimates. Sometimes these are quite good guesses, sometimes they are simply replaced with a mean (this method less popular now with access to highly intricate software packages), dometimes use previous data point etc. Sometimes modify it to be eg “within bottom half of range” etc. Obviously has an effect upon validity/interpretaion of data and relationships etc.
REGRESSION; a more sophisticated method sometimes used to estimate values for missing data. Other variables are used as IV to write a regression equation for the variable with the missing data as the DV. Cases with complete data are used to generate the equation and then predict the value of the missing. Sometimes, after initial predictions, subsequent rounds of predictions are performed and the data gradually converges.
An advantage of using a regression estimate is, is is not as blind as a guess, nor as same as inserting the mean. A disadvantage is that the scores have decreased variation and the scores “fit” together better than they should. Another disadvantage is a requirement that one has a fairly good fitting IV to create the equation. If none of the other IV’s is actually a good predictor for the variable with the missing data, the whole process is no more useful than simply replacing the missing data with the mean.
If a regression estimate gives an estimate outside of the pre-existing completed data (data without missing components)range, it is considered unaccetable to use it.
EXPECTATION MAXIMISATION (EM);method used with randomly missing data. Forms a missing data correlation (or co-variance) by assuming a specific distribution pattern (eg normal). Step 1(E) is estimating the values due to expectations of the missing values. Expectations are then placed in for missing values. Step 2 (M) then the maximium likely estimation is performed. Then after convergence, the EM variance-covariance matrix is formed and used for the missing values.
MULTIPLE IMPUTATION eg used where missing data is on a dichotomous (only 2 options) dv. Decide which other variables to use as predictors and a logistic regression comes up with an equation for estimating the missing values. Then multiple random samples with the replaced missing data, are used to determine the distribution of the variable.
Often good for longitudinal studies (eg within-subject iv or time-series-analysis) and makes no assumption re whether data is randomly missing.
Method of choice when data being analysed by a different agency than who collected the data.
USING A MISSING DATA CORRELATION MATRIX-another option for randonly missing data. All available pairs of values are used to calculate each of the possible correlations. each missing values reduces the possible pairings. Depending on the missing variables, each correlation is then based on a different possible number of cases and subsets. The size of some correlations will place contraints on some of the sizes of the other correlations. An Eigenvalue is used to represent the amount of variance explained by a factor, in factor analysis. Their value (variance) may be inflated in this method and should be particularly cautious re interpreting a negative eigenvalue.
TREAT MISSING DATA AS DATA; It is possible that missing data may be a good predictor for a variable of interest. Create a dummy variable with complete data beingassigned “0” , and incomplete as “1”. The liability of the missing data then becomes an asset. Mean is inserted for missing values , and the dummy variable is used as another variable in the analysis.
REPEATING ANALYSES BOTH WITH AND WITHOUT MISSING DATA;Use some method of deteriming estimates of missing data. Run alayses both with the full set of data including estimates of missing values, and on only complete cases. This is of particular importance when n is small, there is significant amount of missing data, or the data is missing in a non random pattern. If results are very similar, can have confidence in them, but if dissimilar, need to investigate why.
CHOOSING METHOD TO DEAL WITH ISSING DATA:
1. observe pattern and try to determine if random or not.
2. Reasonable to delete cases if missing data is random and missing amounts very few (and the missing data is not all on one variable).
3. Deletion of 1 entire variable reasonable if that variable has a lot of missing data and the variable is not critical to the analysis. If the variable is important, use a dummy variable coding for missing values with mean substitution.
4. Avoid mean substitution unless very small amount missing, no other option, and are very experienced re expected research result.
5. EM often best and simplest provided preliminary analysis shows evidence of data as missing being either MCAR or MAR. Ensure program provides adequade standard errors for the covariance matrix. Best for methods which are not relying upon inferential statistics (such as exploratory factor analysis). Best if not too much missing data.
6. Multiple imputation is considered the most respectable method but is also the hardest to implement.
7. Missing data correlation good if software allows and missing data is small and data set large.
8. Repeat analyses with both with and without missing data highly recommended for whichever method has been selected.
OUTLIERS;Is a case with an extreme value on a variable (univariate outlier), or strange combination of scores on several variables (multivariate outlier). When considering scatterplots and “lines of best fit” etc, outliers have far more impact upon the creation of a best fit line than those in the scatter.
Outliers occur in multivariate/univariate ways, on dichotomous and continuous variables, and on both iv’s and dv’s. They can lead to both Type 1 and Type 2 errors and it may be difficult to determine which. They also lead to results which can only actually be generalised to a similar population with the same outliers.
Reasons for their occurence may be
a) incorrect data entry. ie check it.
b) failure to specify missing value codes in computer syntax such that codes are read as data. ie check it.
c) The outlier does not come from the intended sample population. ie can be safely deleted.
d) The population actually has more extreme distribution of the variable than a normal distribution. ie keep it but may need to consider altering the value of the variable such that the outlier has less of an impact.
eg 15 year old may be within normal range and $45000 income may be within normal range, but both in same case unusal and likely to be a multivariate outlier.
Among dichotomous variables, cases on the “wrong” side of a very uneven split, are likely to be univariate outliers.
In continuous variables, how to search for outliers depends on whether cases are grouped. If analyses are ungrouped (eg regression, canonical correlation, factor analysis, structural equation modelling or some forms of time-series analysis), search for outliers all at once. If analyses are grouped (eg. Ancova, Manover, Mancova, profile analysis, discriminant analysis, logistic regression, survival analysis, or multilevel modelling), the search for outliers is done at the level of each group.
Univariate outliers on continuos variables can be found by checking z scores, bar histograms, normal probability plots etc etc.
Once potential univariate outliers are identified, researcher decides whether or not to transform them TRANSFORMATION=altering data such that outliers are pulled closer to centre.
If acceptable, transformations are done prior to searching for multivariate outliers.
MAHALANOBIS DISTANCE=distance of a case from the centroid of the remaining cases. (centroid is formed at the intersection of all the means of all the variables and can be thought of as a point around which the data is clustered in multiple dimensions.) Mahalanobis distances can be calculated to attempt to identify multivariate outliers but beware that under some conditions, the calculation will result in masking a real outlier, or in falsely producing one.
Other ways of trying to find multivariate outliers include Leverage, Discrepancy, and Influence.
Sometimes, as find and get rid of some multivariate outliers, the data becomes more consistent….which leads you to then be able to find further multivariate outliers which were not initially apparant…..
Do a trial run with and without the latest round of outliers deleted to see if they make a difference to results.(if not, no need to delete or modify them).(This step involves identifying outliers, this is not yet the step for deciding if deleting them etc).
If possible, need to determine why outliers are extreme. It is important to know upon which variables there are deviant cases, so that;
a) can decide if are properly part of your sample
b) so if going to modify/delete scores, can know which ones to do so
c) provides an indication for the kinds of cases for which your results are not generalizable.
When trying to describe one or several outliers, creat a dummy grouping variable with outlier(s) assigned one value, and non outliers another. Use the dummy dv in either discriminant analysis, logistic regression, or as the dv in regression. The goal is to identify variables that distinguish outliers from other cases. Once those variables identified, their means and outliers etc are identified.
REDUCING THE INFLUENCE OF OUTLIERS
1. Having identified univariate outliers, check for data accuracy. If accurate, consider if 1 variable may be responsible for all the outliers. If so, deletion of the variable will significantly reduce outliers. Can delete the variable if it is highly correlated to the other variables. or, if it is not critical to the analysis.
2. If (1) does not apply, decide if outliers are properly part of the population.(delete if not).
3. Reduce impact of univariate outliers via transformation.
4. Another option is to change outlying scores (eg plus or minus 1 unit so closer to mean). Often great if the actual measurement of scores was somewhat arbitrary to begin with.
5. Transformation or changing scores may not truly work with some multivariate outliers as there are issues not so much with each variable score, but with the outlying scores upon multiple variables. Sometimes if left with just a few of these problem outliers, they are deleted.
6. ANY transformations, changes or deletions MUST be reported in the results section, including rationale.
MULTIVARIATE NORMALITY is the assumption that each variable, and all linear combinations of the variables, are normally distributed. When this assumption is met, the residuals (residuals =leftovers or the bits not accounted for by multivariate analyses, and are an “error” between the obtained and predicted scores.) are also normally distributed and independent. It is however not easy to test the assumption. The assumption is made as part of significance testing. For cases where analyses are not grouped, the assumption applies to the distributions of the variables or to the residuals. For analyses which are grouped, the assumption applies to the sampling distributions of the means of variables.
HOMOSCEDASTICITY=where the variance of the error terms (difference b/n actual and predicted values) is constant across all levels of the independent variable in a regression model. An example would be if predicting house prices based on square footage. If homoscedasticity holds, the variability in the error (difference b/n actual and predicted price) will be approx the same for all house sizes, and not increasing or decreasing based on square footage.
NORMALITY= NORMAL DISTRIBUTION
It is useful early on, to determine if each variable (for multivariate anlysis)is normally distributed. Whilst not essential that they are, it makes the solutions of analyses are fair bit easier. Assess normality of variables either statistically or graphically.
There are 2 components to normality;
1. Skewness-to do with symmetry. if skewed, the mean is not in the centre of distribution.
2. Kurtosis-to do with how peaked or flat a curve is.
When there is normal distribution, level of skew=o, and level of kurtosis =0.
Nonnormal kurtosis can result in an underestimation of the variance of a variable.
There are significance tests for both skew and kurtosis. These test the obtained value against the null hypothesis. By convention, alpha levels of either 0.01 or 0.001 are usually used for smaller samples. for larger sa,ples, it is recommended to look at the shape of the distribution.
Standard Errors for both skewness and kurtosis decrease as N increases, thus the null hypoethsis is likely rejected in larger samples, when there are only mild deviations from normal.
With ungrouped data, if nonnormality is found, transformation of variables may be considered. Bear in mind, that even if each variable is normally distributed (or transformed to be so), there is no guarantee that all linear combinations of the variables are normally distributed. But the assumption of multivariate normality is more likely to be correct, if all the variables are normally distributed.
With grouped data, normality is assessed per each group. Transformation is considered if there are few error degrees of freedom and there is nonnormality. The same transformation is applied to all groups even though the transformation may not be optimal for all.
LINEARITY is the assumption that there is a straight-line relationship b/n 2 variables (where one or both said variables may be combinations of multiple variables). Pearson’s r only captures the linear relationships amongst variables. Any substantial non linear relationships between variables would be ignored by Pearson’s r.
Nonlinearity is diagnosed either from residuals plots in analyses involving a predicted variable, or from bivariate scatterplots between pairs of variables. In plots where standardised residuals are plotted against predicted values, nonlinearity is indicated when most of the residuals are above the zero line on the plot at some predicted values, and below the zero line at other predicted values.
Linearity between variables can be assessed roughly, by inspecting bivariate scatterplots. If both variables are normally distributed and linearly related, the scatterplot will be oval-shaped.
Obviously sometimes variables are just not linearly related.
Often variables have both a linear and a curvilinear relationship. eg, symptoms drop off with increasing dose (linear) but reach a point where no further drop in symptoms occurs with increased dose.
With only a handful of variables, easy to assess each possible paired relationship. When many variable however, use statistics on skewness to screen for only the pairs which are likely to deviate from linear relationships. Also check pairs of variables likely to have non linear relationships, with scatterplots.
HOMOSCEDASTICITY
FOR ungrouped data, the assumption that the variability in scores for one continuos variable, is roughly the same at all values of another continuous variable.
FOR grouped data, it is that the variability in the DV is expected to be approx the same as at all levels of the grouping variable.
If not homoscedastic, is heteroscedastic.
When the assumption of multivariate normality is met, the relationships between variables is homoscedastic.
Depending on the situation, heteroscedasticity may not be fatal for an analysis, but the argument will be weakened.
When data is grouped, homoscedasticity is known as homogeneity of variance.
Transformation will often covert heteroscedastic to homoscedastic but then can only interpret using the transformed data.

Question 4

Q

skewness/kurtosis

Question 5

Q

homoscedasticity

Answer

A

homoscedasticity or heteroscedasticity

Question 6

Q

common data transformations

Answer

A

Although data transformations often recommended to overcome problems of outliers, non normality, nonlinearity and heteroscedasticity, it should be remembered that ana analysis is done on variables, and transformed variables may be difficult to interpret in the real world.
If do transform, remember t check variables for normality afterwards.

Question 7

Q

Multicollinearity

Answer

A

Multicollinearity occurs in a correlation matrix when variables are too highly correlated (this is a problem) (eg problem at covariation >0.90).
Singularity is a problem in a correlation matrix when there is a redundany variable (eg could equally be found by using a combination of other variables).
eg Scores on the Wechsler Adult Intelligence Scale (WAIS) and scores on the Stanford-Binet Intelligence Scale are likely to have a problem of multicollinearity because they are different measures of the same thing. But a total WAIS score is singular because it is determined by combining subset scores.
Either bivariate or multivariate correlations can show multicollinearity or singularity. If a bivariate correlation (bivariate correlations consider how 2 variables covary in linear fashion, but in the big picture, there may be more variables also) shows up with >0.90 we have a collinear issue which can usually be solve d by removing some reduncatnt variables. When the situation occurs in a multivariate correlation however, it is much more difficult to determine which variables are the redundant ones.
Redundant variables will weaken your analysis and inflate the error size. Think carefully before including 2 variables with a >0.70 correlation in the same analysis.
Multicollinearity or singularity causes issues in the matrix when trying to do matrix inversion, as the maths does not work.

Question 8

Q

Review of statistical power (2023) courtesy of Dr Benedict Williams

Answer

A

Statistical power is the ability of a test to find an effect when it actually exists.

Power can be affected by:

sample size, where power increases with number (n)
effect size, which is a measure of the strength of the effect. For example, Cohen’s d (the standardised difference between two means), has the following rules of thumb 0.20 = small, 0.50 = medium, 0.80 = large (Cohen, 1992). Each effect size has its own ranges and cut-off values that dictate what is small, medium and large. Whenever you report a relevant effect size you should always reference/consult the specific effect size cut-off figures.
alpha (p value), whereas alpha increases power increases. For example, p < .05 has more power than p < .001 as there is more chance of detecting an effect or relationship.
Power can range from 0 (no power) to 1 (complete power). Cohen (1992) recommends 0.80 as the optimal level.

Power analysis is about ensuring that your statistical procedure has adequate power. In order to do this, you need to know the following:

What statistical test you will be using
The sample size
Your chosen alpha level
Your estimate of the effect size

Estimating the effect size
Hypothesis 1 is that statistics anxiety (DV) is a result of a weak maths self-concept (IV), little previous
experience of maths (IV), and age (IV) (dependence).

The bigger the sample size, the smaller the type 1 and type 2 errors get.
power is how often, if the effect is real, do we manage to detect it.
Past literature helps to determine effect size for your study.
Effect size helps to determine how big a sample you need.

Question 9

Q

Tools for statistical power (2023) courtesy of Dr Benedict Williams

Answer

A

THE NULL HYPOTHESIS SIGNIFICANCE TEST
reject null hypothesis if statistics reach a certian significant level.
ie however you are pften tryong to prove the hypothesis that eg there is a significant effect eg with a tx.(the null hypothesis is that there is no effect).
Type 1 error is when tx does not work, but data seems to support that it does (eg could just be chance that got better etc).
Type 2 error is where tx does work but test has not been able to find it.
Correctly reject null hypothesis when tx works and your data shows it.
True -ve is where the null hypothesis is true (ie no difference b/n tx and control) and data is in line with this.

Question 10

Q

Why screen data? (2023) courtesy of Dr Benedict Willams

Answer

A

The following steps need to be undertaken before you can begin your analysis:
-Screen for out-of-range and miscoded data (data entry errors)
-Identify and deal with outliers (errors or unusual cases)
-dentify and deal with statistical assumption violations
-Identify and deal with missing data

One way to identify univariate outliers is to calculate a ‘Z’ score. ‘Z’ scores have cut-off points that correspond to ‘p’ values and you can define someone as being a statistical outlier if their ‘z’ score is outside p < .001 (Z=+/-3.29) or p < .05 (Z=+/-1.96).

DECISIONS relating to univariate outliers;
Are they real or mistakes? For clues, look at responses to other variables and make careful decisions about whether to include or exclude the data. A general rule is that if you can’t decide, or the answer isn’t clear, undertake your analysis with the data in and then repeat the analysis with the data out, to see if your results are biased. Making decisions on what to do, depends on the following:

Patterns of answers to other variables
Expectations that arise from your knowledge of the area (past research and theory)
Sample size
Statistical technique you intend to use

DEALING with missing data
There is often a problem with missing values and there are many ways researchers deal with missing values. When screening data you need to identify missing values and when you detect missing values, determine if it’s a problem, or not.
System missing data is where you find a blank, or perhaps a dot, in the cell where someone has not provided a response.
Discrete missing data is where you give SPSS a value for the system to help determine why it is missing.
When defining missing values use a value that is not in the range of the variable such as 999. For example, from the previous ‘attitude’ exercise, the response required was to be between 1 and 9. If you know different reasons for missing values, use different discrete values, as per the following:

999 where the respondent missed the response entirely
998 if they answered unsure or don’t know.
This helps us to understand our missing values, where missing data can have lots of meanings and we need to understand why we have missing data to know how best to deal with it. The reasons for missing values are important because they can impact the quality of results.

Random reasons are those reasons which are different for each respondent. For example:

accidentally missed a question
dropped out of the study
ran out of time
didn’t know the answer.

Systematic reasons occur when more than one person misses responding but for the same reason. Those reasons affect some or all respondents systematically. Reason/s could include:

Unable to answer the question
Question is not appropriate
Problem/s with the question
Equipment failure

Question 11

Q

Week 1: Data screening and missing variables (2020) courtesy of Professor Christine Critchley

Question 12

Q

concepts

Answer

A

The three key broad concepts you need to understand to be successful are:

probability
averages
variability.

Question 13

Q

preparing data 2

Answer

A

Ordinal data also has categories, but the categories are ordered on a scale. For example, an ordinal variable might be statistics anxiety which may have four categories from 1 (not at all anxious) to 4 (extremely anxious). As shown in the following image, they can be ordered along a continuum from low anxiety to high anxiety. What makes them ordinal is that the differences between the categories are unequal, which makes this data qualitative.

Anxiety (2020) created by Swinburne Online
Interval
Interval data is also measured along a continuum, but categories are ordered on a scale that is measurable, and the differences between the categories are equal. For example, the variables of height, shown in the following image, are equal and the difference between 1.0 m and 1.5 m is the same as the difference between 2.0 m and 2.5 m. This data is quantitative.

Question 14

Q

sample size

Answer

A

Sample size affects the construct of statistical power.

When working with data, you want enough statistical power to find an effect but not too much power that you find an effect that doesn’t actually exist. It can lead to errors that can be categorised as:
Type I error: Detecting a relationship when one does not exist (risk rises with larger n).
Type II error: Not detecting a significant relationship when one does exist (risk rises with smaller n).

Question 15

Q

popular methods for replacing missing values

Question 16

Q

Video why screen data

Answer

A

Study design must consider how to measure phenomena and what statistics can apply to such measures, before any data collection.
Should spens way more time in preparation than doing any analysis.
Rarely, might work out that intended analysis will not work due to x,y,z and need to change analysis. Still need to indicate intention and why changed.
Data must be screened for
-errors
-unusual/unexpected data
-appropriateness for specific analysis
before any statistics are computed.

Question 17

Q

out of range values video

Answer

A

Doing frequency tables helps to spot out of range values.
Recommend get into the habit of coding for missing data. If code for missing data, when come across it, will know you have looked at it becuase is coded. Whereas if just a dot, not sure if missing value or if forgot to type in….
Using negative codes or very large values for missing data recommended because if forget to tell software, hopefully easier to work backwards and find why getting weird results.
For qualitative data, start with descriptives and check eg minimum and maximum.

Question 18

Q

dealing with outliers video

Answer

A

To find outliers for qualitative variables, find the values of very low frequency. Usually if less than 5 cases, cannot analyse meaningfully, and if less than 5%, will be bias or junk calculations.Therefore either record they were present but not used in analysis because unable or if appropriate, combine groups eg French, Turkish becomes “other” etc. or delete if appropriate.
when run analysis, do it in various ways to see which strategy of dealing with the outliers was best.
Outliers for qualitative variables are found by using histograms or boxplots.
A process not mentioned in many texts is “Windsorizing” where outliers are modifed such that they are not as extreme and have less impact on analysis but are still retained in the data set.