PA DX INTENSIVE (On-going) (4/29/24) Flashcards
- What is the best statistical tool to use in establishing the split-half reliability of a test with a limited number of items?
a. Spearman brown formula
b. Cronbach’s Coefficient Alpha
c. Kuder-Richardson 20
d. Pearson r
a. Spearman brown formula
Reliability - Consistency, accuracy, dependability of test results
Test-retest reliability
Administering a test at 2 different times
time sampling (consistency of test over time)
Pearson r
Parallel Forms reliability
Compares 2 equivalent forms of a test that measure the same attributes
Item sampling (Diff items, but same difficulty/# of items/content)
Alternate forms/equivalent forms
Internal consistency (measures only one construct)
Split–half reliability
Divide into halves then score separately (odd-even or random)
Spearman-Brown formula (the more items, the higher reliability); good if you have limited # of items
Kuder-Richardson 20
Used for dichotomous items; only one correct answer
Test has varying degrees of item difficulty
Kuder-Richardson 21
Used for dichotomous items; only one correct answer
Test has same level of item difficulty (usually for speed test)
Cronbach’s Coefficient Alpha
Used for polychotomous (several possible answers); no one correct answer
Used for likert scales
Dichotomous items: Only 1 correct answer; Polychotomous: No one correct answer; does NOT refer to number of choices
Interrater Reliability
Consistency of judges/raters evaluating the same behavior
Observer differences
Kappa statistics
A study reveals that the higher a person’s food intake is in the evening, the lesser his sleep quality gets. This kind of results show
a. Negative relationship
b. Positive relationship
c. Significant difference
d. No significant relationship
a. Negative relationship
The Children Personality Questionnaire-R is an example of what kind of test?
a. Unstructured test
b. Projective test
c. Structured test
d. Intelligence test
c. Structured test
Ability Test
- Achievement test - previous learning (past)
- Aptitude test - potential for learning or acquiring new skill (future)
- Intelligence test - general potential (present)
Personality test - overt and covert dispositions
- Structured - usually self-report; evaluate yourself
- Projective - either stimulus or response is ambiguous; unstructured
The test taker neglected to have breakfast before the test. As he proceeds with the examination, his stomach starts to growl. What type of challenge to internal validity does this situation pose?
a. Testing
b. Instrumentation
c. History
d. Selection
C. History
Threats to internal validity
> History - occurrence of events (before they took test)
> Maturation - internal/physical changes (longitudinal studies)
> Testing - effects of pretest to the post test (practice effect)
> Instrumentation - inconsistent use of measurement instrument (mistake in test material or administration)
> Statistical regression - tendency of extreme scores to regress toward mean score
> Selection - no random assignment
- Quasi experiment: no random assignment or no control group
> Subject mortality - loss of subjects
> Selection interaction - family of threats (multiple threats)
Which of the following elements must be present before an experiment can be called as a true experiment?
a. Random assignment
b. Control group
c. Any of the above
d. All of the above
d. All of the above
True experiment is random assignment + control group, If either one is missing, it’s quasi experiment
The Draw-A-Person test is developed by:
a. Lewis Terman
b. John Buck
c. R.B. Cattell
d. Florence Goodenough
d. Florence Goodenough
> Terman - Revised Binet-Simon scale to Standord-Binet
Buck - Developed house tree person test
R.B. Cattell - Developed CFIT, 16 PF, conceptualized fluid and crystallized intelligence, nagpasikat ng factor analysis
James Cattell - Coined term “mental test” and launched beginning of mental testing
Goodenough - Developed draw-a-person test
When the distribution of scores includes outliers, it is better not to use
a. Mean
b. Median
c. Mode
d. All of the above
a. Mean
- Never get the mean if there are outliers
- Get Median if there are outliers; least affected by extreme scores
This is the test used to determine the current developmental level of infants.
a. Apgar test
b. Kaufman Assessment Battery
c. Woodcock-Johnson III
d. Bayley Scale
d. Bayley Scale
– Apgar - given to newborn babies to check if they have abnormalities; given twice (1 min after born then 5 mins after born); give third time in 10 mins if results still aren’t good
– Kaufman Assessment Battery - Intelligence test for young children
– Woodcock-Johnson III - Intelligence test; For detecting learning disabilities
– Bayley Scale - for current developmental level (more on motor skills)
In order to determine the concurrent validity of a test, the statistical tool to be employed is
a. Spearman-brown formula
b. Kuder-Richardson 20
c. Point-biserial correlation
d. Pearson r
d. Pearson r
Validity - meaning and usefulness of results; if test is appropriate
Criterion Validity
- How well it corresponds to a particular criterion
- Types:
> Criterion test - well-established test, sure na the test is valid
> Criterion data - any data you can use that’s related to your test/as a basis (eg. performance appraisal, diagnosis, records)
- Predictive validity - Forecasting function - Concurrent validity - simultaneously relationship between test and criterion, no significant time has passed
Content validity
- Adequacy of representation of conceptual domain the test is designed to cover
- Experts judge validity of test items, use critical/logical thinking skills
Construct validity
- Degree to which a test measures what it purports to measure
- Used if test measures abstract variables (exists but hard to measure)
- Based on theoretical perspective
- All encompassing; if you establish construct validity, you establish other validities
- Convergent Validity - measures well with other related constructs; theory said constructs are related
- Divergent/discriminant validity - low correlations with unrelated constructs; theory said constructs are unrelated
Face validity
- Test subjectively viewed that it measures what it purports to measure
- Physical appearance of test
The split-half reliability is used to determine
a. If the test can disregard bias despite the number of factors being measured
b. If all the items in the test measures the same dimension
c. Whether consistent scores would be obtained regardless of the characteristics of the test taker
d. All of the above
b. If all the items in the test measures the same dimension
Split half - internal consistency
Reliability - Consistency, accuracy, dependability of test results
— Test-retest reliability
- Administering a test at 2 different times
- time sampling (consistency of test over time)
- Pearson r
— Parallel Forms reliability
- Compares 2 equivalent forms of a test that measure the same attributes
- Item sampling (Diff items, but same difficulty/# of items/content)
- Alternate forms/equivalent forms
— Internal consistency (measures only one construct)
– Split–half reliability
- Divide into halves then score separately (odd-even or random)
- Spearman-Brown formula (the more items, the higher reliability); good if you have limited # of items
– Kuder-Richardson 20
- Used for dichotomous items; only one correct answer
- Test has varying degrees of item difficulty
– Kuder-Richardson 21
- Used for dichotomous items; only one correct answer
- Test has same level of item difficulty (usually for speed test)
– Cronbach’s Coefficient Alpha
- Used for polychotomous (several possible answers); no one correct answer
- Used for likert scales
Dichotomous items: Only 1 correct answer; Polychotomous: No one correct answer; does NOT refer to number of choices
— Interrater Reliability
- Consistency of judges/raters evaluating the same behavior
- Observer differences
- Kappa statistics
The 16-PF is a personality test that could also determine unusual responses. Specifically, which of the following?
a. Impression Management
b. Infrequency
c. Acquiescence
d. All of the above
d. All of the above
16 PF can detect 3 kinds of unusual responses
-> Impression management: social desirability ------ IM score is 95% and above; person is faking good ------ IM score is 5% and below; person is faking bad -> Infrequency: person is playing safe ------ IN score is 95% and above; person is playing safe -> Acquiescence: tendency to agree to most questions ------ IAC score is 95% and above; person is agreeing to everything
When our code of ethics conflicts with the law, what is the best step to do?
a. Maintain our stance regardless of the law as our ethical code is formulated to protect the safety of our clients and patients
b. Adhere to the law, but at the same time try to minimize the inconvenience it might inflict to the people involved
c. Obey the law as it is the highest order of the land.
d. Resolve the conflict while being committed to the code of ethics
d. Resolve the conflict while being committed to the code of ethics
-> Do D (resolve conflict while adhering to ethics) first as much as possible
-> If you’ve tried everything and hindi talaga pwede, ADHERE TO THE LAW
For individuals who are legally incapable of providing consent, we must
a. Nevertheless explain appropriately to the client
b. Obtain informed assent from them
c. Obtain appropriate permission from their legally authorized person
d. All of the above
d. All of the above
—> Informed assent: For minors (17 and below)
—> Informed consent: For adults (18 and above)
—> If incapable of providing consent, get both informed consent and assent
Mark and Arvin are scheduled to get their height and weight in the clinic. They said it was unnecessary because they just took their measurements the day before. To their surprise, Mark’s weight went from 55kg to 58kg, and Arvin’s weight moved from 60kg to 61kg, while bother of them maintained their height. This could imply that
a. The weighing scale has systematic error
b. The weighing scale has random error
c. Mark ate more than Arvin did the night before
d. None of the above
b. The weighing scale has random error
—-> Systematic error - error is fixed; still possible to get true score (eg. consistently adding 5kg to all scores)
—-> Random error - error is not consistent; now difficult to get true score
When releasing test data, we divulge which of the following?
a. Release raw and scaled scores
b. Release client’s responses to the test questions
c. Observation notes
d. None of the above
d. None of the above
—-> Don’t release raw and scaled scores because the client does not know the interpretation
—-> Don’t release responses because that’s confidential
—-> Don’t release observation notes because you’re not obligated to give them (baka messy or judgmental pa lol)
In creating a research, which one of the following types of statements is always false?
a. Analytical statement
b. Falsifiable statement
c. Contradictory statement
d. Hypothetical statement
c. Contradictory statement
> Analytical statement - always true
Falsifiable statement - can be disapproved by research
Contradictory statement - always false or contradicting
—- We don’t want analytical and contradictory statements
—- We WANT falsifiable statements
One of the primary objectives in developing this scale was to provide an intelligence test suitable for adults, as previously available tests were all designed for school children.
a. Weschler-Bellevue Intelligence Scale
b. Raven’s Progressive Matrices
c. Stanford-Binet Intelligence Scale
d. Culture Fair Intelligence Test
a. Weschler-Bellevue Intelligence Scale
—-> Stanford Binet was first developed for school children (gifted or not) and was verbal
—-> Weschler said we need a test for adults and with nonverbal component
—-> In future revisions, SB added nonverbal
States that measurement error is always random, and advocates standardization of tests.
a. Domain Sampling Model
b. Classical Test Score Theory
c. Item Discriminability Analysis
d. Item Response Theory
b. Classical Test Score Theory
In reliability, we have two theories: CTT and IRT
——– Classical test score theory ——
—-> Assumes that each person has a true score that would be obtained if there were no errors in measurement (Possible makuha ang true score kaso mahirap kasi laging may random error, therefore, we need to standardized the test to minimized the error)
—-> Total score = true score + error
—-> Advocates standardization - Minimize error by standardizing test
—-> Domain sampling model - considers the problem created by using a limited number of items (the more items, the higher the reliability)
A company, in order to determine whether the new computer system will significantly reduce the amount of time spent in data processing, divided their Finance department into two groups. The first group used the new system, while the second group used the old one. However, upon learning of this set-up, the workers in the second group worked overtime to make sure that they would finish processing their data faster. This is an example of
a. Reactivity
b. Pygmalion effect
c. Rosenthal effect
d. John Henry effect
d. John Henry effect
Reactivity - altering behavior due to awareness of being observed
—-> Hawthorne Effect - Know they’re being studied/observed, therefore they are performing good.
—-> John Henry Effect - Control group in competition with experimental group
How do you establish Alternate Forms reliability?
a. You administer the form A of your personality inventory to your sample. After they are finished, you administer the form B.
b. You administer the form A to your sample, wait a couple of weeks, and then administer the form B to the same sample.
c. All of the above
d. None of the above
c. All of the above
Can do either
Subjects serve more than one condition of the independent variable.
a. Between-subjects design
b. Within-subjects design
c. Mixed design
d. Factorial design
b. Within-subjects design
———- Experimental designs ———-
> Between subjects - subjects serve only one condition of IV.
———-> Two/more diff groups, only one condition of IV each.
Ex. I have a total of 60 respondents, I gave white chocolate to the 30 respondents and dark to the other 30
> Within-subjects - aka repeated measures; subjects serve more than one condition of IV
———-> One group, more than one condition of IV.
———-> Longitudinal studies
Ex. I have a total of 60 respondents, I gave white and dark chocolate to all my 60 respondents
> Mixed - one factor is within subject, the other is between subject; has two IV.
———-> Each group only experiences one condition of 1 IV, but also multiple conditions of the other IV.
Ex. Whitening Soap & Frequent Bathing
2 4 6
Kagayaku Grp 1:30
Kayakukayamu Grp 1:30
Kayanatinglahat Grp 1:30
- Total of 90 respondents all in all
- Between subject is the Whitening Soap
- Within subject is the Frequency of bathing
Michelle developed a test with two sets. In order to identify its reliability, she should employ what statistical tool?
a. Pearson r
b. Kuder-Richardson 20
c. Cronbach’s alpha
d. Kappa statistics
a. Pearson r
- Two sets, correlation to one another
Reliability - Consistency, accuracy, dependability of test results
— Test-retest reliability
- Administering a test at 2 different times
- time sampling (consistency of test over time)
- Pearson r
— Parallel Forms reliability
- Compares 2 equivalent forms of a test that measure the same attributes
- Item sampling (Diff items, but same difficulty/# of items/content)
- Alternate forms/equivalent forms
— Internal consistency (measures only one construct)
– Split–half reliability
- Divide into halves then score separately (odd-even or random)
- Spearman-Brown formula (the more items, the higher reliability); good if you have limited # of items
– Kuder-Richardson 20
- Used for dichotomous items; only one correct answer
- Test has varying degrees of item difficulty
– Kuder-Richardson 21
- Used for dichotomous items; only one correct answer
- Test has same level of item difficulty (usually for speed test)
– Cronbach’s Coefficient Alpha
- Used for polychotomous (several possible answers); no one correct answer
- Used for likert scales
Dichotomous items: Only 1 correct answer; Polychotomous: No one correct answer; does NOT refer to number of choices
— Interrater Reliability
- Consistency of judges/raters evaluating the same behavior
- Observer differences
- Kappa statistics
In reliability, what range estimate is acceptable in the clinical setting?
a. .50
b. .80
c. .70
d. .95
d. .95
Dapat sure na sure talaga/almost perfect because when it comes to the clinical setting, it’s determining a person’s fate
If basic research, .70 is acceptable
Harry wants to increase the reliability of his 35-item test. In order to do so, what should he do?
a. Conduct pilot testing
b. Add more items
c. find experts to validate the test
d. correlate it to similar tests
b. Add more items
The more the merrier!
——– Classical test score theory ——
—-> Assumes that each person has a true score that would be obtained if there were no errors in measurement (Possible makuha ang true score kaso mahirap kasi laging may random error, therefore, we need to standardized the test to minimized the error)
—-> Total score = true score + error
—-> Advocates standardization - Minimize error by standardizing test
—-> Domain sampling model - considers the problem created by using a limited number of items (the more items, the higher the reliability)
According to the code of ethics, which of the following is true when it comes to disclosing information?
a. We disclose information only when the client provides permission to do so
b. We disclose information to the source of referral even without the consent of the client
c. When the people need to be protected from harm
d. All of the above
d. All of the above
A - true; disclose with consent
B - true; need to give result to source of referral
C - true; duty to protect
Allen asked his professor for help regarding a sensitive case of a research subject he is currently handling in his study. He gave his professor the necessary information about the case, but not the name of the subject. Allen is protecting his subject’s
a. Anonymity
b. Confidentiality
c. Obscurity
d. Privacy
a. Anonymity
> Anonymity - Protect identity
Confidentiality - Protect information (test scores)
Alice was given an intelligence test on Monday and she obtained a score of 100. She took the same test on Wednesday, and she obtained a score of 130. Based on this, the intelligence test is therefore
a. Reliable but not valid
b. Valid but not reliable
c. Not reliable and not valid
d. Reliable and valid
c. Not reliable and not valid
A test can be reliable but not valid; but a test CANNOT be valid unless it’s reliable
Reliability limits the validity of the test
Obtained when the test measures what it purports to measure.
a. Criterion validity
b. Construct validity
c. Reliability
d. Concurrent validity
b. Construct validity
Validity - meaning and usefulness of results; if test is appropriate
Criterion Validity
- How well it corresponds to a particular criterion
- Types:
> Criterion test - well-established test, sure na the test is valid
> Criterion data - any data you can use that’s related to your test/as a basis (eg. performance appraisal, diagnosis, records)
- Predictive validity - Forecasting function - Concurrent validity - simultaneously relationship between test and criterion, no significant time has passed
Content validity
- Adequacy of representation of conceptual domain the test is designed to cover
- Experts judge validity of test items, use critical/logical thinking skills
Construct validity
- Degree to which a test measures what it purports to measure
- Used if test measures abstract variables (exists but hard to measure)
- Based on theoretical perspective
- All encompassing; if you establish construct validity, you establish other validities
- Convergent Validity - measures well with other related constructs; theory said constructs are related
- Divergent/discriminant validity - low correlations with unrelated constructs; theory said constructs are unrelated
Face validity
- Test subjectively viewed that it measures what it purports to measure
- Physical appearance of test
Which of the following personality tests does not score ambiguous responses?
a. Sack’s Sentence Completion Test
b. Rotter Incomplete Sentence Blank
c. Purpose in Life test
d. None of the above
d. None of the above
A and B are sentence completion tests; they have ambiguous responses
C also scores ambiguous responses
If there is evidence that the association between two variables is not significantly different from 0, then we
a. Reject the null hypothesis
b. Reject the alternative hypothesis
c. Accept the alternative hypothesis
d. Both a and b
b. Reject the alternative hypothesis
Not significantly different from 0 - there is NO significant difference
Null: There is NO significant difference
Alternative: There IS a significant difference
At the very least, what should be the item difficulty of a multiple-choice item with four choices for it to be reasonable?
a. .25
b. .30
c. .20
d. .15
b. .30
> Item analysis - Set of methods to evaluate items
> Item difficulty - number of people who got the item correct; the higher, the easier the question/test. the lower the item difficulty, the harder the question/test/
—– Item easiness
—– Optimal difficulty: Halfway between 100% and level of success expected by chance alone (item difficulty should be higher than probability)
—– .30-.70 to maximize information about the differences among individuals
> Item Discriminability - determines whether people who have done well on item have also done well on the whole test (can discriminate high scorers from low scorers)
—–> Extreme group model: compares those who have done well to those who have done poorly
—————In each item, check which group scored more (high scorers should have more to have good discriminability)
ex. If the group of top performers got the correct answer on the item and the low scorers did not, Accept the Item.
If the group of top performers got the wrong answer on the item and the low scorers got the correct answer, Reject the Item.
—–> Point-Biserial method: correlation between the performance on the item and on the test; used when One of the variables is dichotomous/categorical and the other is continuous
This is developed when the classical test score theory is deemed inadequate in identifying the true ability of the test takers.
a. Domain Sampling Model
b. Item discriminability
c. Item Response Theory
d. Item Analysis
c. Item Response Theory (IRT)
Item response theory (IRT)
—– Focuses on the range of item difficulty that helps assess an individual’s ability
—– Need an item bank, with each item having its own difficulty
—– Item branching - Administering items based on response to previous item; can administer harder or easier items to gauge ability of testtaker
—– Computerized Adaptive Testing (CAT)
Test-retest reliability only applies to
a. Overt behaviors
b. Covert traits
c. Stable traits
d. Dominant traits
c. Stable traits
> Test-retest: consistency of test over timeReliability - Consistency, accuracy, dependability of test results
— Test-retest reliability
- Administering a test at 2 different times
- time sampling (consistency of test over time)
- Pearson r
— Parallel Forms reliability
- Compares 2 equivalent forms of a test that measure the same attributes
- Item sampling (Diff items, but same difficulty/# of items/content)
- Alternate forms/equivalent forms
— Internal consistency (measures only one construct)
– Split–half reliability
- Divide into halves then score separately (odd-even or random)
- Spearman-Brown formula (the more items, the higher reliability); good if you have limited # of items
– Kuder-Richardson 20
- Used for dichotomous items; only one correct answer
- Test has varying degrees of item difficulty
– Kuder-Richardson 21
- Used for dichotomous items; only one correct answer
- Test has same level of item difficulty (usually for speed test)
– Cronbach’s Coefficient Alpha
- Used for polychotomous (several possible answers); no one correct answer
- Used for likert scales
Dichotomous items: Only 1 correct answer; Polychotomous: No one correct answer; does NOT refer to number of choices
— Interrater Reliability
- Consistency of judges/raters evaluating the same behavior
- Observer differences
- Kappa statistics
The researchers are in the middle of the experiment seeking to identify if noise affects concentration when suddenly the aircon turned off. If the experiment continues, what could be an extraneous variable in this case?
a. Noise
b. Concentration
c. Volume of the sound
d. Temperature
d. Temperature
Extraneous variables - variables that are not part of the experiment, but they do exist and affect the results
Jovie is a newly-hired psychometrician in a company. Just before her scheduled employee testing, her boss spoke to her and asked if she could finish the assessment, which normally takes 2 hours, within 30 minutes justifying that the testing procedures are just formality and the employee would be accepted no matter what. What should Jovie do?
a. Finish the assessment within 30 minutes, as apparently the test would not be used as a basis for hiring selection
b. Compromise with the boss to give her at least 30 minutes more
c. Agree, but inform the applicant about the change
d. Do not agree and explain the testing procedures to the boss
d. Do not agree and explain the testing procedures to the boss
Establishing this psychometric property requires good logical skills and intuition.
a. Construct validity
b. Content validity
c. Face validity
d. Interrater Reliability
b. Content validity
Validity - meaning and usefulness of results; if test is appropriate
Criterion Validity
- How well it corresponds to a particular criterion
- Types:
> Criterion test - well-established test, sure na the test is valid
> Criterion data - any data you can use that’s related to your test/as a basis (eg. performance appraisal, diagnosis, records)
- Predictive validity - Forecasting function - Concurrent validity - simultaneously relationship between test and criterion, no significant time has passed
Content validity
- Adequacy of representation of conceptual domain the test is designed to cover
- Experts judge validity of test items, use critical/logical thinking skills
Construct validity
- Degree to which a test measures what it purports to measure
- Used if test measures abstract variables (exists but hard to measure)
- Based on theoretical perspective
- All encompassing; if you establish construct validity, you establish other validities
- Convergent Validity - measures well with other related constructs; theory said constructs are related
- Divergent/discriminant validity - low correlations with unrelated constructs; theory said constructs are unrelated
Face validity
- Test subjectively viewed that it measures what it purports to measure
- Physical appearance of test
He recognized the need for the rapid classification of recruits with respect to general intellectual level during World War I.
a. Alfred Binet
b. Robert Yerkes
c. Karl Pearson
d. Sir Francis Galton
b. Robert Yerkes
WWI: Army Alpha and Army Beta; Developed by Yerkes
Army Alpha - Literates; verbal
Army Beta - Illiterates; nonverbal
The more items a test has, the higher the reliability it will possess. The concept behind this is called
*
0/1
a. Domain Sampling Model
b. Item Response Theory
c. Classical Test Score Theory
d. Item Bank Analysis
a. Domain Sampling Model
See number 18
In test retest, carryover effects do not harm the reliability when
*
0/1
a. The changes in score happened on only a proportion of the test takers
b. The changes in score happened on all of the test takers
c. The changes in score affected only few test items
d. The changes in score affected all the test items
b. The changes in score happened on all of the test takers
Which of the following is an example of external consistency?
*
0/1
a. Interrater reliability
b. Alternate Forms reliability
c. Split-half reliability
d. None of the above
a. Interrater reliability
External Consistency - Interrater
Temporal Consistency - Test retest
Form Consistency - Alternate forms
Split-Half Consistency - Split-half