Assessment Flashcards
Practicality
Time-efficient
Not excessively expensive
★ Reliability
No errors in scoring
Consistent and dependable : a reliable test should yield similar results.
Subjectivity
Inter-rater reliability
Two Ts evaluate by using the same rating scale.
Failure stems from lack of scoring criteria.
Subjectivity of the raters
Subjectivity doesn’t enter into the scoring process.
Intra-rater reliability이 violate되는 이유와 solution
Violation of such reliability can occur in case of unclear scoring criteria, fatigue, bias..
*soultion: careful specification of an analytic scoring instrument can increase both inter- and intra-rater reliability.
test reliability
items that have more than one correct anwer
student-related reliability
temporary illness, fatigue, illness
Validity
Test measures exactly what it is supposed to measure.
Authenticity
lg is natural, contexualized items,
includes meaningful, relevant, interesting topics
stimulates real-world tasks
provides some thematic organization to items through episode.
eg) reading passages selected from real-world sources that test-takers are likely to encounter/
listening comprehension sections feature natural lg with hesitations, white noise, and interruptions.
Topics and situations are interesting and relevant to my life.
Tasks replicates, or clearly approximate, real-world tasks.
Washback
formative
Give learners feedback that enhances their lg development.
How test influences both teaching and learning
Ts can provide information that washes back to Ss in the form of useful dialogues of strengths and weaknesses.
I expected the teacher to go over the test and give “advice” on what I should focus on in the near future.
No” feedback or comments” from the teacher were given.
washback 높이려면?
to comment generously and specifically on test performance.
“comments and feedback”
Letter grades and numerical scores give no information of intrinsic interst to the S.
Formative tests, by definition, provide washback in the form of information to the learner on progress and goals.
Informal assessment: T provides interactive feedback ->washback 높아져
Formal assement: T provides information on Ss’ progress toward goals -> washback 높아져
Criterion validity 정의와 두가지 종류 예시
하나의 새로운 시험을 기존 시험과 비교해서 타당성을 측정 : The extent to which the criterion of the test has actually been reached.
1) Predictive validity: e.g.) @ placement tests, admissions assessment batteries acheivement tests designed to determine Ss’ readiness to move on to another unit.
2) Concurrent validity: eg) high score -> actually proficiency in the lg.
Formative test
Formative tests, by definition, provide washback in the form of information to the learner on progress and goals.
Evaluationg Ss in the progress of forming their competencies and skills. The delivery (by the T) and internalization
All kinds of informal assessment are formative.
Gather information on the developmental “process” of their speaking process
Assess their performance regularly
Summative test
Measure what Ss have grasped at the end of a course or unit of instruction.
* Evaluate only product not process
Summative test fails to provide crucial info.
(cf. formative test는 정보제공)
One major test at the end of semester
Norm-referenced
목적: to place test-takers along a mathematical continuum in rank order
primary concern: Practicability, realiability, validity
Such tests must have such fixed,predetermined responses.
Use the test results to award scholarships to the top 10%.
Criterion-referenced
The test is criterion-referenced, assessing the extent to which the students achieved the goals of the class.
Primary concern: authenticity, washback
(실생활에서 그 능력 사용한다는 목표. 즉, 시험과 실생활 간 일치정도 authenticity/ feedback측면에서 washback)
Give test-takers feedback in the form of grades.
The distribution of Ss’ scores across a continuum may be of little concern as long as the instrument assesses appropriate objective.
The Ss who get over 10 out of 16 will pass the conversation course.
Test administration reliability
Classroom conditions for the test are equale for all students.
ex) aural comprehension test -> street noise
Content validity
1) The tests assess real course objectives, direct testing
2) It requires test-takers to perform the behavior that is being measured.
Items focus on previously practiced in-class reading skills.
Construct validity
e.g.) conducting an oral interview
major components of oral proficiency: pronunciation, fluency,grammatical accuracy, vocab use, socio-linguistic appropriateness
e.g.) a simple written vocab quiz, covering the content of recent unit -> have Ss correctly define a set of words.
그런데, objective가 communicative use of words라면, writing of definitions certainly failes to match a construct of communicative lg use.
Face validity
Whether the test looks as if it is measuring what it is supposed to measure.
Tests that relate to their course work./ familiar task/ directions are clear
The printing was too small. had to read five pages in one hour.
Lots of tasks were unfamiliar
I’ve never done those kinds of tasks in class.
material that she had not dealth with in class
It seemed like a writing test rather than a listening test.
The exam “look like” one that high school Ss normally take.
needs analysis (needs assessment)
process of assessing the needs of Ss
Before designing course, it is necessary to make decisions about what would be taught and how it would be taught.
survey and interview
Info about what my Ss needed to learn or change, their learning styles, interestes, proficiency levels etc.
Based on the info, I decided on the course objectives, contents and activities.
a proficiency test/ standardized test
not linked to any particular textbook or specific course of study. (not limited to single skill in the lg. Rather, it tests “overall proficiency”.)
Summative and norm-referenced : provide results in the form of a single score, measure performance agaisnt a norm (w/ equated scores and percentile rank)
Not provide diagnositc feedback
summative feedback
Ss will receive a total score for the reading section
constructed-respons item
-
Item Response Distribution
- a certain wrong alternative was chosen by a greater number of high group students than low group students.
- more students chose the wrong alternative than those who chose the correct answer.
- A certain wrong alternative did not work as a distracter.
the reliability of the test ***
Item18 deteriorates the internal consistency of the test.
low ability group Ss가 high ability group Ss보다 더 정답을 많이 맞추었을 경우
Item Facility
Item Difficulty
The extent to which an item is easy or difficult for the proposed group of test-takers
정답을 고른 학생의 비율 보여줌
Mr.Park divided the number of Ss who correctly answered a particular item by the total number of Ss who took the test.
Item Discrimination
The extent to which an item differntiates btw high- and low- ability test-takers
Item 20 shows the highest discrimination among the five items.
Item 2 does not distinguish the upper level Ss from the lower level Ss.
예) 어떤 문항에서 잘하는애와 못하는애가 같은점수 받았다 -> have poor ID, because it didn’t discrminate btw the two groups. INTERNAL CONSISTENCY
many upper group students incorrectly chose option C. (Item 2 does not distinguish the upper level Ss from the lower level Ss. )
Distractor
no one from the upper group and lower group chose option B.
Distractor a and b seem to be fulfilling their function of attracting some attention from lower-ability Ss.
portfolios
collections of Ss work
useful for assessing stuent performance: 1. Ss have ownership over the process of learning, 2. Portfolios allow T to pay attention to Ss’ progress as well as achievement.
Alternatives
portfolios conference Journals self-assessment/ peer-assessment observation
Alternatives
portfolios performance-based assessment conference Journals self-assessment/ peer-assessment observation
performance-based assessment
The T observes the performance
The task is evaluated through “direct observation” by the T.
performance-based assessment
The T observes the performance
The task is evaluated through “direct observation” by the T.
The task calls for the integration of language skills.
analytic rating scales
diagnostic information 제공
holistic rating scales (holistic scoring method)
-
discrete point test
assessing one point at a time
On the assumption that lg can be broken down into component parts and that those parts can be tested successfully.
e.g.) grammar and vocab items in multiple choice format./ Large scale stnadardized entrance
integrative test 종류와 integrative test가 강조하는 것
Cloze test
Dictation
emphasizing communication and authenticity / communicative competence
Cloze test 종류/ 특징
Fixed-ratio cloze: Every nth word is deleted in a text
Rational-deletion cloze: Words are deleted in a text on a rational basis (eg. prepositions, sentence connectors) to assess specified grammatical or rhetorical categories.
Rational deletion이 more washback, expectancy grammar (ability to predict the next item)
특징) integrative+ reading ability 측정하는 indirect testing.
Rational deletion cloze
specific content words are chosen to be deleted
-> more washback, expectancy grammar (ability to predict the next time.)
scoring is more difficult in rational deletion cloze than c-test.
Cloze test scoring method 종류acceptable word method
a scoring method that accepts a suitiable,grammatically and rhetorically acceptable word that fits the blank in the original text.
(face validity 높다)
C-test 정의 및 특징
The second half of every other word is deleted
it has a higher scoring reliability
/ lower validity
Cloze test 정의 / Ss가 어떤 competence 사용하나/ 종류
an integrative measure not only of reading ability but of other lg abilities
- Ss use linguistic competence (formal schemata)/ background experience ( content schemata)/ strategic competence
Fixed-ratio deletion
Rational deletion
Cloze test scoring method 종류exact word method
a scoring method that is limited to accepting the same word found in the original text
dictation
It taps into grammatical and discourse competence
Subjective testing
Low reliability/ high validity
Constructed resonse items
e.g.) open-ended response*
Objective testing
It has predetermined fixed responses
High test reliability, Low validity
Selected resonse
e.g.) T/F, multiple choice items
Direct testing
It involves the test-taker in accurately the target task.
High content validity
e.g.) Oral presentation, to test performance directly
Indirect testing
Learners are not performing the task itself but rather a task that is related in some way
Achievement tests
Limited to particular material and are offered after a course has focused on the obejectives in question
Determine whether the course objectives have been met by the end of a given period instruction
Summative: administrated at the end of a lessen,unit,or term of study
Formative: when offereing feedback about the quality of a learner’s performance
Placement tests
to place a student into a particular level of a lg curriculum or school
Diagnostic
Formative (correct/incorrect responses provide Ts with useful information on what may or may not be emphasized in the weeks to come)
Diagnostic tests
To diagnose aspects of a lg that a S needs to develop or that a course should include
-> Should elicit info on what Ss need to work on in the future. Therefore, a diagnostic test will typically offer more detailed, subcategorized information on the learner.
Constructed resonse items
A type of test item or task that requires test-takers to respond to a series of open-ended questions by wr,sp or doing something rather than choose answers from already-made list.
computer adabptive testing
computer testing software that adjusts the questions depending on Ss’ performance on previous test items.
Alternative tests (Performance-based assessment )
it requires Ss to perfrom,create,produce or do s/t.
use real-world contexts.
focus on process as well as products
tap into higher level thinking and problem-solving skills
provide info about both strengths and weaknesses of Ss
involve “an integration of lg skills”
Performance-based assessment T의 주의점
- state the overall goal of the performance
- specify the objectives (crieteria) of the performance in detail
- prepare Ss for performance in stepwise progress
- use a reliable evaluation form, checklist.
- treat performances as opportunities for giving feedback and provide that feedback systematically
- if possible, utilize self- and peer- assessment judiciously.
Rubrics
validity ↑, reliablity ↑
A rubric is a device used to evaluate open-ended, oral and written responses of learners
- usually composed of a set of criteria or competencies, each with descriptions of levels of expectation
- some rubrics involve scaling
Rubric-based assessment
not only were rubrics beneficial for teachers but Ss were also able to better focus their efforts, produce work of higher quality earn better grades, and feel less anxious about assignments.
장) rubrics provide points for Ss to focus on and goals to pursue
단) simplicity (makring a few points on a chart and consider our job is done!) may mask the depth and breadth of a S’s attainment.
Portfolios
a purposeful collection of Ss’ work that demonstrates their efforts, progress and acheivements.
장점) foster intrinsic motivation, responsibility and ownership
- promote S-T interaction w/ the T as a facilitator
- facilitate critical thinking, self-assessment and revision process
- offer opportunities for collaborative work w/peers
포트폴리오 주의점
-State objectives clearly
-Give guidelines on what materials to include
(a sample portfoli from a previous Ss can help stimulate some thoughts on what to include)
- Communicate assessment criteria to Ss. (self-assessment : formative
-Provide positive washback - giving final assessments
e.g.) a holistic scoring scale ranging from 1 to 6.
narrative evaluation of perceived strengths and weakness by the T
Journals
the most formative of all the alternatives in assessment
CONTENT VALIDITY ↑, WASHBACK ↑ ↑
a log of one’s thoughts ,feelings, reactions, assessments, ideas, or progress toward goals, usually written w/ little attention to structure, form, or correctness.
“written conversation between T and Ss”
Dialogue journals
They imply an interaction between the T and the S through dialouges or responses
장점) practice in writing fluently, using writing as a thinking process, emphasizing a stuent’s own voice, afford a unique opportunity for a teacherto offer various kinds of feedback
* T becomes better accuainted with their Ss in terms of both their learning progress and their affective states
: meet Ss’ individual needs
단점) It’s difficult to set up criteria for evaluation
주의점 ) T should provide optimal feedback in your responses.
- cheerleading feedback, instructional feedback, in which you suggest strategies or materials, reality-check feedback -> help Ss set more realistic expectations for their lg abilities
self-assessment
/peer-assessment
autonomy, develop motivation
/ cooperative learning
Observation
observe Ss in the classroom
assess Ss s/o their awarness
naturalness of thier linguistic performance is maximized
Can take the form of recording, checklist, ration scales
Holistic scoring
an approach that uses a “single general scale” to give a global rating for a test-taker’s lg production
장) fast evaluation
단) no diagnostic info is avaible (no washback potential), raters need to be extensively trained to use the scale accurately
Analytic scoring
An approach that separtely rates a number of predetermined aspects (e.g. grammar, content, organization) of a test-taker’s lg production (e. writing)
=> establishing learners to hone in on weakness and caplitalize on strengths
PRACTICALITY ↓, in that more time is required for T to attend to details but ultimately Ss receive more information about their writing
Primary trait scoring
e.g.) 설득하는 글쓰기 -> 설득하는 측면에만 초점두어 점수매기기
It allows both writer and evaluator to focus on function
Multiple choice items
Practicality ↑: time-saving scroing procedures, Reliability ↑: pre-determined correct responses
multiple choice itmes are all receptive, or selective response items in that the test-taker chooses from a set of responses.
STEM: the body of the item that presents a stimulus
Options/ Altnernatives - KEY
Guidelines for designing multiple choice items.
- design each item to measure a single objective.
e.g.) WH-Q이 objective면 이것만 측정
+) Inadvertant (unintentional) clue 제공하면 X
2) State both stem and options as simply and directly as possible - remove needless redundancy from options and stem
3. Make certain that the intended answer is clearly the only correct one (Only one correct answer)
기출) make sure the distractors are the same grammatical class as the key / make sure the key cannot be selected based on Ss' world kn.