Candidates who take the ETOE™ oral performance examination will not receive preliminary results upon completion of the exam since this exam requires human scoring. Candidates who take the ETOE™ oral performance examination will receive official results within approximately six to eight (6-8) weeks from the last date of the corresponding testing window (when all testing is done, not from the date of the candidate’s exam) via email.

The ETOE™ oral performance examination consists of 23 candidate’s audio recorded responses, and 22* of them are scored by human raters.

Raters score the examination by applying the Behaviorally Anchored Rating Scales which was developed and validated by CCHI’s Subject Matter Experts under the guidance of a nationally-recognized psychometrician.

Scoring of each of the item types (“activities”) on the ETOE™ examination requires a specific rubric, each of which is comprised of three to five of the following scales:

  1. Quality of Speech
  2. Task Completion
  3. Accuracy and Cohesion/Coherence
  4. Lexical Content
  5. Grammar

For example, item types Memory Capacity, Restate the Meaning, Equivalence of Meaning, and Reading Comprehension are evaluated using all five scales. While “Listening Comprehension” is evaluated using these four scales: Quality of Speech, Accuracy and Cohesion/Coherence, Lexical Content, and Grammar. And “Shadowing” is evaluated using these three scales: Quality of Speech, Task Completion,  and Accuracy and Cohesion/Coherence.

All scales have equal weight and are applied independently. The main criteria across all scales is how accurately candidates maintain the meaning of the original speech/text (aka “prompt”).

The brief description below is provided as a general guidance. Raters receive extensive continuous training regarding how to apply these scales, and their performance is rigorously monitored to assure its validity and reliability.

  1. Quality of Speech: Quality of speech focuses on the physical characteristics of the speech produced by the candidate. On this scale, common errors include false starts, hesitations, numerous self-repairs, poor pronunciation, intonation, or pace that hinder understanding.
  2. Task Completion: Raters evaluate if the candidate completed the specified task in a relevant manner, from the point of view of following the instructions. For example, for Memory Capacity items, raters expect candidates to repeat  every word in a message they have heard; thus, errors here would be omitting words or paraphrasing them. In Equivalence of Meaning, the expectation is to paraphrase the five underlined terms without losing the meaning and in a coherent manner. Thus, errors in this case would be repeating these five words verbatim or just giving their definitions instead of producing a coherent message.
  3. Accuracy and Cohesion/Coherence: This scale focuses on relevance (logical response) and completeness of the information presented in the prompt. Most common errors include omissions (especially, omission of key points of the prompt), additions that affect the meaning, providing irrelevant information, or producing incoherent statements.
  4. Lexical content: Raters evaluate how accurately the candidate preserves ‘units of information’ of the source speech/text. A unit of information can be an individual word, a group of words or a phrase that communicates a single concept. On this scale, errors include providing inaccurate factual information (e.g., in Reading Comprehension, providing incorrect medical information when answering questions about the text), inaccurate use of vocabulary (choice of words), especially if this usage changes the meaning of the original prompt, changing the register when there is no need to do so. Register is a variety of language used for a particular purpose or in a particular social setting, the level of formality chosen by the speaker. For example, the text of the Reading Comprehension activity is of somewhat high register. When answering the questions about the text, candidates are expected to speak at the same register or at neutral register, so using a low register in the response may lower the grade on this scale (note, we use the verb “may” here and not “will”). At the same time, in Restate the Meaning activity, there are some items where restating is only possible when you change the register, and in this case, the grade will not be lowered. All raters are trained on when the change of register does or does not constitute an error.
  5. Grammar: Grammar includes a set of rules that govern how sentences, phrases and words are put together in a given language. Raters evaluate the candidate’s command of the English grammar. On this scale, errors include changes in verb tense or agreement, use of incorrect pronouns, inaccurate word order (syntax), use of incomplete sentences, etc. that hinder the listener’s understanding of the message.

Raters do not know candidate identities when scoring examinations. Each oral response is scored by two raters independently. Raters do not score the entire exam of one candidate; they score individual responses. Additionally, if two raters disagree by one point on a particular score for a particular response, that response is then scored by a third rater. Raters do not know if a candidate passes or fails the exam because they do not score a whole exam and have no access to other rater’s scores or the final score.

Total scores for each of the exam’s subdomains (“activities”) are weighted according to CCHI’s proprietary formula based on the ETOE™ exam specifications. The passing score (passing standard) is determined by the teams of Subject Matter Experts (SMEs) and the CCHI Commissioners through a standard setting process (see its detailed explanation below). The raw score is then scaled (via a mathematical formula) to the distribution of 300 to 600 with the passing score set at 450. Since different forms of the test may differ slightly in difficulty, a statistical procedure called equating is used to ensure that the passing score of 450 is comparable from form to form (see explanation of the equating procedure below).

The Score Report, in addition to the overall test score, indicates how candidates scored on the exam subdomains (Listening Comprehension, Shadowing, Memory Capacity, Restate the Meaning, Equivalence of Meaning, and Reading Comprehension) to help identify strengths and weaknesses for future study.

Keep in mind that the Score Report states two separate things: the overall test score, and how well you did in specific parts of the test. There is no relationship between the percentages reported for the parts of the test (subdomains) and the overall scaled score.

We report the percentage correct for 6 subdomains: Listening Comprehension, Shadowing, Memory Capacity, Restate the Meaning, Equivalence of Meaning, and Reading Comprehension. The percentage correct for a part of the test (subdomain, e.g., Memory Capacity) is computed as the portion of the points that you earned relative to the number of points it is possible to earn in that part. For example, if the maximum number of points that it is possible to earn in a part of the test is 72 and you earned 51 points, the percentage on your score report would be 71% out of 100% possible in that subdomain.

Your total score is not the average of your performance in subdomains. The total score is based on the full examination. There is no pass or fail status associated with an individual content area (subdomain). The percentages reported for subdomains are intended only as a guide and should be interpreted cautiously due to the small number of items included in each content area. In order to improve your score, if you failed an exam, you need to practice and improve all types of activities. For more information on the domains, see the Test Content Outline.

Explanation of Standard Setting

To establish the passing score for the ETOE exam, CCHI uses the Extended Modified Angoff method that has an established history of determining credible passing standards for credentialing examinations, and, additionally, the Beuk Relative-Absolute Compromise method.

The Extended Modified Angoff method involves two basic elements: conceptualization of a minimally competent candidate and the estimation, as assigned by SMEs, of the average score a minimally competent candidate would receive on each item. A minimally competent candidate is described as an individual who would be able to demonstrate just enough knowledge and skills to pass the examination. In general, such a candidate has enough interpreting skills to practice safely and competently, but does not demonstrate the skill level to be considered an expert.

SMEs provide ratings for each test item estimating a score a minimally competent candidate would get on the item. Then they compare their ratings with empirical data collected during the pilot phase for each item and discuss their ratings as a group, with the goal to reach as close a consensus as possible. The SMEs’ ratings are then averaged, and this “provisional cut score” is further reviewed and validated.

To establish an operational cut score, SMEs are also asked to make a specific prediction about the test as a whole. This prediction is then used to adjust the panel-recommended rating and is known as the Beuk Relative-Absolute Compromise method.

For more information about the standard setting methods, see:

  • Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.) (pp. 508‐600). Washington, DC: American Council on Education.
  • Beuk, C. H. (1984). A method of reaching a compromise between absolute and relative standards in examinations. Journal of Educational Measurement, 21, 147-152.
  • Hambleton, R. M., & Plake, B. S. (1995). Using an extended Angoff procedure to set standards on complex performance assessments. Applied Measurement in Education, 8, 41‐56.
  • Plake, B. S., & Cizek, G. J. (2012). Variations on a theme: The Modified Angoff, Extended Angoff, and Yes/No standard setting methods. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods and innovations (pp. 181‐200). New York, NY: Routledge.

Explanation of Equating

Following the best testing practices, CCHI has several versions of the same exam (called test forms) administered to candidates. One of the reasons for this is to be fair to candidates who take the exam for the first time and to those who retake the exam. Ideally, each candidate should have a new to them version of the exam.

Different test forms may be of slightly different difficulty, because of the natural variations in the language of the test items (e.g., dialogs). And, again, it is important to be fair to all candidates regardless of which form they took. To achieve this fairness, the test forms undergo a procedure called equating.

Equating is a mathematical calculation that ensures that the test forms have the passing points at the same level of the candidate’s performance, i.e., that the forms are “equal” and “fair.” Test forms are equated to the “standard.” The “standard” is the form that the SMEs used to establish the passing score, and all subsequently developed forms are equated to it. Let’s say the standard is Form 1, and Forms 2 and 3 are equated to Form 1. Forms 2 and 3 will have different raw passing points because of this equating but they will be then scaled to represent the same passing score of 450 points. As a result of equating, a slightly easier form will require the candidate get higher points on some test items (called “raw scores”) to pass the exam. And a slightly more difficult test form will allow the candidate get lower points on some test items to pass.

Equating calculations are done by psychometricians and then reviewed and approved by CCHI.

As an analogy, if a second grade mathematics test included both addition and multiplication problems, you might expect the addition problems to be easier and multiplications problems to be harder. Let’s say Class 1 has an exam with 75 addition questions and 25 multiplication ones, whereas class 2 has an exam with 65 addition  and 35 multiplication questions. Then, to be fair for both classes, the final grade on two exams would have to be mathematically adjusted. Let’s say the addition question is worth 1 point, and the multiplication question is worth 4 points. Now imagine these 4 students:

  • Student A from Class 1 who correctly answers all addition questions and misses all multiplication problems would have a final score of 75, and 75% of his questions would be correct.
  • Student B from Class 1 who misses all the addition questions, but answers all the multiplication problems correctly would have a test score of 100, but would have only answered 25% of the questions correctly.
  • Student C from Class 2 who correctly answers all addition questions and misses all multiplication problems would have a final score of 65, and 65% of his questions would be correct.
  • Student D from Class 2 who misses all the addition questions, but answers all the multiplication problems correctly would have a test score of 140, but would have only answered 35% of the questions correctly.

To conclude, the percent scoring should be seen more as an indication that you did better in one domain than another on that particular test. You cannot compare between tests because they have a mix of test items with differing difficulties and, therefore, different weights for the final overall exam score.

When CCHI applies for accrediting and re-accrediting its exams, the equating procedures are submitted for review to the accrediting body. Accreditation is a form of final review and confirmation that the accredited exam meets all the requirements to be fair and reliable.


* The last response, consisting of the audio response in the candidate’s Language Other Than English, is not scored. It is collected by CCHI and will be used in a required continuing education activity for CoreCHI-P certificants. More information about this usage will be available once the first CoreCHI-P credentials are awarded.

Subscribe and be the first to know.