image description

Comprehensive services

to ensure program validity, credibility, and defensibility.

Solutions through Innovation

Psychometric Services

SMT's psychometric services focuses on maintaining the consistency of meaning of reported scores and on assessing the overall quality of exam questions and test forms. Both classical and item resonse theory-based approaches are used to achieve these goals.


Classical test theory (CTT) or true score theory assumes that all test-takers have a true score on an exam that would be expected if they were to take the exam an infinite number of times. Since it is not possible to administer the exam an infinite number of times, any single administration of the exam will produce an pobserved score wich is comprised of a candidate's true score and an error component. CTT expresses candidate performance and reliability information at the test level, and is useful in determining exam quality for smaller testing programs.


 Item response theory (IRT) or modern latent train theory, in contrast to CTT, expresses candidate performance at the item level. IRT provides a framework for evaluating the performance of individual items on an exam so that "good" items can be distinguished form "bad" items. This can improve exam programs by determining which items are appropriate for candidates and provide useful information for distinguishing between competent and non-competent practitioners. In addition, IRT is a test-independent measurement model, so candidate performances can be compared across multiple forms of an exam. IRT is very robust and therefore carries stong assumptions that render it suitable only for larger testing programs.


In order to ensure that passing standards are consistent among forms of an exam, SMT utilizes various equating methodologies. Equating is a statistical process that produces comparable scores on forms that differ in difficulty, and in order to equate there must be a common scale that links the forms together. Maintaining this scale is crucial for a valid and defensible exam program.


One important aspect of scale maintenance is pretesting. Pretesting is the addition of zero-weighted (pretest) items to an exam. These items do not count towards a candidate's score. Pretesting items provides verification that the items are relevant to competency and contribute toward measuring a candidate's proficiency in the material. The statistical data received from pretesting is analyzed to determine if an item performs within an acceptable range. Once an item's statistical performance is verified, it is added to scale and can be used as a weighted item on future exam forms.


An item analysis is a useful tool for detecting exam items that are performing poorly and are not discriminating well between high-ability and low-ability candidates. items that are determined to be flawed statistically are presented to SMEs for reviewing and, if necessary, editing. An item analysis includes the following calculations:

  • Number and percentage of candidates selecting each response.
  • Difficulty index (p-value) for each item.
  • Discrimination index (r-bis) for each item
  • Mean score and standard deviation
  • Reliability coefficient (KR-20)

IRT Parameters

For exam programs that utilize IRT methodology for scale maintenance, IRT fit statistics are produced and analyzed for all items on the exam form. IRT fit statistics indicate whether the statistical performance of items are at an acceptable level and whether each item fits the established exam scale. If not, these items are recommended for SME review to determine whether they should be rewritten and pretested, or removed form the item bank.


Technical test reports (TTRs) provide a comprehensive look at how an exam performed. Each TTR provides an evaluation of how an exam fulfills the recommendations of the authoritative Standards for Educational and Psychological Testing and includes the following where applicable:

  • Mean and standard deviation
  • Standard error of measurement
  • KR-20 reliability
  • Decision consistency reliability
  • Aggregate p-val and r-bis  statistics
  • Raw to scaled score conversion
  • Passing percentage
  • Aggregate IRT statistics and test charcteristic curve (TCC)