My library button
  • No image available

    Policymakers and school administrators have embraced value-added models of teacher effectiveness as tools for educational improvement. Teacher value-added estimates may be viewed as complicated scores of a certain kind. This suggests using a test validation model to examine their reliability and validity. Validation begins with an interpretive argument for inferences or actions based on value-added scores. That argument addresses (a) the meaning of the scores themselves--whether they measure the intended construct; (b) their generalizability--whether the results are stable from year to year or using different student tests, for example; and (c) the relation of value-added scores to broader notions of teacher effectiveness--whether teachers' effectiveness in raising test scores can serve as a proxy for other aspects of teaching quality. Next, the interpretive argument directs attention to rationales for the expected benefits of particular value-added score uses or interpretations, as well as plausible unintended consequences. This kind of systematic analysis raises serious questions about some popular policy prescriptions based on teacher value-added scores.

  • No image available

    In 1998 and again in 2002, samples of eighth grade students in California were tested in reading as part of the state-level component of the National Assessment of Educational Progress (NAEP). In each of these years, all eighth graders in the state were also required to participate in the state's accountability testing, which included the reading test in the Stanford Achievement Tests, Ninth Edition (SAT 9). State-level comparisons of performance on these two assessments showed improvement in the SAT 9, but a slight decline on NAEP (not statistically significant). To examine whether this trend discrepancy might be attributable to content differences between the two tests, SAT 9 reading items were coded into categories corresponding to the NAEP content strands plus a category for items not aligned to the NAEP framework. Analyses of performance within strands indicate that content differences probably cannot explain the discrepant trends on the state accountability test versus NAEP, although differences related to item format are a strong possibility. Implications and alternative explanations are discussed. (Contains 7 endnotes, 6 tables, and 10 figures.).

  • No image available

  • No image available

  • No image available

    The paper initially describes the sources of uncertainty in National Assessment of Educational Progress (NAEP) data and standard errors. As NAEP sample sizes have increased, greater precision has been attained by the program. For this reason, exclusion effects are increasingly important. Two scenarios of revised NAEP results are presented (for New York City and for the nation) that reflect the possible results if all excluded students had been included in the data analysis: the overall NAEP results from the two recalculation scenarios vary considerably. Even where exclusion rate is constant, exclusion may affect score comparisons. When exclusion rates are not constant over time, the effects of exclusions on data comparisons can be significant. NAEP results can be affected by the percentage of students identified as Students with Disabilities (SD) or limited English proficient (LEP) in states or districts, as well as exclusion rates. The paper presents estimated recalculated results for the Trial Urban District Assessments in the two scenarios above to show how the rank orders of the districts' performance might have changed substantially. Student subgroup results may also change with increased inclusion. The effects of exclusions on NAEP data reliability can be minimized: (1) by minimizing exclusions; (2) by establishing exclusion criteria that are as clear and objective as possible and working to assure that those criteria are adhered to; and (3) making practices and criteria across states as uniform as possible. Remedies for the general effect of exclusions include: (1) efforts to minimize exclusions should continue; (2) NAEP users should be reminded more often that the students tested do not represent the entire population; and (3) research should continue on the utility of imputation models that might be used to adjust for effects of exclusions. Consequences of differential exclusion policies may be serious. If such policies vary for the two time points, groups, or jurisdictions compared, then that first, fundamental inference is compromised. The observed change, contrast, or performance gap in fact represents some mixture of differences in actual student achievement distributions and differences in decision rules determining whom to test. The increasing accuracy of NAEP statistics has made even small distortions more important than they once were. For the most part, effects of exclusions on reliability can be offset by increasing sample sizes. Effects of exclusions on validity are more problematical. It is important that NAEP continues to keep exclusions to a minimum, and that efforts be made to work toward more uniform policies and practices for determining which students should be excluded from NAEP and which should be tested. (Contains 12 footnotes and 3 tables.) [This paper was commissioned by NAGB to serve as background information for conference attendees at the NAGB Conference on Increasing the Participation of SD and LEP Students in NAEP.].

  • No image available

    Large-scale testing programs often require multiple forms to maintain test security over time or to enable the measurement of change without repeating the identical questions. The comparability of scores across forms is consequential: Students are admitted to colleges based on their test scores, and the meaning of a given scale score one year should be the same as for the previous year. Agencies set scale-score cut points defining passing levels for professional certification, and fairness requires that these standards be held constant over time. Large-scale evaluations or comparisons of educational programs may require pretest and posttest scale scores in a common metric. In short, to allow interchangeable use of alternate forms of tests built to the same content and statistical specifications, scores based on different sets of items must often be placed on a common scale, a process called test equating.

  • No image available

    Problems of scale typically arise when comparing test score trends, gaps, and gap trends across different tests. To overcome some of these difficulties, we can express the difference between the observed test performance of two groups with graphs or statistics that are metric-free (i.e., invariant under positive monotonic transformations of the test score scale). In a series of studies broken into three parts, we develop a framework for the application of metric-free methods to routine policy questions. The first part introduces metric-free methodology and demonstrates the advantages of these methods when test score scales do not have defensible interval properties. The second part uses metric-free methods to compare gaps in Hispanic-White achievement in California across four testing programs over a 7-year period. The third part uses metric-free methods to compare trends for "high-stakes" State Reading test scores to State score trends on the National Assessment of Educational Progress from 2002 to 2003. As a whole, this series of studies represents an argument for the usefulness of metric-free methods for quantifying trends and gaps and the superiority of metric-free methods for comparing trends and gaps across tests with different score scales. (Contains 16 figures, 4 tables, and 3 notes.).

  • No image available