My library button
  • No image available

    Test scores are commonly reported in a small number of ordered categories. Examples of such reporting include state accountability testing, Advanced Placement tests, and English proficiency tests. This paper introduces and evaluates methods for estimating achievement gaps on a familiar standard-deviation-unit metric using data from these ordered categories alone. These methods hold two practical advantages over alternative achievement gap metrics. First, they require only categorical proficiency data, which are often available where means and standard deviations are not. Second, they result in gap estimates that are invariant to score scale transformations, providing a stronger basis for achievement gap comparisons over time and across jurisdictions. We find three candidate estimation methods that recover full-distribution gap estimates well when only censored data are available. [This paper is published in "Journal of Educational and Behavioral Statistics" v37 n4 p489-517 Aug 2012 (EJ973866).].

  • No image available

    Problems of scale typically arise when comparing test score trends, gaps, and gap trends across different tests. To overcome some of these difficulties, we can express the difference between the observed test performance of two groups with graphs or statistics that are metric-free (i.e., invariant under positive monotonic transformations of the test score scale). In a series of studies broken into three parts, we develop a framework for the application of metric-free methods to routine policy questions. The first part introduces metric-free methodology and demonstrates the advantages of these methods when test score scales do not have defensible interval properties. The second part uses metric-free methods to compare gaps in Hispanic-White achievement in California across four testing programs over a 7-year period. The third part uses metric-free methods to compare trends for "high-stakes" State Reading test scores to State score trends on the National Assessment of Educational Progress from 2002 to 2003. As a whole, this series of studies represents an argument for the usefulness of metric-free methods for quantifying trends and gaps and the superiority of metric-free methods for comparing trends and gaps across tests with different score scales. (Contains 16 figures, 4 tables, and 3 notes.).

  • No image available

    For many teachers, the classroom observation has been the only opportunity to receive direct feedback from another school professional. As such, it is an indispensable part of every teacher evaluation system. Yet it also requires a major time commitment from teachers, principals, and peer observers. To justify the investment of time and resources, a classroom observation should be both accurate and reliable. In this paper, the authors evaluate the accuracy and reliability of school personnel in performing classroom observations. The authors also examine different combinations of observers and lessons observed that produce reliability of 0.65 or above when using school personnel. They asked principals and peers in Hillsborough County, Florida, to watch and score videos of classroom teaching for 67 teacher-volunteers using videos of lessons captured during the 2011-12 school year. Each of 129 observers provided 24 scores on lessons provided to them, yielding more than 3,000 video scores for this analysis. The authors briefly summarize seven key findings: (1) Observers rarely used the top or bottom categories ("unsatisfactory" and "advanced") on the four-point observation instrument; (2) Compared to peer raters, administrators differentiated more among teachers; (3) Administrators rated their own teachers 0.1 points higher than administrators from other schools and 0.2 points higher than peers; (4) Although administrators scored their own teachers higher, their rankings were similar to the rankings produced by others outside their schools; (5) Allowing teachers to choose their own videos generated higher average scores. However, the relative ranking of teachers was preserved whether videos were chosen or not; (6) When an observer formed a positive (or negative) impression of a teacher in the first several videos, that impression tended to linger; and (7) There are a number of different ways to ensure reliability of 0.65 or above. The authors conclude by discussing the implications for the design of teacher evaluation systems in practice. (Contains 7 figures, 10 tables, and 21 footnotes.).

  • No image available

    Ho and Reardon (2012) present methods for estimating achievement gaps when test scores are coarsened into a small number of ordered categories, preventing fine-grained distinctions between individual scores. They demonstrate that gaps can nonetheless be estimated with minimal bias across a broad range of simulated and real coarsened data scenarios. In this paper, we extend this previous work to obtain practical estimates of the imprecision imparted by the coarsening process and of the bias imparted by measurement error. In the first part of the paper, we derive standard error estimates and demonstrate that coarsening leads to only very modest increases in standard errors under a wide range of conditions. In the second part of the paper, we describe and evaluate a practical method for disattenuating gap estimates to account for bias due to measurement error. [This paper was published in "Journal of Educational and Behavioral Statistics" v40 n2 p158-189 Apr 2015 (EJ1057843).].