In 2009, researchers from the Buros Center for Testing at the University of Nebraska-Lincoln (my graduate school) and from the Center for Educational Assessment at the University of Massachusetts - Amherst published the Final Report for their Evaluation of the National Assessment of Educational Process based on a congressional mandate. The full report can be found here.
In the executive summary, the authors state the following:
Comparing student achievement on NAEP across states is complicated. To appreciate the challenges in making state-by-state comparisons, it is necessary to understand the sampling design adopted by NAEP and its potential impact on the results and their interpretations. In NAEP’s multistage cluster sampling procedure, not all students take the assessment, and those students who do take NAEP respond to a subset of the NAEP items in each content area. While this allows for a broad sampling of items from any one content domain, the extent to which subgroups of students are represented adequately in NAEP’s state samples is of concern.
As reported in the current evaluation, NAEP’s sampling procedures do not ensure adequate representation of various subgroups (including those defined by race and ethnicity) within some states, putting valid interpretations about subgroup performances within a state and across states at risk. Using NAEP to verify state results regarding the achievement of students with disabilities is also problematic because decisions about inclusion and allowable accommodations are made at the state level. Because states vary in their inclusion rates and in their treatment of accommodations for NAEP, the validity of state-by-state comparisons is debatable.Below are the main concerns and recommendations the researchers stated regarding NAEP and appropriate uses of its data.
- There is not an organized validity framework for the exam, which is needed given the complexity and multiple uses of NAEP. According to the report, "An organized validity framework takes into account the history of the assessment program, current learning theory, and content-performance expectations from the subject-matter field and related professions. It also addresses contemporary xviii issues in current interpretations and uses of the assessment and anticipates future appropriate and inappropriate uses and consequences of the assessment."
- Additional studies are warranted if NAEP is to be used to verify state assessment results. According to the report, "As reported in the current evaluation, there are numerous factors that can jeopardize the validity of interpretations when using NAEP to verify state results. These include differences in content being assessed, differences in standard setting policies and procedures, differences in the definition of the achievement levels, and differences in the representation of the NAEP state samples. Additional alignment studies that evaluate the congruency between the content assessed by NAEP and state content standards and assessment are crucial. The sampling procedures for NAEP should also be studied. Representation of subgroups across states varies considerably as do the inclusion and exclusion rates for students with disabilities, impacting the validity of the use of NAEP results for state-by-state comparisons and for verifying state assessment results."
- Revise review processes for NAEP technical reports and manuals that facilitate their timely release. "Currently, release of NAEP technical documentation can be years after results have been released, exceeding what testing programs should tolerate... There are several reasons for releasing timely technical documentation; primarily, it assists users in understanding appropriate uses and limitations of NAEP scores."
- Other measures of U.S. students’ educational achievement do not provide strong sources of external validity evidence for NAEP achievement levels. "It is a challenge to gather validity evidence from multiple sources outside a standard-setting study that can be used to evaluate achievement levels. Furthermore, external data are not perfect evaluation evidence due to potential differences in content, sample, and purpose. For example, some tests (like well-known college admissions tests—e.g., the SAT and ACT) involve self-selected samples of college-bound seniors, not a nationally representative sample. In many cases external tests serve purposes that are very different from NAEP. As the differences between what tests purport to do and what they measure increase, the utility of these measures as external evidence decreases."
- NAEP should continue to explore methodologies for setting achievement levels. "Stakeholders continue to use achievement levels as one means of interpreting NAEP results. NAEP has engaged in extensive research on standard-setting since 1992 to improve its practice. Some of this research includes the pilot studies done on the new Mapmark method (Schulz 22 and Mitzel, 2005). However, because this new methodology is not widely used, more research on whether it is appropriate for other NAEP subject areas is needed. Although we conclude that the new methodology worked well with the experts involved in the study on the 2005 grade 12 mathematics assessment, the degree to which the method will work with experts from other subject areas cannot be determined from this evaluation."
- NAEP should prioritize gathering external validity evidence that evaluates the intended uses and interpretations of its achievement levels. "The validity evidence collected by NAEP from internal and procedural sources suggest that the methodology was implemented as intended and that panelists had a positive experience with the process. However, the reasonableness of the results is a judgmental decision by policymakers who should consider additional sources of information. External validity evidence is an additional source of information to help policymakers make the final policy decisions about NAEP achievement levels. Such evidence may include results from additional standard-setting methods, state university entrance levels at the high school level, and transcript studies that evaluate course performance.12 The extent to which the sources of evidence may converge is affected by the intended uses and interpretations of NAEP’s achievement levels as articulated in a validity framework."
- Current NAEP inclusion and participation policies and rates may not provide evidence to support intended uses and interpretations of NAEP. "As mentioned earlier, the intended uses and interpretations of NAEP results should be defined in a validity framework and related to how different types of students and schools are included in the results. Unlike state assessment programs developed for NCLB, all students do not take NAEP. Further, those who take NAEP do not take a full assessment but rather a sample of its content. Thus, those included or excluded can influence the results and any score interpretations. This is particularly true for students with disabilities (SWD) and English language learners (ELL). Decisions about inclusion and accommodations of SWD and ELL are made at the state level... Beyond inclusion policies, participation is also an important consideration. NAEP remains a voluntary assessment for students. Therefore, nonresponse and refusal to participate represent potential threats to the validity of NAEP scores, particularly for grade 12 and private school samples. For example, Chromy (2005) noted that recent student participation rates for grade 12 (74 percent) were considerably lower than grade 4 (94 percent) and grade 8 (92 percent). It is also unclear whether current sampling plans include all potential subgroups of interest within a state, such as students with specific ethnicities, disabilities, varying language proficiencies, and free and reduced-priced lunch program status."
- Intended users were not familiar with NAEP scale scores and had difficulty distinguishing between achievement levels on NAEP and those that were developed by states for NCLB reporting purposes. "Most participants in our utility studies identified NAEP with state-level results. This represents a communications challenge for the future because of stakeholders’ familiarity with the reporting scales and achievement levels used for their state’s own NCLB assessment. For example, there was confusion among participants between state and NAEP achievement level results. This led to recognition that states’ definitions of Proficient are perhaps different from NAEP’s definition of Proficient. However, the nature of such differences is not readily apparent. Another source of confusion is that NAEP defines three achievement levels (i.e. basic, proficient, and advanced), yet often indirectly reports student performance at four levels (i.e. below basic, basic, proficient, and advanced). No policy definition for the achievement level below basic exists."
- Prioritize score reporting and interpretation as an area for research in the NAEP program. "Systematic studies of methods to report NAEP scale scores and achievement levels should be carried out with stakeholder groups prior to their operational use. Although some of this research may include print media, a more critical focus for evaluation is the expanding presence of NAEP on the World Wide Web. Where appropriate, the NAEP elements on the Web should be revised to represent empirical findings about ease of use, stakeholder interests, and accepted Web site development practices. Because NAEP reporting continues to invest in the use of interactive, online tools, the utility of these features must also be assessed."