The Finn Editorial: lots of questions

Chester,

I read the opinion piece you co-authored with William Bennett entitled "The Real Improvement in Texas Schools" which appeared in the NY Times on October 27, 2000.

As my charitable foundation has granted over $5M to education in the past few years, I have a serious interest in the subject of educational effectiveness.

Since you clearly have a much greater understanding of the recent RAND study than most people, I have a few questions about your opinion piece which I hope you will answer for me and for others who have read the RAND paper.

In your article, you write:

"He [Bush] can rightly claim very impressive accomplishments in the area of education; the closer people look at the Texas record, the better it is. Nothing in the latest report contradicts this."

I do not understand why, if there are such impressive educational gains in Texas as you suggest above, they failed to show up at all in Table 1 and Table 2 in Klein's paper. How do you explain that? Does this mean that NAEP data is useless as you seem to imply in your editorial? But if that is the case, then don't we need to totally discount the earlier RAND report since it was based exclusively on NAEP data? Is that what you are saying here? Or was there a calculation error? A methodology error? An error in the NAEP data? Table 1 and Table 2 of Klein's paper used NAEP scale scores right off the NCES web site and NAEP scores, as you know, are the widely accepted "gold standard" measure of academic improvement. Even the Bush campaign admits this as they have clearly given credibility to the earlier RAND study which used only NAEP scores as well. So are you disagreeing with the Bush campaign here?

You failed to mention in your article that the earlier 250-page RAND study (Grissmer) focused on a date range that had very little overlap with the time Bush was in office (it ran from 1990 to 1996). Was that an oversight in writing your article? Or did you not realize this when you read the study?

In your article, you write:

First, the 14-page study is relatively unsophisticated and is based on incomplete data.

That's a little too vague for me. Can you tell me specifically what data is missing? Klein used the same NAEP data as the earlier RAND report (both Texas and national data) and supplemented it with more recent information that was not available in the earlier RAND study and supplemented it also with Texas TAAS data (which the earlier RAND study didn't use at all). So if any data is missing, shouldn't we be challenging the older RAND report as being incomplete since it was missing the NAEP 1998 test data and all of the TAAS data?

Also, if the "missing data" in the newer Klein RAND report proves your point, why did you fail to mention this data and what it shows? Without supplying this "missing" data, your article draws a conclusion based on missing data. I'm sure you would not want to suffer from the same problem as you claim the Klein paper suffers from. So what is the data that is missing?

I heard that in a TV interview on the Today Show, a Bush aide claimed that the data on the Texas Education Agency’s web site was not accurate. This seems strange since there is no disclaimer on the Texas web site about this and Texas had two or more years to fix any problems. However, if there is a problem with the TAAS data on the web site, then as far as I can tell, there are only 3 possibilities here, and all three work against your point:

If the missing data shows that the TAAS scores posted were too low, then it makes the TAAS scores that have been used by everyone even less credible because it would show that an "impossible to achieve" improvement is even harder to believe.
If the missing data confirms the known TAAS scores were accurate, it gives you nothing "new."
If the missing data shows that all the publicly available TAAS scores were too high, then it helps prove Klein's point that TAAS scores can't be trusted and it would confirm that there is no Texas miracle as you (and Bush) have claimed.

Did you ever call Klein to ask him about his paper or to tell him about his missing or incorrect data so that he could issue a retraction? Is there a reason you didn't post the corrected data on your web site? Is there a reason you didn't publish a paper for EPAA or some other peer reviewed educational journal that contains the missing data or even gives someone a hint about what this data is, where you found it, and how you know it disproves Klein's findings? There are too many respected researchers who are fooled by Klein's paper, but without people like you stepping forward to share the truth, they remain in the dark. Why are you keeping this information secret?

In my research, the only comparative study I could find of how things went over the first four years since Bush took office was in Table 2 of Klein's paper. Is there another NAEP data peer review study covering the period when Bush took office that I am not aware of that shows a result that contradicts Klein's? If not, then shouldn't Table 2 (which showed no gain) be the most interesting and relevant measure of academic progress?

In your editorial, you said that Klein's study was " flawed and misleading." Yet Klein's paper used more recent NAEP data than the earlier RAND report (the Grissmer study). And it did not use scores that were manipulated by statistical modeling assumptions and therefore prone to error (as in the Grissmer study) because each racial group was separately treated in Klein's paper. So the potential for biases stemming from modeling and assumptions were eliminated in Klein's paper, but not in Grissmer's study. So what specific flaws are present in Klein's paper? For example, Table 1 and Table 2 use the NAEP data directly. No fancy adjustments like in Grissmer, no nothing. Not even a hint of "fuzzy math." It was a simple subtraction. What is flawed and misleading about the data in Table 1 and Table 2? I checked this myself with the NAEP site and there is no error or omission. You can replicate these calculations yourself in minutes. Was there some mistake we both made? Why did you not point out what the error was?

I also do not understand how you can possibly give any credibility at all to the TAAS scores when according to Fig. 3, there was a .75 effect size for Black 8th graders (if we believe the TAAS scores), while there was only a .12 increase in 8th grade NAEP. As you know, a .75 increase is completely absurd. An improvement like that in the 2 years after Bush took office, is more than miraculous. It's unbelievable! As you are a sophisticated scholar of educational improvement, you are well aware that such a large effect size substantially closes the gap between whites and blacks. In short, if the TAAS scores are true, Bush has accomplished the greatest achievement ever made in education in all of history (by a long shot). That is the kind of effect that should be proclaimed far and wide in every reputable education journal. Yet, we find none of this. Why? Is there a conspiracy to keep such an enormous effect a secret? Why isn't this being copied by other states? By other countries? Why aren't there tons of academic papers examining this incredible miracle? I couldn't fine any! Don't you find that a bit odd? I sure did!

Could it be that the huge TAAS improvement is due to one or more of these factors:

(1) The repetition of certain item types that are easy to coach (e.g., because TX just changes the numbers in word problems that are nearly identical across tests this means that TAAS is "coachable" while other national tests are not),

(2) The high stakes (and perhaps the unrealistic achievement goals) for teachers that leads some of them to bend the rules, such as through word lists, practice, stretching time limits (which is harder to do in 8th grade) and other methods of cheating that have been well documented in the press (and also in this more recent article),

(3) TX assessing just a narrow slice of the content domain that is covered by NAEP (particularly beyond grade 4). The test names are the same, reading and math, but what is tested is likely to be much more limited with TAAS which narrows the curriculum. Again, coaching can help here, unlike a national test like NAEP which measures overall ability in the subject (see (7)).

(4) TX relies exclusively on multiple choice tests (even for writing) but NAEP uses both multiple choice and open-ended questions.

(5) Systematic differences in who takes each test -- there is more pressure on schools to exclude low scoring students from taking TAAS than from taking NAEP, and this difference may be increasing over time.

(6) Making the questions easier each year (unlike a national exam which if that happened it would be easily noticed; TAAS tests are supposedly equated; but of course, the equating might not be accurate)

(7) More and more time spent coaching students on how to answer each type of question on the exam. Even Texas educators themselves complained to the legislature about the amount of time they spend teaching the test, rather than fostering learning. The test should be an assessment of learning, not the end goal, shouldn't it? Coaching will tend to increase the disparity between TAAS and NAEP if there is little content overlap between the exams as noted in (3).

Don McLeroy State Board of Education from Bryan, says that teachers are using too much time preparing for the TAAS test and too little time fostering learning. He recommends that parents be given the opportunity to choose a placement test that will provide feedback on a student's true abilities, and what the student still needs to learn. (from Texas Senate Education Committee meeting)

If none of these are possible reasons for the TAAS scores going up, then how do you know that?

The issue is how much carry over is there from TAAS to NAEP in the skills and knowledge that are producing the high TAAS scores. For instance, are the kids really learning how to read better (which should show up on almost any test) or are they just learning how to answer certain highly specific item types (which would only show up on TAAS)? Don't the results show the latter at best At worst, many or all the 7 factors above could be operating and true learning (the type of learning that would show up in any exam), even of a subset of material, may not be happening?

Let's say you are right. So what did Texas fundamentally do to account for such a change since certainly, we should be implementing this elsewhere? I'd think the Fordham Foundation (which you head) would want to write up exactly what Texas did here to achieve such a remarkable result (and it is even more impressive when you consider that Texas received a "D" in teacher quality from Education Week recently) and encourage others to follow this new method of instruction. I fail to understand why you have not done this. It is clearly core to your mission. But it's not on your site anywhere. Why is that? In fact, you don't even mention the incredible Texas miracle at all in your December 2000 recommendations to Congress. Why is that?

Also, with such an incredible effect size on TAAS as we see in Figure 3, don't you find it extraordinarily odd that these did not show up in NAEP, SAT, ACT, or even in Texas's own TASP scores? Those scores were either flat or declining under Bush (except 1996 4th grade math). Why is that? What is the reason to believe the TAAS scores and disbelieve these other independent measures? Is there a rule we can use for when we should ignore NAEP, SAT, ACT, and TASP scores? Are these national, gold-standard tests scores ever useful? For what? And if the NAEP scores are useless as a measure of comparative performance between students in different states, then we have to discount all those claims of Texas leading the nation, don't we?

I'm sure you know that it's been well established (and very consistent with common sense) that high-stakes testing results are less accurate than low-stakes testing results. For example, if I want to find out how many people in my company can perform a sophisticated calculation by hand, say compute the square root of 2, I'll bet very few can do it. A low-stakes test would reveal a low percentage of people could do this calculation. But if I tell people the night before the test that the next day I'll test them on square roots, and that their continued employment at Propel depends upon their test score, I'm sure you'll agree that one heck of a lot more people will pass the test. Does that mean I suddenly created a bunch of mathematical wizards at my company? Or that my employees tested well on test material they were heavily incentivized to prepare for? I think we clearly know it's the latter. So the bottom line here is that if we want to really ascertain the true mathematical prowess of my employees, I should make it a low-stakes test. So that being the case, if we are to believe the high-stakes test results and completely discount the low-stakes test results, you must really conclusively show that the low-stakes tests are inaccurate and the high-stakes tests are the true measure of broad academic proficiency. If you can do that, then you will have shown that NAEP is worthless. That would be a major upset in the industry (as well as disproving the Bush claims of Texas's rank relative to the rest of the country which is based exclusively on NAEP data). Yet this hasn't been done. Why?

You could argue that students do better on TAAS because they are more motivated to get a high score since it is a higher stakes test. But the differences in student motivation between TAAS and NAEP are essentially constant over time. Thus, these differences cannot explain the disparity in GAIN scores between TAAS and NAEP, i.e., the rate of improvement should be correlated. Something else is operating (such as the 7 points above).

If Texas is doing such a great job in education, than how could it be that The New York Times reported that in February 1999, officials with the University of Texas system presented a report to a Texas House subcommittee complaining of "marked declines in the number of students who are prepared academically for higher education." Shouldn't we see the exact opposite occurring?

Also, have you read the two year study by Walt Haney on TAAS scores? It certainly seems to me that all trusted third party data shows nothing special is happening in Texas. Don't you find such a disparity odd that the only data that shows things are peachy in Texas are the TAAS scores and only one NAEP score (4th grade math in 1996)? Are there any other independent tests that confirm the TAAS scores? Haney's paper was also peer reviewed. Have you seen any peer reviewed study that disputes Haney's findings? Certainly, the only stuff I've seen is stuff like Klein's paper and a huge number of news reports from Texas newspapers and other sources that TAAS scores are completely untrustworthy [Dallas Morning News, 8/17/99, 9/22/99, 4/30/99; Houston Chronicle, editorial, 8/8/95; U.S. News & World Report, 7/19/99]. So if there is independent third party research proving the TAAS scores are trustworthy and the NAEP scores (and all other scores of other exams) are not, would you mind giving me a reference to that paper?

It certainly appears you and Grissmer were fooled by a jump in 1996 4th grade math scores due to a content overlap with the NAEP exam. Had you ever considered that? If you have, then what data do you have to reject the hypothesis? I talked to Grissmer about it and he said it was quite reasonable... in fact it was the only reasonable explanation he was aware of that fits the facts. Do you have an explanation that fits the facts that we don't know about? Why have you not revealed it? Have you spoken with Grissmer about it?

RAND has stated that both reports are correct. I believe that is the case as well. The score jump in the 4th grade math scores was a one time artifact of the content overlap. Therefore, all the data is consistent and shows no educational gain since 1994. I have not seen an explanation consistent with both RAND reports that comes to an opposite conclusion. Do you have such an explanation that accounts for Klein's findings and explains them? If you have, why did you not include it in your editorial?

In Figure 4 of Klein's paper, it clearly showed that the achievement gap widened, while TAAS showed it is narrowing. Is there an error in Figure 4? If so, can you explain what it is? Calculation error? Methodology error? Or do you agree with Klein's conclusion (which is contrary to claims made by the Bush campaign)?

Lastly, how do you explain the correlations in Klein's paper since TAAS scores behave "differently" than all other independent tests. For example,

Independent tests had strong positive correlations between different tests, TAAS math and reading correlated strongly, but independent tests and TAAS tests didn't correlate at all (school unit of analysis). The only reasonable explanation for this is that TAAS is basically useless in determining academic proficiency. Is there another possibility?
SES had strong negative correlation with independent tests, but SES had small negative to positive correlation with TAAS. That suggests that either Texas has an incredible educational program where disadvantaged kids get a much better education than advantaged kids, or that the TAAS tests are so easy after coaching that everyone does equally well. Is there another possibility?

In your conclusion you say that Bush can take credit for these improvements. Even if you honestly believe that 1996 4th grade math scores truly went up, how can you possibly know that it happened after 1994, rather than, say, in 1993? After all, NAEP exams only happen ever 4 years. How do you know that this jump was due to Bush or to his predecessor? How can you prove from the earlier RAND report than things improved AFTER 1994? Even Grissmer admits that the data after 1994 is flat in Texas. And there is no way to know when scores improved during the 1992 to 1996 period—was it evenly distributed, near the beginning, middle, or end of the four year period? So how do you know it was after 1994? What data do you have that you are not telling us? Isn't it possible that all the improvement happened before Bush took office? How can you prove otherwise?

Also, as I'm sure you are aware, Blacks and Latinos in Texas have been scoring highly relatively to the rest of the country for at least a decade. Nobody knows why, but someone has to be first! So it's a bit unfair to ascribe such ranking to Bush, don't you think? Shouldn't we be looking at the data before Bush took office and comparing it with data after Bush took office so we can determine cause and effect? From Table 1 and Table 2, it sure doesn't look like much improvement to me. How about to you? What are you seeing in these numbers that I am not?

We can look at this same issue another way. Texas likes to point to its high NAEP scores (e.g., blacks in TX score higher than blacks elsewhere) as evidence of the effectiveness of its high stakes testing program. This program is so effective that it even worked backwards in time. How else can you explain the fact that the NAEP scores for blacks and Hispanics in TX "led the nation" before the TAAS program was even implemented in 1994 (see Tables 1 & 2 in Klein's report)?

Lastly, I should point out that the reason the RAND report was delayed was due to the additional rounds of peer review (well above the normal peer review process at RAND) that this report went through. I am sure you noticed in the preface to the Klein paper that the external reviewers were Robert Linn and Richard Jaeger, who I understand are two of the top people in the field. Dan Koretz and David Grissmer (the author of the RAND paper that you have said you trust) also reviewed the paper. I'm sure you are also aware that Gene Glass published Klein's paper in another respected peer-reviewed journal (EPAA). So with all these rounds of peer review by eminent scholars, I find it remarkable that you have uncovered errors that these and other experts missed. Rather than keeping these errors confidential, you would do a great public service by educating us all on how so many credible researchers could have been fooled. Why have you not done this? Isn't educating us all on educational issues exactly what the Fordham Foundation is about?

In summary, it seems to me that there are only two possibilities here for NAEP scores:

If NAEP scores are useful as a measure of educational improvement, then Klein's Table 1 and 2 show no gains in Texas (relative to the rest of the US).
If NAEP scores are not useful as a measure of educational improvement, then why did the Bush campaign (and yourself) point to the earlier RAND report as proof of Texas's gains in education since that report was based exclusively on NAEP data?

And for TAAS scores,

If TAAS scores are a true measure of academic proficiency improvement, then Texas has achieved the most remarkable achievement in recorded history in education and all the trusted measures must be abandoned and most all the questions raised above apply.
If TAAS scores are not a good measure due to the 7 reasons explained above, it explains the answers to all the other questions above.

I look forwarded to hearing your response.

Steve Kirsch

Here is the E-mail reply from Bennett's office (this is really interesting)