The Pennsylvania Department of Education is testing a computer technology that purportedly marks long-answer essay questions with the same accuracy as humans for its state-mandated exam, with the intention of potentially decreasing the number of markers needed and the time it takes to get results.

The technology is an artificial intelligence system called IntelliMetric, developed by Vantage Learning of Bucks County, Pa.

Jerry Bennett, school profiles manager for the Pennsylvania Department of Education’s evaluation and reports division, said he was aware of the capabilities of artificial intelligence, though he was skeptical at first. “I had to see it operate before I could believe it,” Bennett said. “So far, the data I’ve looked at says it’s as accurate as two human scorers.”

Scott Elliott, chief operating officer of Vantage Learning, admitted that it seems hard to comprehend that a computer could automatically mark prose for focus, content, style, organization, and conventions—and be more accurate than humans.

“When we introduced this several years ago, there was much skepticism,” Elliott said. But now, “the proof in the pudding is that, after 30 studies over three years, it’s more accurate than expert scorers.”

Before IntelliMetric can begin grading, it has to be “trained” to recognize what answers should look like for each possible score of each essay question.

“We would have experts score several hundred papers according to state standard rubrics, and then we would feed [this information] into the computer,” Elliott said.

Assuming that essay answers are graded on a scale of one to five, IntelliMetric analyzes the previously marked responses, along with their scores, and it “learns” the pattern of the scores.

“The computer doesn’t do any scoring per say, it learns what the pattern of a two or three or four is,” Elliott said. IntelliMetric reflects the marking style of whoever marked the sample answers, since those are the answers it learns from.

The software “emulates the process an expert scorer would do,” Elliott said. It analyzes the text the same way a human would for grammar, structure, and content.

Unlike typical assessments that take four to six months to return results, IntelliMetric provides an immediate response in three to six seconds.

That’s one of the major attractions for the Pennsylvania System of School Assessment (PSSA), the state-mandated exam that is administered in the spring but not reported until the fall.

“We’ve been doing a number of studies to see how [IntelliMetric] works,” Bennett said. In the fall of 1999, the department tested IntelliMetric using answers to an open-ended reading question, in which 11th graders read a passage and responded in writing. In the spring of 2000, about 5,000 11th graders responded to between one and three different open-ended questions.

In December, approximately 27,000 students in grades six, nine, and 11 will field-test 30 new open-ended questions that will be used for future standardize tests. Bennett said his agency will use those answers to evaluate IntelliMetric and see how good it is.

In all cases, both humans and the computer score the papers, and then the two marks are compared to see how close they are. So far, the results have been good.

“We demonstrated [that] IntelliMetric was as accurate, or more accurate, than human scorers,” Elliott said. “It matches [an expert reviewer’s scores] more often than experts will match each other.”

He said experts match each other in the mid-90 percent range, while IntelliMetric matches 99 to 100 percent of the time.

“It doesn’t need a cup of coffee, it doesn’t get tired, it isn’t subject to the kinds of biases humans have,” Elliott said.

This computerized marking technology also could make the marking process cheaper, Bennett said.

“What makes our testing different from other states is [that] we do a lot of the open-ended stuff,” he said. “There’s a lot of scoring. You have to hire a lot of people.”

But IntelliMetric won’t eliminate the need for human test-markers.

“You couldn’t rely on the computer totally. You still need to train the computer, so you’re still going to need human scorers,” Bennett said.

Before the state department of education can adopt this technology, its technical advisory committee would have to approve. But first, “they want to see more data,” Bennett said.

If IntelliMetric continues to prove it works well, Bennett said, the department will work it into the system, although some technical considerations would need to be addressed first.

If the PSSA is administered using computers, the department would have to ensure equal distribution of computer hardware throughout the state. Also, officials would have to determine if using the computer will show differences in test results, depending on a student’s experience and ability to type on a computer.

Although the state department of education is testing this technology for its state-mandated exam, IntelliMetric also has practical uses in the classroom. English teachers who want to assign more writing assignments to their students but don’t, because of all the marking involved, could find some relief with IntelliMetric.

“This allows them to much more constructively use essay-type response, because [students] can get immediate feedback,” Elliott said.

Because IntelliMetric learns the pattern of the original marker, Bennett said, answers would be marked as if the teacher actually graded them.

“If teachers participated in the original training of the computer, then it’s like the teacher scoring” the essays, Bennett said.

Publishers and assessment organizations—including the College Board, Edison Schools, Peterson’s, and the Thompson Learning Company—already use IntelliMetric.

“Because of the sensitive nature of this [technology], we are continuing to pilot it, and we expect this to be used live [in public schools] by next year,” Elliott said.


Pennsylvania Department of Education