Monday, April 7, 2014

The Computer Illiterati Conspiracy (or "Why the Average Teaching Assistant Makes Six Times as Much as College Presidents")



With a growing college population, and the implementation of the Common Core Standards for K-12 students, Automated Essay Scoring (AES for short) is slated to become one of the most lucrative fields in the education market within a few years. Teachers might be good enough when it comes to assessing their students' writing, but they are painstakingly slow (a computer algorithm can churn out grades for tens of thousands of essays in a matter of seconds); they are also inconsistent and biased, and – banish the thought! – they want to get paid for their services.

These are the arguments put forward by ed policy makers and supported by one-dimensional (not to say shoddy) research, such as a much-quoted 2012 study from the University of Akron in which the authors compared human readers scoring student essays "drawn from six states that annually administer high-stakes writing assessments" with the performance of nine essay algorithms grading the same essays. They concluded that:
"automated essay scoring was capable of producing scores similar to human scores for extended-response writing items with equal performance for both source-based and traditional writing genre [sic!] Because this study incorporated already existing data (and the limitations associated with them), it is highly likely that the estimate provided represent a floor for what automated essay scoring can do under operational conditions." (2–3)
Between the lines of academic jargon in the last sentence we find a startling claim: if the high correlation between human readers and their silicon counterparts only represents a "floor" of what the programs are capable of, then then implication must surely be that they are, for all intents and purposes, better graders than the teachers. And true enough, the authors go on to deplore their inconsistency and inability to follow simple instructions:
"The limitation of human scoring as a yardstick for automatic scoring is underscored by the human ratings used for some of the tasks in this study, which displayed strange statistical properties and in some cases were in conflict with documented adjudication procedure." (27)
This is nonsense; nonsense wrapped in academic abstraction, but nonsense nonetheless. When teachers stray from "documented adjudication procedure," this is precisely because they are experienced and creative readers who know full well that an essay might be great even though it does not conform to – and sometimes consciously flouts – rigid evaluation criteria. And as for their grading exhibiting (gah!) "strange statistical properties" it is important to realize that this is not a sign of human fallibility. Quite the contrary. If there is a huge discrepancy between two readers evaluating the same essay, this indicates that at least one of them (possibly both although the one recommending the conservative grade might be wary of repercussions if he or she does not follow the criteria to the letter) has discovered that it is an outstanding essay.

Computer algorithms will always penalize innovation, but surely the students are not supposed to pen Pulitzer-winning essays? Isn't the point of the essays rather to gauge whether they can craft coherent texts according to the K-12 Common Core Standards (the ones listed below are for informative/explanatory essays)?
"Introduce a topic clearly, provide a general observation and focus, and group related information logically; include formatting (e.g., headings), illustrations, and multimedia when useful to aiding comprehension.
Develop the topic with facts, definitions, concrete details, quotations, or other information and examples related to the topic.
Link ideas within and across categories of information using words, phrases, and clauses (e.g., in contrast, especially).
Use precise language and domain-specific vocabulary to inform about or explain the topic.
Provide a concluding statement or section related to the information or explanation presented."
Yes, but even though these criteria are highly mechanic and wouldn't necessarily (if you excuse my anthropomorphizing) recognize a good essay if it bit them in the face, the AES systems still fall woefully short. They can do a word count and a spell check; they can look for run-on sentences and sentence fragments, and discover the ratio of linking words and academic adverbs. The fourth bullet point shouldn't pose much of a problem either. Since they have been fed hundreds of texts graded by humans, and extrapolated the "domain specific" words which correlate with high grades. And what about factual accuracy and logical progression, surely a piece of cake for the silicon cookie monster? Not quite.

One of the most vocal critics of automated essay assessment, Les Peralman who is director of writing at M.I.T., has taken one of the most commonly used automatic scoring systems for a spin. The e-Rater is used not by K-12 schools but by the ETS to grade graduate-level GRE essays (i.e. one of the most high-stakes tests on the market.) So how does it measure up? No, let us not even consider creativity, subtlety, style and beauty (all important traits in grad school work), but look at the rudimentary skills outlined in the Common Core Standards. Is the e-Rater able to discriminate factual accuracy from outlandish claims, logical progression from a narrative mess, sense from nonsense? The following essay, written by Perelman, received the highest grade possible – 6/6 (an essay with this score "sustains insightful in-depth analysis of complex ideas"):
Question: "The rising cost of a college education is the fault of students who demand that colleges offer students luxuries unheard of by earlier generations of college students -- single dorm rooms, private bathrooms, gourmet meals, etc." Discuss the extent to which you agree or disagree with this opinion. Support your views with specific reasons and examples from your own experience, observations, or reading. 

In today's society, college is ambiguous. We need it to live, but we also need it to love. Moreover, without college most of the world's learning would be egregious. College, however, has myriad costs. One of the most important issues facing the world is how to reduce college costs. Some have argued that college costs are due to the luxuries students now expect. Others have argued that the costs are a result of athletics. In reality, high college costs are the result of excessive pay for teaching assistants. 

I live in a luxury dorm. In reality, it costs no more than rat infested rooms at a Motel Six. The best minds of my generation were destroyed by madness, starving hysterical naked, and publishing obscene odes on the windows of the skull. Luxury dorms pay for themselves because they generate thousand and thousands of dollars of revenue. In the Middle Ages, the University of Paris grew because it provided comfortable accommodations for each of its students, large rooms with servants and legs of mutton. Although they are expensive, these rooms are necessary to learning. The second reason for the five-paragraph theme is that it makes you focus on a single topic. Some people start writing on the usual topic, like TV commercials, and they wind up all over the place, talking about where TV came from or capitalism or health foods or whatever. But with only five paragraphs and one topic you're not tempted to get beyond your original idea, like commercials are a good source of information about products. You give your three examples, and zap! you're done. This is another way the five-paragraph theme keeps you from thinking too much. 

Teaching assistants are paid an excessive amount of money. The average teaching assistant makes six times as much money as college presidents. In addition, they often receive a plethora of extra benefits such as private jets, vacations in the south seas, a staring roles in motion pictures. Moreover, in the Dickens novel Great Expectation, Pip makes his fortune by being a teaching assistant. It doesn't matter what the subject is, since there are three parts to everything you can think of. If you can't think of more than two, you just have to think harder or come up with something that might fit. An example will often work, like the three causes of the Civil War or abortion or reasons why the ridiculous twenty-one-year-old limit for drinking alcohol should be abolished. A worse problem is when you wind up with more than three subtopics, since sometimes you want to talk about all of them.
Factual accuracy aside, where is the "in-depth analysis" and the logical progression? This hilarious rant has the trappings of an excellent essay – an advanced vocabulary, plenty of academic linking words as well as a good portion of "domain words" used in student essays on the same topic that scored highly ("teaching assistants", "accommodations", "capitalism") – and the machine cannot tell the difference. The algorithm can be easily fooled, something ETS made no secret of in a 2001 paper. But while admitting that utter nonsense can score highly, they also claim that this is of little relevance since students do not set out to trick an algorithm; they write with human beings in mind (there is still a human reader involved in the GRE scoring process), and the overlap between essays deemed good by humans and the algorithms is almost complete. We can illustrate this with a Venn diagram of essays receiving high scores:




It won't be long, however, before the human readers are given the boot. If you plug the high predictive validity, specious though it might be, into a cost-benefit analysis you would fool many a school board. And here's the rub, with no human reader involved, the green circle is a much more comfortable target to aim for than the blue bull's eye. Chances are that K-12 teachers, pressured to teach the Common Core tests rather than the skills these tests are supposed to measure, will be forced to coach their students how to produce impressive sounding gibberish, perhaps along the lines of:
"You see, start out with a phrase such as 'In today's society', 'During the Middle Ages', or, why not, 'In stark contrast to'. Then you rephrase the essay prompt and begin the second paragraph. Start with a linking word; "thus" or "firstly" are always a safe bet. And whatever you do, don't forget the advanced content words; if you're supposed to write about whether technology is good for mankind, how about a liberal sprinkling of "interaction", "alienation", "reliance" and "Luddite"... Oh yes, the last word will almost guarantee that you'll get an A! In the thirds paragraph..."
As loath as I am to beat the dystopian drum here, there is a real risk that the focus on discrete metrics (and consequently on uniformity and rote-learning) in the Common Core Standards, rather than promoting transparency and equity, might make us blind to the intrinsic worth and unique skills of each student. No longer human beings, they are now points in a big data matrix, in which their performance is mapped with mathematical precision to the performance of their peers. This breaking down of students (pun very much intended) into metrics will most likely lead to a kind of "lessergy" where total ability bears no relation to the sum of their artificially measured skills. A car made out of papier-mâché parts, which might have the same dimensions and at first glance pass for the real thing, will not perform very well on the road. And in much the same way, a student taught to fool the AES algorithms will hardly have gained any real-life skills in writing or critical thinking.

AES is of course only one facet of the big data-fication of education, but it is one of the most egregious ones. Until the two cultures divide has been bridged, policy makers will be as dumbfounded and seduced when told about "chi-square" correlations of automated essay scoring algorithms, and the "strange statistical properties" of human raters, as Diderot was when (if we are to believe the anecdote) Euler explained that given the equation:

\frac{a+b^n}{n}=x

...there is a God.

When I first read Hard Times 12 years ago, I thought it was a clunky, over-the-top satire. Now it seems eerily prophetic (yes, when he wasn't busy earning millions as a high-flying TA, Dickens actually found time to whip up a couple of novels):
"Utilitarian economists, skeletons of schoolmasters, Commissioners of Fact, genteel and used-up infidels, gabblers of many little dog’s-eared creeds [...]  Cultivate in them, while there is yet time, the utmost graces of the fancies and affections, to adorn their lives so much in need of ornament"
Perhaps this is precisely what is needed – a grassroots movement of teachers and educators, writers and poets, students and parents, who can do just that: cultivate some fancies and affections into the Commissioners of Fact, and tell the technocrats and Taylorists that there is more to life than what is dreamt of in their philosophies. Until then, a good way to start would be to sign this petition against Machine Scoring in High-Stakes Essays (with Noam Chomsky as one of the signatories).



No comments:

Post a Comment