Big Data, IBM tells us, is "emerging as the world's newest resource for competitive advantage." Algorithms and neural networks fed with big data are driving cars, translating between languages, creating forged video clips of ex-presidents, classifying images, and -- and here is where our Skynet fears kick in -- beating the best human go players, e-sport "athletes" and fighter pilots.
Software is already reading X-rays and, apparently, doing a better job than many radiologists. Not a day goes by without a newspaper article or TV segment on how big data and AI are ganging up and coming for our jobs. Implicit in this reporting is the idea that humans are fallible, labile, and emotional; we tire easily and our perceptions are distorted by animosities and prejudice. Computers on the other hand...
We cannot chalk this idea of algorithmic impartiality off as journalistic sensationalism. The actors who develop the software are often as guilty of this, as are many academics. In a much-quoted paper form 2013 -- "The Future of Employment: How Susceptible are Jobs to Computerisation," Carl Benedikt Frey and Michael A. Osborne at Oxford University write:
"Computerisation of cognitive tasks is also aided by another core comparative advantage of algorithms: their absence of some human biases. An algorithm can be designed to ruthlessly satisfy the small range of tasks it is given. Humans, in contrast, must fulfill a range of tasks unrelated to their occupation, such as sleeping, necessitating occasional sacrifices in their occupational performance (Kahneman, et al., 1982). The additional constraints under which humans must operate manifest themselves as biases. Consider an example ofhuman bias: Danziger, et al. (2011) demonstrate that experienced Israeli judges are substantially more generous in their rulings following a lunch break. It can thus be argued that many roles involving decision-making will benefit from impartial algorithmic solutions."
It could indeed be argued that "impartial algorithmic solutions" should supplant human jobs, but it would not make for a very good argument for the simple reason that there is no such thing as an impartial algorithm. Attempts to outsource judicial decision-making to algorithms have proven catastrophic. Far from being impartial, the algorithms have turned out to be deeply racist. They not only replicate the prejudice from the humans who created the data in the first place, they also amplify it and removes any transparency (neural networks are "black boxes"; they will tell you what they believe the answer is, but they are incapable of telling you why that is) while at the same time giving it a veneer of objectivity.
This cluelessness is hardly new. What follows is a brief account of the first US Big Data revolution. Just like Messieurs Frey and Osborne, many of the actors believed that by dint of mechanizing they created objective truth, and just like most developers in Silicon Valley, the principal players were convinced that the problems they tackled were purely technical.
This is the story of how the US Census Bureau used big data to locate and round up 110,000 Japanese-Americans in 1942.
"The Electrical Enumerating Machine"
The Constitution of the United States mandates a nationwide census every ten years. What the Founding Fathers had in mind was a tally of all "free Persons" and three-fifths of the slaves so as to be able to apportion taxes and congressional seats. Born out of a desire to carry out a white supremacist agenda under the guise of scientific objectivity (as in the three-fifth ratio), it soon became clear that this agenda could be furthered by adding other, purportedly objective, demographic categories to the questionnaires, or in Paul Schor's words:
"The study of statistics leads almost naturally to the study of the processes by which elites objectify other classes of the population ... The US census fits well in this process of constituting groups of individuals as social problems--especially from 1840 on, when it aimed to find answers to the big political questions about the population, such as slavery and the harmfulness of freedom for blacks, the inassimilability of new immigrants and the "racial suicide" of Anglo-Saxons, racial mixing, and the degeneracy of blacks. This is revealed in the multiplication of racial categories to distinguish groups that sometimes were numerically insignificant; thus the 2,039 Japanese enumerated in 1890 contrasts with the treatment of the white race, which was never defined during the entirety of the period under study." -- Schor, Paul, "Counting Americans: How the US Census Classified the Nation", 2017, p. 3.
With the added groups and categories, and the population boom (from 2.5 million in 1777 to 63 million in 1890), tallying the data and crunching the numbers by hand became a problem. Not only was it time-consuming, it was almost impossible to do what we today would call a custom search. If we, for example, were interested in the number of illiterate men over the age of 50 nationwide, it might be possible to find the number through extensive cross-referencing, but if so, it would have been a very laborious task. The 1880 census had taken 8 years to finalize, and with the new data categories added for each census, there was a risk that the 1890 census would not be finished within the 10-year window.
Tabulating multiple categories by hand was a Sisyphean task.
75 years earlier a maverick British mathematician, Charles Babbage, had floated the idea of using steam to compute mathematical tables. His ideas were ahead of their time (his later design, the Analytical Engine, was the first proof of concept for an all-purpose computer, as we understand the term today, with a CPU, a memory bank, and a printer). Babbage's "engines" would receive their input from punch cards. Lack of funding (and the fact that brilliant conceptual ideas appealed to him much more than humdrum but commercially viable designs) meant that he had to abandon his projects.
In 1888, however, with the Census Bureau struggling to keep up, there was suddenly a market for tabulation machines. Realizing that part of the work could be done by mechanical means, they announced a competition. Whoever could tabulate sample data the fastest with the help of machinery would land a lucrative contract for the 1890 census. The winner, an ex-bureau employee named Herman Hollerith, bested his opponents by solving the task almost ten times faster. How? Just like Babbage before him, his design was based on punch cards.
Hollerith punch card used in the 1890 U.S. census.
Using a special hole punch, a clerk would translate all the demographic data on each census record into discrete data points on a card. The first four columns, for example, signify state, county, and census district. Another column signified race. In his patent application, Hollerith explains that it is not very difficult to make a tally of individual categories, such as the number of men or women in the U.S. This can be done manually. But things get more dicey if you want to run a tally with different variables. (Hollerith's examples are telling in light of the Schor quote above):
"it is required to know the number of native whites,
or of native white males of given ages, or groups of ages, &c., as
in what is technically known as the age and sex tally; ... The labor and expense of such tallies, especially when
counting combinations of items made by the usual methods, are very
great." U.S. Patent US 395782
So, he constructed a machine -- a tabulator -- that could read the punch cards, and could be set through electrical relays to count combinations of categories. The total tallies were then displayed on dials.
Hollerith tabulator dials. The 1890 machine featured 40 of these;
thanks to the double dials, each one could display a number from 0 to 10,000.
Machines based on Hollerith's original design were used by the agency until 1950 when they were partly replaced by computers. Reflecting on the Hollerith revolution in 1965, bureau director Ross Eckler notes how:
"The Superintendent of the
Census of 1890 could rightly take pride in the gains that were
accomplished through the use of the new equipment which initiated the
use of punch cards in extensive statistical tabulations, though perhaps
he did not realize the outstanding importance of the innovation which
first reduced the data on the census schedule to a form which could be
classified and counted by purely mechanical devices." -- Truesdell, p. 1.
"Purely mechanical" does more, linguistically, than explain that the human had been removed from the equation; the phrase also suggests that the classifications -- and the tallies computed by the machine -- had objective validity.
The quote is from the preface of a book by Leon E. Truesdell, who served as the bureau's chief demographer until 1955. Truesdell will soon make a more full-fledged appearance, but what is striking is that his account -- preoccupied with the problem of developing machinery to tally an ever-expanding array of demographic categories -- never once reflects on the categories themselves or what they refer to. His notion of progress is technological, as he writes in the epigraph where he reflects on the rapid digitization of the agency between 1950-1965:
"[I]n a few years, the electronic computer, with its supporting devices for assembling census data, had made far more progress than the punch card had made in 60 years. For the contribution of [the] FOSDIC [computer] is in addition to the fantastic increase in operational speed of the computer and the tremendous increase in the possibilities for complex cross-classification, checking for consistency, inflation from sample, and even adjustment of variable data." -- Truesdell, p. 208.
Internment by Big Data
"Jp" in Column 14 denoted Japanese ancestry. The IBM machines were able to quickly identify all punch cards for Japanese-Americans on the West Coast; these were then de-anonymized and tied to the individual records with names and addresses.
After the bombing of Pearl Harbor, the census bureau -- with state-of-the-art punch card technology and census data on everyone in the US -- was inundated with requests from the military, and they were more than happy to "help with the war effort." The following quote is from Margo Anderson's social history of The American Census:
"In January 1942 [the Census Bureau] acknowledged they
were getting many requests from the military and surveillance agencies
for information on Germans, Italians, and especially Japanese. Leon
Truesdell, chief population statistician, said, 'We got a request
yesterday, for example, from one of the Navy offices in Los Angeles,
wanting figures in more or less geographic detail for the the Japanese
residents in Los Angeles, and we are getting that out.' Assistant
Director Virgil Reed followed up, noting that, for requests for data on
the Japanese, Germans, and Italians, "some of them wanted them by much
finer divisions than States and cities; some of them wanted, I believe
several of them, them by census tract even, Truesdell agreed: 'That Los
Angeles request I just referred to asked for census tracts." [Director
James Clyde] Capt was pleased with these new efforts, bragging, 'We
think it is pretty valuable. Those who got it thought they were pretty
valuable. That is, if they knew there were 801 Japs in a community and
only found 800 of them, then they have something to check up on ...
We're by law required to keep confidential information by individuals
... But in the end, if the defense authorities found 200 Japs missing
and they wanted them names of the Japs in that area, I would give them
further means of checking individuals."
-- The American Census: A Social History, Second Edition, Anderson, 1990, pp. 194-195.
For our friend Truesdell, the request for the whereabouts of Japanese-Americans in Los Angeles was a technical challenge, and one he relished. Unfortunately, as director Capt noted (while at the same time expressing his willingness to flout the law), the agency was not allowed to provide information on individuals. Together with the commerce secretary, Capt lobbied Congress. He wrote an amendment to an omnibus war powers bill that was adopted and passed into law in March 1942.
The New York Times reported how the amendment would remove the confidentiality protection of census records and allow for data sharing between agencies: "[Such] data, now a secret under law, government officials believe, would be of material aid in mopping up those who had eluded the general evacuation orders." (Quoted from Anderson, p. 1196)
Capt's aggressive lobbying, Anderson notes, led to "the provision of technical expertise and small-area tabulations to the army for the roundup, evacuation, and incarceration of the Japanese-ancestry population -- more than 110,000 men, women, and children -- from the West Coast of the United States. (Anderson, p. 196)
Computing Before Computers, Aspray, 1990, p. 143.
PS. I started looking into the story of how big data was used to round up Japanese-Americans after watching this 1981 ABC Nightline interview with Steve Jobs and David Burnham. It is well worth watching.