Why a Lexical Historical Archive is Not Useful

Following up on my previous post proposing a “lexical” historical archive. It is recapped briefly below.

I am working at one of the world’s fastest growth tech firms. My company recently migrated to a new cloud information storage and indexing system called “Highspot.” The indexing system uses string search, filters, and “spots” — subject areas you can navigate into, and from which you can easily navigate to other subject areas — to help users find the materials they are looking for. Compare this to the traditional cloud storage system like Google Drive, which relies on linear directory paths, so that once you drill down to a specific folder within a folder within a folder, it’s not easy to jump to another folder without first moving back up to the most general folder. The Google Drive configuration poses a problem for people who don’t exactly know what they’re looking for, or didn’t themselves index the information they are looking for. This is often the case in fast-growth institutions with high churn (lots of new people being hired and old employees leaving) where many materials are being generated each day. Purportedly, Highspot addresses these problems. You click into the ‘spot’ that denotes your general subject area, then check more specific filters that apply to the file, such as “pitch deck” and “retail” and “created in 2020,” etc, maybe enter a quote that you remember from the file, like “80% of people don’t know what their customers are thinking,” and the technology finds the file for you. I think it is a clever solution for my company’s information indexing and storage needs.

I have been reflecting on the way that we store and retrieve information. Last week, I proposed that we create a “lexical” archive that looks up historical sources by searching for specific words that might have persisted from the past to the present, and so might help us make connections between past and present. Such words included “engineer,” “entrepreneur,” etc.. I proposed that we use Google’s search tools like BigQuery API, or Google’s digital collections like Google Books.

As I thought more about what I was really proposing, I found myself becoming more skeptical about the idea. By basing the archive on ‘word search,’ I was essentially proposing that the archive revolve around a ‘googling’ function. This proposal was very characteristic of the ‘machine age’ in which I have grown up: String-searching is enabled by our new, fast computation abilities, allowing for tools like Google (and every other search bar out there) to exist. I had already had some inkling, due to my previous learning, that ‘googling’ capabilities had enabled human vices. It allowed for lazier scholarship, as people stopped reading books in their entirety but ‘power-skimmed’ for or googled the topic areas they were looking for within the digitized version of a book (thanks to Professor Grafton for the word ‘power-skim’!). This inkling of worry turned into wide-eyed alarm as I realized that googling was antithetical to the very practice of history and that I myself had fallen into the trap of lazily turning to ‘googling’ to solve my research needs.

‘Googling’ reflects a new attitude that humans have towards information. When Paul Duguid traced the genealogy of the word “information,” he found that there has been a change in how “information” is conceived and used: Once, information was synonymous with education, denoting the process of being developed; now, information had become a material particle that be delivered. In recent years, he suggests, information has become impersonal because one doesn’t need to know the “systems” around it. He cites “modularized views” where people extract items from their original context and distribute them in new ones.

Google exactly subscribed to this notion of information as simply, bits. The Chronicle of Higher Education posted a piece about the total catastrophe that the Google Books project was. Google Books was Google’s project begun in 2009 to digitize all the books in the world to create the world’s largest digital library. The project was a disaster. Early on, Google made the choice not to include any library metadata (authors, date of publication, genre, etc) with the books. The thought was: The books can speak for themselves. They have title pages and covers from which we can scrape all the data we need. Yet Google had not anticipated that not all book structures were the same: Where exactly was that title page with the title and author and publication date? Was it the second page? The cover? The last page? Books could be misleading: A title page could have multiple dates. Not to even mention the challenge of categorizing books into genres: A book like Susan Bordo’s Unbearable Weight: Feminism, Western Culture, and the Body was assigned to Health & Fitness. This was hardly the descriptor Bordo would have chosen. Google had chosen to forego the information indexing expertise of librarians because it saw the book as a simple repository of bits, characters, words, and thus, information. “But books aren't simply vehicles for communicating information,” said the article’s author Geoffrey Nunberg aptly, “and managing a vast library collection requires different skills, approaches, and data than those that enabled Google to dominate Web searching.”

Google’s approach to information de-contextualized it. Google threw all its books into a bucket of bits, stripped of all the accompanying information that situated the books in time and place. Clearly, Google thought that this was a problem that could be solved by string-searching capabilities that would mine the books’ contents for information, but Google was mistaken. After doing a little research on the theory of the archive, I found that I had fallen into the same trap as Google did with the digital archive I proposed. By proposing a ‘word search’ functionality as the primary way of getting around the archive, I was essentially de-contextualizing the information in the archive. Kate Theimer in Digital Humanities, commenting on the recent upshot in digital ‘archives,’ made a distinction between the archive and just any digital ‘collection.’ She said, “what defines the work of an archivist is ...whole collections pass into the hands of the archivist, rather than individual materials. The original order is to be preserved,” retaining the integrity of the context of the collection of materials. The difference between just any digital collection and an archive is, in my mind, much like the difference between my company’s storage drive on Highspot and my own Google Drive. Highspot is as decontextualized as one can get; it relies on a string-searching algorithm and the technology organizes the information that one interacts with. Meanwhile, on my own Google Drive, over time, I created the folders and indexing systems that would store my information. The structure of my Google Drive is the unique result of a historical process; to receive and preserve it exactly as it is, with the same internal structure, is what it means to create an ‘archive.’ The structure (or context) itself is as important and historically relevant as the information. The search-based digital collection I had proposed was not an archive by Theimer’s definition of the word (a definition I am persuaded to accept) because I was proposing a functionality that stripped information from its context and created distance between the user and the information’s all-important context.

Now, in the course of my research, I happened upon a suggestion that by nature, a digital archive is de-contextualized. Miguel Escobar Varela at the National University of Singapore pointed out that the stuff in digital archives lacks marginalia and physical markups, either because it was born digital (think, maps that were constructed online) or because it is frozen in time by the archive, sanitized of human touch for all time. All this got me thinking about whether the digital age is by nature ‘decontextualized.’ After all, it seems to have ushered in a newfound attitude that information is a series of bits that can be easily abducted from its original context and re-purposed. I don’t think it is a coincidence that we call the age ‘digital,’ which means discrete, separable.

For example, I think about how humans have had an overwhelmingly negative experience with social media. This ‘digital’ form of socialization de-contextualizes the way we give, receive and process information because we receive information as a floating bit of material; it often is not given to us intentionally, thoughtfully, and personally. Discussions over social media are almost never productive. This ‘digital’ form of socialization is stripped of important contextual ‘clues’ we use to interact. It is purely visual. We are not witness to body language, handwriting, vocal cues, physical proximity, all the important elements of socialization that we humans naturally use.

In this post alone, I cannot resolve whether the digital age is inherently de-contextualized and what that means for humans. However, I hope to spend the next few posts continuing this discussion of how the digital age interacts with the human, hopefully with some historical perspectives thrown in.