Weeks 15, 16, and 17: TEI Trials and Errors

About two and a half weeks ago, I submitted my Free Speech Research Project for IRB approval. While I wait for a response, I have decided to delve into more analytical work. I will be attending DHSI in early June to learn about ethical data visualization. Although the description for the program doesn’t list any prior experience with TEI or R as a prerequisite, I decided to go ahead and practice with both of them.

This post will be dedicated to my experienced with TEI and the following post will deal with R. I had been interested in learning TEI for a while as it seems to be one of the major cornerstones of digital humanities work, particularly when working with text. Although people often associate TEI with close reading techniques, I intended to automate the TEI conversion process, making this project more closely associated with distant reading. Distant reading was something I read about when I first started this position and I found the concept fascinating mainly because I feel like students and people my age instinctively employ distant reading techniques when faced with a large text or corpus. When I studied abroad, I didn’t have enough time to read Bleak House in three days, so I found secondary sources, word clouds, timelines, videos, any kind of visualization I could get my hands on so I could more quickly absorb the text. Essentially, I was trying to do this:

Artist: Jean-Marc Cote Source: http://canyouactually.com/100-years-ago-artists-were-asked-to-imagine-what-life-would-be-like-in-the-year-2000/

And while this is not exactly what digital humanists mean when they say distant reading, it’s a great illustration of the way the internet has drastically changed the way that we understand art and literature. Take This is America for example. So many news outlets and social media participants immediately attempted to dissect the video, none succeeding in capturing every nuance, but as a collective whole, created a fairly solid interpretation of the work within hours of its release, utilizing blog posts, Tweets, articles, videos, essays, and more. Because explanations and interpretations are so readily provided, artists like Donald Glover and Hiro Murai are pushed to create increasingly nuanced and layered works that resist simple interpretations and challenge their audience to dwell on their meaning. We’ve all witnessed times when every single news media outlet comes out with the exact same article on a pop culture phenomenon (just look up articles about Taylor Swift’s Reputation, all of which list each easter egg in the video with ease). The reactions to This is America were diverse and, cobbled together, still seem to fall short of encapsulating the sense of mesmerization that This is America provokes. The internet has so drastically changed the way we consume and interpret art, that artists who want to challenge their audience have to create works that elude the media’s customary act of reacting and labelling. At the same time, scholars have to create new ways of understanding pieces of art. The distant reading techniques employed by digital scholars are an attempt to answer this call.

Before I started anything with my project, I decided to read more about distant reading and its current relationship to the Humanities. Two quotes from these readings really stood out to me.

  • “Computer scientists tend toward problem solving, humanities scholars towards knowledge acquisition and dissemination” (Jänicke). What happens when we collapse that binary. What if the goal of the humanities becomes solving problems and the goal of the computer scientists is to acquire and disseminate knowledge? How does our ability to understand art and data morph into something new when we treat art like data and data like art?
  • “In distant reading and cultural analytics the fundamental issues of digital humanities are present: the basic decisions about what can be measured (parameterized), counted, sorted, and displayed are interpretative acts that shape the outcomes of the research projects. The research results should be read in relation to those decisions, not as statements of self-evident fact about the corpus under investigation” (UCLA). In the same way that art is created to answer a call for increasingly difficult interpretive puzzles, a product of the analytical environment into which it was born, so is the analysis itself. You cannot create data or answers or knowledge or solutions in a vacuum. The decisions I make during this project will impact the results. The fact that I am using digital techniques to perform my analysis doesn’t absolve me of responsibility for what I produce. 

I designed a project that I felt could incorporate both aspects of these quotes. I want to solve a problem and acquire knowledge, and at the same time, understand that whatever I produce would be inextricably tied to my own research decisions. This project also had to include aspects of TEI and R, as well as some of the other tools and skills I’ve learned about thus far. My plan was this:

  1. Find copies of Marx’s first three volumes of Capital and systematically scrape them from the web. (I chose to exclude the fourth volume as it was not available online in the same format as the other three volumes.)
  2. Encode the scraped texts (saved as a plain text file from the CSV file generated by the scraper) as TEI texts.
  3. Use those TEI documents to do some topic modeling or sentiment analysis in R, and put them into Voyant (because why not?).

Capital seemed like a good choice because I have an interest in Marxist theory, so I felt I would intuitively understand some of the results from my experiments, and it would be fun to learn about Marx’s most important work using new techniques that–as far as I know–haven’t been applied to this work before. Capital is also highly structured with parts, chapters, sections, and a variety of textual elements including tables, quotes, foreign phrases, and footnotes, which would give me plenty of opportunities to learn the options available in the TEI schema. Also, May 5 was Marx’s 200th birthday, and it just seemed fitting.

I coded the first chapter by hand so that I understood the structure, why some elements can be used together while others cannot, how to best divide the text using the <div> tag, how to label tags, and transform attributes into TEI-approved formats (e.g. &amp; into &#38;).

Using an initial HTML version of Chapter 1, I gradually worked my way through finding and replacing HTML elements with TEI elements.
Final structure of TEI encoded chapter

Once I completed the first chapter, I felt I had a sufficient understanding of the elements to know what to look for in a TEI document created using an online converter tool. For this step, I used OxGarage to convert the Word document versions of the volumes into P5 TEI. I then cleaned up the documents in order to make them valid. This step required very little effort. Most of my edits pertained to fixing mistakes in the Word documents, like a broken link to a footnote or an incorrect, invisible numbered list before some paragraphs.

This whole process took me about a week to complete. I summarized above but the actual steps looked like this:

  1. Learn how TEI works on TEI by Example (TBE)
  2. Find the first three volumes of Capital online here.
  3. Try to scrape them from the web using my favorite web scraper tool.
  4. Realize the web scraper won’t work the way I want it to because of the way the volumes are displayed with links to sections that are on the same page as other sections.
  5. Download a bunch of versions of the volumes, attempt to scrape straight HTML (didn’t work), try to find a simple way to convert the text into some kind of structured document.
  6. Decide to just encode the first chapter of volume 1 and figure out what to do with the remaining chapters and volumes after that.
  7. Search for some kind of software that I could use to edit TEI.
  8. Download some kind of free nonsense software I immediately deleted when the interface was ugly.
  9. Download a trial version of Oxygen.
  10. Familiarize myself with Oxygen’s interface.
  11. Code Chapter 1 of Capital Volume 1 by hand following TBE’s examples
  12. Validate my code using Oxygen’s built in validating tool and double check it using TBE’s free validating tool.
  13. Attempt to accurately convert the three volumes in their entirety using all kinds of online file conversion tools. (I thought I would go from HTML to XML to TEI.)
  14. Find OxGarage. Convert the Word doc versions of the volumes into TEI.
  15. Make minimal edits to the resulting TEI documents.
  16. Validate all of the documents using the same tools I used for Chapter 1.

Once I had all of my TEI volumes complete and validated, I immediately put them into Voyant for some unsurprising results.

Initial Output
After Some Manipulation

Naturally, the most common words were “capital”, “production”, “value”, “labour”, “money”, etc. I knew this would be the case, but the more I attempted to manipulate the layout, the slower the interface became to the point of not loading altogether, so I ultimately abandoned Voyant and decided to try using the Topic Modeling GUI. I forgot, though, that the GUI only works with plain text files. Here, I’d hit another roadblock and after searching for a solution to this issue, I decided to move on to coding with R, as that seemed to be the most logical way to utilize my TEI documents. Coding with R will be the subject of my next blog post.


Leave a Reply

Your email address will not be published. Required fields are marked *