• about data
  • data blog
  • user login
Home › Blogs › Martin Mueller's blog

TextGrid and Monk

Martin Mueller — Mon, 02/09/2009 - 14:49

Last week I was in Göttingen to celebrate the release of TextGrid, a "modular platform for collaborative textual editing" and a "community grid for the humanities." The beta version is now available for downloading and testing. From a scholarly and a technical perspective this is a very impressive project. At the moment all the texts in it are German, but the project is multilingual and cosmopolitan in its ambitions. 

Like the Monk Project, TextGrid is a multi-instiitutional project involving half a dozen partners from different universities. The coordinator, however, is the Göttingen State and University Library, one of the largest German research libraries. From several conversations with librarians I took away the sense that they take a very generous and dynamic view of the challenges for a research library in the digital world. They see the library as a hub for the creation and manipulation of digital data -- a kitchen, if you will, containing not only a pantry and refrigerator but stoves, ovens, pots and pans, knives, and food mills for the digital chefs of the future. 

Or a witch's kitchen: Göttingen is close to the Brocken, the highest mountain in North German and famous gathering place for the witches at Walpurgisnacht. The digital library as witch's kitchen may have its threatening side, but it is a more promising vision than Snow White in the repository of her glass coffin. And it is a vision in which the distinction between 'content' and 'tools' is increasingly blurred -- as it is in the laboratory of any scientist. 

Monk has focused on text mining and has taken its texts from where it can get them, with a lot of ingenuity and labor spent on getting diverse texts to play nicely with each other. TextGrid starts from the other end: the normative user so far has been the scholar or a team of scholars creating digital texts of 'reference quality'. Both Monk and TextGrid have similar approaches to managing 'digital intertextuality'. TextGrid has formulated a 'baseline encoding' with XSLT stylesheets that let scholars transform their editions into a common interchange format. In the Monk Project Brian Pytlik Zillig and Steve Ramsay have created procedures for moving diverse texts into a version of P5 TEI that is a close cousin of TEI-Lite but supports linguistic annotation. Both approaches are governed by a desire to identify the 'highest common factor' rather than the lowest common denominator. The higher you can pitch the plateau of an interchange format, the more digital affordances you create for scholarly analysis. 

In the Monk environment all texts are lemmatized and morphosyntactically tagged before being incorporated into the Monk datastore. TextGrid includes a tool that lets you lemmatize a text. The two projects take a similar approach to lemmatization. TextGrid maps wordforms to a 'hyperlemma', which is typically the modern form of the lemma. This is a complex procedure and necessarily arbitrary at the margins, since TextGrid will include texts from Middle High German on, and Middle High German differs more from modern German than Chaucer's language does from modern English. Monk has an easier time of it, since its texts don't reach back as far in time, and establishing a common lemma for words from 1500 on is typically not very hard. You get an idea of the problems if you look at the last word in the first line of Chaucer's General Prologue and ask whether 'sote' should be mapped to a lemma 'soot' (as it is in the OED) or whether you should just think of it as a dialect variant of 'sweet', which it is from an etymological perspective. At these margins decisions are not 'right' or 'wrong' but justify themselves in terms of the forms of analysis they support. From the perspective of digital intertextuality, 'lumping' may on balance have greater payoffs than 'splitting'

TextGrid has been built on the Eclipse platform. This is a very interesting choice. I don't understand enough about the technology to weigh the pros and cons of the platform. From a scholarly perspective, however, there is something very attractive about the fact that the user interface seen by the reader or analyst of textual data is very much the same as the user interface seen by the creator or curator. What you see when you download TextGrid is looks pretty much the same as the Eclipse platform.

This breaks all rules of good Web design, where things should always be simple, and whatever Eclipse is, it is not simple.  But the identity of user and developer interface has real advantages for long-term scholarly use. It encourages transparency and makes it more likely that users know about the structure and analytical potential of their data. Know Your Data is pretty good advice for scholarly work in any environment. For scholars who work with data over any length of time, the best platform is not the platform that lets you do simple things immediately but the platform that supports the most informed use over time. Ben Shneiderman has argued that good software should have low thresholds and high ceilings. This is not unlike the dedication of the Shakespeare Folio by its editors "to the great variety of readers, from the most able to him that can but spell." If you have to compromise, as you usually do, I think Web designers tend to err on the side of lowering the ceilings, not to speak of the fundamental design constraints that the Web browser still poses for user activities of any complexity. 

The choice of Eclipse has parallels in the Digital Humanities world. It was used for the Pliny annotation tool by John Bradley, a very sophisticated developer with a deep sense of how scholars in the humanities work. 

Platforms enable and constrain. TextGrid uses the VEX XML editor that is built into Eclipse. This is a very capable tool, but it has one very unfortunate limitation: you cannot enter 'invalid' text. But this turns many editorial activities into the impossible task of making omelettes without breaking eggs. The designer of Vex acknowledges this shortcoming and argues that "Vex should therefore permit invalid documents to be edited. Validation should happen in the background." I hope somebody will rise to that challenge. 

If you stipulate that the XML editor will be more flexible, the TextGrid editing platform has a lot going for it. I created a project and populated it with a one megabyte text file that I subsequently called up for editing over a domestic broadband connection. It took about 20 seconds to load but response time thereafter was good. If I don't have to worry about the storage or backup of your documents can get at them anywhere anytime to work on them, I consider some slowdown in operations a cost that is well worth paying.

TextGrid has  sophisticated and quite intuitive provisions for user management, which appear to accommodate most of the workflow issues one would associate with a team project. How this will work with a dozen contributors scattered over five continents I don't know. But if it works it will be very good indeed. 

The philological background of TextGrid shows up in two other features. There is deep integration with a number of German dictionaries that have been digitized at the University of Trier. Grimm's Wörterbuch, the German equivalent of the OED, is in the public domain.  So are a number of specialized dictionaries. The TextGrid interface supports annotation: you can link easily from locations in texts to dictionaries. 

TextGrid will support the alignment of a facsimile image with the textual transcription, although this feature is not yet activated in the public betacode. I saw it implemented at the meeting with the example of a medieval manuscript. It would work equally with the task of creating digital editions from OCA texts and keeping a life-line to the printed page at all stages of editorial labor and subsequent readerly use.

The ideal toolkit for text-based scholarship of any kind should support the entire cycle of scholarly labor from creation through curation and analysis of textual data. Nothing close to such a toolkit currently exists. Pieces of it exist here and there, in Monk, Philologic, WordHoard, Pliny, TextGrid, and others. TextGrid and Monk probably can learn a good deal from each other.

 

 

 

 

 

  • Martin Mueller's blog

Categories

  • Scholarly crowdsourcing (0)
  • The lexical fabric of Early Modern Drama (10)
  • The TCP Project (6)
  • This and that (4)

Recent blog posts

  • The Great Digital Migration
  • Electronic health records and literary informatics
  • The Library is my lab
  • Traffic analysis
  • Nestle and data curation
  • Crowdsourcing and Early Modern Drama
  • Verbs by prose and verse
  • Adjectives in verse and prose
  • Nouns in Early Modern drama by verse and prose
  • Conjunctions, prepositions, and wh-words
more

Links

  • Dan Cohen
  • Digital Scholarship in the humanities (Lisa Spiro)
  • Geoffrey Rockwell
  • Matthew Kirschenbaum
  • Stephen Ramsay
  • Stéfan Sinclair
  • Wine Dark Sea (Michael Witmore)
  • Work Product (Matt Wilkens)
  • about data
  • data blog
  • user login