Who doesn’t love a good XML file? You know the kind, with the tags closed, and the syntax correct, and it’s been parsed and validated. Now let’s see a raise of hands who loves getting the XML file into that state? Anyone?
If you haven’t guessed, this week’s archiving adventures have been primarily about encoding XML files. These files are, essentially, the finding aids for the collections I am working on and will live online at the OhioLINK finding aid repository. (Warning: This post will include jargon, but I will do my best to explain all of it.)
Encoded Archival Description
The first thing you need to know is that Encoded Archival Description (EAD) is the standard for archival encoding, maintained by the Library of Congress and the Society of American Archivists. Should you be interested in learning about its origins and development, visit the Library of Congress’ EAD website. Basically, the idea of EAD is to standardize and accommodate a diversity of international archival descriptive practices to make archival materials accessible to users. EAD deals in terms of data structure as opposed to data content, meaning the way content is structured is encoded into a finding aid but not how to process, house, arrange, number, or describe materials. EAD encoding is in XML (extensible markup language) format with over 100 tag elements. If you’re not familiar with XML, it’s a markup language for computers using tags, not unlike HTML, but also not like HTML.
What are tags? Tags are the encoding bits (to use really technical terminology) that go around your content and maintain the structural order of that content. For example:
<unittitle>Box 7: Photographs</unittitle>
“Unittitle” is the tag, signifying the title or name of the element, in this case a box of photographs. Each tag is enclosed with the brackets. Whatever text comes after the brackets is your content for that tag (in this case Box 7: Photographs) until you close it, indicated by the slash closing tag. (So yeah, like HTML.) This would be a descriptive element tag.
There are also hierarchical element tags that determine ordering or level. For example, in a particular collection you may have (from largest entity to smallest) series, box, folder, item. Say we have a huge collection, like the Bebe Miller Company collection at TRI. There are hundreds of costumes in this collection, so all of the costumes are arranged into a Costumes series, just as all the videos and films are arranged into a series together. Then there are several boxes within each series, and then within boxes folders and/or items (depending on the material and how detailed the descriptions need to go. This is where the phrases “folder level description” and “item level description” come from. Also, smaller collections, like the Dalcroze Society of America Collection at TRI, which is three boxes, don’t need series level organization.). The levels fit into each other like a Russian nesting doll. To reflect this hierarchical organization with a collection, the level, or <c>, tags are used: <c01>, <c02>, <c03>, etc. Imagine it working this way:
<series> <box> <folder> <item></item> </folder> </box> </series>
So, that’s the gist of the EAD XML encoding. And where our adventure actually begins…
At TRI we use the PastPerfect program for our own in-house record keeping, AND we create EAD finding aids for the OhioLINK repository. They serve different needs and hold different information. This week I created/edited three finding aids for Dalcroze Collections: Irwin Spector Collection, Dalcroze School of Music Collection, and the Dalcroze Society of America Collection. The first two finding aids already exist on the OhioLINK repository, but needed updating (particularly the Spector collection since only the Dalcroze half of the collection was previously processed). The DSA finding aid needed to be created from scratch.
Either way, I spent most of this week converting Excel spreadsheet data into XML documents, via a mail merge. (What?!) So, most of the data lives in PastPerfect, so it can be exported into an Excel spreadsheet (yay spreadsheet!). Now, if you are working with a large collection, and in this case I would say anything with more than 20 items total is large, you really don’t want to manually encode the tags for each item. Not only would that be mind-numbingly tedious, you also leave yourself open to a lot of typing errors. But how to get all those happy EAD XML tags around the data? That’s where the mail merge comes in. Beth Kattelman, one of the curators at TRI, created a mail merge document with the appropriate XML tags so that the process could be automated. And by automated I mean, the bulk of the encoding for the items is done by the computer, but you can still have a lot of clean up to do.
For instance, the Excel spreadsheet had to be cleaned up and re-organized because the numbering method we use doesn’t jive with Excel in a seamless fashion. IS.3.20, meaning Irwin Spector Collection, box 3, folder 20, ends up being read by Excel as a decimal number. Meaning 3.20 (box 3, folder 20) is the same as 3.2 (box 3, folder 2). And all of the teens and 100’s come before the 20’s, 30’s etc. because they are read as decimals — 1.122 (box 1, folder 122) is a smaller decimal number than 1.2 (box 1, folder 2). Without going into too much detail or getting long-winded (too late!), there was a lot of trial and error on my part in the arranging and merging and cleaning up of various documents to get the items into XML format. Now I’ve got a system for myself that makes everything more automated and less manual-clean-up-reliant.
Getting your coding into a program that can read and help edit the XML. NoteTab Lite is a free program, or for a small amount of money you can purchase NoteTab Pro; it helps read and edit all kinds of text files. To make this program all the more useful, I needed to find the EAD Cookbook. Not as delicious as it might sound. It’s a library of files that you import into NoteTab, created by one Michael J. Fox (no, not that Michael J. Fox). The NoteTab specific files/program was created by Chris Prom and adds an EAD specific library of functions to NoteTab — proper element tags, the ability to Parse and Validate the encoding, for example. Both the Cookbook and the NoteTab materials are free! They were made specifically for assisting archivists (who have varying degrees of XML and computer savvy) navigate encoding finding aids.
Next, next step
Editing the XML code in NoteTab. Once you get the Cookbook installed properly, get the hang of the syntax, tags, etc. and once you get a handle on navigating a new program, it’s really not so bad. Until then, you will be swearing at your computer under your breath. I will say, that NoteTab is not the friendliest program to human reading because it doesn’t highlight or bold or color tag elements automatically, or give the option to collapse hierarchical tags for scrolling ease, code lines aren’t numbered (if anyone reading this has any tips for managing these issues, please let me know!). I ended up toggling between NoteTab and a program called Microsoft Visual Studio Tools because it does automate visual elements like color and bolding and collapse-ability into it’s reader. This program also has built in generic XML error-detecting, though not EAD specific abilities. Like I said, once you get your workflow going, you can move pretty quickly.
Next, next, next step
Encode the front materials for your finding aid. Front matter includes accession information, ownership information, dates, biography/historical note of the collection, processing and finding aid creation information, arrangement of the collection, scope and content and extent, restrictions on use if any, institutional information about the housing archive, and so on. I find that creating all of the information in a word processing program (like Word or Pages) is best for me because I can just edit the content. Once it’s ready, then it can be encoded into the XML document.
Upload! The edited/updated DSM and IS finding aids should be showing up any day on the OhioLINK. I didn’t have access to upload them myself because I did not originally create them, and Beth Kattelman is uploading them for me. The DSA finding aid I will wait to create on the OhioLINK EAD finding aid creation tool until after we send the front matter to the DSA to check over and approve, so it may not show up for another week or two.
Here ends part one of my EAD XML encoding adventure. Special thanks to Tara, Brian, and August (and Orville) for their support the past three days as I figured this out. They may be encoding their own finding aids shortly… With the majority of the Dalcroze work completed, I will be moving onto the Sandra Hughes Collection — checking, processing and entering into PastPerfect as needed, then creating the EAD finding aid. Productive!