View this email in your browser

PEMM + Google Sheets

What we did & why; what worked, what didn’t

By Rebecca Sutton Koeser

Plenty of people use Google Sheets for data curation; what would it look like to use Google Sheets as a lightweight relational database? For the Princeton Ethiopian Miracles of Mary (PEMM) project, led by Wendy Laura Belcher, we had a chance to experiment with this. Our hope was to set the PEMM team up with a solution that they would be able to manage themselves, instead of implementing a custom relational database, which would require developer time to make changes and incur a heavier maintenance cost.

The PEMM project is gathering information on a collection of Ethiopian folktales called the Täˀammərä Maryam (The Miracles of Mary), written from the 1300s through the 1900s in the ancient African language of Gəˁəz (classical Ethiopic), with the goal of answering important questions about dating and story origins, and determining whether there is a particular order or set of stories that typically appear together in manuscripts. 

In the first year of dataset curation support from CDH, Wendy and one collaborator worked on a semi-structured text file based on a digitized version of an early catalog of the miracles of Mary. We wanted to get her data into a more structured environment where more collaborators could get involved, there would be more checks to ensure accurate data entry, and Wendy could start to do analysis and begin answering her research questions. 

photo of hands with interleaved fingers

[In our early planning meetings, Wendy used to make a gesture with interleaved fingers, saying “my data is like this!” — which I understood as her sense of her data as relational, with connected information that doesn’t fit in a single table or spreadsheet. © Rebecca Sutton Koeser, all rights reserved.]

So, we started planning for the move to a set of related spreadsheets; we modeled and designed the data structure similar to the way we would a relational database. The process of discussing and deciding how to model your data is always so valuable! It forces you to make important decisions. I still remember the “aha” moment when we decided to differentiate canonical stories from “story instances”: this made it possible to separate the description and documentation for story occurrences in a particular manuscript, while still linking them to the canonical story, when known.

svg image of current data model

[Initial data model diagram included in the PEMM charter]

Once the initial data model was settled, we needed to create the spreadsheets in Google Sheets. We could have done this manually, and that probably would have been faster! But we wanted the process to be automated and tested, so we wrote custom code to create the spreadsheets with the structure we had determined. In a relational database, connections between different tables would use database identifiers and what are called “foreign keys”; in our spreadsheets, we used a vertical lookup from another sheet. For example, story instances have lookups to both canonical story and manuscript, so that a particular story can be identified and located without having to duplicate that information in each row.

After the spreadsheet was created and we’d tested that everything was working correctly, I wrote a Python script to parse and convert the data from Wendy’s semi-structured text file into a set of CSV files for import into Google Sheets. As a result, Wendy and her team were able to start working with the data in a new way. Immediately, errors that had not been noticeable in the text file were obvious and easy to correct in the spreadsheet.

Getting the data into Google Sheets didn’t just help with data entry and review. It also meant we could start using the in-progress data with code and tools. We wrote code to regularly pull data from the Google Sheets document as CSV for use in an “Incipit Search Tool”, to help the project team identify stories based on the first meaningful lines of the story. Because the tool is regularly updated with data from the spreadsheet, the search results improved as they added more information.

We also wrote a script to synchronize data from Google Sheets to CSV files in a data repository on GitHub. I expected it would be straightforward to synchronize the data would be relatively easy, but I was pleased that we were able to get not just the data, but information about data contributors. We use the revision information to identify the authors of recent edits, and then include them as co-authors on the git commit. This aligns with CDH values, since we care about giving people credit for their work and being open and transparent about our own work and processes. Having a copy of the data in GitHub provides an open, non-proprietary backup that can be used with other tools as data work moves forward, and also makes it easier to see changes over time by comparing revisions. We set up automatic data checks with GoodTables, and the data could be used for preliminary analysis or display with a static site generator such as Jekyll or Hugo.

Ultimately, the experiment provided many benefits to the project. Since we successfully transitioned Wendy and her project data into Google Sheets, she’s been able to recruit more collaborators, make significant progress on her project, and begin to do some preliminary analysis. But from the technical side, I’m not sure that this experiment with Google Sheets was a complete success.

Continue to the CDH blog for Koeser’s reflections on lessons learned in the experiment.

Funding Opportunities

In order to develop new humanities-focused curricula for data and computer science undergraduates, the Humanities Computing Curriculum Committee (HC3) seeks to hire three Princeton PhD students as Humanities Data Teaching Fellows. Fellows will receive a $5,000 stipend. Applications are due tomorrow (!) February 12. Read more about the program here.

Princeton faculty, staff, postdoctoral fellows, and graduate students are eligible for our new CDH Data Fellowships. Become part of a cohort to promote ongoing conversations about the methodological, cultural, and ethical questions raised by producing and analyzing humanities data. Funding is available up to $3,000 per project, and faculty are eligible to apply for summer salary of up to $5,000. Deadline: March 1. More detail here.

The deadline for the fall 2021 cohort of Graduate Fellows will be March 15. This program is designed to welcome graduate students into the community of Princeton DH researchers and introduce them to the broader landscape of DH. Fellows receive $750/semester. More information is available here.

Library of Congress Labs is recruiting scholars to learn more about their research methods, sources, and needs in order to inform their future digital strategy. More information here.

Upcoming Events:

Modern America Workshop, “Public Engagement: Finding Audiences Beyond Conferences and Classrooms,” February 11 - 4:30pm to 6:00pm

Intro to ArcGIS Pro software to explore geographic data, tables, and create maps, February 16, 11:00am

CITP Seminar: Predict and Surveil: Data, Discretion, and the Future of Policing, February 16, 12:30pm

Getting Started with the Library: Data Statistical Services and Princeton Research Data Service, February 23, 12:15pm

Digital Qualitative Analysis with Open Source Software, February 24, 11:00am

Getting Started with the Library: Center for Digital Humanities and Stokes Viz Hub, February 24, 12:30pm
Copyright © 2021 Center for Digital Humanities, All rights reserved.

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.

Email Marketing Powered by Mailchimp