PEMM + Google Sheets
What we did & why; what worked, what didn’t
By Rebecca Sutton Koeser
Plenty of people use Google Sheets for data curation; what would it look like to use Google Sheets as a lightweight relational database? For the Princeton Ethiopian Miracles of Mary (PEMM) project, led by Wendy Laura Belcher, we had a chance to experiment with this. Our hope was to set the PEMM team up with a solution that they would be able to manage themselves, instead of implementing a custom relational database, which would require developer time to make changes and incur a heavier maintenance cost.
The PEMM project is gathering information on a collection of Ethiopian folktales called the Täˀammərä Maryam (The Miracles of Mary), written from the 1300s through the 1900s in the ancient African language of Gəˁəz (classical Ethiopic), with the goal of answering important questions about dating and story origins, and determining whether there is a particular order or set of stories that typically appear together in manuscripts.
In the first year of dataset curation support from CDH, Wendy and one collaborator worked on a semi-structured text file based on a digitized version of an early catalog of the miracles of Mary. We wanted to get her data into a more structured environment where more collaborators could get involved, there would be more checks to ensure accurate data entry, and Wendy could start to do analysis and begin answering her research questions.
[In our early planning meetings, Wendy used to make a gesture with interleaved fingers, saying “my data is like this!” — which I understood as her sense of her data as relational, with connected information that doesn’t fit in a single table or spreadsheet. © Rebecca Sutton Koeser, all rights reserved.]
So, we started planning for the move to a set of related spreadsheets; we modeled and designed the data structure similar to the way we would a relational database. The process of discussing and deciding how to model your data is always so valuable! It forces you to make important decisions. I still remember the “aha” moment when we decided to differentiate canonical stories from “story instances”: this made it possible to separate the description and documentation for story occurrences in a particular manuscript, while still linking them to the canonical story, when known.
[Initial data model diagram included in the PEMM charter]
Once the initial data model was settled, we needed to create the spreadsheets in Google Sheets. We could have done this manually, and that probably would have been faster! But we wanted the process to be automated and tested, so we wrote custom code to create the spreadsheets with the structure we had determined. In a relational database, connections between different tables would use database identifiers and what are called “foreign keys”; in our spreadsheets, we used a vertical lookup from another sheet. For example, story instances have lookups to both canonical story and manuscript, so that a particular story can be identified and located without having to duplicate that information in each row.
After the spreadsheet was created and we’d tested that everything was working correctly, I wrote a Python script to parse and convert the data from Wendy’s semi-structured text file into a set of CSV files for import into Google Sheets. As a result, Wendy and her team were able to start working with the data in a new way. Immediately, errors that had not been noticeable in the text file were obvious and easy to correct in the spreadsheet.
Getting the data into Google Sheets didn’t just help with data entry and review. It also meant we could start using the in-progress data with code and tools. We wrote code to regularly pull data from the Google Sheets document as CSV for use in an “Incipit Search Tool”, to help the project team identify stories based on the first meaningful lines of the story. Because the tool is regularly updated with data from the spreadsheet, the search results improved as they added more information.
We also wrote a script to synchronize data from Google Sheets to CSV files in a data repository on GitHub. I expected it would be straightforward to synchronize the data would be relatively easy, but I was pleased that we were able to get not just the data, but information about data contributors. We use the revision information to identify the authors of recent edits, and then include them as co-authors on the git commit. This aligns with CDH values, since we care about giving people credit for their work and being open and transparent about our own work and processes. Having a copy of the data in GitHub provides an open, non-proprietary backup that can be used with other tools as data work moves forward, and also makes it easier to see changes over time by comparing revisions. We set up automatic data checks with GoodTables, and the data could be used for preliminary analysis or display with a static site generator such as Jekyll or Hugo.
Ultimately, the experiment provided many benefits to the project. Since we successfully transitioned Wendy and her project data into Google Sheets, she’s been able to recruit more collaborators, make significant progress on her project, and begin to do some preliminary analysis. But from the technical side, I’m not sure that this experiment with Google Sheets was a complete success.
Continue to the CDH blog for Koeser’s reflections on lessons learned in the experiment.