-
Notifications
You must be signed in to change notification settings - Fork 3
GSoC 2020: CDLI Accounting Corpus Analysis and Visualizations
The CDLI hosts a broad assortment of ancient text data, including thousands of administrative documents which record amounts of goods involved in transactions. These documents are a valuable window into the economy of the ancient Near East, but due to the scale of the corpora, the data are often too large to properly assess by hand. Moreover, the numeric notations used by ancient scribes are opaque to modern readers, and the need to convert to modern notation adds an extra complication to texts which may already be challenging to interpret.
This project has sought to make these accounts more accessible to experts and laypeople alike. We provide tools for converting Sumerian numbers to modern Arabic notation and use the converted values to compute a variety of statistics about counted objects. We also derive information about collocated items and descriptors, and present this information through a series of interactive visualizations designed to help users understand how different items were represented in the ancient Mesopotamian economy.
The code for this project is hosted on the cdli-accounting-viz repository. See the README for install instructions. It is merged with the rest of the CDLI framework in pull request !151.
- Numeral conversion
- Code to convert between Sumerian and Arabic numerals
- Test cases extracted from Sumerian grammars
- Text segmentation
- Code to identify the boundary between successive entries in a text when this is not indicated by a line break in the ATF
- Commodity identification
- Code to scrape data from the ePSD and construct a dictionary in python
- POS tag projection to identify nouns
- Wordnet rules to score how likely an item is to be a counted object
- Identification using common determinatives
- Syntactic rules to distinguish between counted objects and modifiers, based on word order and proximity to the numeral
- Rules for handling alternate spellings in the corpus vs. the dictionary
- Visualizations
- Online interface for displaying the extracted commodity data
- API
- Endpoints to perform numeral conversion (canParse, convert)
- Endpoints to serve commodity information (commodify)
- Endpoints to serve data for visualizations (dictionary, getNumberSystems, summaryStats, collocations, collocationsGraph, modifiers, modifiersGraph, allValues, concordance, similar)
- Framework integration
- Include all components as submodule in the CDLI framework
- Documentation and user guide
Stretch Goals:
- Support for extra languages
The main objectives were all successfully completed; there was not enough time to address the stretch goal of supporting extra languages. We will discuss each objective in turn, beginning with the numeral conversion. Suggestions for future work are given below, and can also be found on the issues page.
We consulted a range of Sumerian grammars to learn the rules underlying Sumerian metrology, and implemented these rules in the convert
python module. This module supports conversion of the following types of numbers:
- cardinal and ordinal counts
- dates
- lengths
- surface are
- volume
- dry capacity
- liquid capacity
- weight
- bricks (Sumerian uses a distinct notation for counting bricks, which differs slightly from the usual cardinal notation)
To simplify conversion, we chose for each system a default unit. All values in that system will be output as a multiple of that unit. This makes conversion straightforward and is appropriate for data which will be processed by a computer. However, this approach can produce values which are cumbersome for human use: imagine measuring all distances in metres, even if the distance is orders of magnitude larger or smaller than a metre. A logical next step would be to support conversion not only between Sumerian and Arabic notations, but between Sumerian notations of different size, for example to convert a length in gi to the equivalent length measured in usz.
Test cases were extracted from the grammars (mainly Edzard 2003 and Jaegersma 2010) to verify correctness of the conversion. Although the conversions pass all of these tests, inspection of the data suggests there are some more complicated notations which the grammars did not address and which the conversion module cannot yet handle. These primarily involve discontinuities between the unit and the digit in surface area notations. The existing conversion is sufficient for the cardinal numbers which make up most of the counts in the corpus, but future work may wish to extend the conversion module to better support these less common notations. Future work could also focus on improving disambiguation of ambiguous notations. There are cases where a notation may have multiple values and the correct reading is not clear; allowing the conversion to take some surrounding context into account could improve disambiguation in these cases.
The CDLI's data is recorded as ATF which explicitly records the boundaries between entries. For this reason, text segmentation turned out to be mostly unnecessary for this project. However, some few texts record multiple items in a single line, and we have created a simple text segmentation module which will divide such lines using the numerals as delimiters.
Note that some lines can contain two numerals without counting multiple commodities, as in 3(asz) gu2 siki a-ra2 2(disz)-kam "3 talents of wool the 2nd time". Other cases involve assigning an amount of goods to each of some number of people, in which case the sum total of items recorded depends on both counts and segmenting the entry will make the count incorrect. Future work should integrate the segmentation module more closely with the commodity ID (discussed next) to disambiguate such cases and ensure correctness of the converted counts in all cases.
To identify words as counted objects, we employ a variety of annotation projections and simple hand-crafted rules. There is, at present, no training data available to construct a more advanced machine learning model, nor are there tools like dependency parsers which could identify words as dependents of the numeral.
The first tool we employ is a dictionary derived from the ePSD. We provide scripts in the code/dict
directory which will scrape the ePSD and process it into an .npz file. During processing, we add definitions for a word's inflected forms and we project POS tags from the English definition.
Next, semantic.py
uses wordnet to find hypernyms of a word's English definition. We define some high-level synsets as being "commodity-like": these are broad classes such as "metal", "goods", "wooden object", and other categories which are likely to represent counted objects. We also blacklist synsets which are more likely to represent owners, recipients, or donors involved in a transaction. We also black- or whitelist individual words to handle edge cases which are incorrectly handled by rules which are otherwise useful. In addition to recording which words are likely to be counted objects, we record whether a word can be an adjectival modifier or a container using similar sets of rules. Finally, we look for common determinatives, which identify commodities with high precision but low recall.
The information from these modules is passed to the commodify
module, which determines the role of each word in a line. We use proximity to the numeral to determine whether a word could plausibly be a counted object; for words which are sufficiently close to the number, we use hand-crafted rules based word order and semantic information to determine whether a word is an object, a vessel containing counted objects, a modifier describing the counted object, or none of the above.
By manually evaluating lines sampled at random from the corpus, we find that this module achieves upwards of 90% accuracy in its labelings. We are aware of a variety of ways to improve the accuracy, but these will require time or better linguistic resources to fully realize. We have laid the groundwork for their implementation in the current codebase and leave the final implementation as future work.
First, some words have multiple spellings, and not all spellings are attested in the dictionary. We have implemented a mapping which will standardize sign names and spellings across the corpus before commodity information is extracted, but we have not fully populated this mapping with known spelling variants. These will have to be provided by experts or extracted from a sign list.
We have also implemented some capacity for identifying implicit objects. This currently works for rations where a text records a person and a volume of food without explicitly writing the food object. Other kinds of rations and implicit objects exist, and the existing functionality could be extended to detect these. This will require the commodify module to take the entire text into account, since the implied object is usually specified at the start of a document and must be remembered.
Finally, there are cases where a counted object is separated from the numeral by some intervening phrase. This often arises in cases where the method of measurement is specified, as in "three talents, weighed using stones, of cloth". We have added rules to detect the most common of these constructions but a general purpose system for identifying discontinuities would be useful.
We have constructed an online interface with interactive visualizations to help users explore the information extracted in the preceding steps.
The visualizations are described more fully in the user guide; this document will provide only a summary.
There are five modules designed to show a wide range of information about objects in the corpus:
- A histogram shows an item's overall distribution, revealing the quantities in which it is recorded and whether it is unimodal or multimodal. Summary statistics give an overview of central tendency and variation in the counts associated with an object.
- The concordance shows all of the entries which record a given object, and can be sorted by the value of the associated count or the number of times the entry occurs. This can be used to identify the contexts which cause an item to be counted in large or small quantities, and to determine which contexts are most typical for a given item.
- A list of nearby items shows which objects are recorded alongside an item of interest. A tabular view presents the raw frequency data, and a graph view additionally reveals how frequently these objects occur with one another (in any context, and not necessarily with the search term). This reveals clusters of items which tend to co-occur, and can be used to identify and compare administrative subgenres.
- A list of descriptors shows the adjectives and other modifiers which are used to describe an item. Once again, a table view shows the raw counts and a graph view shows which adjectives are used together. This can reveal different use cases for an item and show what information was relevant enough to be specified by the scribes.
- A list shows which items have the most similar distribution to a given object. This can be used to evaluate hypotheses that two items have similar patterns of use in the ancient economy.
The tools developed for this project are served via a flask application which runs on the CDLI framework. This API serves the information used to construct the visualizations, allowing other services to use this information for their own purposes. The API also serves the numeral conversion and commodity extraction functions.
There was not enough time to complete the stretch goal of supporting extra languages. The linguistic resources for Sumerian were scarce and of sometimes questionable accuracy. Considerable time was required to acquire and prepare them for use, and the time required to repeat this process on another language would have precluded other, more useful tasks. This time was instead spent satisfying requests from prospective users to add features which will make the tool more useful for their work. These include the ability to browse a custom corpus, and the addition of information about which datapoints come from which texts.
This Summer of Code project was one of the larger projects I have participated in, and allowed me to improve a variety of skills related to collaborating on a large, shared codebase.
I learned how to dockerize a project to ensure consistent function across devices, and how to use flask to serve an application as a web service so that users do not have to install it on their own machine. I learned the specifics of the gitlab continuous integration framework, and became better acquainted with tools such as nginx, cake, and node.js. This project has also given me the opportunity to exercise and strengthen existing skills in python, web development, and d3.js for constructing visualizations.
Outside the technological domain, I have also improved my ability to read Sumerian in order to evaluate how my tools are performing. I have become better acquainted with the workflow of Assyriologists and the tools which are useful to them. I have a better appreciation for the state of the art in digital Assyriology, and just how many useful tools exist for other languages which have not been developed for these ancient languages.
This project was completed by Logan Born under the mentorship of Max Ionov.