Select Page

Previous work

In recent years, several works have proposed traditional machine learning approaches to the study of ancient texts. This body of work has focused on optical character recognition and visual analysis31,32,33,34, writer identification35,36,37 and text analysis38,39,40,41,42,43,44, stylometrics45 and document dating46. It is only very recently that scholarship has begun to use deep learning and neural networks for optical character recognition47,48,49,50,51,52,53,54,55, text analysis56, machine translation of ancient texts57,58,59, authorship attribution60,61 and deciphering ancient languages62,63, and been applied to study the form and style of epigraphic monuments64.

The closest work to Ithaca is our 2019 research on ancient text restoration: Pythia15. Pythia was to our knowledge the first ancient text restoration model to use deep neural networks, and was followed by blank language models18, Babylonian65 and Korean text translation and restoration17, Latin BERT for language modelling, part-of-speech tagging, word sense disambiguation and word similarity16, and the classification of Cuneiform tablets by period66.

Ithaca is to our knowledge the first model to tackle the three central tasks in the epigrapher’s workflow holistically. Not only does it advance the previous state-of-the-art set by Pythia, but it also uses deep learning for geographical and chronological attribution for the very first time and on an unprecedented scale. Ithaca offers interpretable outputs, showcasing the rising importance of cooperation between human experts and machine learning67—as exemplified by our experimental evaluation.

Most importantly, this work shows how matching human experts with deep learning architectures to tackle tasks collaboratively can surpass the individual (unaided) performance of both humans and model on the same tasks. Indeed, recent medical research68,69 further confirms the importance of hybrid architectures in addressing real-world problems. The present work makes human expert interaction possible by visualizing the output probability distributions for all tasks using multiple charts and maps, and augmenting their interpretability by means of saliency maps. It is our hope that this work may set a new standard for the field of digital epigraphy, by using advanced deep learning architectures to support the work of ancient historians.

Generating the I.PHI corpus

When restoring damaged inscriptions, epigraphers conjecture the total number of missing characters based on grammatical and syntactical considerations, and on the reconstructed physical form of the text5. Conjectured missing characters that cannot be restored are conventionally marked with periods or hyphens, one hyphen equating to one missing character. Moreover, PHI presents interpretive transcriptions of the texts (including capitalization, punctuation, word division, lower-case letter conversion).

Thus, moving from the PHI dataset, we substantially expand the ruleset for filtering human annotations previously conceived for Pythia, rendering the text machine-actionable. We removed 9,441 duplicate texts and filtered out all inscriptions under 50 characters in length, whereas, in Pythia’s dataset, we had excluded all texts with fewer than 100 characters. To increase the amount of available text, we retained the supplements proposed by epigraphers (conventionally added between square brackets), and we matched the number of unrestored characters with an equal number of ‘–’ symbols, as is commonly done by epigraphers (Extended Data Fig. 1).

Each PHI inscription is assigned to a region of the ancient Mediterranean world (Extended Data Fig. 2), and includes an additional metadata string referring to the date proposed by epigraphers for the text (Extended Data Fig. 1). The chronological information is noted in a variety of formats (historical eras, precise year intervals); in several languages (including Latin); ranging before (bce) and after (ce) the Common Era; lacking in standardized notation (‘early’, ‘first half’, ‘1st half’, ‘beginning’, ‘beg.’) and often using fuzzy wording (‘late 7th/6th ac.’, ‘ca. 100 a.?’, ‘bef. 64 ad’). After crafting an extended ruleset, we succeeded in generating well-defined date intervals for 60% of all PHI inscriptions, as the chronological metadata of the remaining 40% is either missing or unprocessable. The resulting I.PHI dataset contains 1.93× more inscriptions than the previous Pythia’s dataset. The texts of which the numerical PHI identifier (PHI ID) ended in 3 or 4 were held out and used as test and validation sets, respectively (Extended Data Table 1).

Ithaca architecture

Inputs

For each inscription, the input of the model consists of (1) a sequence of character embeddings (real-valued vectors, each representing the character of the alphabet that occurs at the corresponding position of the inscription); (2) an equally long sequence of word embeddings (real-valued vectors, each representing the vocabulary word at the corresponding character position of the inscription; Fig. 2); and (3) positional embeddings (also real-valued vectors, each representing a position of the input sequence). The first two kinds of embeddings are randomly initialized and learned when training Ithaca (via backpropagation). The positional embeddings are also trainable and they are initialized with a separate sinusoidal function per dimension22 to maintain a symmetrical distance between neighbouring steps and smoothly decay over the maximum length of 768 characters. Our vocabulary includes every word appearing more than 10 times in I.PHI (35,884 words), while damaged or ‘unknown’ (under-represented) words are rendered with an ‘[unk]’ symbol. The joint use of character and word embeddings enables the architecture of Ithaca to be both character- and context-aware70,71,72. Finally, the input sequence is padded with a start-of-sentence character ‘<’.

Torso

The three input sequences are combined by concatenating the different embeddings per-character position and the resulting sequence is fed through the torso of the model. The architecture of Ithaca’s torso consists of eight stacked transformer decoder blocks, inspired by the large-scale transformer model BigBird73. Every block uses four sparse attention heads (using global, local and random attention mechanisms), which reduce the context-length dependency from quadratic to linear, therefore enabling the model to handle lengthier sequences73 compared with classical transformers. Furthermore, the attention mechanism is ‘multi-head’ (Fig. 2) in the sense that it can learn to consider different types of information extracted from the input. For example, different attention heads may be sensitive to particular character sequences, or more perceptive to certain words and phrases with distinctive morphosyntactic or semantic features. Finally, to overcome problems that hinder the stacking of such complicated blocks, each transformer block uses residual connections and layer normalization (shown as ‘add and normalize’ in Fig. 2).

Ithaca’s torso outputs a sequence whose length is equal to the number of input characters, and each item in this sequence is a 2,048-dimensional embedding vector. Each task head consists of a two-layer feedforward network followed by a softmax function. There are three different task heads, handling region attribution, chronological attribution and restoration respectively. To predict the regions and dates, Ithaca uses the first output embedding (t = 1) and passes it on to the two corresponding heads. This arrangement is similar to that of DocBERT74 and works better than other pooling methods (such as mean- and max-pooling over the output embeddings) in our experimental evaluation. Finally, for the restoration task, Ithaca uses the remaining output embeddings (t > 1) as there is a direct correspondence with the input text characters: for each missing character position, the corresponding output embedding of the torso is fed to the head of the restoration task, which predicts the missing character.

Data preparation and augmentation

I.PHI may be the first multitask dataset of machine-actionable epigraphical text, but its size is still several orders of magnitude smaller than modern typical language datasets. To avert the risk of overfitting, which is common in large-scale deep neural network architectures, we apply several data augmentation methods, described below, to artificially increase the size of I.PHI’s training set. Our preliminary experimental evaluation found that these methods are crucial in achieving the reported performance. These augmentation methods are applied anew whenever a training inscription is re-encountered in each training epoch.

Text clipping

For each inscription, we select an arbitrary section of its text and ignore the remaining text. We implement this by first sampling a segment length between 50 and 768 characters, and then sampling the starting index of the segment. This method helps Ithaca to generalize and improve the handling of partial inputs.

Forcing the model to rely on contextual information often leads to improvements in prediction. To achieve this in our model, during training, we randomly hide up to half of the input text by replacing sequences of characters sampled from a geometric distribution (P = 0.1) with ‘–’. This span masking is intended to replicate the distribution over the length of missing characters estimated from the dataset, and uses the hidden ground-truth characters as target labels for the restoration task.

Word deletion

During training, we also delete words from each input text (without replacing them with any special characters in this case) with a 20% probability. Here, the goal is again to increase variability in the training data to improve the model’s ability to generalize over all possible ways in which inscriptions are damaged75.

Sentence swap

By randomly swapping sentences in the input text with a 25% probability, we generate multiple input–label pairs for the auxiliary task of next-sentence prediction (NSP)75 (see below).

Data circularity

Ithaca’s source dataset (PHI) is a synthesis of generations of scholarly research. Epigraphers typically restore texts and attribute them chronologically through a process of induction. Textual restorations are proposed on the basis of parallels, mediated by wider historical and linguistic knowledge; chronological attributions are proposed partly from archaeological and contextual information, partly from textual form and content, and partly from textual and material parallels. The texts on which Ithaca trains include previous scholarly restorations; and the dates recorded are the product of accumulated scholarly knowledge and induction from archaeological, historical and textual study. This might be thought to imply circularity, but that would be true only if Ithaca were operating in a world of objective data and aiming to offer a single objectively true solution. Rather, Ithaca is an assistive tool aiming to improve on and facilitate a scholarly process of induction, model uncertainty and propose possible solutions for the scholar to consider.

Considering textual restoration, Ithaca avoids the risk of ‘history from square brackets’76,77,78 (assuming any proposed restoration to be ground truth, meaning the accepted consensus, rather than merely one of several hypotheses), because none of Ithaca’s proposed restorations are assumed to be objectively certain—instead, they are presented as plausible suggestions. Furthermore, the inclusion of existing scholarly conjectures within the training set itself does not constitute a form of ‘history from square brackets’, as such conjectures are themselves plausible restorations achieved by a process of induction and considered acceptable by one or more experts, and as such are precisely the sort of result that Ithaca itself aims to generate. The value of Ithaca is indeed its ability to learn from the largest possible dataset of attested and possible texts, making the underlying process of inductive reasoning as powerful as possible, and so generating possible restorations for scholars to evaluate.

As for chronological attribution, the dataset on which Ithaca trains is founded in the past study of multiple elements (such as archaeological provenance, material form, textual content and form). Ithaca in turn learns through close attention to the text alone. The attributions proposed by Ithaca therefore have their basis in the inductive study of a vast textual dataset and its correlation to chronological data that are more broadly derived. Ithaca is therefore able to bring some refinement to those attempts to date the texts through the application of machine learning specifically to the textual patterns in that data. Thus, Ithaca is, in this case, a part of that scholarly process, and no more or less circular in its reasoning than any other scholar.

For the task of restoration, we use the text-masking augmentation method to mask parts of the input and produce ground truths. We subsequently use a cross-entropy loss to train Ithaca to predict the missing characters. The cross-entropy loss is also used for geographical attribution, using the region metadata as target labels. We further apply label smoothing with a coefficient of 10% to avoid overfitting and to provide historians with a smoother distribution of predicted hypotheses. For the task of chronological attribution, Ithaca discretizes all dates between 800 bc and ad 800 with a bin size of 10 years. This range covers the majority of the PHI dataset entries and encompasses the conventional date range for Greek epigraphy. The processed ground-truth date intervals are discretized into bins of equal probability, forming the target probability distribution. The limitations of discretizing and amalgamating date ranges of different levels of precision based on past scholarship have been noted79,80—the scale of data on which Ithaca trains, together with the increased attention to textual patterns (compared with the previous paragraph), at least partially meet that challenge. We then use the Kullback–Leibler divergence to minimize the difference between target and predicted probability distribution (Fig. 3c).

Finally, to allow for better modelling of context, we introduce a next sentence prediction loss, an auxiliary function common to language modelling tasks81. During training, we randomly shuffle some of the sentences of the input text, and at the end of each (non-final) sentence (marked by a full stop, ʻ.ʼ) we predict whether the next sentence is in the correct order (valid) or a product of the shuffling augmentation. By deploying the torso’s output embeddings for the full stops, we introduce an additional feedforward network that uses binary cross-entropy to predict the validity of the next sentence whenever a ʻ.ʼ character appears.

Using this setup, Ithaca was trained for a week on 128 Tensor Processing Units (TPU) v4 pods on the Google Cloud Platform. The effective batch size was 8,192 texts and a LAMB optimizer82 was used to optimize Ithaca’s parameters with a learning rate of 3 × 10−4. Using Bayesian optimization hyperparameter search, the loss functions of each task were combined using the following function:

$$L=3times {L}_{{rm{Restoration}}}+2times {L}_{{rm{Region}}}+1.25times {L}_{{rm{Date}}}+0.01times {L}_{{rm{NSP}}}.$$

We do not use a separate masked (token) language modelling loss, which is commonly used when pretraining language models, as it is very similar to the restoration loss, although the latter masks characters instead of tokens.

To obtain Ithaca’s textual restoration predictions, we select a sequence of missing characters to predict and use Beam Search with a beam width of 100. Instead of using a standard sequential Beam Search, we take advantage of Ithaca’s non-autoregressive nature83,84,85, and use a non-sequential one instead. Each beam starts with the prediction scoring the highest confidence86, then proceeds iteratively to restore at each time-step the characters of which the certainty is the highest. We found that this version of Beam Search performed substantially better in our evaluation metrics. For region attribution, the outputs are presented as a plot of the top 10 predictions; for chronological attributions, we visualize the model’s predictive distribution over possible date bins. Finally, to reduce the variance of random segment selections, we repeat the process ten times and report results averaged over the iterations.

Ancient historian baseline

The evaluators for ancient text restoration were two graduate students of ancient history, with 7 years of historical and linguistic training and specializing in Greek history and epigraphic documents. Thus, they can be assumed to be more capable than the ‘average’ ancient historian, but not yet equivalent to (the very small number) of established specialists in the field. The scholars were allowed to use the training set to search for textual ‘parallels’, and made an average of 50 restorations in 2 h.

Although Ithaca can indeed propose restoration hypotheses faster, and model its prediction uncertainty, it cannot make choices on the basis of historical and material context. Thus, the experimental setup cannot be considered to be direct comparison between human historians and machine learning, nor are the evaluators assumed to be a proxy for all historians. Instead, the experiment was intended to measure the difficulty of the task and the potential for cooperative artificial intelligence.

Onomastics baseline

Greek nomenclature is commonly used by epigraphers as one of several elements to inform their attribution predictions87. Inspired by this method in the wider epigraphic workflow, we designed an ‘onomastic’ baseline, of which the predictions are based exclusively on the metadata associated with Greek personal names. Five annotators searched for name(s) appearing in a set of inscriptions in the Lexicon of Greek Personal Names (LGPN), a database recording the geographical and chronological distribution of ancient names27, and based their attribution hypotheses on the LGPN’s distribution data. Evaluators were also provided with the inscription’s date or place of writing for the geographical or chronological attribution tasks, respectively.

Restoration metrics

To evaluate different restoration methods, for every inscription, we predict a sequence of 1–10 contiguous missing characters. These lengths account for 83% of the distribution of missing character lengths in I.PHI, and enable comparisons with both previous work and the human baselines. Note that, thanks to the text-masking augmentation adopted during training, Ithaca could potentially restore up to half of the input text.

Although the number of characters to be predicted reflects the difficulty of the task, the restored sequences in the test sets held out for human evaluation might not necessarily maintain the same distribution of lengths (as they were a subset of the test set). Thus, instead of reporting only the average scores over the entire test set (as done in previous work), we chose to account for these length discrepancies and compute the average scores for each restored sequence length. First, we computed a separate CER for all samples of each length (between 1–10 characters),

$${{rm{CER}}}_{l}=frac{1}{{sum }_{i}^{N}{I}_{{{rm{len}}}_{i}=l}}mathop{sum }limits_{i}^{N}{I}_{{{rm{len}}}_{i}=l}times frac{{rm{EditDistance}}({{rm{pred}}}_{i},{{rm{target}}}_{i})}{l},$$

where I is the indicator function, leni denotes the length of the i-th sample, N is the number of samples, predi is the predicted sequence of missing characters of the i-th sample and targeti the corresponding target sequence. We next calculate the average for all lengths:

$${{rm{CER}}}_{{rm{score}}}=frac{1}{L}mathop{sum }limits_{l}^{L}{{rm{CER}}}_{l}.$$

where L = 10 is the maximum length.

As human annotators annotated only a subset of the test set owing to time constraints, macro-averaging assigns equal importance to all sample lengths to represent the difficulty of the task independently of dataset statistics, and therefore enabling a fair comparison of the methods. Similarly, for accuracy, we first computed a separate accuracy per length, and then the average:

$${{rm{a}}{rm{c}}{rm{c}}{rm{u}}{rm{r}}{rm{a}}{rm{c}}{rm{y}}}_{l}=frac{1}{{sum }_{i}^{N}{I}_{{{rm{l}}{rm{e}}{rm{n}}}_{i}=l}}mathop{sum }limits_{i}^{N}{I}_{{{rm{l}}{rm{e}}{rm{n}}}_{i}=l}times {I}_{{{rm{p}}{rm{r}}{rm{e}}{rm{d}}}_{i}={{rm{t}}{rm{a}}{rm{r}}{rm{g}}{rm{e}}{rm{t}}}_{i}},$$

$${{rm{accuracy}}}_{{rm{score}}}=frac{1}{L}mathop{sum }limits_{l}^{L}{{rm{accuracy}}}_{l}.$$

As our model outputs a predictive distribution in the chronological attribution task, we introduce an interpretable metric to measure the distance in years between a prediction and the ground-truth interval (Fig. 3c). More specifically, we use a distance metric between the mean of the predictive distribution and the target ground-truth interval; the latter is defined by a minimum (gtmin) and a maximum (gtmax) date in years:

$${rm{Years}}={begin{array}{cc}0, & {{rm{if; gt}}}_{{rm{max }}}ge {{rm{pred}}}_{{rm{avg}}}ge {{rm{gt}}}_{{rm{min }}}\ |{{rm{pred}}}_{{rm{avg}}}-{{rm{gt}}}_{{rm{max }}}|, & {{rm{if; pred}}}_{{rm{avg}}} > {{rm{gt}}}_{{rm{max }}}\ |{{rm{pred}}}_{{rm{avg}}}-{{rm{gt}}}_{{rm{min }}}|, & {{rm{if; pred}}}_{{rm{avg}}} < {{rm{gt}}}_{{rm{min }}}end{array}.$$

Model selection

The final model was obtained by storing the best-performing model on the validation set by using a combined metric that sums the accuracy for textual restoration and geographical attribution, and the distance in years divided by 100 for chronological attribution to make the magnitude comparable. The extensive computational resources required to train our model made the Pareto frontier computation infeasible.

Ithaca’s predictions are 5× closer to ground truths than those recorded in the onomastics baseline (144.4 years). More specifically, Ithaca’s average date prediction is within 28.7 years of the ground-truth date interval, and the median is only 3 years. The results are shown in detail in Extended Data Fig. 3.

Restoring full texts with Ithaca

To overcome memory constraints and length limitations for long inscriptions (>768 characters), Ithaca can be applied iteratively to restore all missing text in a damaged inscription. We experimented with this option on inscription IG II² 116, which is missing 378 characters, and compared Ithaca’s predictions with those of our previous work Pythia on the same text, using the authoritative edition published by Rhodes and Osborne as ground truths88. The models’ correct restorations are highlighted in green (Extended Data Fig. 4), and the erroneous ones in red. In a real-world scenario, both Ithaca and Pythia would provide a ranked set of 20 restoration hypotheses. The comparison in performance between Pythia and Ithaca is stark (74 versus 45 mistakes): moreover, in all cases in which the restoration is in red, the ground-truth sequence existed within the beam of Ithaca’s top 20 hypotheses.

Epigraphers determine the original location where an inscription was written by examining the personal names, local or regional dialectal varieties, and idiosyncratic lexicon or style of an inscription. Moving from this methodological premise, and to discover underlying patterns in Ithaca’s geographical predictions, we compute statistics to track the words that appear most frequently in texts whose region Ithaca predicts correctly. Thus, for each word of the test set, we compute an average accuracy and a frequency of appearance. This visualization is intended to evaluate whether the occurrence of particular words could be correlated to the model’s geographical attributions.

The most frequent words that appear in texts with high prediction accuracy clustered primarily in inscriptions from the region of Delphi, and pertained to the epigraphic genre of ‘manumission inscriptions’ (Extended Data Table 2 for an example). Ancient Greek society depended heavily on unfree labour, but slaves could be freed through a process known as ‘manumission’, which was publicly documented and certified by inscriptions89,90. Over 1,000 such texts dating between around 201 bc and ad 100 have been found in Delphi91,92. The words appearing in Ithaca’s accuracy statistics are identified as typical of these manumission texts, which are in turn distinctive of this region (for example, ἐπίστευσε, άποδμενος, καταδουλισμωι, βεβαιωτήρ, ωνάν): these words could therefore be underpinning the correct attribution predictions (a detailed example is offered in Extended Data Table 2). Further study can now be dedicated to investigating stylized manumissions as distinctive of Delphi.

To further assess the impact of Ithaca’s output visualization techniques in a real-world scenario, we also analysed the saliency maps for geographical attribution of the manumission inscriptions. Indeed, the saliency maps for the Delphic inscription BCH 66/67 (1942/3) 82,9, for example, highlight words typically found in manumission texts and which also appear in Ithaca’s word statistics: these words (ἐπίστευσε, ἐλευθερος, ποιέουσα, ἀποτρέχουσα) have the most important role in the geographical attribution of the inscription, while also betraying the text’s genre as a typical slave manumission inscription (Extended Data Fig. 5b).

Redating disputed Athenian decrees

In the absence of helpful internal evidence of a text’s date (for example, the mention of known historical figures93), epigraphers typically derive an approximate date on the basis of a text’s content, letterforms and grammatical criteria. For example, one of the most notorious methodological debates in epigraphy concerns the ‘three-bar sigma’ dating convention, which holds that no Athenian public document containing the three-bar sigma letter (ϟ) could be dated after the year 446/5 bc, when the letter was supplanted by the four-bar sigma (Σ). On the basis of this chronological benchmark, a group of inscriptions whose interpretation is central to the political history of Classical Athens, and which feature the earlier letter ϟ, were dated to pre-446/5 bc by many authoritative corpora28, 94. This set of decrees exists in the PHI dataset (Extended Data Table 3), and their dating labels follow the conventional ‘higher’ dating of the three-bar sigma criterion.

However, this orthodox dating system soon proved to be problematic: the high dates proposed for these decrees did not agree with contemporary literary accounts reporting on Athenian imperialist policies. Few historians contested the validity of the sigma criterion29,95, but in 1990 photo-enhancement and laser scanning confirmed the down-dating of an inscription featuring the three-bar sigma (the Egesta decree, IG I3 11) from 458 to 418 bc96. Over the following decade, the sigma’s traditional cut-off date was revisited, and the dates of other decrees were also pushed back28,97.

Ithaca’s predictions for this set of disputed inscriptions independently align with the most recent dating breakthroughs (Extended Data Fig. 6). For example, the (in)famous Chalcis decree (IG I3 40; Extended Data Fig. 7), which records an oath of allegiance sworn by the city of Chalcis to Athens98 and traditionally dated to 446/5 bc28, is attributed by Ithaca to 420 bc, therefore concurring with the lower dating hypothesis of 424/3 bc proposed by more recent scholarship99. Perhaps the most compelling example of Ithaca’s prediction independently aligning with a lower dating hypothesis is the decree of Kleinias (IG I3 34)100, regulating the collection of tribute across the Athenian empire. The sigma dating system would assign the inscription to 448/7 bc28, but scholars have recently challenged this orthodoxy and proposed the earlier date of 425/4 bc101. Ithaca’s prediction agrees precisely with the latter, dating the famous decree to 424 bc.

Ithaca has re-dated a number of these key inscriptions with striking accuracy (Extended Data Table 3). Although it may seem slight, this 40/30-year chronological reorganization has considerable implications for our grasp of Athenian imperial behaviour, leading historians to a more profound understanding of one of the most momentous periods of ancient history28,97. The fact that Ithaca was trained on the largest available dataset of Greek epigraphic texts makes it possible to challenge or overcome individual biases or, indeed, errors in the existing academic tradition, notwithstanding the fact that the dataset in question is originally based on the accumulated academic tradition.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.