Testing the predictive strength of the comparative method : An ongoing experiment on unattested words in Western Kho-Bwa languages

Although it is well-known to most historical linguists that the comparative method could in principle be used to predict hitherto unobserved words in genetically related languages, the task ofword prediction is rarely discussed in the linguistic literature. Here, we introduce ‘reflex retrodiction’ as a new task for historical linguistics and report an ongoing experiment in which we use a computer-assisted workflow to retrodict reflexes for so far unobserved words in eight varieties of Western Kho-Bwa (a subgroup of Sino-Tibetan). Since, at the time ofwriting this report, the experiment is still ongoing, we do not report concrete results, but instead provide an estimate of our expectations by testing the performance of the computational part of our workflow on existing language data. Our results suggest that reflex retrodiction has the potential of becoming a useful tool for historically oriented fieldwork.


Introduction
It is well known that the comparative method cannot only be used to reconstruct languages no longer reflected in writing systems, but that it can also be used to predict structures or words in languages that have not yet been investigated or observed. Thus, when based on comparative and internal evidence, Saussure (1879) proposed the existence of coefficients sonantiques in the system of the Indo-European proto-language he predicted that -if ever a language was found that retained these elements -these new sounds would surface as segmental elements in certain cognate sets of the so far undetected language. These sounds are nowadays known as laryngeals (*h₁, *h₂, *h₃, see Meier-Brügger 2002), and when Hittite was identified as an Indo-European language (Hrozný1915), one of the two sounds prognosticated by Saussure could indeed be identified in several word forms, thus providing evidence for Saussure's original 'prediction'.
Saussure's prediction was not planned as such, and it is unlikely that Saussure even thought of his theory in this way. That prediction in this sense, which is more appropriately called retrodiction (since it is not directed towards future events), is possible in our discipline, however, is wellknown, even if it less frequently discussed as such in the literature. When dealing with linguistic retrodiction, linguists try to infer the structure of so far unobserved datapoints based on the data available to them at a given point in time. Classical examples for linguistic retrodiction are the universals of grammar proposed by Greenberg (1963). As these universals are usually stated in the form of implications, we can -provided the universal holds -infer the presence of one structural feature if we know the feature that implies it (see also Blevins 2004 on predictions in the field of historical phonology). Another example is the common practice in historical linguistics to retrodict missing reflexes of cognate sets when searching for etymologies in a given language (see, for example, Michael et al. 2015: 196). Even among speakers living in contact areas, we can at times observe how they learn to guess how a word unknown to them would sound in the target language (Branner 2006: 215).
In order to test the predictive force and the usefulness of prediction studies in historical linguistics, we are currently carrying out an experiment on missing words in Western Kho-Bwa language data. Western Kho-Bwa is a branch of the Sino-Tibetan language family 1 , that has not been thoroughly investigated so far. The main idea of this ongoing experiment is to use a computer-assisted workflow by which missing reflexes in an etymological dataset of eight Western Kho-Bwa language varieties are predicted, using computational techniques which are later refined manually. These missing reflexes can then be directly tested in fieldwork, by comparing predictions against attested reflexes.
In the following sections, we introduce reflex retrodiction as a new explicit task of historical linguistics and point to existing automated solutions (Section 2). We then present our experiment in detail, providing information on its background, on the language varieties involved, and how we plan to evaluate the results (Section 3). Given that -at the time of writing this paper -the experiment is still ongoing -we then provide a succinct outlook on our expectations, by testing the performance of the algorithm on existing Western Kho-Bwa language data (Section 4).

Reflex retrodiction as a new task for historical linguistics
Although prediction is rarely mentioned and mostly implicitly practised in historical linguistics, we consider it a vital aspect of the comparative method, and we think that a more explicit discussion of prediction techniques could play a vital role for the future of our discipline. While our linguistic knowledge derived from the techniques for historical language comparison could be used for a wide range of predictions targeting different linguistic domains, we think that the task of reflex retrodiction deserves more attention in particular. Reflex retrodiction is hereby understood as the task by which a linguist tries to predict the form of the reflex of a given proto-form or a cognate set attested in different languages.
Linguists apply reflex retrodiction routinely when searching for thus far unattested cognates in a specific language. Cognate sets are often spotty, showing reflexes only in a small sample of all languages under investigation, especially when initial research only considers cognate sets that share the same meaning across all languages. Hence, the actual search for the missing words in other regions of the lexicon can turn out to be very tedious and time-consuming. In order to ease this search, scholars intuitively predict missing forms, based on known sound laws or known patterns of sound correspondences. When asking informants or sifting through dictionaries, they search for forms that match their guess, which drastically reduces the search space. If a form comes close to the researcher's prediction (including forms that are not completely identical, but similar enough), they can directly add them to their list of attested reflexes for a given cognate set.
In the following sections, we will quickly discuss how retroflex retrodiction is carried out traditionally, and which automated methods have been proposed so far. We will conclude by discussing the potential of more formalized approaches to reflex retrodiction when dealing with unstudied or understudied language families and linguistic sub-groups.

Classical approaches for reflex retrodiction
In principle, there are two basic ways how reflex retrodiction can be carried out: top-down or pattern-based. Top-down approaches start from a given proto-form and an ordered list of sound laws, which researchers apply step by step, until the form in the language missing the reflex has been derived. Applying this technique successfully requires both a very good knowledge of the sound change processes (both the individual sound changes as well as the diachronic order in which they occurred) of all the languages under investigation and a reliable proto-form. Given the complexity of sound-law-based derivations, which require a very detailed knowledge of the sound change processes that lead to the diversification of a given language family, top-down approaches are only applicable when dealing with very well-attested and deeply investigated language families, such as Indo-European.
Pattern-based approaches use knowledge of observed correspondence patterns to fill gaps where reflexes are missing in cognate sets. There are two basic ways in which this can be done. Firstly, one can use pairwise sound correspondences to try to predict a word form unknown in one language from a word form known in another language. An example for this would be to use the German Dorf 'village' to predict the English counterpart thorp, which nowadays is only attested in village names. The disadvantage of pairwise sound correspondences is that we often face complex correspondences by which one sound in one language may have two or more counterparts in the other language. Although phonetic conditions can at times give us hints as to the choice of the correct sound, this is not necessarily given in all cases, specifically also because conditions for mergers or splits in sound change can easily be lost during language change.
To circumvent the problem of missing information when trying to predict words from one language into another, one can, secondly, predict reflexes from sound correspondence patterns across multiple languages. In the linguistic literature, we often find examples of recurring correspondences across more than one language, which are usually used to illustrate how certain proto-sounds are reflected in the languages under investigation (Clackson 2007: 37). Correspondence patterns across multiple languages have, of course, a greater predictive force, given that evidence lost in the majority of languages may still be present somewhere. An example is the vocalism of Indo-European, which can barely be reconstructed without resorting to Ancient Greek (Meier-Brügger 2002). The disadvantage of correspondence patterns, however, is, that they are difficult to formalise. Furthermore, it is unlikely that linguists can remember the complexity that correspondence patterns across multiple languages can show in reality with enough detail.
Given that the distinction between pairwise and multiple language comparison is essentially arbitrary, it is obvious that linguists pursuing reflex retrodiction in practise will resort to an intuitive weighting of evidence. If linguists know that for a given unattested reflex a specific language provides the clue information, such as the vowel in Indo-European, they will naturally try to start with the language that provides the crucial information. In cases where the situation is less clear, they will successively increase the number of witnesses in order to come up with the form that, in their opinion, best matches the evidence. It is also obvious that intuitive correspondence-based retrodiction is very hard to formalise for computational applications, given that the weightings will usually be language-specific and that humans are very flexible in taking different kinds of evidence into account. It would probably even be incorrect to say that a given form predicted by a linguist has been solely arrived at by correspondence-based retrodiction alone, given that linguists who study a language family closely usually also have at least a rough idea about the major sound changes that took place in the past in order to produce the patterns we observe at present.

Automatic methods for reflex retrodiction
While the task of reflex retrodiction is not strictly divided into different strategies and the distinction between pairwise sound correspondences and correspondence patterns across multiple languages is usually not made in practice, automated approaches that have been proposed so far tend to follow one of these two major strategies. Since, as we have emphasised in the previous section, reflex retrodiction by means of sound laws is usually only applicable to well-studied language families whose history is already well-understood by experts, we will not discuss automated approaches for this task in detail here. For readers interested in this topic, we recommend the very detailed survey on the broader task of 'computerised forwards reconstruction' by Sims-Williams (2018).
Although not necessarily labelled as such, automated approaches for reflex retrodiction based on pairwise sound correspondences have been used for quite some time. An example is the early work by Chen (1997) on mutual dialect intelligibility, in which the author proposes an automated measure to assess how well speakers can understand words from different dialects, assuming that these words are, in fact, cognate. If we turn this idea around, and ask, how well speakers could predict the pronunciation of a word, taking the potential knowledge of pairwise sound correspondences into account, we would have a first idea to develop an automated method for reflex retrodiction based on language pairs. In times of growing popularity of machine learning, in particular neural network approaches, as a powerful tool for multiple different purposes, it is not surprising that scholars have already tested the power of these tools for the purpose of reflex retrodiction. Thus, Dekker (2018) uses the data provided by the NorthEuralex project of Dellert & Jäger (2017) along with methods for automated cognate detection as provided by the LingPy software package of  to test the power of different neural network approaches and settings to handle the task of pairwise word prediction across different languages. The results are generally promising, showing at times rather low differences between predicted and attested words. A drawback is that the method does not use phonetic transcriptions as input, but instead converts the data to the reduced sound class system proposed by the ASJP project (Wichmann et al. 2016), the so-called ASJP code (Brown et al. 2008), which consists of only 40 symbols instead of the much richer inventory offered by the International Phonetic Alphabet (IPA Handbook 1999). Thus, while very interesting as a pilot study, the approach is less feasible for people interested in practical applications, although we hope that the author will find time in the future to increase the flexibility of the work flow, allowing scholars to use the method for their own work.
In contrast to the pairwise approach proposed by Dekker, List (2019) uses sound correspondence patterns across multiple languages for the task of reflex retrodiction. The basic goal of the algorithm proposed by List is not to predict missing reflexes across different languages, but rather to identify sound correspondence patterns in multilingual datasets containing many 'gappy' or 'patchy' cognate sets. If cognate sets show reflexes in only a few languages, it is often not clear which of the observed sound correspondences, derived from phonetically aligned cognate sets, should belong to the same correspondence pattern.
As an example, consider data from four Western Kho-Bwa varieties in Table 1. In this example, we can easily spot two clear-cut correspondence patterns for the initial consonant, showing exactly the same set of reflexes in 'push' and 'human being' 2 and another set of reflexes in 'know' and 'poison'. Based on these data only, this division is straightforward. For the concepts 'burn, roast' and 'scratch' there are missing values for one variety each. Based on the data provided here, we would predict the initial of Jerigaon 'burn, roast' as [r] and that of Jerigaon 'scratch' as [d]. In the case of 'fireplace, hearth', 3 however, we would have a hard time guessing the correct initial, let alone the entire segment, based on the available data, since the correspondences of the initials could -due to the two gaps in Khoina and Jerigaon -be assigned to both the pattern of 'push' and 'human being', and to the pattern of 'know' and 'poison'. Whenever we assign this cognate set as a whole to one of the patterns, we make an implicit, testable prediction for the corresponding initial of the missing forms. In this case, this would either be [r] or [d].

Concept Khispi Duhumbi Khoina Jerigaon
The algorithm proposed by List (2019) essentially infers correspondence patterns from aligned cognate sets, by treating columns (also called sites) in all aligned cognate sets in the data as the nodes of a network, with edges displaying those sites which are compatible with each other. By modelling the data as a network, the cluster problem can then be treated as the well-known minimum clique cover problem in graph theory (Bhasker & Samad 1991), for which approximate solutions exist (Welsh & Powell 1967). Given that the algorithm assigns all alignment sites in a given dataset to unique correspondence patterns, all 'gappy' sites which are assigned to a given correspondence pattern contain inherently a prediction regarding the form of missing reflexes. Thus, when assigning the initial site of the alignment from the cognate set for 'fireplace, hearth' to the pattern of 'push' in Table 1, we would predict that the reflexes for the two missing words should start with [r]. If we assigned it to the pattern 'know', we would predict it to be [d].
The predictive force of the algorithm for sound correspondence pattern inference was tested as part of the initial study and revealed a rather high accuracy of automated reflex retrodictions, with accuracy scores ranging between 50% and 80% for varying data sets. The reasons for the difference in the accuracy scores is still not fully understood, but it seems clear that they are not only related to the genetic diversity of the languages under question, but also to phonotactic aspects, such as the size of the phoneme inventories of the target languages. Despite these aspects remaining unclear for the moment, we consider the results as interesting enough to justify a further testing of the method.

A prediction experiment on Western Kho-Bwa languages
In order to test the predictive force of the comparative method, we designed a prediction experiment for hitherto unobserved words in the Western Kho-Bwa languages. The main idea of this ongoing experiment is to use the information provided by regular sound correspondences across a set of eight Western Kho-Bwa varieties to retrodict pronunciations for reflexes (words and morphemes) that have so far not yet been elicited in field work. Once predicted and registered with the Open Science Framework (https://osf.io), follow-up field work will allow us to verify the retrodictions that were made before, thereby testing not only the current knowledge of sound correspondences, but also the general predictive power of the comparative method.

Background on the experiment
The starting point of this experiment was an initial etymological data set, assembled by Bodt during fieldwork conducted in Arunachal Pradesh between 2012 and 2017. These data were initially only available in a nonstandardised form, namely a manually prepared Word table. The data set was first converted to a spreadsheet with standardised notations. During one week of intensive work, we then normalised the data to a level where it could not only be automatically processed with the help of different software tools provided by the LingPy Python package ), but also sufficiently post-edited and corrected with help of the web-based EDICTOR tool (List 2017). The initial goal was to prepare the data to such an extent that we could annotate the data and pursue the work flow of computer-assisted language comparison which Hill and List developed as part of a long-term project aimed at the reconstruction of Proto-Burmish (Hill & List 2017).
While this work was on-going, List finished his article on the automated inference of sound correspondence patterns across multiple languages, mentioned above (List 2019). In order to evaluate the performance of the method, he designed an experiment in which the data was split into parts of different sizes, and the information present in the inferred correspondence patterns was used to retrodict the most likely pronunciation of word forms that were artificially deleted from the data. As mentioned before, this experiment turned out to be surprisingly successful, reaching accuracy levels between 50% (Burmish languages) and 80% (Polynesian languages).
The Western Kho-Bwa etymological data set was based on initial fieldwork aimed at obtaining a first understanding of the possible genetic relatedness of these languages. Since not all concepts were elicited for all the varieties, there was a considerable number of missing words in the data, ranging between 5% (Duhumbi) to 34% (Shergaon) of the words, with an average of 22% (see Table 3.1). Given that Bodt's data was missing potentially crucial witnesses for a historical reconstruction of the subgroup, and List just finished his draft on the algorithm that could be used as an automated method for reflex retrodiction, it was clear that the Western Kho-Bwa language data would be a very good test case to check both how well the algorithm performed on the reflex retrodiction task and how well expert predictions on word forms would perform in general. Additionally, such a test would also give us a clearer picture of the power of the comparative method and the regularity of sound change. Given that unattested word prediction -be it based on human assessment or automated methods -heavily relies on the classical assumption that sound change is regular, the accuracy of word prediction also gives us direct insights into the regularity of sound change within a given language family. If language families of similar time depths differ with respect to the degree to which they can be successfully predicted, one could explain this with different degrees of overall sound change regularity, provoked by processes that disguise or counteract the regularity of sound change, such as borrowing, but also high amounts of productive word formation, as we can observe it in many subgroups of Sino-Tibetan (List 2016b).

Background on Western Kho-Bwa languages
In 1952, Stonor, basing himself on local sources, reported that the two languages 'Sulung' or 'Puroik' and 'Bugun' are mutually intelligible (Stonor Variety Items Ratio   (Tayeng 1990) and Dondrup (Dondrup 1988(Dondrup , 1990(Dondrup , 2004. On the Chinese side of the border, the first Puroik data were published as part of the large-scale survey Tibeto-Burman Phonology and Lexicon (Sūn 1991). Based on these materials and his own field work data, Jackson Sun (Sun 1992(Sun , 1993 was the first to suggest that Puroik, Bugun, Sherdukpen and 'Lishpa-Butpa' are not just a random residue when all other major languages are subtracted, but that they might belong together and form a coherent linguistic group. Other researchers after him either adopted his view or independently reached the same conclusion (Burling 2003, Rutgers 1999. van Driem (2001) dubbed the group 'Kho-Bwa cluster' in his handbook Languages of the Himalayas, by combining his provisional reconstructions for 'water' and 'fire' in the subgroup. More recent publications include the Puroik description from China by Lı̌ (2004)  ) mention 'Kho-Bwa' as a (potential) branch of Tibeto-Burman in western Arunachal Pradesh. Although the exact phono-logical shape of the reconstructions kho 'water' and bwa 'fire' needs to be established, we follow Lieberherr & Bodt (2017) and others before them in using 'Kho-Bwa' as a label for these languages. Besides the fact that this terminology is already established to some extent, it has the advantage of not being biased toward one language like 'Bugunish' (Sun 1993), or a region like 'Kamengic' (Blench & Post 2014, Post & Burling 2017). Furthermore, 'Kho-Bwa' offers an exhaustive definition of the group: Any language of western Arunachal Pradesh in which the word for 'water' starts with k and the word for 'fire' starts with b is a 'Kho-Bwa' language.
The Western Kho-Bwa languages (Bodt 2014a,b) are eight distinct linguistic varieties spoken in the western part of the Kho-Bwa area: the valleys of the Gongri and Tenga rivers. The languages belonging to this subgroup are Khispi (Lishpa), Duhumbi (Chugpa), Sartang and Sherdukpen. Sartang has four distinct speech varieties, whereas Sherdukpen has two. The number of speakers of these linguistic varieties combined is around 8,500, and considering the low speaker population and the rapid socioeconomic and cultural changes in this area, all varieties must be considered endangered.

Linguistic data on Kho-Bwa used in our study
Our linguistic data reflects eight distinct Western Kho-Bwa varieties. The data in its current form is presented in form of a spreadsheet file that can be directly imported by the LingPy software (http://lingpy.org, ; by the LingRex package (List 2018), which provides the code for automated reflex retrodiction as presented in List (2019); and by the EDICTOR interface (http://edictor.digling.org), a web-based tool that allows for a quick manual correction of automated analyses (List 2017).
The basic structure of this data format is a header in the first row, which indicates the content of the cells in each column, and one word per language and per row. In addition to basic columns, such as a unique identifier (ID), the name of the language variety (DOCULECT), or an elicitation gloss for the concept (CONCEPT), the original data entry for the given word (VALUE) and a semi-automatically segmented form (TOKENS), the data contains very detailed manually corrected analyses on cognate relations (CROSSIDS), expressed in the form of partial, cross-semantic cognates (List 2016a), morphological glosses (MORPHEMES), the prosodic structure of each entry (STRUCTURE), and a phonetic alignment analysis of the data (ALIGNMENT). 4  An example for the data format employed in our approach is provided in Table 3, where reflexes for the concept 'burn, roast' are given across five varieties (with three entries missing so far). The column with the prosodic structure (STRUCTURE) plays an important role in clustering the alignment sites into correspondence patterns, since -as a rule -the algorithm will only try to cluster those alignment columns into the same partition that are identical with respect to their prosodic label. This also reflects the classical practice of distinguishing between sound correspondences of the initials and the rhymes in comparative analyses of Sino-Tibetan languages, and South-East-Asian languages in general.

ID
Our format comes very close to the specifications required by the Cross-Linguistic Data Formats initiative (https://cldf.clld.org, , which seeks to increase the overall comparability of linguistic data by encouraging scholars to adhere to general standards by linking their data to reference catalogues, such as Glottolog for languages (Ham-marström et al. 2018), Conception for concepts (List et al. 2016), and the transcription system advocated by the Cross-Linguistic Transcription Systems initiative (CLTS, https://clts.clld.org, Anderson et al. forthcoming).
When registering the experiment, our data was still missing the links to the Concepticon and the CLTS transcription system, but in the meantime, we have prepared the data in CLDF format, and it can be found on GitHub (https://github.com/lexibank/bodtkhobwa) and Zenodo (Version 1.0.1, https://zenodo.org/record/2632545).

Computer-assisted reflex retrodiction
Our computer-assisted workflow for word prediction consists of two parts, an automated and a manual task. In the automated task, we employ the automated correspondence pattern recognition method by List (2019) in order to predict the missing words in the data. The result of this analysis is a table of morphemes as predicted by the algorithm. These come in three variants, as shown in Table 4. The difference between these three variants is that they display different degrees of uncertainty. At times, an alignment site could be assigned to different correspondence patterns, as we have seen for the concept 'hearth, fireplace' in our example in Table  1. If we have to decide between two or more correspondence patterns, the algorithm orders these patterns in decreasing order of alignment sites supporting a given pattern. The variant shown as Word1 in the table only picks the first value for a given language, while Word2 picks the first two values (if more than one are found), and displays them in a single segment slot, separated by a pipe (|) symbol. If no correspondence pattern can be found for a given alignment site (which may happen if the sites do not occur regularly in the data) the algorithm displays this by using Ø, as our symbol for missing data. That a certain prediction suffers from uncertainty in one of the alignment sites is further displayed in the Column Qu. by a question mark.
Hence, in Table 4, we see predictions for the same three concepts with missing values in Table 1: 'burn, roast', 'scratch' and 'fireplace, hearth'. As we predicted earlier, the 'best fit' for the initial for the concept 'burn, roast' in Jerigaon is indeed an [r], and the 'best fit' for the concept 'scratch' in Jerigaon is indeed a [d]. The automatic analysis already comes up with predictions for the initial for the concept 'fireplace, hearth' in Khoina and Jerigaon because, unlike the data presented in Table 1, the concept actually has attested reflexes in Khoitam, Rahung, Rupa and Shergaon on basis of which the reflexes can be assigned to the correspondence set 'push', rather than to the correspondence set 'know'. That the concept 'fireplace, hearth' nonetheless has a question mark in the Column Qu. is because the algorithm cannot assign a value to the alignment sites of the rhymes (i.e. the nucleus in the case of the prefix and the coda in case of the root) of the predicted word.
Given that our algorithm predicts all missing words mechanically regardless of whether this makes sense in terms of lexical considerations, it is clear that the selection of items to be explicitly elicited does not need to contain all items for which predictions could be made. The automated transcriptions provided in the supplementary material 5 may thus contain predictions that we already know are unlikely to exist. This may be due to lexical innovations, borrowings, or because no concepts exist for a given elicitation gloss. For this reason, Bodt made a manual analysis, extracting those predictions that on basis of his experience would be most promising and interesting. This list comprises a list of 630 detailed predictions (including full words with prefix and main root), as well as an informed guess by Bodt that at times overrides the automated prediction, especially in those cases, where the automated prediction would only predict a suffix instead of a full word form, or where the automated prediction could not be resolved fully from the data fed to the algorithm. Essentially, this allows us to compare two different kinds of predictions: the fully automated ones, and the ones corrected by the expert, which is also shared in the supplementary material. 6 So, to continue with the example in Table 4 above, the manually adjusted prediction for the concept 'fireplace, hearth' in Khoina ([br ɔ p]) and Jerigaon ([br ɔ p]) is based on the evidence from the other Sartang and Sherdukpen varieties, including contraction of the fire-prefix to the root.

Status and time line of the experiment
As mentioned before, the initial fieldwork was carried out by Bodt between 2012 and 2017, with the major part of the data on the Western Kho-Bwa varieties collected in 2014. A first overview of the results, including a Word table with provisional proto-forms and the underlying sound correspondences, was presented at the South East Asian Linguistic Seminar in Padang, Indonesia, in May 2017. In June 2018, the spreadsheet was sufficiently enhanced by converting it into formats that can be computationally processed. On August 20, 2018, List carried out the experiment on automated word prediction. On October 3, 2018, Bodt used the automated predictions to come up with a list of sensible manual predictions to be checked during his follow-up field work. This list contained 630 different word forms in total, and about 65 words on average per variety. 7 On October 5, 2018, List registered the experiment, including both the code, the data, and the automatically and the manually corrected reflex retrodictions with the Open Science Framework (https://osf.io) at https://osf.io/evcbp/. The basic idea of registering an experiment is to deposit a hypothesis prior to testing it with some provider, in order to make sure that the hypothesis was not created after the scholars inspected data and results. In the case of our word prediction experiment, the hypothesis consists of the 630 predictions we have come up with. That means, we do not provide a single hypothesis to be tested, but a rather long list of predictions that can all be tested individually. The field work to check the reflex retrodictions against the real word forms was carried out in October and November 2018. We are now in the process of comparing the accuracy of the word predictions and share our results in form of a publication and talks starting with the International Historical Linguistics Meeting in Canberra 2019.

Future evaluation of the experiment
Evaluating the accuracy of word predictions can be done in a very straightforward way by comparing the predicted word form with the attested word form segment by segment. A metric would then score all those cases in which a predicted segment differs from an attested one, and yield the average accuracy of a predicted word by dividing the number of correctly predicted sound segments by the number of incorrectly predicted sound segments. This procedure is already implemented in the LingRex software we used for this study, and likewise described and illustrated in List (2019). Thus, the reflex retrodiction task on controlled data sets can be easily automated and tested in those cases where data is artificially distorted and we know in advance that each missing word indeed has a counterpart in a cognate set of the given language.
When working with real-language data, and words that are really unattested at the time of the prediction, however, it is also possible that the predicted word does not exist in the target language, but has been lost due to lexical replacement. As a result, any metric that wants to judge the accuracy of a prediction experiment as we have conducted it on Western Kho-Bwa language data needs to assess first if the words that are attested for a given semantic slot are indeed cognate with the cognate set which was used in order to predict the unknown word form. If it turns out that the word is indeed not cognate with the words used for the prediction, this should not count as a failure of the method, but should instead be ignored when comparing the accuracy of the prediction experiment.
We are currently still discussing and evaluating the most useful metrics for evaluating the accuracy of both the automated and the manually corrected predictions. Ideally, we would have addressed this problem even before registering the experiment. However, as we consider our research as a pilot study on the task of reflex retrodiction, we hope that our colleagues will understand that we were not able to predict completely how this could be carried out in an optimal manner.

Testing automated predictions on Western Kho-Bwa
Before conducting an experiment of this kind, it is useful to compute the rate of accuracy we might expect from a random sampling of the data alone. For this purpose, we randomly deleted words from the existing data and then used the distorted data set to predict the deleted words. The accuracy is then computed for each word form by counting how many times the algorithm proposes the correct word form and how many times it fails. This can be represented in a percentages score, our accuracy score. After 100 trials, documented in the supplementary material, the accuracy of the prediction experiment on the data reached 59% (0.5854), with an average proportion of 61% of the data being retained. Comparing this score with other data sets, as reported in List (2019), we can see that the Western Kho-Bwa language varieties are less easy to predict than Polynesian languages or Chinese dialects, but rather seem to be as challenging as the Burmish languages in the sample of Hill & List (2017). The fact that the prediction did not reach higher scores may also result from the fact that the original data is already sparse with respect to mutual coverage.

Conclusion
With languages disappearing at rates never experienced before, reflex retrodiction could become more and more important as a practical tool for historically oriented linguistic fieldwork. As the speakers of these languages are becoming fewer in number and older, there is a genuine risk that words that are important from a historical-comparative view point may be lost before they are recorded. Retrodicting these words may help to render the search for cognate forms in these languages less time-consuming and more efficient. For example, rather than having to ask how to say 'to carry by hand', 'to carry on the shoulder', or 'to carry on the back' in a given language, one could directly retrodict the missing forms and ask whether they exist in the target language, and if they exist, one could further ask for their exact semantic interpretation. Additionally, reflex retrodiction will make it easier to elicit cognates of certain words that contain rare segments in a given variety, such as marginally occurring distinctive onsets or rhymes. These actually attested forms can then be used to strengthen purported sound correspondences between linguistic varieties and reconstruct proto-forms.
In this paper, we have described an initial attempt to test how reflex retrodiction could be used in actual field work. Our experiment proposes a first workflow that illustrates how similar experiments could be carried out by colleagues working on other language families. There is no need to follow our workflow completely: scholars could just use their intuition before going back to the field to make lists of forms they think they should check again, which may already be a regular -though largely unreportedpractice among field linguists. By sharing these predictions with the public through registering experiments with the Open Science Framework, scholars can not only share their current state of knowledge with the community, but also test it against the data they observe. Ideally, this can help to strengthen specific hypotheses, and it can also help to increase the awareness that sound change is -in reality -to a large extent proceeding along regular pathways. That this seems to be the case has already been shown in the original study on sound correspondence patterns by List (2019), upon which our automated prediction procedure is based. Here, the automated predictions reached results ranging from 53% to 82% in prediction accuracy when only half of the data was considered. However, given that we expect human prediction to exceed the accuracy of our automated predictions, it would be very interesting not only for linguists but also for other scientific fields and a greater public, if linguists around the world tested and reported their predictions during their fieldwork. Saying that sound change is more or less regular is one thing, but demonstrating that it allows us to guess the pronunciation of words with an amazingly high accuracy, adds practical proof to our otherwise theoretical and formal discipline.
Readers may ask themselves why we report this experiment here in a stage where the major work of checking how well the reflex retrodiction works in the end has not yet been carried out. We decided to report this study already at this stage, since we hope to get some feedback from our colleagues. We are not only interested to receive suggestions for enhancement of our current study, but we would also like to hear how field workers dealing with historical language comparison of hitherto poorly investigated languages are making or have made use of reflex retrodiction.

Comments invited
PiHPh relies on post-publication review of the papers that it publishes. If you have any comments on this piece, please add them to its comments site. You are encouraged to consult this site after reading the paper, as there may be comments from other readers there, and replies from the author. This paper's site is here: https://doi.org/10.2218/pihph.4.2019.3037