Reconstructing Proto-Germanic
For my Machine Learning and for Language Processing course I undertook a project aiming to reconstruct the extinct language Proto-Germanic using machine learning methods. The paper can be found here. Moreover, as part of the project I constructed a dataset of Germanic language cognates which I have made freely available for anyone to use. The dataset can be found on my Github.
Abstract
We can learn what extinct languages sounded like through proto-form reconstruction by tracing back phonological and grammatical shifts. Meloni et al. (2021) and Kim et al. (2023) have developed supervised neural models that have achieved state-of-the-art results for reconstructing Latin and Middle Chinese, as the wealth of written records for these languages provide ample data to train their models on. We provide the first cognate dataset of Germanic languages for reconstructing Proto-Germanic, a language with no written records. Training a neural transformer model on variations of the dataset, reveals that not all descendant languages contribute equally and that removing sparse entries greatly improves performance, achieving comparable results with other datasets. Error analysis exposes a need for grammatical information to infer complex morphological patterns while analyzing the model’s reconstructions shows that the model learns meaningful generalizations and is able to infer patterns of phonological change.