Abstract
Corpus Linguistics Tools for Sahidic Coptic
Amir Zeldes1 & Caroline T. Schroeder2
1 Humboldt-Universität zu Berlin, 2 University of the Pacific
Coptic, the language of Christian Egypt in the Hellenistic era of the first millennium, offers both a chance and a challenge for digital humanities research in the 21st century. On the one hand, there are comparatively few digital resources available: no publically available automatic tokenization, part-of-speech tagging, or corpus search software, nor any guidelines on how to undertake these tasks (we are aware of only one, incomplete and unreleased effort to tag Coptic in Orlandi 2004; our work bases partly on Orlandi’s lexical resources, kindly made available to us). On the other hand, an explosion of work in digital humanities (standards like TEI/EpiDoc for manuscript digitization, cf. Cayless et al. 2009 or digital infrastructure like Perseus, cf. Crane et al. 2009, to name just two) has led to a wide range of resources one can draw on in bringing Coptic to the level of technology now enjoyed e.g. by Greek and Latin.
To seize these opportunities, we have endeavored to develop comprehensive, freely available tools for the automatic linguistic processing of Coptic manuscripts that can be corrected manually and made available online. We present the first publically available tokenizer (lexicon and rule-based) for the main Sahidic dialect of Coptic, as well as two corresponding part-of-speech tagging schemes and training models, fine and coarse grained. Tokenization for Coptic is a non-trivial task, since manuscripts are written in scriptio continua (without spaces), but Coptic word forms are linguistically segmented at two levels: both into minimal morphemes, and into larger word forms, corresponding to nominal or verbal complexes, including related prepositions and articles (nouns) and multiple concatenated conjugation bases with subject/object pronouns and allomorphy (verbs). Our tokenizer currently addresses only the first task, and assumes that a human annotator has separated the scriptio continua into the coarse word forms. Example (1) shows morpheme borders added by the tokenizer, represented by pipe symbols. In some cases, letters can stand for two sounds that belong to different morphemes. In such cases the tokenizer saves the original diplomatic form and also outputs an alternative orthography which allows morphemes to be represented separately. This is shown in (2) for the letter theta), which stands for a /t/ followed by /h/ coming from different morphemes (individual letters are transliterated in angle brackets). In words of Greek origin, theta, phi and chi should be retained, while coincidental combinations of multiple morphemes leading to these letters must be disentangled.
Etc. at Abstract
Amir Zeldes1 & Caroline T. Schroeder2
1 Humboldt-Universität zu Berlin, 2 University of the Pacific
Coptic, the language of Christian Egypt in the Hellenistic era of the first millennium, offers both a chance and a challenge for digital humanities research in the 21st century. On the one hand, there are comparatively few digital resources available: no publically available automatic tokenization, part-of-speech tagging, or corpus search software, nor any guidelines on how to undertake these tasks (we are aware of only one, incomplete and unreleased effort to tag Coptic in Orlandi 2004; our work bases partly on Orlandi’s lexical resources, kindly made available to us). On the other hand, an explosion of work in digital humanities (standards like TEI/EpiDoc for manuscript digitization, cf. Cayless et al. 2009 or digital infrastructure like Perseus, cf. Crane et al. 2009, to name just two) has led to a wide range of resources one can draw on in bringing Coptic to the level of technology now enjoyed e.g. by Greek and Latin.
To seize these opportunities, we have endeavored to develop comprehensive, freely available tools for the automatic linguistic processing of Coptic manuscripts that can be corrected manually and made available online. We present the first publically available tokenizer (lexicon and rule-based) for the main Sahidic dialect of Coptic, as well as two corresponding part-of-speech tagging schemes and training models, fine and coarse grained. Tokenization for Coptic is a non-trivial task, since manuscripts are written in scriptio continua (without spaces), but Coptic word forms are linguistically segmented at two levels: both into minimal morphemes, and into larger word forms, corresponding to nominal or verbal complexes, including related prepositions and articles (nouns) and multiple concatenated conjugation bases with subject/object pronouns and allomorphy (verbs). Our tokenizer currently addresses only the first task, and assumes that a human annotator has separated the scriptio continua into the coarse word forms. Example (1) shows morpheme borders added by the tokenizer, represented by pipe symbols. In some cases, letters can stand for two sounds that belong to different morphemes. In such cases the tokenizer saves the original diplomatic form and also outputs an alternative orthography which allows morphemes to be represented separately. This is shown in (2) for the letter theta), which stands for a /t/ followed by /h/ coming from different morphemes (individual letters are transliterated in angle brackets). In words of Greek origin, theta, phi and chi should be retained, while coincidental combinations of multiple morphemes leading to these letters must be disentangled.
Etc. at Abstract
No comments:
Post a Comment