A part-of-speech tagger for Nepali

Home | About | Projects | Publications | Links | Contact | Nelralec | EMILLE

This investigation is devoted to the development of the first automated part-of-speech tagger for Nepali. It is being undertaken as a collaborative effort by partners in the Nelralec project.

Background

The Nelralec project (Nepali Language Resources and Localisation for Education and Communication) is a three-year research project funded by the EU Asia IT&C committee. Known in Nepali as Bhasha Sanchar, the project seeks to address a variety of needs in terms of computational support for the Nepali language, ranging from text-to-speech software and a localised operating system, to educational structures and language resources to support the development of corpus and computational linguistics in Nepal, through to the implementation of new corpus-based lexicography techniques in a new, empirical Nepali dictionary.

As a partner in the Nelralec project, Lancaster has contributed expertise in the areas of corpus design, encoding, annotation and analysis. Part of our ongoing co-operation with the partners in Nepal has been to develop a framework for part-of-speech tagging in Nepali.

Part-of-speech tagging

Part-of-speech (POS) tagging, also known as morphosyntactic categorisation or syntactic wordclass tagging (see van Halteren 1999). A POS analysis is the very basic grammatical task of assigning every word in a sentence or text to the correct morphosyntactic category - noun, verb, adjective, adverb, and so on. In POS tagging, labels or tags are added to every word in a text to inicate their category.

While it is possible to assign these tags manually, it is highly desirable to automate the process, as otherwise the process of applying a POS analysis to a large corpus becomes prohibitively work-intensive. Several techniques have been developed to accomplish this, primarily in the 1980s and 1990s (see, especially, Garside et al. 1987; Brill 1995; Karlsson et al. 1995).

We judged POS tagging for Nepali to be an important goal for the Nelralec project for a number of reasons. Firstly, annotating our new corpus, the Nepali National Corpus (NNC) with POS tags would help ensure its status as a state-of-the-art language resource. Secondly, the annotations could be exploited in a variety of relevent ways: to assist in corpus-based lexicography, to provide an enhanced resource for language engineering applications, and to widen the range of analyses available to future researchers utilising the corpus in the investigation of the grammatical and textual structures of Nepali.

The Nelralec tagset

The first prerequisite for an automated POS tagger is a tagset - that is, a set of exhaustive categories into which any token in the language can be placed. While the nature of language is such that there will always be words that are hard to classify or ambiguous nbetween two categories, the tagset categories should be designed in such a way as to minimise these problems.

The Nepali tagset used on the Nelralec project was developed by a team of linguists from Tribhuvan University (especially Yogendra Yadava, Ram Lohani, and Bhim Regmi) and Lancaster University (Andrew Hardie). Starting from an initial set of categories based mainly on previous analyses of Nepali grammar (for instance Acharya 1991), the tagset was applied to small data samples, discussed, revised, and then re-tested over a period of several weeks, before being finalised at a meeting in Kathmandu in June 2005.

Following the example of Hardie's (2003) work on the morphosyntactic analysis fo Uredu, which demonstrated that the frameworks of analysis developed for European languages were easily extensible to Indo-Aryan languages, the tagset is in general keeping with the EAGLES guidelines for morphosyntactic annotation of corpora.

The tagset is fully hierarchical - that is, in a tag such as VVYN1F, the first letter (V-) indicates the class of all verbs, the first two letters (VV-) indicate finite verbs, the first three letters (VVY-) indicate third person finite verbs, and so on, until at the lowest level of the hierarchy the fully specific tag VVYN1F indicates a very tightly defined, narrow category (feminine singular non-honorific third person finite verbs, such as che).

The final definition of the Nelralec tagset has been disseminated as Hardie et al. (2005) and is available at this link.

The tagset has two main structural features of note. Firstly, the Nepali postpositions, which are preferentially written as affixes on the noun or other word that they govern, are treated as separate tokens in this scheme of analysis. This gives the tagset the flexibility needed to handle the very large array of potentitally possible configurations of case.

Secondly, tense, aspect and modality are not marked up on finite verbs, which are classified solely according to their agreement marking - a necesary simplification for dealing with the very complex verbal inflections of Nepali, which, together with the use of compound verbs, could not be indicated by the tagset without the use of thousands of addiitonal categories.

Manually annotated data

Many methodologies for automated POS tagging require training data (for example, as a basis for a wordform-and-tags lexicon, or to estimate tag transition probabilities in an n-gram stochastic tagger based on a hidden Markov model). To create a suitable resource for training and testing our automated system, a team of analysts in Kathmandu, led by Ram Lohani, undertook a stage of manual analysis - that is, the insertion by hand of tags into one of the texts.

At first, the whole process of tagging - tokenisation, supplying a tag, compiling lists of morphological rules and exceptions, and so on - had to be done entirely by hand. This is a slow and laborious process. However, as the quantity of linguistic knowledge in the manually tagged dataset grew, it became possible to incorporate that knowledge into a preliminary version of the automatic tagger (see below), which was then run on the texts prior to manual analysis.

This meant that many of the most common wordforms could be tagged in advance of manual analysis, and even unknown words could be given a set of probable tags. The analysts' task was, at this stage, more closely akin to post-editing than to from-scratch annotation, and thus, our productivity increased greatly as a result of this "bootstrapping" procedure.

Over a period of several months, we were successful in manually tagging a 350,000 word subsection of the 1 million word Nepali National Corpus Core Sample.

An automated part-of-speech tagger

The software framework used for the Nepali POS tagger is Unitag (see Hardie 2004, 2005). This unified tagging system, originally developed to tag Urdu, is now entirely language-independent, and based entirely on Unicode. It consists of a powerful morphological and lexical analysis system, and twin disambiguation modules, one based on hand-written rules and the other using a probabilistic system based on a Markov model.

The process of developing the Nepali tagger is therefore primarily a matter of developing the appropriate Nepali-specific linguistic knowledge bases for the system. These resources include tokenisation rules and an exceptions list, a wordform-and-tags lexicon, a database of morphological patterns that indicate part-of-speech, a set of contextual disambiguation rules, and a tag transition probability matrix. Some of these can be derived automatically from the training data; others have to be built or compiled by hand.

The development of these resources allowed us to raise the accuracy rate of the tagger to a level of 93% (this may vary depending on the type of text).

The Nepali tagger can be downloaded from this link.

References

Acharya, J (1991) A descriptive grammar of Nepali. Washington, D.C.: Georgetown University Press.

Garside, R, Leech, G and Sampson, G (eds.) (1987) The computational analysis of English. London: Longman.

van Halteren, H (ed.) (1999) Syntactic wordclass tagging. Dordrecht: Kluwer Academic Publishers.

Hardie, A (2003) Developing a tagset for automated part-of-speech tagging in Urdu. In: Archer, D, Rayson, P, Wilson, A, and McEnery, T (eds.) (2003) Proceedings of the Corpus Linguistics 2003 conference. UCREL Technical Papers Volume 16. Department of Linguistics, Lancaster University.

Hardie, A (2004) The computational analysis of morphosyntactic categories in Urdu. PhD thesis, University of Lancaster.

Hardie, A (2005) Automated part-of-speech analysis of Urdu: conceptual and technical issues. In: Yadava, Y, Bhattarai, G, Lohani, RR, Prasain, B and Parajuli, K (eds.) Contemporary issues in Nepalese linguistics. Kathmandu: Linguistic Society of Nepal.

Hardie, A, Lohani, R, Regmi, B and Yadava, Y (2005) Categorisation for automated morphosyntactic analysis of Nepali: introducing the Nelralec Tagset (NT-01). Nelralec/Bhasha Sanchar Working Paper 2.

Karlsson, F, Voutilainen, A, Heikkilä, J and Anttila, A (eds.) (1995) Constraint Grammar: a language-independent system for parsing unrestricted text. Berlin: Mouton de Gruyter.

"A part-of-speech tagger for Nepali" is part of the Nelralec project, funded by the EU-Asia IT&C Programme
Nelralec (Bhasha Sanchar)
EU-Asia IT&C Programme
Home | About | Projects | Publications | Links | Contact | Nelralec | EMILLE

Last updated 17th May 2006.