The Sixth
INTERNATIONAL SYMPOSIUM ON MALAY/INDONESIAN LINGUISTICS |
Nirwana Resort Hotel, Bintan Island, Riau, Indonesia
Gerry Knowles & Zuraidah Mohd Don University of Lancaster | University of Malaya g.knowles@lancaster.ac.uk | z.mohddon@lancaster.ac.uk This paper reports on an investigation currently being supported (July - October 2002) by Dewan Bahasa dan Pustaka into the automatic grammatical tagging of a corpus of Malay texts. We reported on our first attempts at tagging and lemmatising to the symposium at Leipzig in 2001, since when we have developed an annotated lexicon of over 10,000 words. Malay is of considerable interest to the corpus linguist on account of its grammatical classes. Conventional notions of 'parts of speech' really belong to Indo-European and can be highly confusing when imposed on Malay. We are developing a data-driven approach to grammatical class in Malay, and setting up classes the validity of which it would be difficult to dispute. We also need a more flexible approach to the lemma than has hitherto been recognised in corpus linguistics. The nature of grammatical class has important consequences for the design of a parser for Malay. An encouraging proportion of the data can be handled by a simple set of ordered rules using our tags; it is not at all clear how to parse Malay texts using conventional 'part-of-speech' tags. The parser also forces a clear distinction between language-specific properties of Malay as opposed to properties which Malay shares with other languages. |