UWC Xhosa Department



UWC Xhosa Department, the core activity of the Xhosa Department at the University of the Western Cape (UWC) is the teaching of Xhosa to both mother-tongue speakers (Xhosa Studies) and non-mother-tongue speakers (Xhosa Language Acquisition). Beginning in 2007, the Department has also volunteered to offer credit-bearing introductory courses to Pharmacy , Community and Health Sciences , and Dentistry.



It is heartening that other faculties realise the role that the Xhosa Department can play in their curricula. Since 1999, the Iilwimi Sentrum has made use of departmental staff members in rendering language services.



A vibrant team of just under ten staff members is responsible for all the teaching, research and community services offered. In recent years, conference attendance has picked up, as well as the reworking of presentations (papers) into publications (articles). All staff members currently also jointly participate in a large departmental research project, the eXe-Files, a project in the broad field of computational linguistics.

To learn more about the Xhosa Department and all its activities, please browse the menu on the left. Under ‘Staff’ each member is briefly presented, under ‘Modules’ an overview of the modules is offered, under ‘Research’ the overarching departmental research project is outlined, under ‘Products’ a few books, CD-ROMs and online tools produced by department members are listed, under ‘Links’ a number of useful web pages are brought together, and if you want to contact the department, please go to ‘Contact Us’.



Research

Over the years, the members of the Xhosa Department have been involved in various research projects ‘ see the Staff pages and the Products page for some examples. In addition to their own projects, all members are currently involved in one overarching project: the eXe-Files. This project is presented here.

Technical Details

Project title:     The eXe-Files ‘ An innovative electronic Xhosa corpus to boost new research and the creation of modern tools

Auspices:     Supported by UWC’s Department of Research Development, and co-sponsored by an anonymous Publisher and TshwaneDJe HLT.

Period:     Start: mid-2006 / Stop: end-2009

Principal researcher:     G-M de Schryver (Prof.)

Other project leader:     S.J. Neethling (Prof.)

Other co-researchers:    T.V. Mabeqa (Ms.); L.K. Mletshe (Mr.); N.L. Mpolweni (Ms.); T. Ntwana Mgijima (Ms.); N. Skade (Mr.); A. van Huyssteen (Mrs.)

Research assistant:     S. Dlamini (Mrs.)

Introduction: Computational Linguistics

Computational linguistics is all around us today, although we do not always realise this: a web search engine such as Google, spellcheckers, electronic dictionaries, automated translation from one language into another, and so on, all make use of the results from this field. In nearly all instances, an electronic corpus of language data is one of the core components. In this project, the intention is to build an innovative Xhosa corpus, the eXe-Files, which will then enable to research the language in a new way, and to produce the first ICT products for and in Xhosa.

‘X’ stands for the language being worked on and with, Xhosa, while ‘exe-files’ stands for the tangible outcomes, metaphorically seen as ‘executables’.*

_________________

* From Wikipedia: “.exe is the common filename extension for denoting an executable … An executable or executable file, in computer science, is a file whose contents are meant to be interpreted as a program by a computer.”

Feasibility: Other South African Examples



Over the past few years, the feasibility of both corpus-based research and corpus-based products has already been proven for a number of South African languages. In 2006, for example, the corpus-based study ‘Locative trigrams in Northern Sotho, preceded by analyses of formative bigrams’ was published in Linguistics (44/1: 135-193), the top journal of the field, by G-M de Schryver and E. Taljard. This clearly indicates that this new approach to fundamental studies in linguistics can truly reach the international scene, and as a by-product, the African languages are given the widest possible international exposure.

As an example of products, in 2003, corpus-based spellcheckers, commissioned by the Department of Arts and Culture (DAC), were compiled and released by D.J. Prinsloo and G-M de Schryver for all official South African languages. The release of those tools was accompanied by a number of research articles published in the local accredited journals.

Goal: Uplift Xhosa

Over the past decade, the focus in South Africa has primarily been on Northern Sotho and Zulu. The time has now come to uplift Xhosa within the broad field of computational linguistics. In this regard the project team is very fortunate indeed, as it consists of no less than five mother-tongue speakers ‘ some of whom have published novels in Xhosa with top publishers such as Oxford University Press. Three language specialists complete the team.

The eXe-Files: General Outline

The first step in the project is to bring the following types of data together: (1) existing electronic files in Xhosa that are available from the project members, (2) freely available Internet data in Xhosa, and (3) selected sections from existing published material in Xhosa. These data are processed according to the very latest principles in corpus building, with the aim to produce a corpus that is both balanced and representative of the language. Although the aim is to reach ‘ten million running words of text’ (the ‘tokens’), these data are only processed computationally, at which point the original format of the material is transformed into what one can view as mostly a set of language statistics, among them a much smaller number of different/unique orthographic words in the corpus (known as ‘types’).

In the second step these statistics and the small contexts around them are used to study the Xhosa language in a new way, within a corpus linguistics framework. Basically, all traditional language fields are eligible for study, consequently becoming ‘corpus-based literature studies’, ‘corpus-based education studies’, ‘corpus-based translation studies’, and so on.

The fields just mentioned are indeed also envisioned as research fields. Other fields that are already being researched are the localisation principles into Xhosa (both from a linguistics as well as a cultural perspective), automated part-of-speech tagging (contrasting, among others, machine-learning techniques with finite-state morphological analyses), and lemmatisation approaches (needed in lexicography).

Linking research with tools, all of the following are envisioned as practical outcomes: a new approach to comparative Xhosa literature, the proposal of new (corpus-based) techniques for outcomes-based and task-oriented Xhosa education, automatic Xhosa term extractors, localised Xhosa software, supervised and unsupervised POS-taggers for Xhosa, and finally a new type of corpus-based Xhosa dictionary.

Publications in each of those fields, as well as on the corpus-building process itself, are being prepared. The eXe-Files, in short, provides the team with the building blocks to be a respected international player in computational linguistics, and this while placing the spotlight on Xhosa

The eXe-Files: Sub-fields

From 2007 onwards, the project has been split along the lines of the following sub-fields, each with its own manager:

    Corpus Building: G-M de Schryver

    Lemmatisation and POS-tagging: S.J. Neethling

    Localisation and Terminology: N. Skade

    Spellchecking and (Machine) Translation: T. Ntwana



    Lexicography: A. van Huyssteen