Artifical Intelligence
Lelapa launches Africa’s first AI large language model
Swahili, Yoruba, IsiXhosa, Hausa, and isiZulu all form part of the multilingual InkubaLM, aimed at enhancing low-resource African languages.
Swahili, Yoruba, IsiXhosa, Hausa, and isiZulu have come to AI. These languages form part of InkubaLM, the continent’s first multilingual AI large language model (LLM), designed by pan-African AI initiative Lelapa AI to support and enhance low-resource African languages.
This model promises to accelerate the development of digital solutions for African languages as well as preservation efforts for languages that have historically been underrepresented in the digital landscape.
“As AI practitioners, we are committed to forging an inclusive future through the power of AI”, says Pelonomi Moiloa, CEO of Lelapa AI. “No one should have to assimilate to a culture outside of their own in order to access cutting edge technology.
“While AI holds the promise of global prosperity, the challenge lies in the resources required for large models, which are often out of reach for the majority of the world. Open-source models have attempted to bridge this gap, but much more can be done to make models cost-effective, accessible, and locally relevant.”
As the world becomes increasingly connected, the need to preserve and promote linguistic diversity has never been more critical. Many languages, spoken by millions of people, lack the digital resources and tools necessary to thrive. Addressing this pressing issue, Lelapa AI has developed InkubaLM as a state-of-the-art language model that leverages cutting-edge AI to provide robust support for these languages.
InkubaLM, translating to Dung Beetle Language Model, is a robust, compact model designed to serve African communities, without requiring extensive resources.
Like the dung beetle, which moves 250 times its weight, InkubaLM exemplifies the strength of smaller models, says Lelapa. Accompanied by two datasets, InkubaLM marks the first of many initiatives to distribute the resource load, ensuring African communities are empowered without having to start from scratch in building models and solutions.
InkubaLM consists of two datasets, Inkuba-Mono and Inkuba-Instruct. The Inkuba-Mono is a monolingual dataset collected from open-source repositories in five African languages to pre-train InkubaLM models. Lelapa collected open-source datasets for five African languages from repositories on Hugging Face, Github, and Zenodo. After preprocessing, researchers used 1.9-billion tokens of data to train the InkubaLM models.
Inkuba Instruct currently provides essential tools for translation, transcription, and natural language processing. Its instruction dataset focused on five tasks: Machine Translation, Sentiment Analysis, Named Entity Recognition (NER), Parts of Speech Tagging (POS), Question Answering, and News topic Classification. For each task, it covered five African languages: Hausa, Swahili, Zulu, Yoruba, and Xhosa.
Lelapa AI says it is committed to making this technology widely available, offering open access to researchers, educators, and language communities and Africans to be included in the digital economy.
“InkubaLM is a foundation that can be built on,” Lelapa said in a statement. “The aim is to gain good functionality for the five selected languages, as conventional LLMs do not perform well and are therefore not truly a viable option for users to develop applications that are most needed and relevant for mainstream Africans. InkubaLM can be fine-tuned further for different language tasks to avoid having to train a model from scratch that is able to handle these languages. In the case of the datasets however, this means that any language model can be better tuned to suit these languages.”
Atnafu Tonja, fundamental research lead at Lelapa AI, says: “Our language model is not just a technological achievement; it is a step towards greater linguistic equality and cultural preservation.”
- To find out more about InkubaLM, follow this blog post.