Launching LazarusNLP: Reviving Indonesia's Dying Languages through NLP

Today we are launching LazarusNLP, an independent research group dedicated to leveraging Natural Language Processing (NLP) to preserve and revitalize the diverse languages of Indonesia. In a nation boasting over 700 distinct languages, our mission is to ensure that each language receives the attention it deserves in the digital age.

This blog aims to discuss the gaps in NLP research and development for Indonesian languages and introduce our initial projects. We are excited to share our work and invite the community to join us in our mission!

You can try out our projects in the following web app demo:

Info

This web app is available at our 🤗 HuggingFace Space.

Background

Indonesia's linguistic landscape is rich and varied, with languages evolving independently across different regions. Despite the prevalence of Indonesian (Bahasa Indonesia) as the national language, many of these regional languages face the threat of extinction. UNESCO has identified 137 Indonesian languages as vulnerable or endangered, highlighting the urgent need for action¹.

While advancements in NLP have benefited major languages like Indonesian, there has been a glaring gap in applying these technologies to Indonesia's regional languages. This neglect risks further marginalizing these languages in an increasingly digital world. Our mission is to bridge this gap by developing NLP tools and resources tailored to the unique linguistic features of each Indonesian language and to reviving Indonesia's dying languages.

Projects

IndoT5: T5 Language Models for the Indonesian Language

IndoT5 is a T5-based language model trained specifically for the Indonesian language. With just 8 hours of training on a limited budget, we developed a competitive sequence-to-sequence, encoder-decode model capable of fine-tuning tasks such as summarization, chit-chat, and question-answering. Despite the limited training constraints, our model is competitive when evaluated on the IndoNLG (text generation) benchmark.

Indonesian Sentence Embedding Models

We trained open-source sentence embedding models for Indonesian, enabling applications such as information retrieval (useful for retrieval-augmented generation!) semantic text similarity, and zero-shot text classification. We leverage existing pre-trained Indonesian language models like IndoBERT and state-of-the-art unsupervised techniques and established sentence embedding benchmarks.

Indonesian Natural Language Inference (NLI) Models

Open-source lightweight NLI models that are competitive with larger models on IndoNLI benchmark, with significantly less parameters. We applied knowledge distillation methods to small existing pre-trained language models like IndoBERT Lite. These models offer efficient solutions for tasks requiring natural language inference capabilities while minimizing computational resources such as cross-encoder-based semantic search.

🤗 HuggingFace Collection

Many-to-Many Multilingual Translation Models

Adapting mT5 to 45 languages of Indonesia, we developed a robust baseline model for multilingual translation for languages of Indonesia. This facilitates further fine-tuning for niche domains and low-resource languages, contributing to greater linguistic inclusivity. Our models are competitive with existing multilingual translation models on the NusaX benchmark.

Future Plans

Our journey has just begun. Looking ahead, we are committed to expanding our repository of open-source pre-trained language models, with a focus on Indonesia's languages, multilinguality, culture, and code-switching. By democratizing access to NLP tools for all Indonesian languages, we aim to catalyze a renaissance in linguistic diversity.

Join us in our mission to breathe new life into Indonesia's linguistic tapestry!

Contact Us

We are always open to collaboration and welcome contributions from the community. If you are interested in our work or have ideas to share, please reach out to us at lazarusnlp(at)gmail(dot)com.

Written by David Samuel Setiawan, Steven Limcorn, and Wilson Wongso. Last updated 19 February 2024.

Moseley, Christopher, ed. (2010). Atlas of the World’s Languages in Danger. Memory of Peoples (3rd ed.). Paris: UNESCO Publishing. ISBN 978-92-3-104096-2. ↩