Machine Translation

Indo-mT5

Indo-mT5 is mT5 fine-tuned for machine translation of regional languages of Indonesia. We release our dataset creation scripts, training code, and fine-tuned models for other to leverage.

There are two types of models:

Multilingual: Many-to-many, multilingual translation model.
Bilingual: Unidirectional, bilingual translation model.

We also further experiment with two settings:

Baseline: Model trained on 7 languages (ace, ban, bug, ind, jav, min, sun).
All: Model trained on 45 languages as listed here.

Training

Our experiments are conducted in these steps:

Multilingual Training on Bible: We first fine-tuned mT5 on multilingual translation on parallel Bible dataset, creating Indo-mT5.
Multilingual Training on NusaX: We take Indo-mT5 and fine-tune them on multilingual pairs of the NusaX dataset.
Bilingual Training on NusaX: We take Indo-mT5 and fine-tune them on bilingual pairs of the NusaX dataset.

Therefore, we have six training scripts:

Dataset	Config	Type	Training Script	Evaluation Script
Bible	Baseline	Multilingual	train_bible_baseline.sh	eval_bible_baseline.sh
Bible	All (v2)	Multilingual	train_bible_all.sh	eval_bible_all.sh
NusaX	Baseline	Multilingual	train_nusax_baseline_multilingual.sh	eval_nusax_baseline_multilingual.sh
NusaX	All (v2)	Multilingual	train_nusax_all_multilingual.sh	eval_nusax_all_multilingual.sh
NusaX	Baseline	Bilingual	train_nusax_baseline_bilingual.sh	eval_nusax_baseline_bilingual.sh
NusaX	All (v2)	Bilingual	train_nusax_all_bilingual.sh	eval_nusax_all_bilingual.sh

Results

We evaluated our models on NusaX (Winata et al., 2022) and compared them to existing models.

`ind -> x`

Model	#params	`ace`	`ban`	`bbc`	`bjn`	`bug`	`jav`	`mad`	`min`	`nij`	`sun`	avg
IndoGPT (Winata et al., 2022)	117M	9.60	14.17	8.20	22.23	5.18	24.05	14.44	26.95	17.56	23.15	16.55
IndoBART v2 (Winata et al., 2022)	132M	19.21	27.08	18.41	40.03	11.06	39.97	28.95	48.48	27.11	38.46	29.88
mBART-50 Large (Winata et al., 2022)	610M	17.21	22.67	17.79	34.26	10.78	35.33	28.63	43.87	25.91	31.21	26.77
mT5 Base (Winata et al., 2022)	580M	14.79	18.07	18.22	38.64	6.68	33.48	0.96	45.84	13.59	33.79	22.41
NLLB-200 Distilled (zero-shot)	600M	2.74	4.87	-	-	1.66	17.66	-	9.79	-	11.92	8.11
Indo-mT5 NusaX Multilingual	580M	16.02	22.48	-	-	8.86	33.65	-	33.65	-	29.76	24.07
Indo-mT5 NusaX Bilingual	580M	17.99	27.03	-	-	10.80	39.63	-	51.56	-	35.16	30.36
Indo-mT5 v2 NusaX Multilingual	580M	14.28	19.19	14.86	28.39	8.05	28.70	20.95	32.70	22.30	26.19	21.56
Indo-mT5 v2 NusaX Bilingual	580M	17.58	24.24	16.69	38.81	10.20	37.87	29.77	50.90	26.93	34.22	28.72

`x -> ind`

Model	#params	`ace`	`ban`	`bbc`	`bjn`	`bug`	`jav`	`mad`	`min`	`nij`	`sun`	avg
IndoGPT (Winata et al., 2022)	117M	7.01	13.23	5.27	19.53	1.98	27.31	13.75	23.03	10.83	23.18	14.51
IndoBART v2 (Winata et al., 2022)	132M	24.44	40.49	19.94	47.81	12.64	50.64	36.10	58.38	33.50	45.96	36.99
mBART-50 Large (Winata et al., 2022)	610M	18.45	34.23	17.43	41.73	10.87	39.66	32.11	59.66	29.84	35.19	31.92
mT5 Base (Winata et al., 2022)	580M	18.59	21.73	12.85	42.29	2.64	45.22	32.35	58.65	25.61	36.58	29.65
NLLB-200 Distilled (zero-shot)	600M	9.42	21.24	-	-	6.18	30.54	-	40.49	-	26.91	22.46
Indo-mT5 NusaX Multilingual	580M	23.94	35.30	-	-	16.68	29.76	-	48.10	-	36.54	31.72
Indo-mT5 NusaX Bilingual	580M	24.78	42.15	-	-	16.27	47.26	-	62.94	-	42.39	39.30
Indo-mT5 v2 NusaX Multilingual	580M	21.01	30.43	18.57	34.21	14.42	35.19	27.04	42.64	26.90	33.78	28.42
Indo-mT5 v2 NusaX Bilingual	580M	22.87	39.48	20.48	44.53	15.97	45.20	36.65	60.97	32.38	39.80	35.83

Parallel Bible Dataset Creation

This will cover the creation process of our Bible machine-translation dataset.

Overview

Scrape Bible Data
Align Bible Verses
Load as Machine-Translation Dataset

Bible Scraping

python utils/scrape_parallel.py \
    --codes abun aceh ambdr aralle balantak bali bambam bauzi berik bugis dairi duri ende galela gorontalo iban jawa kaili_daa karo kupang lampung madura makasar mamasa manggarai mentawai meyah minang mongondow napu ngaju nias rote sabu sangir sasak simalungun sunda taa tabaru tb toba toraja uma yali yawa \
    --outdir corpus \
    -j 4

Align Bible Verses

for LANGUAGE in abun aceh ambdr aralle balantak bali bambam bauzi berik bugis dairi duri ende galela gorontalo iban jawa kaili_daa karo kupang lampung madura makasar mamasa manggarai mentawai meyah minang mongondow napu ngaju nias rote sabu sangir sasak simalungun sunda taa tabaru tb toba toraja uma yali yawa
do
    python utils/align.py --path corpus/$LANGUAGE.json --outdir corpus_aligned
done

You can read more about aligning Bible verses in our blogpost.

Data Loading Script

In the data loading script, we have to do these two steps:

Split unique verse IDs into train/test/validation subsets.
Generate permutations of every verse ID for every subset.

You can find our data loading implementation in src/alkitab-sabda-mt.py.