mMARCO
mMARCO is a multilingual version of the MS MARCO passage ranking dataset, translated via Google Translate API. It supports up to 14 languages, including Indonesian.
Unlike the original MS MARCO dataset, this version only has query-positive-negative triplets. In the original version, for instance, it had a list of passages which may be relevant to the query, and a label for the most relevant passage.
Bi-Encoder with MultipleNegativesRankingLoss
IndoBERT Base
python train_bi-encoder_mmarco_mnrl.py \
--model-name indobenchmark/indobert-base-p1 \
--train-dataset-name unicamp-dl/mmarco \
--train-dataset-config indonesian \
--max-seq-length 32 \
--max-train-samples 1000000 \
--num-epochs 5 \
--train-batch-size 128 \
--learning-rate 2e-5 \
--warmup-ratio 0.1
References
@misc{bonifacio2021mmarco,
title={mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset},
author={Luiz Henrique Bonifacio and Vitor Jeronymo and Hugo Queiroz Abonizio and Israel Campiotti and Marzieh Fadaee and and Roberto Lotufo and Rodrigo Nogueira},
year={2021},
eprint={2108.13897},
archivePrefix={arXiv},
primaryClass={cs.CL}
}