SCT

SCT is an Efficient Self-Supervised Cross-View Training For Sentence Embedding, that also supports knowledge-distillation from a fine-tuned sentence embedding teacher model. Like ConGen, the technique enforces the student model to mimic the logits of the teacher model on an instance queue and also generalize it to augmentations of texts for robustness. Unlike ConGen, the instance queue is generated by random (fake) sentence embeddings instead of actual sentence vectors.

Training via SCT requires an unsupervised corpus, which is readily available for Indonesian texts. In our experiments, we used Wikipedia texts. As for the data augmentation method, Limkonchotiwat et al. (2023) proposed using back-translation via an NMT model or Google Translate API. However, since that is costly to compute for 1 million texts, we opted for a simple single-word deletion technique. Interestingly, we found out that using a backtranslated corpus resulted in a poorer model than using single-word deletion. We hypothesize that this is due to the quality of open-source Indonesian machine translation models. Further study is required.

SCT Distillation with Single-word Deletion

IndoBERT Base

python train_sct_distillation.py \
    --model-name indobenchmark/indobert-base-p1 \
    --train-dataset-name LazarusNLP/wikipedia_id_backtranslated \
    --train_text_column_1 text \
    --do_corrupt \
    --max-seq-length 128 \
    --num-epochs 20 \
    --train-batch-size 128 \
    --early-stopping-patience 7 \
    --learning-rate 1e-4 \
    --teacher-model-name sentence-transformers/paraphrase-multilingual-mpnet-base-v2 \
    --queue-size 65536 \
    --student-temp 0.5 \
    --teacher-temp 0.5

SCT Distillation with Back-translated Corpus

IndoBERT Base

python train_sct_distillation.py \
    --model-name indobenchmark/indobert-base-p1 \
    --train-dataset-name LazarusNLP/wikipedia_id_backtranslated \
    --train_text_column_1 text \
    --train_text_column_2 text_bt \
    --max-seq-length 128 \
    --num-epochs 20 \
    --train-batch-size 128 \
    --early-stopping-patience 7 \
    --learning-rate 1e-4 \
    --teacher-model-name sentence-transformers/paraphrase-multilingual-mpnet-base-v2 \
    --queue-size 65536 \
    --student-temp 0.5 \
    --teacher-temp 0.5

References

@article{10.1162/tacl_a_00620,
    author = {Limkonchotiwat, Peerat and Ponwitayarat, Wuttikorn and Lowphansirikul, Lalita and Udomcharoenchaikit, Can and Chuangsuwanich, Ekapol and Nutanong, Sarana},
    title = "{An Efficient Self-Supervised Cross-View Training For Sentence Embedding}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {11},
    pages = {1572-1587},
    year = {2023},
    month = {12},
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00620},
    url = {https://doi.org/10.1162/tacl\_a\_00620},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00620/2196817/tacl\_a\_00620.pdf},
}