ConGen

ConGen is an unsupervised, knowledge distillation technique that aims to control and generalize smaller student model from a larger sentence embedding teacher model. In short, the technique enforces the student model to mimic the logits of the teacher model on an instance queue subset of the training data (control) and also generalize it to augmentations of texts for robustness.

Training via ConGen requires an unsupervised corpus, which is readily available for Indonesian texts. In our experiments, we used Wikipedia texts. As for the data augmentation method, Limkonchotiwat et al. (2022) proposed using back-translation via an NMT model or Google Translate API. However, since that is costly to compute for 1 million texts, we opted for a simple single-word deletion technique.

ConGen with Single-word Deletion

IndoBERT Base

python train_con_gen.py \
    --model-name indobenchmark/indobert-base-p1 \
    --train-dataset-name LazarusNLP/wikipedia_id_20230520 \
    --max-seq-length 32 \
    --max-train-samples 1000000 \
    --num-epochs 20 \
    --train-batch-size 128 \
    --early-stopping-patience 7 \
    --learning-rate 1e-4 \
    --teacher-model-name sentence-transformers/paraphrase-multilingual-mpnet-base-v2 \
    --queue-size 65536 \
    --student-temp 0.5 \
    --teacher-temp 0.5

IndoBERT Lite Base

python train_con_gen.py \
    --model-name indobenchmark/indobert-lite-base-p1 \
    --train-dataset-name LazarusNLP/wikipedia_id_20230520 \
    --max-seq-length 32 \
    --max-train-samples 1000000 \
    --num-epochs 20 \
    --train-batch-size 128 \
    --early-stopping-patience 7 \
    --learning-rate 3e-4 \
    --teacher-model-name sentence-transformers/paraphrase-multilingual-mpnet-base-v2 \
    --queue-size 65536 \
    --student-temp 0.05 \
    --teacher-temp 0.05

SimCSE-IndoBERT Base

python train_con_gen.py \
    --model-name LazarusNLP/simcse-indobert-base \
    --train-dataset-name LazarusNLP/wikipedia_id_20230520 \
    --max-seq-length 32 \
    --max-train-samples 1000000 \
    --num-epochs 20 \
    --train-batch-size 128 \
    --early-stopping-patience 7 \
    --learning-rate 1e-4 \
    --teacher-model-name sentence-transformers/paraphrase-multilingual-mpnet-base-v2 \
    --queue-size 65536 \
    --student-temp 0.5 \
    --teacher-temp 0.5

References

@inproceedings{limkonchotiwat-etal-2022-congen,
  title = "{ConGen}: Unsupervised Control and Generalization Distillation For Sentence Representation",
  author = "Limkonchotiwat, Peerat  and
    Ponwitayarat, Wuttikorn  and
    Lowphansirikul, Lalita and
    Udomcharoenchaikit, Can  and
    Chuangsuwanich, Ekapol  and
    Nutanong, Sarana",
  booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
  year = "2022",
  publisher = "Association for Computational Linguistics",
}