publications
2024
- arXiv 2024LangSAMP: Language-Script Aware Multilingual PretrainingYihong Liu, Haotian Ye, Chunlan Ma, and 2 more authorsarXiv preprint arXiv:2409.18199, 2024
Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings – learnable vectors assigned to different languages. These embeddings are discarded for two main reasons: (1) mPLMs are expected to have a single, unified parameter set across all languages, and (2) they need to function seamlessly as universal text encoders without requiring language IDs as input. However, this removal increases the burden on token embeddings to encode all language-specific information, which may hinder the model’s ability to produce more language-neutral representations. To address this challenge, we propose Language-Script Aware Multilingual Pretraining (LangSAMP), a method that incorporates both language and script embeddings to enhance representation learning while maintaining a simple architecture. Specifically, we integrate these embeddings into the output of the transformer blocks before passing the final representations to the language modeling head for prediction. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages. The resulting model consistently outperforms the baseline. Extensive analysis further shows that language/script embeddings encode language/script-specific information, which improves the selection of source languages for crosslingual transfer. We make our code and models publicly available at https://github.com/cisnlp/LangSAMP.
- COLING 2025How Transliterations Improve Crosslingual AlignmentYihong Liu, Mingyang Wang, Amir Hossein Kargaran, and 6 more authorsarXiv preprint arXiv:2409.17326, 2024
Recent studies have shown that post-aligning multilingual pretrained language models (mPLMs) using alignment objectives on both original and transliterated data can improve crosslingual alignment. This improvement further leads to better crosslingual transfer performance. However, it remains unclear how and why a better crosslingual alignment is achieved, as this technique only involves transliterations, and does not use any parallel data. This paper attempts to explicitly evaluate the crosslingual alignment and identify the key elements in transliteration-based approaches that contribute to better performance. For this, we train multiple models under varying setups for two pairs of related languages: (1) Polish and Ukrainian and (2) Hindi and Urdu. To assess alignment, we define four types of similarities based on sentence representations. Our experiments show that adding transliterations alone improves the overall similarities, even for random sentence pairs. With the help of auxiliary alignment objectives, especially the contrastive objective, the model learns to distinguish matched from random pairs, leading to better alignments. However, we also show that better alignment does not always yield better downstream performance, suggesting that further research is needed to clarify the connection between alignment and performance.
- EMNLP 2024SynthEval: Hybrid Behavioral Testing of NLP Models with Synthetic EvaluationRaoyuan Zhao, Abdullatif Köksal, Yihong Liu, and 3 more authorsIn Findings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024
Traditional benchmarking in NLP typically involves using static, held-out test sets and calculating aggregated statistics based on diverse examples. However, this approach often results in an overestimation of performance and lacks the ability to offer comprehensive, interpretable, and dynamic assessments of NLP models. Recently, works like DynaBench and Checklist have addressed these limitations through behavioral testing of NLP models with test types generated by a multi-step human-annotated pipeline. Unfortunately, manually creating a variety of test types requires significant human labor, thus weakening efficiency. In this work, we propose SynthEval, a hybrid behavioral testing framework that leverages large language models (LLMs) to generate a wide range of test types for a comprehensive evaluation of NLP models. The SynthEval framework first generates sentences via LLMs using controlled generation, and then identifies challenging examples by comparing the predictions made by LLMs with task-specific NLP models. In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the task-specific models consistently exhibit. We apply SynthEval to two classification tasks and show that our framework is effective in identifying weaknesses of strong models on these tasks.
- arXiv 2024Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin ScriptsChunlan Ma, Yihong Liu, Haotian Ye, and 1 more authorarXiv preprint arXiv:2407.02320, Nov 2024
Decoder-only large language models (LLMs) excel in high-resource languages across various tasks through few-shot or even zero-shot in-context learning (ICL). However, their performance often does not transfer well to low-resource languages, especially those written in non-Latin scripts. Inspired by recent work that leverages transliteration in encoder-only models, we investigate whether transliteration is also effective in improving LLMs’ performance for low-resource languages written in non-Latin scripts. To this end, we propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both. We apply these methods to several representative LLMs of different sizes on various tasks including text classification and sequential labeling. Our findings show that the effectiveness of transliteration varies by task type and model size. For instance, all models benefit from transliterations for sequential labeling (with increases of up to 25%).
- EMNLP 2024Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training AlignmentOrgest Xhelili, Yihong Liu, and Hinrich SchuetzeIn Findings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024
TMultilingual pre-trained models (mPLMs) have shown impressive performance on cross-lingual transfer tasks. However, the transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language, even though the two languages may be related or share parts of their vocabularies. Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method aiming to improve the cross-lingual alignment between languages using diverse scripts. We select two areal language groups, \textbfMediterranean-Amharic-Farsi and \textbfSouth+East Asian Languages, wherein the languages are mutually influenced but use different scripts. We apply our method to these language groups and conduct extensive experiments on a spectrum of downstream tasks. The results show that after PPA, models consistently outperform the original model (up to 50% for some tasks) in English-centric transfer. In addition, when we use languages other than English as sources in transfer, our method obtains even larger improvements. We will make our code and models publicly available at https://github.com/cisnlp/Transliteration-PPA
- COLING 2025TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated DataYihong Liu, Chunlan Ma, Haotian Ye, and 1 more authorarXiv preprint arXiv:2405.09913, Nov 2024
Transliterating related languages that use different scripts into a common script shows effectiveness in improving crosslingual transfer in downstream tasks. However, this methodology often makes pretraining a model from scratch unavoidable, as transliteration brings about new subwords not covered in existing multilingual pretrained language models (mPLMs). This is not desired because it takes a lot of computation budget for pretraining. A more promising way is to make full use of available mPLMs. To this end, this paper proposes a simple but effective framework: Transliterate-Merge-Initialize (TransMI), which can create a strong baseline well-suited for data that is transliterated into a common script by exploiting an mPLM and its accompanied tokenizer. TransMI has three stages: (a) transliterate the vocabulary of an mPLM into a common script; (b) merge the new vocabulary with the original vocabulary; and (c) initialize the embeddings of the new subwords. We applied TransMI to three recent strong mPLMs, and our experiments demonstrate that TransMI not only preserves their ability to handle non-transliterated data, but also enables the models to effectively process transliterated data: the results show a consistent improvement of 3% to 34%, varying across different models and tasks. We make our code and models publicly available.
- ACL 2024TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language ModelsYihong Liu, Chunlan Ma, Haotian Ye, and 1 more authorIn Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2024
The world’s more than 7000 languages are written in at least 293 scripts. Due to various reasons, many closely related languages use different scripts, which poses a difficulty for multilingual pretrained language models (mPLMs) in learning crosslingual knowledge through lexical overlap. As a consequence, mPLMs are faced with a script barrier: representations from different scripts are located in different subspaces, which can result in crosslingual transfer involving languages of different scripts performing suboptimally. To address this problem, we propose TransliCo, a framework that optimizes the Transliteration Contrastive Modeling (TCM) objective to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script (in our case Latin), which enhances uniformity in the representation space for different scripts. Using Glot500-m, an mPLM pretrained on over 500 languages, as our source model, we fine-tune it on a small portion (5%) of its training data, and refer to the resulting model as Furina. We show that Furina not only better aligns representations from distinct scripts but also outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks. Additionally, we achieve consistent improvement in a case study on the Indic group where the languages exhibit areal features but use different scripts. We make our code and models publicly available.
- Insights 2024MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot TransferHaotian Ye, Yihong Liu, Chunlan Ma, and 1 more authorIn Proceedings of the Fifth Workshop on Insights from Negative Results in NLP, Jun 2024
Transformer-based pre-trained language models (PLMs) have achieved remarkable performance in various natural language processing (NLP) tasks. However, pre-training such models can take considerable resources that are almost only available to high-resource languages. On the contrary, static word embeddings are easier to train in terms of computing resources and the amount of data required. In this paper, we introduce MoSECroT (Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer, a novel and challenging task that is especially relevant to low-resource languages for which static word embeddings are available. To tackle the task, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language. In this way, we can train the PLM on source-language training data and perform zero-shot transfer to the target language by simply swapping the embedding layer. However, through extensive experiments on two classification datasets, we show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines. In this paper, we attempt to explain this negative result and provide several thoughts on possible improvement.
- NAACL 2024OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued PretrainingYihong Liu, Peiqin Lin, Mingyang Wang, and 1 more authorIn Findings of the Association for Computational Linguistics: NAACL 2024, Jun 2024
Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the model, thus weakening the efficiency. To address these issues, we propose a novel framework: \textbfOne \textbfFor \textbfAll (\textbfOFA), which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively. OFA takes advantage of external well-aligned multilingual static word vectors and injects the alignment knowledge into the subword embeddings. In addition, OFA applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which largely reduces the number of parameters. We show OFA accelerates the convergence of continued pretraining, which is environmentally friendly as much fewer carbon footprints are generated. Through extensive experiments, we demonstrate OFA can achieve competitive or better performance than default continued pretraining baselines on a wide range of crosslingual downstream tasks. We make our code and models publicly available.
2023
- EMNLP 2023Crosslingual Transfer Learning for Low-Resource Languages Based on Multilingual Colexification GraphsYihong Liu, Haotian Ye, Leonie Weissweiler, and 2 more authorsIn Findings of the Association for Computational Linguistics: EMNLP 2023, Dec 2023
In comparative linguistics, colexification refers to the phenomenon of a lexical form conveying two or more distinct meanings. Existing work on colexification patterns relies on annotated word lists, limiting scalability and usefulness in NLP. In contrast, we identify colexification patterns of more than 2,000 concepts across 1,335 languages directly from an unannotated parallel corpus. We then propose simple and effective methods to build multilingual graphs from the colexification patterns: \textbfColexNet and \textbfColexNet+. ColexNet’s nodes are concepts and its edges are colexifications. In ColexNet+, concept nodes are additionally linked through intermediate nodes, each representing an ngram in one of 1,334 languages. We use ColexNet+ to train \overrightarrow\mboxColexNet+, high-quality multilingual embeddings that are well-suited for transfer learning. In our experiments, we first show that ColexNet achieves high recall on CLICS, a dataset of crosslingual colexifications. We then evaluate \overrightarrow\mboxColexNet+ on roundtrip translation, sentence retrieval and sentence classification and show that our embeddings surpass several transfer learning baselines. This demonstrates the benefits of using colexification as a source of information in multilingual NLP.
- arXiv 2023A study of conceptual language similarity: comparison and evaluationHaotian Ye, Yihong Liu, and Hinrich SchützearXiv preprint arXiv:2305.13401, Dec 2023
An interesting line of research in natural language processing (NLP) aims to incorporate linguistic typology to bridge linguistic diversity and assist the research of low-resource languages. While most works construct linguistic similarity measures based on lexical or typological features, such as word order and verbal inflection, recent work has introduced a novel approach to defining language similarity based on how they represent basic concepts, which is complementary to existing similarity measures. In this work, we study the conceptual similarity in detail and evaluate it extensively on a binary classification task.
- ACL 2023A Crosslingual Investigation of Conceptualization in 1335 LanguagesYihong Liu, Haotian Ye, Leonie Weissweiler, and 4 more authorsIn Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul 2023
Languages differ in how they divide up the world into concepts and words; e.g., in contrast to English, Swahili has a single concept for ‘belly’ and ‘womb’. We investigate these differences in conceptualization across 1,335 languages by aligning concepts in a parallel corpus. To this end, we propose Conceptualizer, a method that creates a bipartite directed alignment graph between source language concepts and sets of target language strings. In a detailed linguistic analysis across all languages for one concept (‘bird’) and an evaluation on gold standard data for 32 Swadesh concepts, we show that Conceptualizer has good alignment accuracy. We demonstrate the potential of research on conceptualization in NLP with two experiments. (1) We define crosslingual stability of a concept as the degree to which it has 1-1 correspondences across languages, and show that concreteness predicts stability. (2) We represent each language by its conceptualization pattern for 83 concepts, and define a similarity measure on these representations. The resulting measure for the conceptual similarity between two languages is complementary to standard genealogical, typological, and surface similarity measures. For four out of six language families, we can assign languages to their correct family based on conceptual similarity with accuracies between 54% and 87%
- IWSLT 2023On the Copying Problem of Unsupervised NMT: A Training Schedule with a Language Discriminator LossYihong Liu, Alexandra Chronopoulou, Hinrich Schütze, and 1 more authorIn Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), Jul 2023
Although unsupervised neural machine translation (UNMT) has achieved success in many language pairs, the copying problem, i.e., directly copying some parts of the input sentence as the translation, is common among distant language pairs, especially when low-resource languages are involved. We find this issue is closely related to an unexpected copying behavior during online back-translation (BT). In this work, we propose a simple but effective training schedule that incorporates a language discriminator loss. The loss imposes constraints on the intermediate translation so that the translation is in the desired language. By conducting extensive experiments on different language pairs, including similar and distant, high and low-resource languages, we find that our method alleviates the copying problem, thus improving the translation performance on low-resource languages.
2022
- ACL 2022Flow-Adapter Architecture for Unsupervised Machine TranslationYihong Liu, Haris Jabbar, and Hinrich SchuetzeIn Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022
In this work, we propose a flow-adapter architecture for unsupervised NMT. It leverages normalizing flows to explicitly model the distributions of sentence-level latent representations, which are subsequently used in conjunction with the attention mechanism for the translation task. The primary novelties of our model are: (a) capturing language-specific sentence representations separately for each language using normalizing flows and (b) using a simple transformation of these latent representations for translating from one language to another. This architecture allows for unsupervised training of each language independently. While there is prior work on latent variables for supervised MT, to the best of our knowledge, this is the first work that uses latent variables and normalizing flows for unsupervised MT. We obtain competitive results on several unsupervised MT benchmarks.
2021
- JournalA label-oriented loss function for learning sentence representationsYihong Liu, Wei Guan, Dongxu Lu, and 1 more authorComputer Speech & Language, May 2021
Neural network methods which leverage word-embedding obtained from unsupervised learning models have been widely adopted in many natural language processing (NLP) tasks, including sentiment analysis and sentence classification. Existing sentence representation generation approaches which serve for classification tasks generally rely on complex deep neural networks but relatively simple loss functions, such as cross entropy loss function. These approaches cannot produce satisfactory separable sentence representations because the usage of cross entropy may ignore the sentiment and semantic information of the labels. To extract useful information from labels for improving the distinguishability of the obtained sentence representations, this paper proposes a label-oriented loss function. The proposed loss function takes advantage of the word-embeddings of labels to guide the production of meaningful sentence representations which serve for downstream classification tasks. Compared with existing end-to-end approaches, the evaluation experiments on several datasets illustrate that using the proposed loss function can achieve competitive and even better classification results.