Hippolyte Gisserot-Boukhlef

I am a second-year PhD student in artificial intelligence at CentraleSupélec, Université Paris-Saclay, conducting research in collaboration with the Artefact Research Center through a CIFRE partnership.

In today’s rapidly evolving NLP landscape, where generative models often fall short of addressing all challenges and remain highly resource-intensive, my research centers on the role of embeddings and effective representation learning. I investigate the entire pipeline, from pretraining strategies to downstream evaluation, with the aim of identifying the key factors that truly drive model performance.

Feel free to explore my website to learn more about my research, publications, and ongoing projects. Don’t hesitate to get in touch!

news

Jul 08, 2025	Our paper, EuroBERT: Scaling Multilingual Encoders for European Languages is accepted to COLM 2025.
Jul 02, 2025	We release Should We Still Pretrain Encoders with Masked Language Modeling?, a large-scale study comparing causal and bidirectional pretraining objectives for text representation learning.
Jun 23, 2025	EuroBERT is nominated for the 2025 Datacraft Awards in the AI and Society category.

selected publications

Should We Still Pretrain Encoders with Masked Language Modeling?

Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, and 5 more authors

2025

Abs

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 38 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models, reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at this https URL to foster further research.
EuroBERT: Scaling Multilingual Encoders for European Languages

Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, and 16 more authors

2025

Abs

General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide applicability, encoders have been recently overshadowed by advances in generative decoder-only models. However, many innovations driving this progress are not inherently tied to decoders. In this paper, we revisit the development of multilingual encoders through the lens of these advances, and introduce EuroBERT, a family of multilingual encoders covering European and widely spoken global languages. Our models outperform existing alternatives across a diverse range of tasks, spanning multilingual capabilities, mathematics, and coding, and natively supporting sequences of up to 8,192 tokens. We also examine the design decisions behind EuroBERT, offering insights into our dataset composition and training pipeline. We publicly release the EuroBERT models, including intermediate training checkpoints, together with our training framework.