XNLI

The Cross-Lingual NLI Corpus (XNLI)

Alexis Conneau
Guillaume Lample
Ruty Rinott
Holger Schwenk
Ves Stoyanov
Facebook AI

Adina Williams
Sam Bowman
NYU

Introduction

The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. This results in 112.5k annotated pairs. Each premise can be associated with the corresponding hypothesis in the 15 languages, summing up to more than 1.5M combinations. The corpus is made to evaluate how to perform inference in any language (including low-resources ones like Swahili or Urdu) when only English NLI data is available at training time. One solution is cross-lingual sentence encoding, for which XNLI is an evaluation benchmark.

Examples

Language	Premise	Label	Hypothesis
Face-to-face conversation
English	There's so much you could talk about on that I'll just skip that.	contradictory	I want to tell you everything I know about that!
Letters
French	Cet investissement a permis la rénovation et la vente de 60 maisons à des acheteurs modestes et la réhabilitation de plus de 100 appartements abordables et de grande qualité.	entailment	Les appartements étaient des dépotoirs et personne ne les a réparés.
Telephone Speech
Greek	Το κορίτσι που μπορεί να με βοηθήσει είναι στον δρόμο προς την πόλη.	neutral	Η κοπέλα που θα με βοηθήσει είναι 5 μίλια μακριά.
9/11 Report
Bulgarian	При измерване на ефективността, съвършенството е недостижимо.	entailment	Можете да бъдете перфектни, ако се опитате достатъчно.
Fiction
Urdu	دھکےلو، کپتان، اور انہیں ایک کشتی بھیجنے کا اشارہ کرو اور ان کو یقین دلاو کہ مس یہاں ہے۔	contradiction	کشتی کو بلانے کی کوئی ضرورت نہ تھی کیوں کہ مس کبھی آئی ہی نہیں

Download

XNLI is distributed in a single ZIP file containing the corpus as both JSON lines (jsonl) and tab-separated text (txt). The English training data can be found on the MultiNLI website.

Download: XNLI 1.0 (17MB, ZIP)

Data description paper and citation

A description of the data can be found here (PDF) or in the corpus package zip. If you use the corpus in an academic paper, please cite us:

@InProceedings{conneau2018xnli,
  author = "Conneau, Alexis
                 and Rinott, Ruty
                 and Lample, Guillaume
                 and Williams, Adina
                 and Bowman, Samuel R.
                 and Schwenk, Holger
                 and Stoyanov, Veselin",
  title = "XNLI: Evaluating Cross-lingual Sentence Representations",
  booktitle = "Proceedings of the 2018 Conference on Empirical Methods 
               in Natural Language Processing",
  year = "2018",
  publisher = "Association for Computational Linguistics",
  location = "Brussels, Belgium",
}

Baselines and Code

The XNLI paper presents several baselines for language adaptation.

Code will soon be made available. We also release the machine translated data for reproducing the TRANSLATE-TRAIN and TRANSLATE-TEST:

Download: XNLI-MT 1.0 (445MB, ZIP)

License

See details in the XNLI paper.

Acknowledgments

This project has benefited from financial support to Samuel R. Bowman by Google, Tencent Holdings, and Samsung Research.