The Multi-Genre NLI Corpus
Adina Williams
Nikita Nangia
Sam Bowman
NYU
Introduction
The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. The corpus served as the basis for the shared task of the RepEval 2017 Workshop at EMNLP in Copenhagen.
Examples
Premise | Label | Hypothesis |
Fiction | ||
The Old One always comforted Ca'daan, except today. | neutral | Ca'daan knew the Old One very well. |
Letters | ||
Your gift is appreciated by each and every student who will benefit from your generosity. | neutral | Hundreds of students will benefit from your generosity. |
Telephone Speech | ||
yes now you know if if everybody like in August when everybody's on vacation or something we can dress a little more casual or | contradiction | August is a black out month for vacations in the company. |
9/11 Report | ||
At the other end of Pennsylvania Avenue, people began to line up for a White House tour. | entailment | People formed a line at the end of Pennsylvania Avenue. |
Download
MultiNLI is distributed in a single ZIP file containing the corpus as both JSON lines (jsonl) and tab-separated text (txt).
Download: MultiNLI 1.0 (227MB, ZIP)
Previous versions
MultiNLI 0.9 differs from MultiNLI 1.0 only in the pairID and promptID fields in the training and development sets (and the attached paper), so results achieved on version 0.9 are still valid on 1.0. Version 0.9 can be downloaded here.
The Stanford NLI Corpus (SNLI)
MultiNLI is modeled after SNLI. The two corpora are distributed in the same formats, and for many applications, it may be productive to treat them as a single, larger corpus. You can find out more about SNLI here and download it from an NYU mirror here.
Data description paper and citation
A description of the data can be found here (PDF) or in the corpus package zip. If you use the corpus in an academic paper, please cite us:
@InProceedings{N18-1101, author = "Williams, Adina and Nangia, Nikita and Bowman, Samuel", title = "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference", booktitle = "Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", year = "2018", publisher = "Association for Computational Linguistics", pages = "1112--1122", location = "New Orleans, Louisiana", url = "http://aclweb.org/anthology/N18-1101" }
Baselines, code, and analysis
The data description paper presents the following baselines:
Model | Matched Test Acc. | Mismatched Test Acc. |
Most Frequent Class | 36.5% | 35.6% |
CBOW | 65.2% | 64.6% |
BiLSTM | 67.5% | 67.1% |
ESIM | 72.4% | 71.9% |
Note that the ESIM relies on attention between sentences and would be ineligible for inclusion in the RepEval competition. All three models are trained on a mix of MultiNLI and SNLI and use GloVe word vectors. Code (TensorFlow/Python) is available here, alongside a script to reproduce the categories used in the error analysis in the paper.
Additional analysis-oriented datasets are available as part of GLUE and here.
Test set and leaderboard
To evaluate your system on the full test set, use the following Kaggle in Class competitions. You do not need to submit code to evaluate your model, and you may evaluate under a psuedonym, but you are expected to post a brief description of your model in the competition discussion board.
These competitions will be open indefinitely. Evaluations on a subset of the test set had previously been conducted with different leaderboards through the RepEval 2017 Workshop.
Researchers interested in multi-task learning and general-purpose representation learning can also access the test set through a separate leaderboard on the GLUE platform.
The best result (state of the art) that we've seen written up in a paper is 82.1/81.4 from Radford et al. 2018.
Related Resources
The XNLI corpus provides additional development and test sets for MultiNLI in fifteen languages.
Evaluations on the hard subset of the test set used in Gururangan et al. '18 are available separately (matched/mismatched).
License
See details in the data description paper.
Thanks
This work was made possible by a Google Faculty Research Award. We also thank George Dahl, the organizers of the RepEval 2016 and RepEval 2017 workshops, Andrew Drozdov, Angeliki Lazaridou, and our other NYU colleagues for help and advice.