For papers after August 2023, see Google Scholar.
Studying Large Language Model Generalization with Influence Functions Unpublished manuscript, 2023
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning Unpublished manuscript, 2023
Measuring Faithfulness in Chain-of-Thought Reasoning Unpublished manuscript, 2023
Inverse Scaling: When Bigger Isn't Better (data) Transactions on Machine Learning Research, 2023
ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context Learning (data) Proceedings of ACL, 2023
What Do NLP Researchers Believe? Results of the NLP Community Metasurvey (results viewer) Proceedings of ACL, 2023
Instruction Induction: From Few Examples to Natural Language Task Descriptions (code and data) Proceedings of ACL, 2023
Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs Unpublished manuscript, 2023
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting (code and data) Unpublished manuscript, 2023
Eight Things to Know about Large Language Models Unpublished manuscript, 2023
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (code and data) TMLR, 2023
Improving Code Generation by Training with Natural Language Feedback (code and data) Unpublished manuscript, 2023
Pretraining Language Models with Human Preferences (code and data) Proceedings of ICML, 2023
The Capacity for Moral Self-Correction in Large Language Models Unpublished manuscript, 2023
(QA)²: Question Answering with Questionable Assumptions (data) Proceedings of ACL, 2023
Discovering Language Model Behaviors with Model-Written Evaluations (data, interactive viewer) Unpublished manuscript, 2022
What Artificial Neural Networks Can Tell Us about Human Language Acquisition In Shalom Lappin and Jean-Philippe Bernardy (Eds.), Algebraic Structures in Natural Language, 2022
Constitutional AI: Harmlessness from AI Feedback Unpublished manuscript, 2022
Measuring Progress on Scalable Oversight for Large Language Models (data) Unpublished manuscript, 2022
SQuALITY: Building a Long-Document Summarization Dataset the Hard Way (data and code) Proceedings of EMNLP, 2022
SocioProbe: What, When, and Where Language Models Learn about Sociodemographics (code) Proceedings of EMNLP, 2022
Two-Turn Debate Doesn't Help Humans Answer Hard Reading Comprehension Questions Proceedings of the NeurIPS ML Safety Workshop, 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Unpublished manuscript, 2022
Language Models (Mostly) Know What They Know Unpublished manuscript, 2022
QuALITY: Question Answering with Long Input Texts, Yes! (data and code) Proceedings of NAACL, 2022
The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail Proceedings of ACL, 2022
BBQ: A Hand-Built Bias Benchmark for Question Answering (data and code) Findings of ACL, 2022
What Makes Reading Comprehension Questions Difficult? (preprint) Proceedings of ACL, 2022
Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions Proceedings of The First Workshop on Learning with Natural Language Supervision, 2022
Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair Unpublished manuscript, 2021
Clean or Annotate: How to Spend a Limited Data Collection Budget Proceedings of the Third Workshop Deep Learning for Low-Resource NLP, 2022
NOPE: A Corpus of Naturally-Occurring Presuppositions in English (corpus page) Proceedings of CoNLL, 2021
Does Putting a Linguist in the Loop Improve NLU Data Collection? (code and data) Findings of EMNLP, 2021
Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers Proceedings of BlackboxNLP, 2021
What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks? (code and data) Proceedings of ACL, 2021
Comparing Test Sets with Item Response Theory (code and data) Proceedings of ACL, 2021
When Do You Need Billions of Words of Pretraining Data? (code) Proceedings of ACL, 2021
What Will it Take to Fix Benchmarking in Natural Language Understanding? (slides for the 45-minute version of the talk) Proceedings of NAACL, 2021
Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options (code and data) Proceedings of AACL, 2020
English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too (code) Proceedings of AACL, 2020
Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually) (code and data) Proceedings of EMNLP, 2020
Precise Task Formalization Matters in Winograd Schema Evaluations (code) Proceedings of EMNLP (short paper), 2020
New Protocols and Negative Results for Textual Entailment Data Collection (code and data) Proceedings of EMNLP, 2020
CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models (code and data) Proceedings of EMNLP, 2020
Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data (code and data) Proceedings of the Workshop on Insights from Negative Results in NLP, 2020
Do self-supervised neural networks acquire a bias towards structural linguistic generalizations? (code and data) Proceedings of CogSci, 2020
Self-Training for Unsupervised Parsing with PRPN (code) Proceedings of IWPT, 2020
Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work? Proceedings of ACL, 2020
jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models (project site) Proceedings of ACL (demonstration track), 2020
BLiMP: A Benchmark of Linguistic Minimal Pairs for English (project site) Transactions of the ACL (TACL), 2020
Learning to Learn Morphological Inflection for Resource-Poor Languages Proceedings of AAAI, 2020
Do Attention Heads in BERT Track Syntactic Dependencies? Unpublished manuscript, 2019
Inducing Constituency Trees through Neural Machine Translation Unpublished manuscript, 2019
Neural Unsupervised Parsing Beyond English Proceedings of The Workshop on Deep Learning for Low-Resource NLP (DeepLo), 2019
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (project site, baseline code) Proceedings of NeurIPS, 2019
Can Unconditional Language Models Recover Arbitrary Sentences? Proceedings of NeurIPS, 2019
Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs (code and data) Proceedings of EMNLP, 2019
Towards Realistic Practices In Low-Resource Natural Language Processing: The Development Set Proceedings of EMNLP, 2019
Neural Network Acceptability Judgments (corpus page) Transactions of the ACL (TACL), 2019
Can You Tell Me How to Get Past Sesame Street: Sentence-Level Pretraining Beyond Language Modeling (code) Proceedings of ACL, 2019
Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark Proceedings of ACL, 2019
Probing What Different NLP Tasks Teach Machines about Function Word Comprehension (code) Proceedings of *SEM, 2019 Best Paper Award
Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks Unpublished manuscript, 2019
On Measuring Social Biases in Sentence Encoders Proceedings of NAACL, 2019
Identifying and Reducing Gender Bias in Word-Level Language Models Proceedings of the NAACL Student Research Workshop, 2019
Grammatical Analysis of Pretrained Sentence Encoders with Acceptability Judgments (data) Unpublished manuscript, 2019
Looking for ELMo's Friends: Sentence-Level Pretraining Beyond Language Modeling (code) Unpublished manuscript superseded by Can You Tell Me How to Get Past Sesame Street, above, 2019
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding (project site) Proceedings of ICLR, 2019
What do you learn from context? Probing for sentence structure in contextualized word representations (code) Proceedings of ICLR, 2019
Language Modeling Teaches You More Syntax than Translation Does: Lessons Learned Through Auxiliary Task Analysis Unpublished manuscript, 2018
Verb Argument Structure Alternations in Word and Sentence Embeddings (corpus page) Proceedings of SCiL, 2018
A Stable and Effective Learning Strategy for Trainable Greedy Decoding Proceedings of EMNLP, 2018
XNLI: Cross-lingual Sentence Understanding through Inference (corpus page) Proceedings of EMNLP, 2018
Grammar Induction with Neural Language Models: An Unusual Replication (code) Proceedings of EMNLP (short paper), 2018
The Lifted Matrix-Space Model for Semantic Composition Proceedings of CoNLL, 2018
ListOps: A Diagnostic Dataset for Latent Tree Learning (code and data) Proceedings of the NAACL Student Research Workshop, 2018
Training a Ranking Function for Open-Domain Question Answering Proceedings of the NAACL Student Research Workshop, 2018
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference (corpus page) Proceedings of NAACL, 2018
Do latent tree learning models identify meaningful structure in sentences? (code) Transactions of the ACL (TACL), 2018
Annotation Artifacts in Natural Language Inference Data (data on the MultiNLI corpus page) Proceedings of NAACL (short paper), 2018
Ruminating Reader: Reasoning with Gated Multi-Hop Attention Proceedings of the Workshop on Machine Reading for Question Answering, 2018
The RepEval 2017 Shared Task: Multi-Genre Natural Language Inference with Sentence Representations Proceedings of RepEval 2017: The Second Workshop on Evaluating Vector Space Representations for NLP, 2017
Detecting and Explaining Crisis Proceedings of The 2017 Computational Linguistics and Clinical Psychology Workshop, 2017
Sequential Attention Proceedings of the 2nd Workshop on Representation Learning for NLP, 2017
Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning Unpublished manuscript, 2017
Modeling natural language semantics in learned representations Stanford University Dissertation, 2016
Generating Sentences from a Continuous Space Proceedings of CoNLL, 2016
A Fast Unified Model for Parsing and Sentence Understanding (code) Proceedings of ACL, 2016
Tree-structured composition in neural networks without tree-structured architectures Proceedings of the NIPS Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches, 2015
A large annotated corpus for learning natural language inference (corpus page) Proceedings of EMNLP, 2015 Best New Data Set or Resource Award
Recursive Neural Networks Can Learn Logical Semantics (code and data, poster) Proceedings of The 3rd Workshop on Continuous Vector Space Models and their Compositionality, 2015
Learning Distributed Word Representations for Natural Logic Reasoning Proceedings of the AAAI Spring Symposium on Knowledge Representation and Reasoning, 2015
A Gold Standard Dependency Corpus for English Proceedings of LREC, 2014
Can recursive neural tensor networks learn logical reasoning? (code and data) Unpublished manuscript, 2014
Idiosyncratic transparent vowels in Kazakh Proceedings of AMP, 2013 A typo in item (39) in the published version is corrected here.
More constructions, more genres: Extending Stanford Dependencies Proceedings of DepLing, 2013
Two arguments for vowel harmony by trigger competition Proceedings of CLS, 2013
Automatic animacy classification (poster) Proceedings of The NAACL Student Research Workshop, 2012
Vowel varmony, opacity, and finite-state OT Technical report TR-2011-03, Department of Computer Science, The University of Chicago.
Speech recognition with segmental conditional random fields: A summary of the JHU CLSP 2010 Summer Workshop Proceedings of ICASSP, 2011
Modeling pronunciation variation with context-dependent articulatory feature decision trees Proceedings of Interspeech, 2010
An aside: My Erdős number is 4, by way of Karen Livescu, Kamalika Chaudhuri, and Fan Chung, by way of Chris Manning, Val Spitkovsky, and Daniel Kleitman, or by way of Victor O.K. Li, Kuang Xu, and Joel H. Spencer.
A few technical points to keep in mind when discussing technologies like ChatGPT Slides for the panel AI FUTURES with Critical AI, 2023
Why Adversarially-Collected Test Sets Don't Work as Benchmarks Slides for an Invited Talk at The First Workshop on Dynamic Adversarial Data Collection (DADC), 2022
Crowdsourcing Beyond Annotation: Case Studies in Benchmark Data Collection EMNLP Tutorial, 2021
How do we fix natural language understanding evaluation? Invited talk slides for a CMU ML Department virtual invited talk, 2020
Evaluating Recent Progress Toward General-Purpose Language Understanding Models Invited talk slides for a Google Research virtual invited talk, 2020
Evaluating Recent Progress Toward General-Purpose Language Understanding Models (video) Invited talk slides for the Allen Institute for AI and the University of Washington, 2019
Task-Independent Language Understanding Invited talk slides for Cornell and IBM Research, 2019
Task-Independent Sentence Understanding Models *SEM/SemEval joint invited talk slides, 2019
Deep Learning for Natural Language Inference NAACL tutorial, 2019
A large annotated corpus of entailments and contradictions Talk Slides from California Universities Semantics and Pragmatics, 2015
Computational Linguistics Guest lecture for an introductory linguistics class with Asya Pereltsvaig, 2015
Neural networks for natural language understanding Guest lecture for Chris Potts and Bill MacCartney's computational natural language understanding class, 2015
vector-entailment: A MATLAB toolkit for tree-structured recursive neural networks 2015
Transparent vowels in ABC: open issues ABC↔Conference invited talk handout, 2014
Seto vowel harmony and neutral vowels Presentation at LSA, 2013
Measuring amok Course paper, 2012