S2ORC: The Semantic Scholar Open Research Corpus

Kyle Lo, Lucy Lu Wang, Mark E Neumann, Rodney Michael Kinney, and Daniel S. Weld
ACL  2020

Tl;DR: We introduce S2ORC, a large contextual citation graph of English-language academic papers from multiple scientific domains; the corpus consists of 81.1M papers, 380.5M citation edges, and associated paper metadata.

Stolen Probability: A Structural Weakness of Neural Language Models

David Demeter, Gregory Kimmel, and Doug Downey
ACL  2020

Tl;DR: We show that the softmax output common in neural language models leads to a limitation: some words (in particular, those with an embedding interior to the convex hull of the embedding space) can never be assigned high probability by the model, no matter what the context.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

Language (Re)modelling: Towards Embodied Language Understanding

Ronen Tamari, Chen Shani, Tom Hope, Miriam R. L. Petruck, Omri Abend, and Dafna Shahaf
ACL  2020

Tl;DR: We bring together ideas from cognitive science and AI/NLU, arguing that grounding by analogical inference and executable simulation will greatly benefit NLU systems. We propose a system architecture along with a roadmap towards realizing this vision.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith
ACL  2020

Tl;DR: We argue that textual domains comprise a spectrum of different granularities. Pretraining along this spectrum maximizes performance of language models on NLP tasks.

SPECTER: Document-level Representation Learning using Citation-informed Transformers

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld
ACL  2020

Tl;DR: We propose a document representation model that incorporates inter-document context into pretrained language models.

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan
preprint  2020

Tl;DR: We introduce the Longformer, with an attention mechanism that scales linearly with sequence length, achieving state-of-the-art results on multiple character-level language modeling and document-level tasks.

Abductive Commonsense Reasoning

Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Scott Wen-tau Yih, and Yejin Choi
ICLR  2020

Tl;DR: We conceptualize a new task of Abductive NLI and introduce a challenge dataset, ART, that consists of over 20k commonsense narrative contexts and 200k explanations, formulated as multiple choice questions for easy automatic evaluation.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

Just Add Functions: A Neural-Symbolic Language Model (2020)

Dave Demeter and Doug Downey
AAAI  2020

Tl;DR: We present a model based on pretrained language models for classifying sentences in context of other sentences. Achieves SOTA results on 4 datasets on 2 different domains. We also release a challenging dataset of 2K discourse facets in CS domain.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

Pretrained Language Models for Sequential Sentence Classification

Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi, and Daniel S. Weld
EMNLP  2019

Tl;DR: We present a model based on pretrained language models for classifying sentences in context of other sentences. Achieves SOTA results on 4 datasets on 2 different domains. We also release a challenging dataset of 2K discourse facets in CS domain.

SciBERT: Pretrained Language Model for Scientific Text

Iz Beltagy, Kyle Lo, and Arman Cohan
EMNLP  2019

Tl;DR: SciBERT is a pretrained language model for scientific text.