Extracting a Knowledge Base of Mechanisms from COVID-19 Papers

Tom Hope*, Aida Amini*, David Wadden, Madeleine van Zuylen, Eric Horvitz, Roy Schwartz, and Hannaneh Hajishirzi
preprint  2020

Tl;DR: To navigate the collection of COVID19 papers from different domains, we present a KB of mechanisms relating to COVID19, to support domain-agnostic search and exploration of general activities, functions, influences and associations in these papers.

S2ORC: The Semantic Scholar Open Research Corpus

Kyle Lo, Lucy Lu Wang, Mark E Neumann, Rodney Michael Kinney, and Daniel S. Weld
ACL  2020

Tl;DR: We introduce S2ORC, a large contextual citation graph of English-language academic papers from multiple scientific domains; the corpus consists of 81.1M papers, 380.5M citation edges, and associated paper metadata.

SciREX: A Challenge Dataset for Document-Level Information Extraction

Sarthak Jain, Madeleine van Zuylen, Hanna Hajishirzi, and Iz Beltagy
ACL  2020

Tl;DR: We introduce a new dataset called SciREX that requires understanding of the whole document to annotate entities, and their document-level relationships that usually span beyond sentences or even sections.

SPECTER: Document-level Representation Learning using Citation-informed Transformers

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld
ACL  2020

Tl;DR: We propose a document representation model that incorporates inter-document context into pretrained language models.

Fact or Fiction: Verifying Scientific Claims

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi
preprint  2020

Tl;DR: we construct SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts annotated with labels and rationales. We develop baseline models for SciFact, and demonstrate that these models benefit from combined training on a large dataset of claims about Wikiped... ia articles, together with the new SciFact data.

CORD-19: The Covid-19 Open Research Dataset

Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn Funk, Rodney Kinney, Ziyang Liu, William Merrill, Paul Mooney, Dewey Murdick, Devvret Rishi, Jerry Sheehan, and 10 more...
ACL, NLP-COVID workshop   2020

Tl;DR: The Covid-19 Open Research Dataset (CORD-19) is a growing 1 resource of scientific papers on Covid-19 and related historical coronavirus research. CORD-19 is designed to facilitate the development of text mining and information retrieval systems over its rich collection of metadata and structured fu... ll text papers.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar
BioNLP  2019

Tl;DR: We created a spaCy pipeline for biomedical and scientific text processing. The core models include dependency parsing, part of speech tagging, and named entity recognition models retrained on general biomedical text, and custom tokenization. We also release four specific named entity recognition mod... els for more focused biomedical entity recognition. Additionally, we include optional components for abbreviation resolution, simple entity linking to UMLS, and sentence splitting.

Structural Scaffolds for Citation Intent Classification in Scientific Publications

Arman Cohan, Waleed Ammar, Madeleine van Zuylen, and Field Cady
NAACL  2019

Tl;DR: We propose a new scaffolding model for classifying citation intents using two auxiliary tasks to handle low-resouce training data. We additionally propose SciCite, a multi-domain dataset of citation intents.

GrapAL: Querying Semantic Scholar's Literature Graph

Christine Betts, Joanna L. Power, and Waleed Ammar
NAACL, Demo   2019

Tl;DR: We introduce GrapAL (Graph database of Academic Literature), a versatile tool for exploring and investigating scientific literature which satisfies a variety of use cases and information needs requested by researchers.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications

Dongyeop Kang, Waleed Ammar, Bhavana Dalvi, Madeleine van Zuylen, Sebastian Kohlmeier, Eduard Hovy, and Roy Schwartz
NAACL  2018

Tl;DR: We present the first public dataset of scientific peer reviews available for research purposes, containing 14.7K paper drafts and the corresponding accept/reject decisions in top-tier venues.