Publications

Showing 1 - 38 of 38
clear filters

TREC-COVID: Rationale and Structure of an Information Retrieval Shared Task for COVID-19

Kirk Roberts, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, Kyle Lo, Ian Soboroff, Ellen Voorhees, Lucy Lu Wang, and William R Hersh
JAMIA  2020

Tl;DR: This article presents a brief description of the rationale and structure of TREC-COVID, a still-ongoing IR evaluation. TREC-COVID is creating a new paradigm for search evaluation in rapidly evolving crisis scenarios.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

SciSight: Combining faceted navigation and research group detection for COVID-19 exploratory scientific search

Tom Hope, Jason Portenoy*, Kishore Vasan*, Jonathan Borchardt*, Eric Horvitz, Daniel S. Weld, Marti A. Hearst, and Jevin D. West
preprint  2020

Tl;DR: SciSight is a novel framework for exploratory search of COVID-19 research that integrates two key capabilities: first, exploring interactions between biomedical facets (e.g., proteins, genes, drugs, diseases, patient characteristics); and second, discovering groups of researchers and how they are co... nnected.

S2ORC: The Semantic Scholar Open Research Corpus

Kyle Lo, Lucy Lu Wang, Mark E Neumann, Rodney Michael Kinney, and Daniel S. Weld
ACL  2020

Tl;DR: We introduce S2ORC, a large contextual citation graph of English-language academic papers from multiple scientific domains; the corpus consists of 81.1M papers, 380.5M citation edges, and associated paper metadata.

Stolen Probability: A Structural Weakness of Neural Language Models

David Demeter, Gregory Kimmel, and Doug Downey
ACL  2020

Tl;DR: We show that the softmax output common in neural language models leads to a limitation: some words (in particular, those with an embedding interior to the convex hull of the embedding space) can never be assigned high probability by the model, no matter what the context.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

Language (Re)modelling: Towards Embodied Language Understanding

Ronen Tamari, Chen Shani, Tom Hope, Miriam R. L. Petruck, Omri Abend, and Dafna Shahaf
ACL  2020

Tl;DR: We bring together ideas from cognitive science and AI/NLU, arguing that grounding by analogical inference and executable simulation will greatly benefit NLU systems. We propose a system architecture along with a roadmap towards realizing this vision.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith
ACL  2020

Tl;DR: We argue that textual domains comprise a spectrum of different granularities. Pretraining along this spectrum maximizes performance of language models on NLP tasks.

SUPP.AI: Finding Evidence for Supplement-Drug Interactions

Lucy Lu Wang, Oyvind Tafjord, Sarthak Jain, Arman Cohan, Sam Skjonsbert, Carissa Schoenick, Nick Botner, and Waleed Ammar
ACL Demo  2020

Tl;DR: We extracted evidence of supplement-drug interactions from 22M scientific articles. Using transfer learning approaches, we fine-tune the BERT language model using labeled evidence of drug-drug interactions, and use the resulting model to detect supplement interaction evidence. We surface these inter... actions on a demo website, SUPP.AI, and provide the dataset and model for use by other researchers.

SciREX: A Challenge Dataset for Document-Level Information Extraction

Sarthak Jain, Madeleine van Zuylen, Hanna Hajishirzi, and Iz Beltagy
ACL  2020

Tl;DR: We introduce a new dataset called SciREX that requires understanding of the whole document to annotate entities, and their document-level relationships that usually span beyond sentences or even sections.

High-Precision Extraction of Emerging Concepts from Scientific Literature

Daniel King, Doug Downey, and Daniel S. Weld
SIGIR  2020

Tl;DR: A novel, unsupervised method for extracting scientific concepts from papers, based on the intuition that each scientific concept is likely to be introduced or popularized by a single paper that is disproportionately cited by subsequent papers mentioning the concept.

SPECTER: Document-level Representation Learning using Citation-informed Transformers

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld
ACL  2020

Tl;DR: We propose a document representation model that incorporates inter-document context into pretrained language models.

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan
preprint  2020

Tl;DR: We introduce the Longformer, with an attention mechanism that scales linearly with sequence length, achieving state-of-the-art results on multiple character-level language modeling and document-level tasks.

Building a Better Search Engine for Semantic Scholar

and Sergey Feldman
blog  2020

Tl;DR: 2020 is the year of search for Semantic Scholar, a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of our biggest endeavors this year is to improve the relevance of our search engine, and my mission beginning at the start of the year was to figure o... ut how to use about 3 years of search log data to build a better search ranker.

Fact or Fiction: Verifying Scientific Claims

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi
preprint  2020

Tl;DR: we construct SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts annotated with labels and rationales. We develop baseline models for SciFact, and demonstrate that these models benefit from combined training on a large dataset of claims about Wikiped... ia articles, together with the new SciFact data.

TLDR: Extreme Summarization of Scientific Documents

Isabel Cachola, Kyle Lo, Arman Cohan, and Daniel S. Weld
preprint  2020

Tl;DR: We introduce TLDR generation for scientific papers, a new automatic summarization task with high source compression and provide a new dataset and models for effective generation of TLDRs.

Gender trends in CS authorship

Lucy Lu Wang, Gabriel Stanovsky, Luca Weihs, and Oren Etzioni
CACM  2020

Tl;DR: An analysis of 2.87 million computer science papers reveals that, if current trends continue, parity between the number of male and female authors will not be reached in this century. With optimistic projection models, gender parity is forecast to be reached by 2100 in CS, but projected to be reache... d within two to three decades in the biomedical literature.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

SLEDGE: A Simple Yet Effective Baseline for Coronavirus Scientific Knowledge Search

Sean MacAvaney, Arman Cohan, and Nazli Goharian
preprint  2020

Tl;DR: We present a SLDEDGE, a search system that utilizes SciBERT to effectively re-rank articles related to SARS-CoV-2. SLEDGE achieves state-of-the-art results on the TREC covid search round 1 benchmark.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

CORD-19: The Covid-19 Open Research Dataset

Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn Funk, Rodney Kinney, Ziyang Liu, William Merrill, Paul Mooney, Dewey Murdick, Devvret Rishi, Jerry Sheehan, and 10 more...
ACL, NLP-COVID workshop   2020

Tl;DR: The Covid-19 Open Research Dataset (CORD-19) is a growing 1 resource of scientific papers on Covid-19 and related historical coronavirus research. CORD-19 is designed to facilitate the development of text mining and information retrieval systems over its rich collection of metadata and structured fu... ll text papers.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

Explanation-Based Tuning of Opaque Machine Learners with Application to Paper Recommendation

Benjamin Charles Germain Lee, Kyle Lo, Doug Downey, and Daniel S. Weld
preprint  2020

Tl;DR: We developed a general approach for actionable explanations, which you can try within our Semantic Sanity prototype. User studies of the approach have shown that it leads to higher perceived user control, trust, and satisfaction.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection

Ellen M. Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R. Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang
preprint  2020

Tl;DR: TREC-COVID is a community evaluation designed to build a test collection that captures the information needs of biomedical researchers using the scientific literature during a pandemic.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

Abductive Commonsense Reasoning

Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Scott Wen-tau Yih, and Yejin Choi
ICLR  2020

Tl;DR: We conceptualize a new task of Abductive NLI and introduce a challenge dataset, ART, that consists of over 20k commonsense narrative contexts and 200k explanations, formulated as multiple choice questions for easy automatic evaluation.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

Citation Text Generation

Kelvin Luu, Rik Koncel-Kedziorski, Kyle Lo, Isabel Cachola, and Noah A. Smith
preprint  2020

Tl;DR: We introduce the task of citation text generation: given a pair of scientific documents, explain their relationship in natural language text in the manner of a citation from one text to the other.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

Just Add Functions: A Neural-Symbolic Language Model (2020)

Dave Demeter and Doug Downey
AAAI  2020

Tl;DR: We present a model based on pretrained language models for classifying sentences in context of other sentences. Achieves SOTA results on 4 datasets on 2 different domains. We also release a challenging dataset of 2K discourse facets in CS domain.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

Pretrained Language Models for Sequential Sentence Classification

Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi, and Daniel S. Weld
EMNLP  2019

Tl;DR: We present a model based on pretrained language models for classifying sentences in context of other sentences. Achieves SOTA results on 4 datasets on 2 different domains. We also release a challenging dataset of 2K discourse facets in CS domain.

SciBERT: Pretrained Language Model for Scientific Text

Iz Beltagy, Kyle Lo, and Arman Cohan
EMNLP  2019

Tl;DR: SciBERT is a pretrained language model for scientific text.

ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar
BioNLP  2019

Tl;DR: We created a spaCy pipeline for biomedical and scientific text processing. The core models include dependency parsing, part of speech tagging, and named entity recognition models retrained on general biomedical text, and custom tokenization. We also release four specific named entity recognition mod... els for more focused biomedical entity recognition. Additionally, we include optional components for abbreviation resolution, simple entity linking to UMLS, and sentence splitting.

Combining Distant and Direct Supervision for Neural Relation Extraction (2019)

Iz Beltagy, Kyle Lo, and Waleed Ammar
NAACL  2019

Tl;DR: We improve relation extraction models by combining the distant supervision data with an additional directly-supervised data, which we use as supervision for the attention weights. We find that joint training on both types of supervision leads to a better model because it improves the model's ability... to identify noisy sentences.

Structural Scaffolds for Citation Intent Classification in Scientific Publications

Arman Cohan, Waleed Ammar, Madeleine van Zuylen, and Field Cady
NAACL  2019

Tl;DR: We propose a new scaffolding model for classifying citation intents using two auxiliary tasks to handle low-resouce training data. We additionally propose SciCite, a multi-domain dataset of citation intents.

GrapAL: Querying Semantic Scholar's Literature Graph

Christine Betts, Joanna L. Power, and Waleed Ammar
NAACL, Demo   2019

Tl;DR: We introduce GrapAL (Graph database of Academic Literature), a versatile tool for exploring and investigating scientific literature which satisfies a variety of use cases and information needs requested by researchers.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

China catching up to US in AI research

Field Cady and Oren Etzioni
blog  2019

Tl;DR: We analyzed over two million academic papers, and found that China has already surpassed the US in published AI papers. If current trends continue, China is poised to overtake the US in the most-cited 50% of papers this year, in the most-cited 10% of papers next year, and in the 1% of most-cited pap... ers by 2025.

Quantifying Sex Bias in Clinical Studies at Scale With Automated Data Extraction

Sergey Feldman, Waleed Ammar, Kyle Lo, Elly Trepman, Madeleine van Zuylen, and Oren Etzioni
JAMA  2019

Tl;DR: We extracted counts of women and men from over 40k published clinical trial articles and found substantial underrepresentation of female participants in 7 of 11 disease categories, especially HIV/AIDS, chronic kidney diseases, and cardiovascular diseases.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

Construction of the Literature Graph in Semantic Scholar

Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford, and et al.
NAACL  2018

Tl;DR: This paper introduces the Semantic Scholar literature graph, consisting of more than 280M nodes, representing papers, authors, entities and various interactions between them. [acknowledgements: TAGME entity linker (https://tagme.d4science.org/)]

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

Extracting Scientific Figures with Distantly Supervised Neural Networks

Noah Siegel, Nicholas Lourie, Russell Power, and Waleed Ammar
JCDL  2018

Tl;DR: In this paper, we induce high-quality training labels for the task of figure extraction in a large number of scientific documents, with no human intervention.

Content-Based Citation Recommendation

Chandra Bhagavatula, Sergey Feldman, Russell Power, and Waleed Ammar
NAACL  2018

Tl;DR: We embed a given query document into a vector space, then use its nearest neighbors as candidates, and rerank the candidates using a discriminative model trained to distinguish between observed and unobserved citations.

Ontology Alignment in the Biomedical Domain Using Entity Definitions and Context

Lucy Lu Wang, Chandra Bhagavatula, Mark Neumann, Kyle Lo, Christopher Wilhelm, and Waleed Ammar
BioNLP  2018

Tl;DR: This ontology matcher can be used to generate alignments between entities in two biomedical ontologies. The matcher uses entity definitions and usage context retrieved from the Semantic Scholar corpus to assist in entity matching.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

Does ArXiv help increase citation counts?

Sergey Feldman, Kyle Lo, and Waleed Ammar
preprint  2018

Tl;DR: We explore the degree to which papers prepublished on arXiv garner more citations, in an attempt to paint a sharper picture of fairness issues related to prepublishing. We observe that papers submitted to arXiv before acceptance have, on average, 65% more citations in the following year compared to... papers submitted after, even after accounting for variables such as venue and author influentialness.

  • s2 View and cite on Semantic Scholar
  • View on

A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications

Dongyeop Kang, Waleed Ammar, Bhavana Dalvi, Madeleine van Zuylen, Sebastian Kohlmeier, Eduard Hovy, and Roy Schwartz
NAACL  2018

Tl;DR: We present the first public dataset of scientific peer reviews available for research purposes, containing 14.7K paper drafts and the corresponding accept/reject decisions in top-tier venues.

Semi-supervised End-to-End Entity and Relation Extraction

Waleed Ammar, Mathew E. Peters, Chandra Bhagavatula, and Russell Power
SemEval  2017

Tl;DR: Our submission to SemEval 2017 Task 10 (ScienceIE) shared task placed 1st in end-to-end entity and relation extraction and 2nd in relation-only extraction. We find that pretraining neural forward and backward language model produces word representations that can drastically improve model performanc... e. This finding resulted in the later development of ELMo contextualized embeddings.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF

Identifying Meaningful Citations

Marco Valenzuela, Vu Ha, and Oren Etzioni
AAAI, Workshop   2015

Tl;DR: We introduce the novel task of identifying important citations in scholarly literature, i.e., citations that indicate that the cited work is used or extended in the new effort. We believe this task is a crucial component in algorithms that detect and follow research topics and in methods that measur... e the quality of publications.

  • s2 View and cite on Semantic Scholar
  • PDF View PDF