*Result*: Natural language processing methods for detecting and measuring the impact of scientific work beyond academia

Title:
Natural language processing methods for detecting and measuring the impact of scientific work beyond academia
Publisher Information:
University of Warwick, 2022.
Publication Year:
2022
Collection:
University of Warwick
Document Type:
*Dissertation/ Thesis* Electronic Thesis or Dissertation
Language:
English
Accession Number:
edsble.887986
Database:
British Library EThOS

*Further Information*

*Scientific research has a profoundly important impact on our society and the environment. However, the multifaceted nature of this impact makes it particularly difficult to measure and, as shown in this thesis, it cannot be measured using traditional academic impact metrics that focus on counting citations and publications. Furthermore, existing societal and environmental impact metrics are only applicable to one scientific discipline or geography or are expensive processes run irregularly by government agencies. This thesis investigates natural language processing methods for identifying and measuring societal and environmental scientific impact and how such impact is reported in the news. A novel regression task and model are presented for identifying and quantifying this impact based on text extracted from scientific papers and news articles that discuss them. This is enabled by developing methods for linking and comparing news articles with academic papers that they discuss, whilst accounting for the structural and linguistic differences between the two types of document. Text encoding strategies for representation and comparison of long documents are also a focus of the thesis. A new cross-domain, co-reference resolution task between news articles and scientific papers is introduced so that co-referring entities may be used as anchors for aligning the two types of documents. Through comparisons of news article excerpts and sentences from corresponding scientific papers, it is shown that scientific discourse structure and argumentation in scientific papers is a likely predictor of which information will be presented prominently in news articles. This work introduces several novel natural language task settings for which no pre existing data sets exist. This has necessitated the production of new human-annotated datasets which were built using bespoke annotation tools that use semi-supervised learning to accelerate the labelling process and minimise the cognitive load of the task on the annotator. The thesis also makes use of low resource approaches including few-shot and multi-task learning to facilitate the development of accurate models with small data-sets. The resulting annotated data-sets, annotation tools and guidelines along with state-of-the art machine learning models are all made available as open assets. This thesis contributes new ways to measure societal and environmental impact of scientific work and help scientists and funding bodies understand how work is being used by others, justify the spending of public funding and inform better public engagement.*