Scientific software citation intent classification using large language models

15 October 2024

Software plays a crucial role in research across all disciplines, from data processing to simulation and visualisation. It has consequently become increasingly common to mention or cite software in research publications, either through a formal citation, a link to the software, or simply the name of the software.

At the Mapping the Impact of Research Software in Science hackathon hosted by the Chan Zuckerberg Initiative, we looked at recognising, analysing and visualising the use of software in research. One of the projects concentrated on understanding the intent behind mentioning software. The results were recently presented at NSLP 2024 and published in a paper.

"Why did you mention this?"

In recent years investigations into research software mentions has collected a broad range of data and revealed a lot about the “who” and “what” of software mentions – who cites software, what software is cited in which papers, and who maintains that software. The “why” on the other hand isn’t as well understood.

As with many big questions around software mentions, finding out why a researcher included a software name in their paper is quite challenging. Take for example the mention of “pandas” in a publication. Did the authors analyse the efficiency of the Python pandas library? Did they summarise a related publication that used pandas to analyse a large dataset, or did they use it themselves for that purpose? Did they rewrite parts of the pandas library? Or did they not work with software at all, and instead worked out what pandas like to eat for lunch (bamboo, probably)?

Finding out why a piece of software is mentioned in a publication allows us to better understand how software is developed and used by the research community. For example, in my research into RSE repositories, one of the main challenges was to identify software developed specifically for research. In the end, we had to manually go through research publications and look for researchers describing how they had developed a piece of software for the research project. This manual labour meant that the size of the resulting dataset was limited. If there was an automated way of classifying the intent behind a research citation, it would be possible to analyse much larger volumes of data about scientific software.

Functional intent classes

As illustrated in the pandas example above, there are a variety of reasons for citing software in a research paper. Assuming that we already know that the word in question refers to software (and not a panda bear), we can roughly summarise the intent into three categories:

Creation: The paper describes or acknowledges the creation of research software.
Usage: The paper describes the use of research software in any part of the research procedure, for any purpose.
Related: The paper describes the research software for any other reason.

Large language models to the rescue

The key to understanding the intent of any word in a text is context. From the surrounding sentence, we can usually tell whether the software was created or used for research. Consider for example the following sentence, taken from Steketee et al.^[1]:

Statistical analyses were carried out using Graphpad Prism, SPSS, Microsoft Excel and R.

With the context, it’s easy to determine that the authors didn’t create any of these tools but used them for the research they are presenting.

Large language models (LLMs) have shown that they are effective in contextualising language, for example to summarise or analyse large volumes of text. As such, they are a promising tool for analysing software citations in research publications for citation intent. During the hackathon, we explored multiple such models, including BERT-based models, GPT-3.5 and GPT-4, using fine-tuning as well as zero-shot and few-shot learning. The results suggest that through careful fine-tuning, LLMs can achieve an accuracy of over 80% for software citation intent classification. Further research will show how these kinds of models can be used effectively to enrich existing datasets about research software mentions with intent information, making it possible to ask even bigger questions about how we write and reuse software in the research community.

Scientific Software Citation Intent Classification Using Large Language Models:
https://link.springer.com/chapter/10.1007/978-3-031-65794-8_6

References

[1] Steketee, P. C., Vincent, I. M., Achcar, F., Giordani, F., Kim, D. H., Creek, D. J., & Barrett, M. P. (2018). Benzoxaborole treatment perturbs S-adenosyl-L-methionine metabolism in Trypanosoma brucei. PLoS neglected tropical diseases, 12(5), e0006450.

Preview image: Jamillah Knowles & We and AI / Better Images of AI / People and Ivory Tower AI 2 / CC-BY 4.0.

Author

Ms Kara Moraw

k.moraw@epcc.ed.ac.uk

View profile