Skip to content

Latest commit

 

History

History
87 lines (58 loc) · 4.8 KB

README.md

File metadata and controls

87 lines (58 loc) · 4.8 KB

Transforming Scholarly Landscapes: Influence of Large Language Models on Academic Fields Beyond Computer Science

Arxiv License Python Versions

This repository contains the code to procure and preprocess the dataset, introduced in the paper "Transforming Scholarly Landscapes: Influence of Large Language Models on Academic Fields beyond Computer Science". The code is released under an Apache 2.0 license.

The code hosted in this repository extracts data and metadata from the Semantic Scholar Corpus (additionally requires the API key, see below), pre-processes it, and saves it locally in .jsonl format, which is compatible with common visualization tools (such as Tableau). While this data can be used for a broad analysis of scholarly documents, we use it to investigate the increasing application of Large Language Models (LLMs) across diverse fields outside of computer science.

Contact person: Aniket Pramanick

Don't hesitate to send us an e-mail or report an issue if something is broken (and it shouldn't be) or if you have further questions.

Abstract

Large Language Models (LLMs) have ushered in a transformative era in Natural Language Processing (NLP), reshaping research and extending NLP's influence to other fields of study. However, there is little to no work examining the degree to which LLMs influence other research fields. This work empirically and systematically examines the influence and use of LLMs in fields beyond NLP. We curate $106$ LLMs and analyze $\sim$ 148k papers citing LLMs to quantify their influence and reveal trends in their usage patterns. Our analysis reveals not only the increasing prevalence of LLMs in non-CS fields but also the disparities in their usage, with some fields utilizing them more frequently than others since 2018, notably Linguistics and Engineering together accounting for $\sim$ 45% of LLM citations. Our findings further indicate that most of these fields predominantly employ task-agnostic LLMs, proficient in zero or few-shot learning without requiring further fine-tuning, to address their domain-specific problems. This study sheds light on the cross-disciplinary impact of NLP through LLMs, providing a better understanding of the opportunities and challenges.

Getting Started

Follow the instructions below to create the Python environment to fetch the data.

$ conda create -n llmtrends pip python=3.9 
$ conda activate llmtrends
$ pip install -r requirements.txt

Usage - Fetch Data

To use the dataset for analysis, you need to collect the dataset using the following code. You will need your semantic scholar API Key.

  1. To fetch the entire semantic scholar corpus, use:
python -c "from code.get_s2roc import *; s2_api_key = "YOUR API KEY"; get_s2roc_files()" 

  1. To fetch only the extracted abstracts from scholarly documents, use:
python -c "from code.get_s2roc import *; s2_api_key = "YOUR API KEY"; get_abstracts()"

  1. To fetch the research papers, use:
python -c "from code.get_s2roc import *; s2_api_key = "YOUR API KEY"; get_papers()"

  1. To fetch only the names of the author data of the scholarly documents, use:
python -c "from code.get_s2roc import *; s2_api_key = "YOUR API KEY"; get_authors()"

  1. To fetch the citation graph, use:
python -c "from code.get_s2roc import *; s2_api_key = "YOUR API KEY"; get_citations()" 

  1. To fetch the publication venues of the research papers, use:
python -c "from code.get_s2roc import *; s2_api_key = "YOUR API KEY"; get_publication_venues()" 

Usage - Analyze Data

We have used Tableau for Students to analyze the data and create all the plots. However, any other visualization software could be used as well to analyze the data.

Citation

If you use this code in your work, please cite our paper as follows:

@article{pramanick2024llmtrends,
  title={Transforming Scholarly Landscapes: Influence of Large Language Models on Academic Fields beyond Computer Science},
  author={Pramanick, Aniket and Hou, Yufang and Mohammad, Saif and Gurevych, Iryna},
  journal={arXiv preprint arXiv:2409.19508},
  year={2024},
  url={https://arxiv.org/abs/2409.19508}
}

Disclaimer

This repository contains experimental software and is published for the sole purpose of giving additional background details on the associated paper.