Caifan Du scite author profile

Caifan Du

6Publications

34Citation Statements Received

148Citation Statements Given

How they've been cited

How they cite others

116

144

Affiliations

The University of Texas at Austin

Publications

Order By: Most citations

Softcite dataset: A dataset of software mentions in biomedical and economic research publications

Cohoon

Lopez³

et al. 2021

Asso for Info Science & Tech

View full text Add to dashboard Cite

Software contributions to academic research are relatively invisible, especially to the formalized scholarly reputation system based on bibliometrics. In this article, we introduce a gold‐standard dataset of software mentions from the manual annotation of 4,971 academic PDFs in biomedicine and economics. The dataset is intended to be used for automatic extraction of software mentions from PDF format research publications by supervised learning at scale. We provide a description of the dataset and an extended discussion of its creation process, including improved text conversion of academic PDFs. Finally, we reflect on our challenges and lessons learned during the dataset creation, in hope of encouraging more discussion about creating datasets for machine learning use.

show abstract

CiteAs: Better Software through Sociotechnical Change for Better Software Citation

Cohoon

Priem

et al. 2021

View full text Add to dashboard Cite

show abstract

Softcite software mention extraction from the CORD-19 publications

Lopez¹,

Du²,

Cohoon³

et al. 2021

View full text Add to dashboard Cite

Mining Software Entities in Scientific Literature

Lopez¹,

Cohoon

et al. 2021

View full text Add to dashboard Cite

Voices of Workers: Why a Worker-Centered Approach to Crowd Work Is Challenging

Du¹,

Lease²

2022

Preprint

View full text Add to dashboard Cite

Understanding progress in software citation: a study of software citation in the CORD-19 corpus

Cohoon

Lopez³

et al. 2022

View full text Add to dashboard Cite

In this paper, we investigate progress toward improved software citation by examining current software citation practices. We first introduce our machine learning based data pipeline that extracts software mentions from the CORD-19 corpus, a regularly updated collection of more than 280,000 scholarly articles on COVID-19 and related historical coronaviruses. We then closely examine a stratified sample of extracted software mentions from recent CORD-19 publications to understand the status of software citation. We also searched online for the mentioned software projects and their citation requests. We evaluate both practices of referencing software in publications and making software citable in comparison with earlier findings and recent advocacy recommendations. We found increased mentions of software versions, increased open source practices, and improved software accessibility. Yet, we also found a continuation of high numbers of informal mentions that did not sufficiently credit software authors. Existing software citation requests were diverse but did not match with software citation advocacy recommendations nor were they frequently followed by researchers authoring papers. Finally, we discuss implications for software citation advocacy and standard making efforts seeking to improve the situation. Our results show the diversity of software citation practices and how they differ from advocacy recommendations, provide a baseline for assessing the progress of software citation implementation, and enrich the understanding of existing challenges.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Caifan Du

Softcite dataset: A dataset of software mentions in biomedical and economic research publications

CiteAs: Better Software through Sociotechnical Change for Better Software Citation

Softcite software mention extraction from the CORD-19 publications

Mining Software Entities in Scientific Literature

Voices of Workers: Why a Worker-Centered Approach to Crowd Work Is Challenging

Understanding progress in software citation: a study of software citation in the CORD-19 corpus

Contact Info

Product

Resources

About