GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications

Lopez, Patrice

doi:10.1007/978-3-642-04346-8_62

Cited by 164 publications

(131 citation statements)

References 2 publications

(2 reference statements)

Supporting

Mentioning

130

Contrasting

Unclassified

Order By: Relevance

“…In a recent survey and evaluation, several non-commercial reference parsing tools, Tkaczyk et al (2018) found that the best three performing ones all use a CRF approach: GROBID (Lopez, 2009), CERMINE (Tkaczyk et al, 2015) and ParsCit (Councill et al, 2008). All three benefit from task-specific tuning using extra annotated data, with GROBID showing the best off-the-shelf results.…”

Section: Related Workmentioning

confidence: 99%

Deep Reference Mining From Scholarly Literature in the Arts and Humanities

Alves

Colavizza

Kaplan

2018

Front. Res. Metr. Anal.

View full text Add to dashboard Cite

We consider the task of reference mining: the detection, extraction and classification of references within the full text of scholarly publications. Reference mining brings forward specific challenges, such as the need to capture the morphology of highly abbreviated words and the dependence among the elements of a reference, both following codified reference styles. This task is particularly difficult, and little explored, with respect to the literature in the arts and humanities, where references are mostly given in footnotes. We apply a deep learning architecture for reference mining from the full text of scholarly publications. We explore and discuss three architectural components: word and character-level word embeddings, different prediction layers (Softmax and Conditional Random Fields) and multi-task over single-task learning. Our best model uses both pre-trained word embeddings and characters embeddings, and a BiLSTM-CRF architecture. We test our solution on a dataset of annotated references from the historiography on Venice and, using a linear-chain CRF classifier as a baseline, we show that this deep learning architecture improves by a considerable margin. Furthermore, multi-task learning performs almost on par with a single-task approach. We thus confirm that there are important gains to be had by adopting deep learning for the task of reference mining.

show abstract

Section: Related Workmentioning

confidence: 99%

Deep Reference Mining From Scholarly Literature in the Arts and Humanities

Alves

Colavizza

Kaplan

2018

Front. Res. Metr. Anal.

View full text Add to dashboard Cite

show abstract

“…The system based on TeamBeam algorithm proposed by Kern et al [13] is able to extract a basic set of metadata from PDF documents using an enhanced Maximum Entropy classifier. Lopez [14] proposes GROBID system for analysing scientific texts in PDF format. GROBID uses CRF in order to extract document's metadata, full text and a list of parsed bibliographic references.…”

Section: State Of the Artmentioning

confidence: 99%

“…Reference sections are typically located in the documents using heuristics [6,7,16,17] or machine learning [14,18].…”

Section: State Of the Artmentioning

confidence: 99%

CERMINE: automatic extraction of structured metadata from scientific literature

Tkaczyk

Szostek

Fedoryszak

et al. 2015

IJDAR

143

131

View full text Add to dashboard Cite

CERMINE is a comprehensive open-source system for extracting structured metadata from scientific articles in a born-digital form. The system is based on a modular workflow, whose loosely coupled architecture allows for individual component evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algorithm and facilitates future architecture expanding. The implementations of most steps are based on supervised and unsupervised machine learning techniques, which simplifies the procedure of adapting the system to new document layouts and styles. The evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types, with the average F score of 77.5 %. CERMINE system is available under an open-source licence and can be accessed at http:// cermine.ceon.pl. In this paper, we outline the overall workflow architecture and provide details about individual steps implementations. We also thoroughly compare CERMINE to similar solutions, describe evaluation methodology and finally report its results. B Dominika Tkaczyk

show abstract

“…In essence, they require to be as neatly associated to the originated researcher or institution, as a warrant for the trustfulness of 20 See in particular the importance of machine learning techniques in this respect (Lopez 2009) the content, but also in order to allow an adequate citation of the work. More generally, scientific data have to be, even more than publications, associated with precise metadata (in the same way as what we have for publications with bibliographical data).…”

Section: Characterising Research Datamentioning

confidence: 99%

Scholarly Communication

Romary¹

2014

Encyclopedia of Social Network Analysis and Mining

View full text Add to dashboard Cite

The chapter tackles the role of scholarly publication in the research process (quality, preservation) and looks at the consequences of new information technologies in the organization of the scholarly communication ecology. It will then show how new technologies have had an impact on the scholarly communication process and made it depart from the traditional publishing environment. Developments will address new editorial processes, dissemination of new content and services, as well as the development of publication archives. This last aspect will be covered on all levels (open access, scientific, technical and legal aspects). A view on the possible evolutions of the scientific publishing environment will be provided.

show abstract

GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications

Cited by 164 publications

References 2 publications

Deep Reference Mining From Scholarly Literature in the Arts and Humanities

Deep Reference Mining From Scholarly Literature in the Arts and Humanities

CERMINE: automatic extraction of structured metadata from scientific literature

Scholarly Communication

Contact Info

Product

Resources

About