Abstract. Based on state of the art machine learning techniques, GRO-BID (GeneRation Of BIbliographic Data) performs reliable bibliographic data extractions from scholar articles combined with multi-level term extractions. These two types of extraction present synergies and correspond to complementary descriptions of an article. This tool is viewed as a component for enhancing the existing and the future large repositories of technical and scientific publications. ObjectivesThe purpose of this demonstration is to show to the digital library community a practical example of the accuracy of current state of the art machine learning techniques applied to information extraction in scholarship articles. The demonstration is based on the web application at the following addresse: http://grobid.no-ip.org. Bibliographical Data ExtractionAfter the selection of a PDF document, GROBID extracts the bibliographical data corresponding to the header information (title, authors, abstract, etc.) and to each reference (title, authors, journal title, issue, number, etc.). The references are associated to their respective citation contexts. The result of the citation extraction can be exported as a whole or per reference following different formats (BibTeX and TEI) and as COInS 1 . The automatic extraction of bibliographical data is a challenging task because of the high variability of the bibliographical formats and presentations. We have applied Conditional Random Fields to this task following the approach of [1] implemented with the Mallet toolkit [2], based on approx. 1000 training examples for header information, and 1200 training examples for cited references. An evaluation with the reference CORA dataset showed a reliable level of accuracy of 98,6% per header field and 74.9% per complete header instance, 95,7% per citation field and 78.9% per citation instance.
The motion of a thin viscous layer of fluid on a horizontal solid surface bounded laterally by a dry spot and a vertical solid wall is considered. A lubrication model with contact line motion is studied. We find that for a container of fixed length the axisymmetric equilibrium solutions with small dry spots are unstable to axisymmetric disturbances. As the size of the dry spot increases, the equilibrium solutions become unstable to nonaxisymmetric disturbances. In addition, we present numerical solutions of the nonlinear evolution equations in the axisymmetric and nonaxisymmetric cases for different values of the parameters. The axisymmetric results show good agreement with existing experimental results.
A thin layer of liquid advancing over a dry, heated, inclined plate is studied. A lubrication model with contact line motion is derived. The plate is at constant temperature, and the surface Biot number is specified. The steady-state solution is obtained numerically. In addition, the steady-state solution is studied analytically in the neighbourhood of the contact line. A linear stability analysis about the steady state is then performed. The effects of gravity, thermocapillarity and contact line motion are discussed. In particular, we determine a band of unstable wavenumbers, and the maximum growth rate as a function of these parameters.
Software contributions to academic research are relatively invisible, especially to the formalized scholarly reputation system based on bibliometrics. In this article, we introduce a gold‐standard dataset of software mentions from the manual annotation of 4,971 academic PDFs in biomedicine and economics. The dataset is intended to be used for automatic extraction of software mentions from PDF format research publications by supervised learning at scale. We provide a description of the dataset and an extended discussion of its creation process, including improved text conversion of academic PDFs. Finally, we reflect on our challenges and lessons learned during the dataset creation, in hope of encouraging more discussion about creating datasets for machine learning use.
It has been observed that when a thin liquid film coats an initially dry inclined plane, a spanwise instability occurs at the leading edge. Here we develop a model for the evolution of this coating film which includes inertia, gravity, surface tension and the contact angle at the leading edge of the film. A Kármán–Pohlhausen method is used to include inertia. We determine steady state profiles of the film and investigate their stability. The predictions of the model are compared to some recent experiments and we find good agreement. This theory gives improvement over a lubrication theory in experiments where Reynolds numbers are significantly larger than one.
Experimental results are presented for the motion of a dry spot in a thin viscous film on a horizontal surface. These include global and spatial measurements of dry spot diameter, front velocities, static and dynamic contact angle, and the shape of the liquid–solid interface. Data are presented as a function of initial fluid depth for both an advancing fluid front of a collapsing dry spot and a receding fluid front of an opening dry spot. Results for both cases show that the final or static hole diameter increases as the initial fluid depth decreases. Also, insight is obtained into the relationship between the contact angle and the velocity for both advancing and receding fluid fronts. The experimental results are compared to a lubrication model, and good agreement is obtained.
Patent prior-art search is concerned with finding all filed patents relevant to a given patent application. We report a comparison between two search approaches representing the state-of-the-art in patent prior-art search. The first approach uses simple and straightforward information retrieval (IR) techniques, while the second uses much more sophisticated techniques which try to model the steps taken by a patent examiner in patent search. Experiments show that the retrieval effectiveness using both techniques is statistically indistinguishable when patent applications contain some initial citations. However, the advanced search technique is statistically better when no initial citations are provided. Our findings suggest that less time and effort can be exerted by applying simple IR approaches when initial citations are provided.
This paper presents the system called PATATRAS (PATent and Article Tracking, Retrieval and AnalysiS) realized at the Humboldt University for the IP track of CLEF 2009. Our approach presents three main characteristics:1. The usage of multiple retrieval models (KL, Okapi) and term index definitions (lemma, phrase, concept) for the three languages considered in the present track (English, French, German) producing ten different sets of ranked results.2. The merging of the different results based on multiple regression models using an additional validation set created from the patent collection.3. The exploitation of patent metadata and of the citation structures for creating restricted initial working sets of patents and for producing a final re-ranking regression model.As we exploit specific metadata of the patent documents and the citation relations only at the creation of initial working sets and during the final post ranking step, our architecture remains generic and easy to extend.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.