Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models-in all languages except English-very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.
This paper describes the Linguistic Annotation Framework under development within ISO TC37 SC4 WG1. The Linguistic Annotation Framework is intended to serve as a basis for harmonizing existing language resources as well as developing new ones.
To cite this version:Chris Armbruster, Laurent Romary. Comparing Repository Types -Challenges and barriers for subject-based repositories, research repositories, national repository systems and institutional repositories in serving scholarly communication.
We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCARbased and Wikipedia-based ELMo embeddings for these languages on the part-ofspeech tagging and parsing tasks. We show that, despite the noise in the Common-Crawlbased OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the crosslingual benefit of multilingual embedding architectures.
It is widely recognized that the proliferation of annotation schemes runs counter to the need to re-use language resources, and that standards for linguistic annotation are becoming increasingly mandatory. To answer this need, we have developed a framework comprised of an abstract model for a variety of different annotation types (e.g., morpho-syntactic tagging, syntactic annotation, co-reference annotation, etc.), which can be instantiated in different ways depending on the annotator's approach and goals. In this paper we provide an overview of the framework, demonstrate its applicability to syntactic annotation, and show how it can contribute to comparative evaluation of parser output and diverse syntactic annotation schemes.
We present Grobid-quantities, an open-source application for extracting and normalising measurements from scientific and patent literature. Tools of this kind, aiming to understand and make unstructured information accessible, represent the building blocks for large-scale Text and Data Mining (TDM) systems. Grobid-quantities is a module built on top of Grobid [6] [13], a machine learning framework for parsing and structuring PDF documents. Designed to process large quantities of data, it provides a robust implementation accessible in batch mode or via a REST API. The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task. The models are trained using CRF (Conditional Random Field) algorithm [12] for extracting quantities (atomic values, intervals and lists), units (such as length, weight) and different value representations (numeric, alphabetic or scientific notation). Identified measurements are normalised according to the International System of Units (SI). Thanks to its stable recall and reliable precision, Grobid-quantities has been integrated as the measurement-extraction engine in various TDM projects, such as Marve (Measurement Context Extraction from Text), for extracting semantic measurements and meaning in Earth Science [10]. At the National Institute for Materials Science in Japan (NIMS), it is used in an ongoing project to discover new superconducting materials. Normalised materials characteristics (such as critical temperature, pressure) extracted from scientific literature are a key resource for materials informatics (MI) [9].
In this paper we provide a systematic and comprehensive set of modeling principles for representing etymological data in digital dictionaries using TEI. The purpose is to integrate in one coherent framework both digital representations of legacy dictionaries and born-digital lexical databases that are constructed manually or semi-automatically.We provide examples from many different types of etymological phenomena from traditional lexicographic practice, as well as analytical approaches from functional and cognitive linguistics such as metaphor, metonymy and grammaticalization, which in many lexicographical and formal linguistic circles have not often been treated as truly etymological in nature, and have thus been largely left out of etymological dictionaries.In order to fully and accurately express the phenomena and their structures, we have made several proposals for expanding and amending some aspects of the existing TEI framework.Finally, with reference to both synchronic and diachronic data, we also demonstrate how encoders may integrate semantic web/linked open data information resources into TEI dictionaries as a basis for the sense, and/or the semantic domain of an entry and/or an etymon. 30 This would be referred to as case loss from the perspective of the item"s morpho-syntactic etymology. 31 or any terminological equivalent to either; 32 which one could easily retrieve by means of a simple XPath instruction such as: ./cit[@type='etymon']/gramGrp [gen = ‗neut']
Abstract. This paper delineates the main characteristics of the Episciences platform, an environment for overlay peerreviewing that complements existing publication repositories, designed by the Centre pour la Communication Scientifique directe (CCSD) 1 service unit. We describe the main characteristics of the platform and present the first experiment of launching two journals in the computer science domain onto it. Finally, we address a series of open questions related to the actual changes in editorial models (open submission, open peer-review, augmented publication) that such a platform is likely to raise, as well as some hints as to the underlying business model. Keywords: Overlay journal, editorial platform, scholarly communication, repositories, Open Access Exploring new scholarly publication modelsThe recent debates on Open Access have mainly focused on opposing models, the so-called green model, where scientists deposit their (possibly published) research papers in open repositories and the gold model where publishers, usually following the payment of an author fee, freely release the publication online. This debate often misses two points. First, that what is at stake is to have a reliable and sustainable communication system for science where scientists themselves have the say and are provided with all means to quickly disseminate their results while receiving the appropriate feedback (usually embodied by peer-reviewing) from their communities. Second, that all data generated around the evaluation, the reviews and the associated discussions (forums, etc.) shall be monitored by the scientific community.Still, we know that alternative models to the traditional publisher-owned journals are possible, and experiences carried out in the human sciences with the OpenEdition endeavour for instance have shown that research communities may react favourably when a real alternative is being offered. Such initiatives provide a systemic concept of publishing (from scholarly blogs to journal publications) comprising both new editorial frameworks and business models.In this context, we present a new initiative to provide an overlay journal environment, i.e. a journal that is built as an additional peer-reviewing layer on top of a publication repository (see [9]). This
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.