Marc Brysbaert scite author profile

Word frequency is the most important variable in research on word processing and memory. Yet, the main criterion for selecting word frequency norms has been the availability of the measure, rather than its quality. As a result, much research is still based on the old Kucera and Francis frequency norms. By using the lexical decision times of recently published megastudies, we show how bad this measure is and what must be done to improve it. In particular, we investigated the size of the corpus, the language register on which the corpus is based, and the definition of the frequency measure. We observed that corpus size is of practical importance for small sizes (depending on the frequency of the word), but not for sizes above 16-30 million words. As for the language register, we found that frequencies based on television and film subtitles are better than frequencies based on written sources, certainly for the monosyllabic and bisyllabic words used in psycholinguistic research. Finally, we found that lemma frequencies are not superior to word form frequencies in English and that a measure of contextual diversity is better than a measure based on raw frequency of occurrence. Part of the superiority of the latter is due to the words that are frequently used as names. Assembling a new frequency norm on the basis of these considerations turned out to predict word processing times much better than did the existing norms (including Kucera & Francis and Celex). The new SUBTL frequency norms from the SUBTLEX(US) corpus are freely available for research purposes from http://brm.psychonomic-journals.org/content/supplemental, as well as from the University of Ghent and Lexique Web sites.

show abstract

Norms of valence, arousal, and dominance for 13,915 English lemmas

Warriner

2013

View full text Add to dashboard Cite

Information about the affective meanings of words is used by researchers working on emotions and moods, word recognition and memory, and text-based sentiment analysis. Three components of emotions are traditionally distinguished: valence (the pleasantness of a stimulus), arousal (the intensity of emotion provoked by a stimulus), and dominance (the degree of control exerted by a stimulus). Thus far, nearly all research has been based on the ANEW norms collected by Bradley and Lang (1999) for 1,034 words. We extended that database to nearly 14,000 English lemmas, providing researchers with a much richer source of information, including gender, age, and educational differences in emotion norms. As an example of the new possibilities, we included stimuli from nearly all of the category norms (e.g., types of diseases, occupations, and taboo words) collected by Van Overschelde, Rawson, and Dunlosky (Journal of Memory and Language 50:289-335, 2004), making it possible to include affect in studies of semantic memory.

show abstract

Concreteness ratings for 40 thousand generally known English word lemmas

2013

View full text Add to dashboard Cite

CCo on nc cr re et te en ne es ss s r ra at ti in ng gs s f fo or r 4 40 0 t th ho ou us sa an nd d g ge en ne er ra al ll ly y k kn no ow wn n E En ng gl li is sh h w wo or rd d l le em mm ma as s AbstractConcreteness ratings are presented for 37,058 English words and 2,896 two-word expressions (such as "zebra crossing" and "zoom in"), obtained from over four thousand participants by means of a norming study using internet crowdsourcing for data collection. Although the instructions stressed that the assessment of word concreteness would be based on experiences involving all senses and motor responses, a comparison with the existing concreteness norms indicates that participants, as before, largely focused on visual and haptic experiences. The reported dataset is a subset of a comprehensive list of English lemmas and contains all lemmas known by at least 85% of the raters. It can be used in future research as a reference list of generally known English lemmas. Concreteness ratings for 40 thousand English word lemmasConcreteness evaluates the degree to which the concept denoted by a word refers to a perceptible entity. The variable came to the foreground in Paivio's dual-coding theory (Paivio, 1971(Paivio, , 2013. According to this theory, concrete words are easier to remember than abstract words, because they activate perceptual memory codes in addition to verbal codes.Schwanenflugel, Harnishfeger, and Stowe (1988) presented an alternative context availability theory, according to which concrete words are easier to process because they are related to strongly supporting memory contexts, whereas abstract words are not, as can be demonstrated by asking people how easy it is to think of a context in which the word can be used.

show abstract

Age-of-acquisition ratings for 30,000 English words

2012

View full text Add to dashboard Cite

We present age-of-acquisition (AoA) ratings for 30,121 English content words (nouns, verbs, and adjectives). For data collection, this megastudy used the Web-based crowdsourcing technology offered by the Amazon Mechanical Turk. Our data indicate that the ratings collected in this way are as valid and reliable as those collected in laboratory conditions (the correlation between our ratings and those collected in the lab from U.S. students reached .93 for a subsample of 2,500 monosyllabic words). We also show that our AoA ratings explain a substantial percentage of the variance in the lexical-decision data of the English Lexicon Project, over and above the effects of log frequency, word length, and similarity to other words. This is true not only for the lemmas used in our rating study, but also for their inflected forms. We further discuss the relationships of AoA with other predictors of word recognition and illustrate the utility of AoA ratings for research on vocabulary growth. Keywords Word recognition . Age of acquisition . Ratings . Amazon Mechanical TurkResearchers using words as stimulus materials typically control or manipulate their stimuli on a number of variables.The four that are most commonly used are word frequency, word length, similarity to other words, and word onset. In this article, we will argue that age of acquisition (AoA) should be part of this list, and we provide ratings for a substantial number of words in order to do so. First, however, we will discuss the evidence in favor of the big four.

show abstract

Subtlex-UK: A New and Improved Word Frequency Database for British English

Heuven

Mandera

Keuleers

et al. 2014

Quarterly Journal of Experimental Psychology

892

692

View full text Add to dashboard Cite

We present word frequencies based on subtitles of British television programmes. We show that the SUBTLEX-UK word frequencies explain more of the variance in the lexical decision times of the British Lexicon Project than the word frequencies based on the British National Corpus and the SUBTLEX-US frequencies. In addition to the word form frequencies, we also present measures of contextual diversity part-of-speech specific word frequencies, word frequencies in children programmes, and word bigram frequencies, giving researchers of British English access to the full range of norms recently made available for other languages. Finally, we introduce a new measure of word frequency, the Zipf scale, which we hope will stop the current misunderstandings of the word frequency effect.

show abstract

Lexique 2 : A new French lexical database

New

Pallier

Brysbaert

et al. 2004

Behavior Research Methods, Instruments, & Computers

760

619

View full text Add to dashboard Cite

In this article, we present a new lexical database for French: Lexique. In addition to classical word information such as gender, number, and grammatical category, Lexique includes a series of interesting new characteristics. First, word frequencies are based on two cues: a contemporary corpus of texts and the number of Web pages containing the word. Second, the database is split into a graphemic table with all the relevant frequencies, a table structured around lemmas (particularly interesting for the study of the inflectional family), and a table about surface frequency cues. Third, Lexique is distributed under a GNU-like license, allowing people to contribute to it. Finally, a metasearch engine, Open Lexique, has been developed so that new databases can be added very easily to the existing ones. Lexique can either be downloaded or interrogated freely from http://www.lexique.org.

show abstract

Power Analysis and Effect Size in Mixed Effects Models: A Tutorial

2018

View full text Add to dashboard Cite

In psychology, attempts to replicate published findings are less successful than expected. For properly powered studies replication rate should be around 80%, whereas in practice less than 40% of the studies selected from different areas of psychology can be replicated. Researchers in cognitive psychology are hindered in estimating the power of their studies, because the designs they use present a sample of stimulus materials to a sample of participants, a situation not covered by most power formulas. To remedy the situation, we review the literature related to the topic and introduce recent software packages, which we apply to the data of two masked priming studies with high power. We checked how we could estimate the power of each study and how much they could be reduced to remain powerful enough. On the basis of this analysis, we recommend that a properly powered reaction time experiment with repeated measures has at least 1,600 word observations per condition (e.g., 40 participants, 40 stimuli). This is considerably more than current practice. We also show that researchers must include the number of observations in meta-analyses because the effect sizes currently reported depend on the number of stimuli presented to the participants. Our analyses can easily be applied to new datasets gathered.

show abstract

Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting: A review and empirical validation

Mandera

Keuleers

Brysbaert

2017

Journal of Memory and Language

331

467

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Marc Brysbaert

Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English

Norms of valence, arousal, and dominance for 13,915 English lemmas

Concreteness ratings for 40 thousand generally known English word lemmas

Age-of-acquisition ratings for 30,000 English words

Subtlex-UK: A New and Improved Word Frequency Database for British English

Lexique 2 : A new French lexical database

Power Analysis and Effect Size in Mixed Effects Models: A Tutorial

Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting: A review and empirical validation

Contact Info

Product

Resources

About