Inference using significance testing and Bayes factors is compared and contrasted in five case studies based on real research. The first study illustrates that the methods will often agree, both in motivating researchers to conclude that H1 is supported better than H0, and the other way round, that H0 is better supported than H1. The next four, however, show that the methods will also often disagree. In these cases, the aim of the paper will be to motivate the sensible evidential conclusion, and then see which approach matches those intuitions. Specifically, it is shown that a high-powered non-significant result is consistent with no evidence for H0 over H1 worth mentioning, which a Bayes factor can show, and, conversely, that a low-powered non-significant result is consistent with substantial evidence for H0 over H1, again indicated by Bayesian analyses. The fourth study illustrates that a high-powered significant result may not amount to any evidence for H1 over H0, matching the Bayesian conclusion. Finally, the fifth study illustrates that different theories can be evidentially supported to different degrees by the same data; a fact that P-values cannot reflect but Bayes factors can. It is argued that appropriate conclusions match the Bayesian inferences, but not those based on significance testing, where they disagree.
Researchers often conclude an effect is absent when a null-hypothesis significance test yields a non-significant p-value. However, it is neither logically nor statistically correct to conclude an effect is absent when a hypothesis test is not significant. We present two methods to evaluate the presence or absence of effects: Equivalence testing (based on frequentist statistics) and Bayes factors (based on Bayesian statistics). In four examples from the gerontology literature we illustrate different ways to specify alternative models that can be used to reject the presence of a meaningful or predicted effect in hypothesis tests. We provide detailed explanations of how to calculate, report, and interpret Bayes factors and equivalence tests. We also discuss how to design informative studies that can provide support for a null model or for the absence of a meaningful effect. The conceptual differences between Bayes factors and equivalence tests are discussed, and we also note when and why they might lead to similar or different inferences in practice. It is important that researchers are able to falsify predictions or can quantify the support for predicted null-effects. Bayes factors and equivalence tests provide useful statistical tools to improve inferences about null effects.
The rubber hand illusion is one reliable way to experimentally manipulate the experience of body ownership. However, debate continues about the necessary and sufficient conditions eliciting the illusion. We measured proprioceptive drift and the subjective experience (via questionnaire) while manipulating two variables that have been suggested to affect the intensity of the illusion. First, the rubber hand was positioned either in a posturally congruent position, or rotated by 180°. Second, either the anatomically same rubber hand was used, or an anatomically incongruent one. We found in two independent experiments that a rubber hand rotated by 180° leads to increased proprioceptive drift during synchronous visuo-tactile stroking, although it does not lead to feelings of ownership (as measured by questionnaire). This dissociation between drift and ownership suggests that proprioceptive drift is not necessarily a valid proxy for the illusion when using hands rotated by 180°.
The ability to respond to hypnotic suggestibility (hypnotizability) is a stable trait which can be measured in a standardized procedure consisting of a hypnotic induction and a series of hypnotic suggestions. The SWASH is a 10-item adaptation of an established scale, the Waterloo-Stanford Group C Scale of Hypnotic Suggestibility (WSGC). Development of the SWASH was motivated by three distinct aims: to reduce required screening time, to provide an induction which more accurately reflects current theoretical understanding and to supplement the objective scoring with experiential scoring. Screening time was reduced by shortening the induction, removing two suggestions which may cause distress (dream and age regression) and by modifications which allow administration in lecture theatres, so that more participants can be screened simultaneously. Theoretical issues were addressed by removing references to sleep, absorption and eye fixation and closure. Data from 418 participants at the University of Sussex and the Lancaster University are presented, along with data from 66 participants who completed a retest screening. The subjective and objective scales were highly correlated. The subjective scale showed good reliability and objective scale reliability was comparable to the WSGC. The addition of subjective scale responses to the post-hypnotic suggestion (PHS) item suggested a high probability that responses to PHS are inflated in WSGC screening. The SWASH is an effective measure of hypnotizability, which reflects changes in conscious experience and presents practical and theoretical advantages over existing scales.
The self-concept maintenance theory holds that many people will cheat in order to maximize self-profit, but only to the extent that they can do so while maintaining a positive self-concept. Mazar, Amir, and Ariely (2008, Experiment 1) gave participants an opportunity and incentive to cheat on a problem-solving task. Prior to that task, participants either recalled the Ten Commandments (a moral reminder) or recalled 10 books they had read in high school (a neutral task). Results were consistent with the self-concept maintenance theory. When given the opportunity to cheat, participants given the moral-reminder priming task reported solving 1.45 fewer matrices than did those given a neutral prime (Cohen's d = 0.48); moral reminders reduced cheating. Mazar et al.'s article is among the most cited in deception research, but their Experiment 1 has not been replicated directly. This Registered Replication Report describes the aggregated result of 25 direct replications (total N = 5,786), all of which followed the same preregistered protocol. In the primary meta-analysis (19 replications, total n = 4,674), participants who were given an opportunity
Srull and Wyer (1979) demonstrated that exposing participants to more hostility-related stimuli caused them subsequently to interpret ambiguous behaviors as more hostile. In their Experiment 1, participants descrambled sets of words to form sentences. In one condition, 80% of the descrambled sentences described hostile behaviors, and in another condition, 20% described hostile behaviors. Following the descrambling task, all participants read a vignette about a man named Donald who behaved in an ambiguously hostile manner and then rated him on a set of personality traits. Next, participants rated the hostility of various ambiguously hostile behaviors (all ratings on scales from 0 to 10). Participants who descrambled mostly hostile sentences rated Donald and the ambiguous behaviors as approximately 3 scale points more hostile than did those who descrambled mostly neutral sentences. This Registered Replication Report describes the results of 26 independent replications (N = 7,373 in the total sample; k = 22 labs and N = 5,610 in the
Dijksterhuis and van Knippenberg (1998) reported that participants primed with a category associated with intelligence ("professor") subsequently performed 13% better on a trivia test than participants primed with a category associated with a lack of intelligence ("soccer hooligans"). In two unpublished replications of this study designed to verify the appropriate testing procedures, Dijksterhuis, van Knippenberg, and Holland observed a smaller difference between conditions (2%-3%) as well as a gender difference: Men showed the effect (9.3% and 7.6%), but women did not (0.3% and -0.3%). The procedure used in those replications served as the basis for this multilab Registered Replication Report. A total of 40 laboratories collected data for this project, and 23 of these laboratories met all inclusion criteria. Here we report the meta-analytic results for those 23 direct replications (total N = 4,493), which tested whether performance on a 30-item general-knowledge trivia task differed between these two priming conditions (results of supplementary analyses of the data from all 40 labs, N = 6,454, are also reported). We observed no overall difference in trivia performance between participants primed with the "professor" category and those primed with the "hooligan" category (0.14%) and no moderation by gender.
Past research has provided abundant evidence that playing violent video games increases aggressive behavior. So far, these effects have been explained mainly as the result of priming existing knowledge structures. The research reported here examined the role of denying humanness to other people in accounting for the effect that playing a violent video game has on aggressive behavior. In two experiments, we found that playing violent video games increased dehumanization, which in turn evoked aggressive behavior. Thus, it appears that video-game-induced aggressive behavior is triggered when victimizers perceive the victim to be less human.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.