Simple Interactive Statistical Analysis
Bonferroni
Input.
Input should be the pursued alpha level, a decimal number between 'zero' and 'one' in the top box. The number of comparisons, a positive integer number without decimals, is given in the second box. Set mean r (correlation) to zero for full Bonferoni correction and to a value between 0 and 1 for partial Bonferroni correction.
Explanation.
The Bonferroni correction/adjustment procedure is the most basic of SISA procedures, however, Bonferroni correction concerns an issue about which there is much, and ongoing, discussion. Bonferroni correction concerns the question if, in the case of more than one test in a particular study, the alpha level should be adjusted downward to consider chance capitalization.
The alpha level is the chance taken by researchers to make a type one error. The type one error is the error of incorrectly declaring a difference, effect or relationship to be true due to chance producing a particular state of events. Customarily the alpha level is set at 0.05, or, in no more than one in twenty statistical tests the test will show 'something' while in fact there is nothing. In the case of more than one statistical test the chance of finding at least one test statistically significant due to chance fluctuation, and to incorrectly declare a difference or relationship to be true, increases. In five tests the chance of finding at least one difference or relationship significant due to chance fluctuation equals 0.22, or one in five. In ten tests this chance increases to 0.40, which is about one in two. Using the Bonferroni method the alpha level of each individual test is adjusted downwards to ensure that the overall risk for a number of tests remains 0.05. Even if more than one test is done the risk of finding a difference or effect incorrectly significant continues to be 0.05.
Although the logic is beautiful, there is a serious drawback. If the chance of incorrectly producing a difference, making a type one error, on an individual test is reduced, the chance of making a type two error is increased, that no effect or difference is declared, while in fact there is an effect. Thus, by reducing for individual tests the chance on type one errors, i.e. the chance of introducing ineffective medical treatments or ineffective improvements; the chance on a type two errors is increased, i.e. the chance that effective treatments, effective educational methods, or improved production methods, are not discovered. So, when is Bonferroni correction used correctly and when is it used incorrectly? There are three basic scenarios.
Scenario one. If a single hypothesis of no effect is tested using more than one test, and the hypothesis is rejected if one of the tests shows statistical significance, Bonferroni correction should be applied. For example, if in a factory there are five points where quality control is applied to the same product, and the product is rejected for the market if it fails on only one of these five tests, then the chance of rejecting the product at each of the control points should be downwardly adjusted to keep the overall chance of incorrect rejection at a predefined level. In a similar situation, a doctor takes blood at three different places of the body, to assess glucose levels. If one of the tests shows positive the patient is considered diabetic. Each of the tests should be made less sensitive to ensure that the risk of a false positive, the risk of incorrectly declaring the patient diabetic and giving him or her pointless medication does not become unacceptably high due to repeated testing. Basically, scenario one is not considered problematic and you should apply Bonferroni correction in such cases.
Scenario one with correlated multiple outcomes. If you test for significance of a hypothesis using tests which are mutually correlated the Bonferroni correction is too conservative. For example, if a number of tests are fully correlated knowledge of the outcome of a single test would be sufficient to know what the outcome on all the other tests would be. To then set the alpha level for the single test you need to observe to approximately alpha divided by the number of tests would be wrong. In the case of correlated outcomes a corrected alpha is required which is in between no correction at all and full, Bonferroni, correction. SISA allows you to add the mean correlation between the outcome variables as a parameter. For this you need the usual triangular matrix (without the diagonal) of the correlations between the outcome variables, sum the correlations and divide the result by the number of correlations used. A mean correlation of zero ('0') gives you full Bonferroni adjustment, a mean correlation of one no adjustment at all, for other values of the correlation you will get a corrected alpha which is in between the two extremes. One would expect a set of Bonferroni adjusted variables to have something in common and therefore to be correlated!
One of the problems with scenario one is that purists would argue that in the case of Bonferroni correction all nil-hypotheses which are the subject of Bonferroni adjustments should be rejected if only one hypothesis is false. Such as in the example of the glucose tests the patient is declared diabetic if only one test shows positive, not considering the fact that two tests were negative. Few scientists who apply Bonferroni adjustment are prepared to do this and they generally like to keep the option open to consider tests on their individual merit, which brings us to scenario two.
Scenario two is much more disputed. This is the case when in a single study more than one hypothesis is evaluated, each hypothesis with a single test. If the alpha level of each test is set at 0.05, at least one in twenty of the hypothesis tested will turn up significant, due to chance fluctuation. For example, in a life style study blood pressure, television viewing behaviour, leisure time physical activity, and cigarette smoking are studied. Explaining variables are age, gender, occupation and ethnic background. Now, if one is interested in the general question whether the background variables are related to the life style variables, and to that end a number of comparisons are made, this is scenario one and Bonferroni correction should be used. However, if one is interested in the specific relationship between, say, gender and television viewing, and the specific hypothesis is tested that the respondents' gender is not predictive of television viewing behaviour, then Bonferroni correction should not be used. Most statisticians are of the opinion that the study of a single topic or hypothesis should, in the case of using pre-defined statements and existing theory, not be affected by what goes on in other places in the world, or in the study concerned, for that matter. Each little study done in the context of a larger study should be considered on its own merits. However, this point of view is not universily supported and particularly in medicine there is an opinion that each test in a study should be considered in the light of the number of tests done in the study as a whole.
Scenario three concerns the situation when non-predefined hypotheses are pursued using many tests, one test for each hypothesis. Basically this concerns the situation of data 'dredging' or 'fishing': many among us will recognize correlation variables=all or t-test groups=sex(2) variables=all. Above all, this should not be done. Bonferroni correction is difficult in this situation as the alpha level should be lowered very considerably in situations of such wealth (potentially with a factor of r*(r-1)/2, whereby r is the number of variables), and most standard statistical packages are not able to provide small enough p-values to do it. SISA's advice is, if you want to go ahead with it anyway, to test at the 0.05 level for each test. After a relationship has been found, and this relationship is theoretically meaningful, the relationship should be confirmed in a separate study. This can be done after new data is collected or, in the same study, by using the 'split sample' method. The sample is split in two, one half is used to do the 'dredging', the other half is used to confirm the relationships found. The disadvantage of the split sample method is that you lose power (use the procedure power to estimate how much). A Bayesian method can be used if you want to formally incorporate the result of the original study or dredging in the confirmation process. But don't put too high a value on your original finding.
Perneger TV. What is wrong with Bonferroni adjustments. British Medical Journal 1998;136:1236-1238. ->BMJ
Sankoh AJ, Huque MF, Dubey SD. Some comments on frequently used multiple endpoint adjustments methods in clinical trials. Statistics in Medicine 1997;16:2529-2542. ->Medline