DOI: 10.2307/2529310
OpenAccess: Closed
This work is not Open Acccess. We may still have a PDF, if this is the case there will be a green box below.

The Measurement of Observer Agreement for Categorical Data

J. R. Landis,Gary G. Koch

Categorical variable
Homogeneity (statistics)
This paper presents a general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies. The procedure essentially involves the construction of functions of the observed proportions which are directed at the extent to which the observers agree among themselves and the construction of test statistics for hypotheses involving these functions. Tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interobserver agreement are developed as generalized kappa-type statistics. These procedures are illustrated with a clinical diagnosis example from the epidemiological literature.
    Cite this:
Generate Citation
Powered by Citationsy*
Referenced Papers:
DOI: 10.1080/01621459.1966.10502021
Cited 184 times
A Note on the Equivalence of Two Test Criteria for Hypotheses in Categorical Data
Abstract It is shown in this note that for testing linear hypotheses in categorical data Wald's statistic [1943] is equivalent (i.e., algebraically-identical) to the χ2 1 statistic of Neyman [1949], and for testing nonlinear hypotheses it is equivalent to the χ2 1 statistic using Neyman's linearization technique, whenever these χ2 1 statistics are defined.
DOI: 10.1111/j.1467-9574.1975.tb00254.x
Cited 102 times
A review of statistical methods in the analysis of data arising from observer reliability studies (Part I)
Summary This paper reviews research situations in medicine, epidemiology and psychiatry, in psychological measurement and testing, and in sample surveys in which the observer(rater or interviewer) can be an important source of measurement error. Moreover, most of the statistical literature in observer variability is surveyed with attention given to a notational unification of the various models proposed. In the continuous data case, the usual analysis of variance (ANOVA) components of variance models are presented with an emphasis on the intraclass correlation coefficient as a measure of reliability. Other modified ANOVA models, response error models in sample surveys, and related multivariate extensions are also discussed. For the categorical data case, special attention is given to measures of agreement and tests of hypotheses when the data consist of dichotomous responses. In addition, similarities between the dichotomous and continous cases are illustrated in terms of intraclass correlation coefficients. Finally, measures of agreement, such as kappa and weighted-kappa, are discussed in the context of nominal and ordinal data. A proposed unifying framework for the categorical data case is given in the form of concluding remarks.
DOI: 10.1080/00401706.1968.10490539
Cited 23 times
Hypotheses Of ‘No Interaction’ In Multi-dimensional Contingency Tables
The statistical analysis of multi-dimensional contingency tables is discussed from the point of view of the associated underlying model. Different formulations of hypotheses of ‘no interaction’ are considered. The corresponding test statistics are based on a general and computationally simple criterion originally due to Wald [1943]. The suggested methods are illustrated with several numerical examples.
DOI: 10.1037/h0031619
Cited 5,844 times
Measuring nominal scale agreement among many raters.
DOI: 10.1037/h0028106
Cited 1,287 times
Large sample standard errors of kappa and weighted kappa.
The statistics kappa (Cohen, 1960) and weighted kappa (Cohen, 1968) were introduced to provide coefficients of agreement between two raters for nominal scales. Kappa is appropriate when all disagreements may be considered equally serious, and weighted kappa is appropriate when the relative seriousness of the different possible disagreements can be specified. The papers describing these two statistics also present expressions for their standard errors. These expressions are incorrect, having been derived from the contradictory assumptions of fixed marginal totals and binomial variation of cell frequencies. Everitt (1968) derived the exact variances of weighted and unweighted kappa when the parameters are zero by assuming a generalized hypergeometric distribution. He found these expressions to be far too complicated for routine use, and offered, as alternatives, expressions derived by assuming binomial distributions. These alternative expressions are incorrect, essentially for the same reason as above. Assume that N subjects are distributed into k* cells by each of them being assigned to one of k categories by one rater and, independently, to one of the same k categories by a second
DOI: 10.1037/h0031643
Cited 608 times
Measures of response agreement for qualitative data: Some generalizations and alternatives.
DOI: 10.1080/00401706.1967.10490444
¤ Open Access
Cited 41 times
A General Approach to the Estimation of Variance Components
A general method of estimation of variance components in random-effects models of the nested and/or classification type is considered. If a given parameter is estimable with respect to some particular experimental design (i.e., an unbiased estimate of the parameter may be obtained from the experiment), then the suggested estimator may be readily computed with only the aid of a desk calculator. The estimates are always unbiased and consistent (with respect to the structure of the experimental design); in the case of balanced experiments, they coincide with those obtained from the analysis of variance. Secondly, the problem of designing experiments to estimate variance components is briefly discussed from the point-of-view of the suggested estimation procedure. As a result, certain non-balanced designs are seen to yield more efficient estimators of particular parameters in specified situations than the corresponding balanced design using the same number of observations. Finally, the method of estimation is ...
DOI: 10.1080/00401706.1973.10489010
Cited 101 times
Errors of Measurement, Precision, Accuracy and the Statistical Comparison of Measuring Instruments
A very important and yet widely misunderstood concept or problem in science and technology is that of precision and accuracy of measurement. It is therefore necessary to define the terms precision and accuracy (or imprecision and inaccuracy) clearly and analytically if possible. Also, we need to establish and develop appropriate statistical tests of significance for these measures, since generally a relatively small number of measurements will be made or taken in most investigations. In this paper a discussion is given of some of the pertinent literature for estimating variances in errors of measurement, or the “imprecisions” of measurement, when two or three instruments are used to take the same observations on a series of items or characteristics. Also, present techniques for comparing the imprecision of measurement of one instrument with that of a second instrument through the use of statistical tests of significance are reviewed, as well as procedures for detecting the significance of the difference i...
DOI: 10.1177/001316446802800205
Cited 7 times
Estimating Individual Rater Reliabilities from Analysis of Treatment Effects
MAG: 2018385240
Cited 1,015 times
An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers.
This paper presents a general statistical methodology for the anialysis of mnultivariate categorical data involving agreement among nmore than two observers. Since these situations give rise to very large contingency tables in which mi0ost of the observed cell frequencies are zero, procedures based on indicator variables of the raw data for individual subjects are used to genierate first-order margins and main diagonal sums from the conceptual multidinmenisional contingency table. From these quantities, estimates are generated to reflect the strenlgth of'an internlal mlajority decision on each subject. Moreover, a subset of 'observers who demonstrate a high level of interobserver agreement can be identified by using pairwise agreement statistics betweeni each observer and the internal majority standard opinion on each subject. These procedures are all illustrated within the context of'a clinical diagnosis examiiple involving seven pathologists.
DOI: 10.1097/00010694-196006000-00016
Cited 2,602 times
The Analysis of Variance
Originally published in 1959, this classic volume has had a major impact on generations of statisticians. Newly issued in the Wiley Classics Series, the book examines the basic theory of analysis of variance by considering several different mathematical models. Part I looks at the theory of fixed-effects models with independent observations of equal variance, while Part II begins to explore the analysis of variance in the case of other models.
DOI: 10.2307/2529549
Cited 453 times
Measuring Agreement between Two Judges on the Presence or Absence of a Trait
At least a dozen indexes have been proposed for measuring agreement between two judges on a categorical scale. Using the binary (positive-negative) case as a model, this paper presents and critically evaluates some of these proposed measures. The importance of correcting for chance-expected agreement is emphasized, and identities with intraclass correlation coefficients are pointed out.
DOI: 10.1037/h0026256
Cited 6,652 times
Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.
A previously described coefficient of agreement for nominal scales, kappa, treats all disagreements equally. A generalization to weighted kappa (Kw) is presented. The Kw provides for the incorpation of ratio-scaled degrees of disagreement (or agreement) to each of the cells of the k * k table of joi
DOI: 10.1080/00401706.1968.10490601
Cited 20 times
Some Further Remarks Concerning “A General Approach to the Estimation of Variance Components”
The estimates of Koch [1967a] have the undesirable property that they may change in value if the same constant is added to each of the observations. In this paper, an alternative procedure based on the same generd principles is developed and applied to a variety of models. As before, the estimators obtained are unbiased and consistent. They are also reasonably easy to compute. Finally, in the case of balanced experiments, they coincide with those obtained from the analysis of variance. On the other hand, their structure is more complex than that of the estimators considered in the previous paper. In particular, the derivation of their covariance matrix is much more complicated, and hence no attempt has been made here to study its properties.
DOI: 10.1177/001316446002000104
Cited 29,343 times
A Coefficient of Agreement for Nominal Scales
DOI: 10.2307/2528901
Cited 1,414 times
Analysis of Categorical Data by Linear Models
Assume there are ni., i = 1, 2, *--, s, samples from s multinomial distributions each having r categories of response. Then define any u functions of the unknown true cell probabilities {7rij: i = 1, 2, * , s; j = 1, 2, * , r, where E jrij l 1 } that have derivatives up to the second order with respect to 7rij, and for which the matrix of first derivatives is of rank u. A general noniterative procedure is described for fitting these functions to a linear model, for testing the goodness-of-fit of the model, and for testing hypotheses about the parameters in the linear model. The special cases of linear functions and logarithmic functions of the 7rin are developed in detail, and some examples of how the general approach can be used to analyze various types of categorical data are presented.
DOI: 10.1080/00401706.1959.10489861
Cited 21 times
The Measuring Process
This paper deals with the theory of a proposed method for the statistical study of measuring processes. The practical aspects of the method, including computational details, are discussed in a companion paper published in the ASTM Bulletin. In the present article a theoretical framework is proposed for the mathematical expression of the sources of variation in measuring methods and a suitable method of statistical analysis is described. Particular attention is given, both here and in the companion paper, to interlaboratory studies of test methods. An illustration based on data taken from the chemical literature is appended.
DOI: 10.2307/2529309
¤ Open Access
Cited 554 times
A General Methodology for the Analysis of Experiments with Repeated Measurement of Categorical Data
This paper is concerned with the analysis of multivariate categorical data which are obtained from repeated measurement experiments. An expository discussion of pertinent hypotheses for such situations is given, and appropriate test statistics are developed through the application of weighted least squares methods. Special consideration is given to computational problems associated with the manipulation of large tables including the treatment of empty cells. Three applications of the methodology are provided.
DOI: 10.2307/2529683
Cited 41 times
An Analysis for Compounded Functions of Categorical Data
One area of application which has become increasingly important to statisticians and other researchers is the analysis of categorical data. Often the principal objective in such investigations is either the testing of appropriate hypotheses or the fitting of simplified models to the multi-dimensional contingency tables which arise when frequency counts are obtained for the respective cross-classifications of specific qualitative variables. Grizzle, Starmer, and Koch [1969] (subsequently abbreviated GSK) have described how linear regression models and weighted least squares can be used for this purpose. The resulting test statistics belong to the class of minimum modified chi-square due to Neyman [1949] which is equivalent to the general quadratic form criteria of Wald [1943]. As such, they have central x2-distributions when the corresponding null hypotheses are true. Two alternative approaches to this methodology are that based on maximum likelihood as formulated by Bishop [1969; 1971] and Goodman [1970; 1971a, b] and that based on minimum discrimination information as formulated by Ku et al. [1971]. In each of the previously mentioned papers, primary emphasis was given to the formulation of models and the problems of analysis under various conditions of no interaction (see Roy and Kastenbaum [1956] or Bhapkar
DOI: 10.1016/0010-468x(76)90037-4
¤ Open Access
Cited 110 times
A computer program for the generalized chi-square analysis of categorical data using weighted least squares (GENCAT)
GENCAT is a computer program which implements an extremely general methodology for the analysis of multivariate categorical data. This approach essentially involves the construction of test statistics for hypotheses involving functions of the observed proportions which are directed at the relationships under investigation and the estimation of corresponding model parameters via weighted least squares computations. Any compounded function of the observed proportions which can be formulated as a sequence of the following transformations of the data vector — linear, logarithmic, exponential, or the addition of a vector of constants — can be analyzed within this general framework. This algorithm produces minimum modified chi-square statistics which are obtained by partitioning the sums of squares as in ANOVA. The input data can be either: (a) frequencies from a multidimensional contingency table; (b) a vector of functions with its estimated covariance matrix; and (c) raw data in the form of integer-valued variables associated with each subject. The input format is completely flexible for the data as well as for the matrices.
DOI: 10.1080/01621459.1966.10480876
Cited 17 times
Assessing the Accuracy of Multivariate Observations
Abstract A generalization is given to the multivariate case of the linear model usually employed in the determination of accuracy of observations. Likelihood ratio tests are derived for testing hypotheses concerning systematic differences among observers, and a criterion is suggested for evaluating the magnitude of errors of measurement.
DOI: 10.2307/2281294
Cited 312 times
Statistical Theory in Research.
DOI: 10.1177/001316447303300309
Cited 2,506 times
The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability
DOI: 10.1080/01621459.1948.10483261
Cited 144 times
On Estimating Precision of Measuring Instruments and Product Variability
Abstract A measurement or observed value is discussed as the sum of two components—one the absolute value of the characteristic measured and the other an error of measurement. The variation in absolute values of the characteristic or items measured is termed product variability, whereas the variation in errors of measurement of an instrument is called the precision or reproducibility of measurement. Techniques are given for separating and estimating product variability and precision of measurement. Comparisons of the various techniques are also discussed for cases involving two or more instruments.
DOI: 10.1093/oxfordjournals.aje.a119582
Cited 29 times
DOI: 10.2307/2281708
Cited 60 times
Corrigenda: Measures of Association for Cross Classifications
DOI: 10.2307/2528038
Cited 39 times
On the Analysis of Contingency Tables with a Quantitative Response
This paper illustrates tests for some suitable hypotheses in analysis of contingency tables when some characters are quantitative. For a two-dimensional table tests are given for the hypothesis of homogeneity of mean scores, the hypothesis of linearity of regression of mean scores, and also for testing significance of regression of mean scores on the level of the other character. For a three-dimensional table some similar procedures are offered. It is briefly pointed out how such test criteria can be derived in a systematic manner by an application of a certain generalized least squares technique.
DOI: 10.2307/2528319
Cited 43 times
On the Hypotheses of 'No Interaction' in Contingency Tables
The statistical analysis of data in multi-dimensional contingency tables is discussed in terms of appropriate underlying probability models. Emphasis is placed on the distinction between 'factors' (such as treatments or blocks) which have fixed marginal totals and 'responses' (such as category of performance) which have random marginal totals. Hence, four principal cases arise: (i) the 'multi-response, no factor' tables, (ii) the 'multi-response, uni-factor' tables, (iii) the 'multi-response, multifactor' tables, (iv) the 'uni-response, multi-factor' tables. For situations (i) and (ii), the concept of 'no interaction' is related to questions regarding the pattern of association among responses. However, for situation (iv), it is related to how factors combine (e.g., additively) to determine the response distribution. Finally, for situation (iii), both types of questions arise. For each of the different types of tables, the problem of formulating appropriate hypotheses of 'no interaction' is considered. The corresponding test statistics are based upon a general and computationally simple criterion of Wald [1943]. The suggested methods are illustrated with several numerical examples.
DOI: 10.2307/2528934
¤ Open Access
Cited 65 times
The Analysis of Categorical Data from Mixed Models
This paper is concerned with contingency tables which are analogous to the well-known mixed model in analysis of variance. The corresponding experimental situation involves exposing each of n subjects to each of the d levels of a given factor and classifying the d responses into one of r categories. The resulting data are represented in an r X r X ... X r contingency table of d dimensions. The hypothesis of priincipal interest is equality of the one-dimensioinal marginal distributions. Alternatively, if the r categories may be quantitatively scaled, then attention is directed at the hypothesis of equality of the mean scores over the d first order marginals. Test statistics are developed in terms of minimum Neyman X2 or equivalently weighted least squares analysis of underlying linear models. As such, they bear a strong resemblance to the Hotelling T2 procedures used with continuous data in mixed models. Several numerical examples are given to illustrate the use of the various methods discussed.
DOI: 10.2307/2556167
Cited 12 times
Reliability of Measurements for Studies of Cerebrov Vascular Atherosclerosis
For epidemiologic and comparative pathologic studies of cerebral atherosclerosis, assessment of reliability of measurements is necessary. Such a study is described along with the measurement method used. The development of the methodology for assessing reliability of data is presented. Within and between coder variability is estimated. For the biometrician, the salient feature is that several methods of determining reliability might have to be explored and tried before arriving at a method which is useful and acceptable to the clinician or clinical pathologist.
MAG: 2532157753
Cited 3 times
A general methodology for the measurement of observer agreement when the data are categorical
DOI: 10.1037/e465522008-010
Cited 15 times
A New Measure of Agreement Between Rank-Ordered Variables
MAG: 2798510847
Cited 1,468 times
Linear Models
The Measurement of Observer Agreement for Categorical Data” is a paper by J. R. Landis Gary G. Koch published in the journal Biometrics in 1977. It was published by Wiley. It has an Open Access status of “closed”. You can read and download a PDF Full Text of this paper here.