Somers' D or Kendall's tau-a with confidence intervals
somersd varlist [weight] [if] [in] [, taua transf(transformation_name) tdist cenind(cenind_list) cluster(varname) cfweight(expression) funtype(functional_type) wstrata(varlist) bstrata(varlist | _n) notree level(#) cimatrix(new_matrix) ]
where transformation_name is one of
iden | z | asin | rho | zrho | c
and functional_type is one of
wcluster | bcluster | vonmises
and cenind_list is a list of variable names and/or zeros.
fweights, iweights, and pweights are allowed; see weight.
bootstrap, by, jackknife, statsby, svy jackknife, svy bootstrap, svy brr and svy sdr are allowed; see prefix.
somersd computes confidence intervals for a wide range of rank statistics. It includes 3 component modules, each with a .pdf manual, which is distributed with the somersd package as an ancillary file. The modules are as follows:
Module File Calculates confidence intervals for -------------------------------------------------------------------------- somersd somersd.pdf Kendall's tau-a and Somers' D censlope censlope.pdf Theil-Sen median (and other percentile) slopes cendif cendif.pdf Hodges-Lehmann median (and other percentile) differences
The modules censlope and cendif require the module somersd in order to work and use a lot of the same options.
The module somersd calculates values of Somers' D or Kendall's tau-a for the first variable of varlist as a predictor of each of the other variables in varlist, with estimates and jackknife variances and confidence intervals output as if for the parameters of a maximum likelihood fit. It is possible to use lincom to output confidence limits for differences between the population Somers' D or tau-a values.
Options for use with somersd
taua causes somersd to calculate Kendall's tau-a. If taua is absent, then Somers' D is calculated.
transf(transformation_name) specifies that the estimates are to be transformed, defining a confidence level for the transformed population value. iden (identity or untransformed) is the default. z specifies Fisher's z (the hyperbolic arctangent), asin specifies Daniels' arcsine, rho specifies Greiner's rho (Pearson correlation estimated using Greiner's relation), zrho specifies the z-transform of Greiner's rho, and c specifies Harrell's c. If the first variable of the varlist is a binary indicator of a disease and the other variables are quantitative predictors for that disease, then Harrell's c is the area under the receiver operating characteristic (ROC) curve. somersd recognizes the transformation names arctanh and atanh as synonyms for z, arcsin and arsin as synonyms for asin, sinph as a synonym for rho, zsinph as a synonym for zrho, and roc and auroc as synonyms for c. It also recognizes unambiguous abbreviations for transformation names, such as id for iden or aur for auroc. The transformations are calculated using a Mata function.
tdist specifies that the estimates are assumed to have a t distribution with N-1 degrees of freedom, where N is the number of clusters if cluster() is specified, or the number of observations if cluster() is not specified. If tdist is not specified, then the standardized Somers' D estimates are assumed to be sampled from a standard Normal distribution. Simulation study data suggest that the tdist option should be recommended.
cenind(cenind_list) specifies a list of left- or right-censorship indicators, corresponding to the variables mentioned in the varlist. Each censorship indicator is either a variable name or a zero. If the censorship indicator corresponding to a variable is the name of a second variable, then this second variable is used to indicate the censorship status of the first variable, which is assumed to be left-censored (at or below its stated value) in observations in which the second variable is negative, right-censored (at or above its stated value) in observations in which the second variable is positive, and uncensored (equal to its stated value) in observations in which the second variable is zero. If the censorship indicator corresponding to a variable is a zero, then the variable is assumed to be uncensored. If cenind() is unspecified, then all variables in the varlist are assumed to be uncensored. If the list of censorship indicators specified by cenind() is shorter than the list of variables specified in the varlist, then the list of censorship indicators is completed with the required number of zeros on the right.
cluster(varname) specifies the variable which defines sampling clusters. If cluster() is specified, then the variances and confidence limits are calculated assuming that the data represent a sample of clusters from a population of clusters, rather than a sample of observations from a population of observations.
cfweight(expression) specifies an expression giving the cluster frequency weights. These cluster frequency weights must have the same value for all observations in a cluster. If cfweight() and cluster() are both specified, then each cluster in the dataset is assumed to represent a number of identical clusters equal to the cluster frequency weight for that cluster. If cfweight() is specified and cluster() is unspecified, then each observation in the dataset is treated as a cluster, and assumed to represent a number of identical one-observation clusters equal to the cluster frequency weight. For more details on the interpretation of weights, see Interpretation of weights below.
funtype(functional_type) specifies whether the Somers' D or Kendall's tau-a functionals estimated are between-cluster, within-cluster or Von Mises functionals. These three functional types are specified by the options funtype(bcluster), funtype(wcluster) or funtype(vonmises), respectively. If funtype() is not specified, then funtype(bcluster) is assumed, and between-cluster functionals are estimated. The within-cluster Somers' D is a generalization of the confidence interval corresponding to the sign test. The Gini coefficient is a special case of the clustered Von Mises Somers' D. For further details, see the manual somersd.pdf, distributed with somersd as an ancillary file.
wstrata(varlist) specifies a list of variables whose value combinations are the W strata. If wstrata() is specified, then somersd estimates stratified Somers' D or Kendall's tau-a parameters, applying only to pairs of observations within the same W stratum. These parameters can be used to measure associations within strata, such as associations between an outcome and an exposure within groups defined by values of a confounder, or by values of a propensity score based on multiple confounders.
bstrata(varlist | _n) specifies the B strata. If bstrata() is specified, then somersd estimates Somers' D or Kendall's tau-a parameters specific to pairs of observations from different B strata. These B strata are either combinations of values of a list of variables (if varlist is specified) or the individual observations (if _n is specified). B strata will not often be required. However, if we are estimating the within-cluster Kendall's tau-a (using the options taua funtype(wcluster)), then the additional option bstrata(_n) will ensure that the within-cluster Kendall's tau-a can take the whole range of values from -1 (in the case of complete discordance within clusters) to +1 (in the case of complete concordance within clusters).
notree specifies that somersd does not use the default search tree algorithm based on Newson (2006a), but instead uses a trivial algorithm, which compares every pair of observations and requires much more time with large datasets. This option is rarely used except to compare performance. Both algorithms are implemented in Mata, using a set of Mata functions, whose source code is distributed with the somersd package.
level(#) specifies the confidence level, as a percentage, for confidence intervals of the estimates; see level.
cimatrix(new_matrix) specifies an output matrix to be created, containing estimates and confidence limits for the untransformed Somers' D, Kendall's tau-a or Greiner's rho parameters. If transf() is specified, then the confidence limits will be asymmetric and based on symmetric confidence limits for the transformed parameters. This option (like level() may be used in replay mode as well as in nonreplay mode.
For uncensored variables X and Y, Kendall's tau-a is defined as
tau_a(X,Y) = E[sign(X1-X2)*sign(Y1-Y2)]
where (X1,Y1) and (X2,Y2) are sampled from the bivariate distribution of X and Y. In the case of censored variables X and Y, with censorship indicators R and S, respectively, which are negative for left-censorship, positive for right-censorship, and zero for noncensorship, we define Kendall's tau-a as
tau_a(X,Y) = E[csign(X1,R1,X2,R2)*csign(Y1,S1,Y2,S2)]
where the function
is defined as 1 if U>V and P>=0>=Q, -1 if U<V, P<=0<=Q, and 0 otherwise.
Somers' D is defined as
D(Y|X) = tau_a(X,Y)/tau_a(X,X)
In the case of a binary X-variable, Somers' D is the parameter tested for a zero value by the Mann-Whitney U test. If X is a disease indicator and Y is a quantitative diagnostic measure, then Somers' D is related to the area A under the ROC curve by the formula
and confidence limits for A can be calculated by specifying the option transf(c). The covariance matrix is estimated by jackknifing the underlying U statistics and using Taylor polynomials. Confidence intervals for differences and other contrasts can be calculated using lincom. Confidence intervals for Theil-Senn median (and other percentile) slopes (or per-unit ratios) can be calculated using censlope, which is distributed as part of the somersd package. Confidence intervals for Hodges-Lehmann median (and other percentile) differences (and ratios) between two groups can be calculated using cendif, which is also distributed as part of the somersd package.
Full documentation of the somersd package (including methods and formulas) is provided in the files somersd.pdf, censlope.pdf, and cendif.pdf, which are distributed with the somersd package (see net). They can be viewed using the Adobe Acrobat Reader, which can be downloaded from the Adobe Acrobat website. somersd uses a library of Mata functions, and the source code for these functions is distributed with somersd as installation files.
For a comprehensive review of Kendall's tau-a, Somers' D and median differences, see Newson (2002). The statistical and computational methods used by the somersd package are described in detail in Newson (2006a), Newson (2006b) and Newson (2006c).
Interpretation of weights
somersd inputs up to two weight expressions, which are the ordinary Stata weights given by the weight and the cluster frequency weights given by the cfweight() option. Internally, somersd defines and uses three distinct sets of weights, which are the cluster frequency weights, the observation frequency weights, and the importance weights.
The cluster frequency weights must be the same for different observations in a cluster, and imply that each cluster in the input dataset represents a number of identical clusters equal to the cluster frequency weight in that cluster. If cluster() is not specified, then the individual observations are clusters, and the cluster frequency weight implies that each one-observation cluster represents a number of identical one-observation clusters equal to the cluster frequency weight. The cluster frequency weights are given by cfweight() if that option is specified; are set to 1 if cfweight() is unspecified and cluster() is specified; are equal to the ordinary Stata weights if neither cluster() nor cfweight() is specified and the ordinary Stata weights are fweights; and are equal to 1 otherwise.
The observation frequency weights are summed over all observations in the input dataset to produce the number of observations reported by somersd and returned in the estimation result e(N), and are not used in any other way. They are set by cfweight() if that option is specified and the ordinary Stata weights are not fweights, are equal to the ordinary Stata weights if cfweight() is unspecified and the ordinary Stata weights are fweights, are equal to the product of the cfweight() expression and the ordinary Stata weights if cfweight() is specified and the ordinary Stata weights are fweights, and are equal to 1 otherwise.
The importance weights are used as described in the Methods and Formulas section of the file somersd.pdf distributed with the somersd package. They are equal to the ordinary Stata weights if these are specified and either cluster() or cfweight() is specified, are equal to the ordinary Stata weights if neither of these two options is specified and the ordinary Stata weights are specified as pweights or iweights, and are equal to 1 otherwise.
. somersd foreign mpg weight, tr(z)
. somersd us gpm weight . lincom (weight-gpm)/2
. somersd us gpm weight, tr(c) . lincom weight-gpm
. somersd mpg weight displ, taua tr(z) cluster(manuf)
The following example demonstrates the cenind() option:
. use http://www.stata-press.com/data/r9/drugtr, clear . gene youth=100-age . gene byte censind=1-died . somersd studytime drug youth, tr(c) cenind(censind) . lincom drug-youth . sts test drug, wilcoxon . somersd drug studytime, tr(z) cenind(0 censind)
somersd saves the following in e():
Scalars e(N) number of observations e(N_clust) number of clusters e(df_r) residual degrees of freedom e(denominator) common denominator e(depvarsum) sum of X-variable in estimation sample
Macros e(cmd) somersd e(cmdline) command as typed e(param) parameter (somersd or taua) e(parmlab) parameter label in output e(tdist) tdist if specified e(depvar) name of X-variable e(clustvar) name of cluster variable e(vcetype) title used to label standard error e(wtype) weight type e(wexp) weight expression e(cfweight) cfweight() expression e(funtype) funtype() option e(wstrata) wstrata() option e(bstrata) bstrata() option e(predict) program called by predict (somers_p) e(transf) transformation specified by transf() e(tranlab) transformation label in output e(properties) b V
Matrices e(b) coefficient vector e(V) variance-covariance matrix of the estimators
Functions e(sample) marks estimation sample
Note that (confusingly) e(depvar) is the X-variable, or predictor variable, in the conventional terminology for defining Somers' D. somersd is also different from most estimation commands in that its results are not designed to be used by predict. If the user tries to do so, then the program somers_p is called, and tells the user that predict should not be used after somersd. The scalar e(denominator) contains the common denominator used in calculating the Somers' D or Kendall's tau-a statistics.
Roger Newson, Imperial College London, UK. Email: email@example.com
Newson, R. 2002. Parameters behind "nonparametric" statistics: Kendall's tau, Somers' D and median differences. Stata Journal 2: 45-64.
Newson, R. 2006a. Efficient calculation of jackknife confidence intervals for rank statistics. Journal of Statistical Software 15: 1-10.
Newson, R. 2006b. Confidence intervals for rank statistics: Somers' D and extensions. Stata Journal 6: 309-334.
Newson, R. 2006c. Confidence intervals for rank statistics: Percentile slopes, differences, and ratios. Stata Journal 6: 497-520.
Manual: [R] spearman, [R] ranksum, [R] signrank, [R] roc STB: STB-52: sg123, STB-55: snp15, STB-57: snp15.1, STB-58: snp15.2, STB-58: snp16; STB-61: snp15.3; STB-61: snp16.1. Online: ktau, ranksum, signrank, roc, lincom, jknife, cendif, censlope, if installed