------------------------------------------------------------------------------- help forsomersd(SJ6-3: snp15_6; SJ5-3: snp15_5; SJ3-3: snp15_4; STB-61: snp15_3; STB-58: snp15_2; STB-57: snp15) -------------------------------------------------------------------------------

Somers' D or Kendall's tau-a with confidence intervals

somersdvarlist[weight] [if] [in] [,tauatransf(transformation_name)tdistcenind(cenind_list)cluster(varname)cfweight(expression)funtype(functional_type)wstrata(varlist)bstrata(varlist|_n)notreelevel(#)cimatrix(new_matrix)]where

transformation_nameis one of

iden|z|asin|rho|zrho|cand

functional_typeis one of

wcluster|bcluster|vonmisesand

cenind_listis a list of variable names and/or zeros.

fweights,iweights, andpweights are allowed; see weight.

bootstrap,by,jackknife,statsby,svy jackknife,svy bootstrap,svy brrandsvy sdrare allowed; seeprefix.

Description

somersdcomputes confidence intervals for a wide range of rank statistics. It includes 3 component modules, each with a .pdf manual, which is distributed with thesomersdpackage as an ancillary file. The modules are as follows:Module File Calculates confidence intervals for --------------------------------------------------------------------------

somersdsomersd.pdfKendall's tau-a and Somers' Dcenslopecenslope.pdfTheil-Sen median (and other percentile) slopescendifcendif.pdfHodges-Lehmann median (and other percentile) differencesThe modules

censlopeandcendifrequire the modulesomersdin order to work and use a lot of the same options.The module

somersdcalculates values of Somers' D or Kendall's tau-a for the first variable ofvarlistas a predictor of each of the other variables invarlist, with estimates and jackknife variances and confidence intervals output as if for the parameters of a maximum likelihood fit. It is possible to uselincomto output confidence limits for differences between the population Somers' D or tau-a values.

Options for use with somersd

tauacausessomersdto calculate Kendall's tau-a. Iftauais absent, then Somers' D is calculated.

transf(transformation_name)specifies that the estimates are to be transformed, defining a confidence level for the transformed population value.iden(identity or untransformed) is the default.zspecifies Fisher's z (the hyperbolic arctangent),asinspecifies Daniels' arcsine,rhospecifies Greiner's rho (Pearson correlation estimated using Greiner's relation),zrhospecifies thez-transform of Greiner's rho, andcspecifies Harrell's c. If the first variable of thevarlistis a binary indicator of a disease and the other variables are quantitative predictors for that disease, then Harrell's c is the area under the receiver operating characteristic (ROC) curve.somersdrecognizes the transformation namesarctanhandatanhas synonyms forz,arcsinandarsinas synonyms forasin,sinphas a synonym forrho,zsinphas a synonym forzrho, androcandaurocas synonyms forc. It also recognizes unambiguous abbreviations for transformation names, such asidforidenoraurforauroc. The transformations are calculated using a Mata function.

tdistspecifies that the estimates are assumed to have a t distribution withN-1degrees of freedom, whereNis the number of clusters ifcluster()is specified, or the number of observations ifcluster()is not specified. Iftdistis not specified, then the standardized Somers' D estimates are assumed to be sampled from a standard Normal distribution. Simulation study data suggest that thetdistoption should be recommended.

cenind(cenind_list)specifies a list of left- or right-censorship indicators, corresponding to the variables mentioned in thevarlist. Each censorship indicator is either a variable name or a zero. If the censorship indicator corresponding to a variable is the name of a second variable, then this second variable is used to indicate the censorship status of the first variable, which is assumed to be left-censored (at or below its stated value) in observations in which the second variable is negative, right-censored (at or above its stated value) in observations in which the second variable is positive, and uncensored (equal to its stated value) in observations in which the second variable is zero. If the censorship indicator corresponding to a variable is a zero, then the variable is assumed to be uncensored. Ifcenind()is unspecified, then all variables in thevarlistare assumed to be uncensored. If the list of censorship indicators specified bycenind()is shorter than the list of variables specified in thevarlist, then the list of censorship indicators is completed with the required number of zeros on the right.

cluster(varname)specifies the variable which defines sampling clusters. Ifcluster()is specified, then the variances and confidence limits are calculated assuming that the data represent a sample of clusters from a population of clusters, rather than a sample of observations from a population of observations.

cfweight(expression)specifies an expression giving the cluster frequency weights. These cluster frequency weights must have the same value for all observations in a cluster. Ifcfweight()andcluster()are both specified, then each cluster in the dataset is assumed to represent a number of identical clusters equal to the cluster frequency weight for that cluster. Ifcfweight()is specified andcluster()is unspecified, then each observation in the dataset is treated as a cluster, and assumed to represent a number of identical one-observation clusters equal to the cluster frequency weight. For more details on the interpretation of weights, seeInterpretation ofweightsbelow.

funtype(functional_type)specifies whether the Somers' D or Kendall's tau-a functionals estimated are between-cluster, within-cluster or Von Mises functionals. These three functional types are specified by the optionsfuntype(bcluster),funtype(wcluster)orfuntype(vonmises), respectively. Iffuntype()is not specified, thenfuntype(bcluster)is assumed, and between-cluster functionals are estimated. The within-cluster Somers' D is a generalization of the confidence interval corresponding to the sign test. The Gini coefficient is a special case of the clustered Von Mises Somers' D. For further details, see the manualsomersd.pdf, distributed withsomersdas an ancillary file.

wstrata(varlist)specifies a list of variables whose value combinations are the W strata. Ifwstrata()is specified, thensomersdestimates stratified Somers' D or Kendall's tau-a parameters, applying only to pairs of observations within the same W stratum. These parameters can be used to measure associations within strata, such as associations between an outcome and an exposure within groups defined by values of a confounder, or by values of a propensity score based on multiple confounders.

bstrata(varlist|_n)specifies the B strata. Ifbstrata()is specified, thensomersdestimates Somers' D or Kendall's tau-a parameters specific to pairs of observations from different B strata. These B strata are either combinations of values of a list of variables (ifvarlistis specified) or the individual observations (if_nis specified). B strata will not often be required. However, if we are estimating the within-cluster Kendall's tau-a (using the optionstauafuntype(wcluster)), then the additional optionbstrata(_n)will ensure that the within-cluster Kendall's tau-a can take the whole range of values from -1 (in the case of complete discordance within clusters) to +1 (in the case of complete concordance within clusters).

notreespecifies thatsomersddoes not use the default search tree algorithm based on Newson (2006a), but instead uses a trivial algorithm, which compares every pair of observations and requires much more time with large datasets. This option is rarely used except to compare performance. Both algorithms are implemented in Mata, using a set of Mata functions, whose source code is distributed with thesomersdpackage.

level(#)specifies the confidence level, as a percentage, for confidence intervals of the estimates; seelevel.

cimatrix(new_matrix)specifies an output matrix to be created, containing estimates and confidence limits for the untransformed Somers' D, Kendall's tau-a or Greiner's rho parameters. Iftransf()is specified, then the confidence limits will be asymmetric and based on symmetric confidence limits for the transformed parameters. This option (likelevel()may be used in replay mode as well as in nonreplay mode.

RemarksFor uncensored variables X and Y, Kendall's tau-a is defined as

tau_a(X,Y) = E[sign(X1-X2)*sign(Y1-Y2)]where (X1,Y1) and (X2,Y2) are sampled from the bivariate distribution of X and Y. In the case of censored variables X and Y, with censorship indicators R and S, respectively, which are negative for left-censorship, positive for right-censorship, and zero for noncensorship, we define Kendall's tau-a as

tau_a(X,Y) = E[csign(X1,R1,X2,R2)*csign(Y1,S1,Y2,S2)]where the function

csign(U,P,V,Q)is defined as 1 if U>V and P>=0>=Q, -1 if U<V, P<=0<=Q, and 0 otherwise.

Somers' D is defined as

D(Y|X) = tau_a(X,Y)/tau_a(X,X)In the case of a binary X-variable, Somers' D is the parameter tested for a zero value by the Mann-Whitney U test. If X is a disease indicator and Y is a quantitative diagnostic measure, then Somers' D is related to the area A under the ROC curve by the formula

A=[D(Y|X)+1]/2and confidence limits for A can be calculated by specifying the option

transf(c). The covariance matrix is estimated by jackknifing the underlying U statistics and using Taylor polynomials. Confidence intervals for differences and other contrasts can be calculated usinglincom. Confidence intervals for Theil-Senn median (and other percentile) slopes (or per-unit ratios) can be calculated usingcenslope, which is distributed as part of thesomersdpackage. Confidence intervals for Hodges-Lehmann median (and other percentile) differences (and ratios) between two groups can be calculated usingcendif, which is also distributed as part of thesomersdpackage.Full documentation of the

somersdpackage (including methods and formulas) is provided in the filessomersd.pdf,censlope.pdf, andcendif.pdf, which are distributed with thesomersdpackage (seenet). They can be viewed using the Adobe Acrobat Reader, which can be downloaded from the Adobe Acrobat website.somersduses a library of Mata functions, and the source code for these functions is distributed withsomersdas installation files.For a comprehensive review of Kendall's tau-a, Somers' D and median differences, see Newson (2002). The statistical and computational methods used by the

somersdpackage are described in detail in Newson (2006a), Newson (2006b) and Newson (2006c).

Interpretation of weights

somersdinputs up to two weight expressions, which are the ordinary Stata weights given by theweightand the cluster frequency weights given by thecfweight()option. Internally,somersddefines and uses three distinct sets of weights, which are the cluster frequency weights, the observation frequency weights, and the importance weights.The cluster frequency weights must be the same for different observations in a cluster, and imply that each cluster in the input dataset represents a number of identical clusters equal to the cluster frequency weight in that cluster. If

cluster()is not specified, then the individual observations are clusters, and the cluster frequency weight implies that each one-observation cluster represents a number of identical one-observation clusters equal to the cluster frequency weight. The cluster frequency weights are given bycfweight()if that option is specified; are set to 1 ifcfweight()is unspecified andcluster()is specified; are equal to the ordinary Stata weights if neithercluster()norcfweight()is specified and the ordinary Stata weights arefweights; and are equal to 1 otherwise.The observation frequency weights are summed over all observations in the input dataset to produce the number of observations reported by

somersdand returned in the estimation resulte(N), and are not used in any other way. They are set bycfweight()if that option is specified and the ordinary Stata weights are notfweights, are equal to the ordinary Stata weights ifcfweight()is unspecified and the ordinary Stata weights arefweights, are equal to the product of thecfweight()expression and the ordinary Stata weights ifcfweight()is specified and the ordinary Stata weights arefweights, and are equal to 1 otherwise.The importance weights are used as described in the

Methods and Formulassection of the filesomersd.pdfdistributed with thesomersdpackage. They are equal to the ordinary Stata weights if these are specified and eithercluster()orcfweight()is specified, are equal to the ordinary Stata weights if neither of these two options is specified and the ordinary Stata weights are specified aspweights oriweights, and are equal to 1 otherwise.

Examples

. somersd foreign mpg weight, tr(z)

. somersd us gpm weight. lincom (weight-gpm)/2

. somersd us gpm weight, tr(c). lincom weight-gpm

. somersd mpg weight displ, taua tr(z) cluster(manuf)The following example demonstrates the

cenind()option:

. use http://www.stata-press.com/data/r9/drugtr, clear. gene youth=100-age. gene byte censind=1-died. somersd studytime drug youth, tr(c) cenind(censind). lincom drug-youth. sts test drug, wilcoxon. somersd drug studytime, tr(z) cenind(0 censind)

Saved results

somersdsaves the following ine():Scalars

e(N)number of observationse(N_clust)number of clusterse(df_r)residual degrees of freedome(denominator)common denominatore(depvarsum)sum ofX-variable in estimation sampleMacros

e(cmd)somersde(cmdline)command as typede(param)parameter (somersdortaua)e(parmlab)parameter label in outpute(tdist)tdistif specifiede(depvar)name ofX-variablee(clustvar)name of cluster variablee(vcetype)title used to label standard errore(wtype)weight typee(wexp)weight expressione(cfweight)cfweight()expressione(funtype)funtype()optione(wstrata)wstrata()optione(bstrata)bstrata()optione(predict)program called bypredict(somers_p)e(transf)transformation specified bytransf()e(tranlab)transformation label in outpute(properties)b VMatrices

e(b)coefficient vectore(V)variance-covariance matrix of the estimatorsFunctions

e(sample)marks estimation sampleNote that (confusingly)

e(depvar)is theX-variable, or predictor variable, in the conventional terminology for defining Somers' D.somersdis also different from most estimation commands in that its results are not designed to be used bypredict. If the user tries to do so, then the programsomers_pis called, and tells the user thatpredictshould not be used aftersomersd. The scalare(denominator)contains the common denominator used in calculating the Somers' D or Kendall's tau-a statistics.

AuthorRoger Newson, Imperial College London, UK. Email: r.newson@imperial.ac.uk

ReferencesNewson, R. 2002. Parameters behind "nonparametric" statistics: Kendall's tau, Somers' D and median differences.

Stata Journal2: 45-64.Newson, R. 2006a. Efficient calculation of jackknife confidence intervals for rank statistics.

Journal of Statistical Software15: 1-10.Newson, R. 2006b. Confidence intervals for rank statistics: Somers' D and extensions.

Stata Journal6: 309-334.Newson, R. 2006c. Confidence intervals for rank statistics: Percentile slopes, differences, and ratios.

Stata Journal6: 497-520.

Also seeManual:

[R] spearman,[R] ranksum,[R] signrank,[R] rocSTB: STB-52: sg123, STB-55: snp15, STB-57: snp15.1, STB-58: snp15.2, STB-58: snp16; STB-61: snp15.3; STB-61: snp16.1. Online:ktau,ranksum,signrank,roc,lincom,jknife,cendif,censlope, if installed