Title
margdistfit -- Post-estimation command that compares the observed and theoretical marginal distributions.
Syntax
margdistfit , [ { pp | qq | cumul | hangroot[(hangroot_options)] } sims(#) noparsamp obsopts(scatter_options) refopts(line_options) simopts(line_options) nosquare e(#) ]
Description
margdistfit is a post-estimation command for checking how well distributional assumptions of a regression model fit to the data. It does so by comparing the marginal distribution implied by the regression model to the distribution of the dependent variable. This comparison is done through either a probability-probabilty plot, a quantile-quantile plot, a hanging rootogram, or a plot of the two cumulative distribution functions.
The key concept in this command is the marginal distribution. Regression models assume a distribution for the dependent variable, and this distribution can be described in terms of a small number of parameters: e.g. the mean and the standard deviation in case of the normal/Gaussian distribution. One or more of these distribution parameters, typically the mean, is allowed to differ from observation to observation depending on the values of the explanatory variables. As a consequence, the distribution of the explained variable implied by the model is a mixture distribution such that each observation has its own parameters. This is the marginal distribution.
To give an indication of how much deviation from the theoretical distribution is still legitimate, the graph will also show the distribution of several (by default 20) simulated variables under the assumption that the regression model is true. By default, the simulations include both uncertainty due to uncertainty about the parameter estimates and uncertainty due to the fact that they are random draws from a distribution. This is achieved by creating the simulated variables in two steps: first the parameters are drawn from their sampling distribution, and than the simulated variable is drawn given those parameters.
margdistfit may be used after estimating a model with regress, poisson, zip, nbreg, gnbreg, zinb, or betafit (the latter is available from ssc).
Options
pp specifies that a probability-probability plot is to be displayed. This graph is best for looking at the comparison of the theoretical and observed distribution in the middle of the distribution. It may not be combined with qq, cumul, or hangroot.
qq specifies that a quantile-quantile plot is to be displayed. This graph is best for looking at the comparison of the theoretical and observed distribution in the tails of the distribution. This is the default. It may not be combined with pp, cumul, or hangroot.
cumul specifies that the observed and theoretical cumulative density functions are to be graphed. It may not be combined with pp, cumul, or hangroot.
hangroot[(hangroot_options)] specifies that a hanging rootogram is used to compare the observed and theoretical distributions. This requires that the hangroot package is installed, which is available from ssc. It may not be combined with pp, qq, or cumul.
sims(#) specifies the number of simulated variables, the default is 20.
noparsamp specifies that the simulated variables should be drawn from the distribution with parameters based on the point estimates of the model and avoid drawing the parameters from the sampling distribution.
obsopts(scatter_options) options governing how the distribution of the observed variable looks.
refopts(line_options) options governing how the reference line looks.
simopts(line_options) options governing how the distributions of the simulated variable look.
nosquare specifies that the graph is not forced to be square. By default the probability-probability and quantile-quantile plots are forced to be square as a perfect fit is represented by the 45 degree line. By forcing the graph to be square the 45 degree line truely has an angle of 45 degrees. This option is not allowed in combination with cumul or hangroot.
e(#) specifies the maximal error used when approximating the quantile function or cumulative density function. The quantile function is computed using the algorithm discussed in (hoermann and leydold 2003). A similar algorithm is used to compute the cumulative density function. The latter is strictly speaking not necessary, but it significantly speeds up the computation in medium to large datasets. With pp or cumul it may be a number between 0 and 1e-3. The cumulative density function will be directly computed instead of approximated when a number less than 1e-12 is specified. With qq it may be a number between 1e-12 and 1e-3. The default is min(1e-6,10^-ceil(log10(N))), where N is the sample size.
Examples
A well fitting model:
sysuse nlsw88, clear gen lnw = ln(wage) reg lnw grade ttl_exp tenure union margdistfit, qq (click to run)
A not so well fitting model. Note that linear regression is typically quite robust against deviations from this assumption. However, knowing that such deviations exist in your data and substantively understanding why they are there can add a lot "flesh" to the "bare bones" of your model.
sysuse auto, clear reg price mpg foreign margdistfit, pp (click to run)
An example created to illustrate that the marginal distribution can look very different from what one may expect. I use regress, so I assume a normal distribution where the mean can change from observation to observation depending on the value of x. In this case the data was created such that we should see a distribution of y that has consists of two humps, one at -2 and the other at 2, which is indeed the case.
preserve set seed 12345 drop _all set obs 500 gen x = runiform() < .5 gen y = -2 + 4*x + rnormal() regress y x margdistfit, hangroot(jitter(5)) restore (click to run)
An example that can be used to compare the fit of several count models.
The strange pattern in the last graph is due to the large sampling variability in the inflation parameter, and by default the parameters are for each simulation drawn from the sampling distribution. That way some of the samples are drawn from a distribution where the probability of a degenerate zero is 1 - that is, the distribution reduces to a spike at 0 - while for the other samples that probability is 0 - that is, the distribution reduces to a negative binomial. This means that in essence the zinb model is not appropriate for this data.
preserve use http://www.stata-press.com/data/lf2/couart2,clear mkspline ment1 20 ment2 = ment // this is just to ensure that graph names do not conflict // with any graph name you have open tempname poisson zip nb zinb poisson art fem mar kid5 phd ment1 ment2 margdistfit, hangroot(susp notheor jitter(2)) title(poisson) name(`poisson' > ) zip art fem mar kid5 phd ment1 ment2, inflate(_cons) margdistfit, hangroot(susp notheor jitter(2)) title(zip) name(`zip') nbreg art fem mar kid5 phd ment1 ment2 margdistfit, hangroot(susp notheor jitter(2)) title(nbreg) name(`nb') zinb art fem mar kid5 phd ment1 ment2, inflate(_cons) margdistfit, hangroot(susp notheor jitter(2)) title(zinb) name(`zinb') restore (click to run)
Author
Maarten L. Buis Universitaet Tuebingen Institut fuer Soziologie maarten.buis@uni-tuebingen.de
References
Hoermann, Wolfgang and Leydold, Josef. (2003). Continuous random variate generation by fast numerical inversion. ACM Transactions on Modeling and Computer Simulation, 13(4): 347--362.
Acknowledgement Garry Anderson, David Ashcraft, Ronan Conroy, Nick Cox and Austin Nichols (in alphabetical order) made several useful comments.
Also see
Online: pnorm, qnorm
If installed: hangroot, qplot, pbeta, qbeta