------------------------------------------------------------------------------- help foregenmore-------------------------------------------------------------------------------

Extensions to generate (more extras)

egen[type]newvar=fcn(arguments)[ifexp] [inrange] [,options]

Descriptionegen creates

newvarof the optionally specified storage type equal tofcn(arguments). Depending onfcn(),argumentsrefers to an expression, a varlist, a numlist, or an empty string. The options are similarly function dependent.

Functions(The option

by(byvarlist)means that computations are performed separately for each group defined bybyvarlist.)Functions are grouped thematically as follows: Grouping and graphing Strings, numbers and conversions Dates, times and time series Summaries and estimates First and last Random numbers Row operations

Grouping and graphing

axis(varlist)[, gaplabel(lblvarlist)missingreverse] resembles egen'sgroup(), but is specifically designed for constructing categorical axis variables for graphs, hence the name. It creates a single variable taking on values 1, 2, ... for the groups formed byvarlist.varlistmay contain string, numeric, or both string and numeric variables. The order of the groups is that of the sort order ofvarlist.gapoverrides the default numbering of 1 up by adding a gap of 1 whenever a variable changes.label()specifies that labels are to be assigned based on the value labels or values oflblvarlist; if not specified,lblvarlistdefaults tovarlist.missingindicates that missing values invarlist(either numeric missing or"") are to be treated like any other value when assigning groups, instead of missing values being assigned to the group missing.reversereverses labelling so that groups that would have been assigned values of 1 ... whatever are instead assigned values of whatever ... 1. (Stata 8 required.)To order groups of a categorical variable according to their values of another variable, in preparation for a graph or table:

. egen meanmpg = mean(-mpg), by(rep78). egen Rep78 = axis(meanmpg rep78), label(rep78). tabstat mpg, by(Rep78) s(min mean max)

clsst(varname),values(numlist)[later] returns whichever of thenumlistinvalues()is closest (differs by least, disregarding sign) to the numeric variablevarname.laterspecifies that in the event of ties values specified later in the list overwrite values specified earlier. If varname is 15 then 10 and 20 specified byvalues(10 20)are equally close. For any observation containing 15 the default is that 10 is reported, whereas withlater20 is reported. For anumlistcontaining an increasing sequence,laterimplies choosing the higher of two equally close values. (Stata 6 required.)

. egen mpgclass = clsst(mpg), v(10(5)40)

egroup(varlist)is a extension of egen'sgroup()function with the extra optionlabel(lblvarlist), which will attach the original values (or value labels if they exist) oflblvarlistas value labels. This option may not be combined with thelabeloption. (Stata 7 required; superseded byaxis()above.)

group2(varlist)is a generalisation of egen'sgroup()with the extra optionsort(egen_call). Groups ofvarlistwill have values 1 upwards according to their values on the results of a specifiedegen_call. For example,group2(rep78) sort(mean(mpg))will produce a variable such that the group ofrep78with the lowest mean ofmpgwill have value 1, that with the second lowest mean will have value 2, and so forth. As withgroup(), thelabeloption will attach the original values ofvarlist(or value labels if they exist) as value labels. The argument ofsort()must be a valid call to anegenfunction, official or otherwise. (Stata 7 required; use ofegroup()oraxis()above is now considered better style.)

mlabvpos(yvar xvar)[,logpolynomial(#)matrix(5x5 matrix)] automatically generates a variable giving clock positions of marker labels given names of variablesyvarandxvardefining the axes of a scatter plot. Thus the command generates a variable to be used in the scatter optionmlabvpos().The general idea is to pull marker labels away from the data region. So, marker labels in the lower left of the region are at clock positions 7 or 8, and those in the upper right are at clock-position 1 or 2, etc. More precisely, considering the following rectangle as the data region, then marker labels are placed as follows:

+--------------+ |11 12 12 12 1| |10 11 12 1 2| | 9 9 12 3 3| | 8 7 6 5 4| | 7 6 6 6 5| +--------------+

Note that there is no attempt to prevent marker labels from overplotting, which is likely in any dataset with many observations. In such situations you might be better off simply randomizing clock positions with say

ceil(uniform() * 12).If

yvarandxvarare highly correlated, than the clock-positions are generated as follows (which is however the same general idea):+--------------+ | 12 1 3| | 12 12 3 4| |11 11 12 5 5| |10 9 6 6 | | 9 7 6 | +--------------+

To calculate the positions, the x axis is first categorized into 5 equal intervals around the mean of

xvar. Afterwards the residuals from regression ofyvaronxvarare categorized into 5 equal intervals. Both categorized variables are then used to calculate the positions according to the first table above. The rule can be changed with the optionmatrix().

logindicates that residuals from regression are to be calculated using the logarithms ofxvar. This might be useful if the scatter shows a strong curvilinear relationship.

polynomial(#)indicates that residuals are to be calculated from a regression ofyvaron a polynomial ofxvar. For example, usepoly(2)if the scatter shows a U-shaped relationship.

matrix(#)is used to change the general rule for the plot positions. The positions are specified by a 5 x 5 matrix, in which cell [1,1] gives the clock position of marker labels in the upper left part of the data region, and so forth. (Stata 8.2 required.)

. egen clock = mlabvpos(mpg weight). scatter mpg weight, mlab(make) mlabvpos(clock). egen clock2 = mlabvpos(mpg weight), matrix(11 1 12 11 1 \\ 10 2 12 10 2\\ 9 3 12 9 3 \\ 8 4 6 8 4 \\ 7 5 6 7 5). sc mpg weight, mlab(make) mlabvpos(clock2)

Strings, numbers and conversions

base(varname)[,base(#)] produces a string variable containing the digits of a base#(default 2, possible values 2(1)9) representation ofvarname, which must contain integers. Thus ifvarnamecontains values 0, 1, 2, 3, 4, and the default base is used, then the result will contain the strings"000","001","010","011","100". If any integer values are negative, all string values will start with-if negative and+otherwise. See alsodecimal(). The examples show how to unpack this string into individual digits if desired. (Stata 6 required.)

. egen binary = base(code)Suppose

binaryisstr5. To get individualstr1variables,

. forval i = 1/5 {. gen str1 code`i' = substr(binary, `i',1). }and to get individual numeric variables,

. forval i = 1/5 {. gen byte code`i' = real(substr(binary, `i', 1)). }

decimal(varlist)[,base(#)] treats the values ofvarlistas indicating digits in a base#(default 2, possible values integers >=2) representation of a number and produces the decimal equivalent. Thus if three variables are given with values in a single observation of 1 1 0, and the default base is used, the decimal result is 1 * 2^2 + 1 * 2^1 + 0 * 2^0 = 4 + 2 + 0 = 6. Similarly if base 5 is used, the decimal equivalent of 2 3 4 is 2 * 5^2 + 3 * 5^1 + 4 * 5^0 = 50 + 15 + 4 = 59. Note that the order of variables invarlistis crucial. (Stata 7 required.)

. egen decimal = decimal(q1-q8)

incss(strvarlist),substr(substring)[insensitive] indicates occurrences ofsubstringwithin any of the variables in a list of string variables by 1 and other observations by 0.insensitivemakes comparison case-insensitive. (Stata 6 required; an alternative is now just to use foreach.)

. egen buick = incss(make), sub(buick) i

iso3166(varname)[,origin(codes|names)language(en|fr)verboseupdate] mapsvarnamecontaining "official short country names" into a new variable containing the ISO 3166-1-alpha-2 code elements (e.g. DE for "Germany", GB for "United Kingdom" and HM for "Heard Island and McDonald Islands") and vice versa. The official short country names can be in English (default) or French. Correspondingly the function produces country names from ISO 3166-1-alpha-2 codes in English or French. (Version 9.2 required.)

origin(codes|names)declares the character of the country variable that is already in the data. The default isnames, meaning thatvarnameholds the "official short country names". This information may be stored as a string variable or as a numeric variable that is labeled accordingly. This default setting produces ISO 3166-1-alpha-2 codes from the country names. If country names should be produced from the two letter codes, useegennewvar= iso3166(varname),origin(codes).

language(en|fr)defines the language in which the country names are stored, or should be produced.language(en)is for English names (default);language(fr)is for French names.

verboseFor the mapping from country names to ISO 3166-1-alpha2 codes the program expects official short country names. It cannot handle unofficial country names such as "Great Britain", "Taiwan" or "Russia". Such unofficial country names result in the generation of missing values for the respective countries. By defaultiso3166()only returns the number of missing values it has produced. WithverboseStata also provides the list of unofficial country names invarnameand a clickable link to the list of official country names. This is convenient if one wants to correct the information stored invarnamebefore usingiso3166(). For the transformation of ISO 3166-1-alpha2 codes into country names,verbosedoes something equivalent.

updateThe ISO 3166-1-alpha2 codes are automaticaly looked up in information provided by the ISO 3166 Maintenance Agency of the International Organization for Standardization. The information is automatically downloaded from the internet when the user specifiesiso3166()the first time, or wheneverupdateis specified. Note: Updating the matching list regularly will guarantee thatiso3166()always produces up-to-date country names. However, updating the match list may also produce missing values when running older do-files for data sets with countries that no longer exist (for example, Yugoslavia).Note the implications: This function will only work if your copy of Stata can access the internet, at least for the first time it is called. The results of the function might be not fully reproducible in the future.

msub(strvar),find(findstr)[replace(replacestr)n(#)word] replaces occurrences of the words offindstrby the words ofreplacestrin the string variablestrvar. The words offindstrand ofreplacestrare separated by spaces or bound by" ": thusfind(a b "c d")includes three words, in turn"a","b"and"c d", and double quotation marks""should be used to delimit any word including one or more spaces. The number of words infindstrshould equal that inreplacestr, except that (1) an emptyreplacestris taken to specify deletion; (2) a single word inreplacestris taken to mean that each word offindstris to be replaced by that word. As quotation marks are used for delimiting, literal quotation marks should be included in compound double quotation marks, as in`"""'. By default all occurrences are changed.n(#)specifies that the first#occurrences only should be changed.wordspecifies that words infindstrare to be replaced only if they occur as separate words instrvar. The substitutions ofmsub()are made in sequence. (Stata 6 required;msub()depends on the built-in functions subinstr() and subinword().)

. egen newstr = msub(strvar), f(A B C) r(1 2 3)(replaces"A"by"1","B"by"2","C"by"3")

. egen newstr = msub(strvar), f(A B C) r(1 2 3) n(1)(replaces"A"by"1","B"by"2","C"by"3", first occurrence only)

. egen newstr = msub(strvar), f(A B C) r(1)(replaces"A"by"1","B"by"1","C"by"1")

. egen newstr = msub(strvar), f(A B C)(deletes"A","B","C")

. egen newstr = msub(strvar), f(" ")(deletes spaces)

. egen newstr = msub(strvar), f(`"""')(deletes quotation mark")

. egen newstr = msub(strvar) f(frog) w(deletes"frog"only if occurring as single word)

noccur(strvar),string(substr)creates a variable containing the number of occurrences of the stringsubstrin string variablestrvar. Note that occurrences must be disjoint (non-overlapping): thus there are two occurrences of"aa"within"aaaaa". (Stata 7 required.)

nss(strvar),find(substr)[insensitive] returns the number of occurrences ofsubstrwithin the string variablestrvar.insensitivemakes counting case-insensitive. (Stata 6 required.)The inclusion of

noccur()andnss(), two almost identical functions, was an act of sheer inadvertence by the maintainer.

ntos(numvar),from(numlist)to(list of string values)generates a string variable from a numeric variablenumvar, mapping each numeric value innumlistto the corresponding string value. The number of elements in each list must be the same. String values containing blanks should be delimited by doube quotation marks" ". Values not defined by the mapping are generated as missing. The type of the string variable is determined automatically. (Stata 6 required.)

. egen grade = ntos(Grade), from(1/5) to(Poor Fair Good "Very good"Excellent)

nwords(strvar)returns the number of words within the string variablestrvar. Words are separated by spaces, unless bound by double quotation marks" ". (Stata 6 required; superseded by wordcount()).

repeat(),values(value_list)[by(byvarlist)block(#)] produces a repeated sequence ofvalue_list. The items ofvalue_list, which may be anumlistor a set of string values, are assigned cyclically to successive observations. The order of observations is determined (1) after noting anyiforinrestrictions; (2) within groups specified byby(), if issued; (3) by the current sort order.block()specifies that values should be repeated in blocks of the specified size: the default is 1. The variable type is determined smartly, and need not be specified. (Stata 8 required.)

. egen quarter = repeat(), v(1/4) block(3). egen months = repeat(), v(`c(Months)'). egen levels = repeat(), v(10 50 200 500)

sieve(strvar),{keep(classes)|char(chars)|omit(chars)} selects characters fromstrvaraccording to a specified criterion and generates a new string variable containing only those characters. This may be done in three ways. First, characters are classified using the keywordsalphabetic(any ofa-zorA-Z),numeric(any of0-9),spaceorother.keep()specifies one or more of those classes: keywords may be abbreviated by as little as one letter. Thuskeep(an)selects alphabetic and numeric characters and omits spaces and other characters. Note that keywords must be separated by spaces. Alternatively,char()specifies each character to be selected oromit()specifies each character to be omitted. Thuschar(0123456789.)selects numeric characters and the stop (presumably as decimal point);omit(" ")strips spaces andomit(`"""')strips double quotation marks. (Stata 7 required.)

ston(strvar),from(list of string values)to(numlist)generates a numeric variable from a string variablestrvar, mapping each string value to the corresponding numeric value innumlist. The number of elements in each list must be the same. String values containing blanks should be delimited by" ". Values not defined by the mapping are generated as missing. (Stata 6 required.)

. egen Grade = ston(grade), to(1/5) from(Poor Fair Good "Very good"Excellent)

wordof(strvar),word(#)returns the#th word of string variablestrvar.word(1)is the first word,word(2)the second word,word(-1)the last word, and so forth. Words are separated by spaces, unless bound by quotation marks" ". (Stata 6 required; superseded by word().)

Dates, times and time series

bom(m y)[,lag(lag)format(format)work] creates an elapsed date variable containing the date of the beginning of monthmand yeary.mcan be a variable containing integers between 1 and 12 inclusive or a single integer in that range.ycan be a variable containing integers within the range covered by elapsed dates or a single integer within that range. Optionallylag()specifies a lag: the beginning of the month will be given forlagmonths before the current date.lag(1)refers to the previous month,lag(3)to 3 months ago andlag(-3)to 3 months hence. Thelagmay also be specified by a variable containing integers. Optionally a format, usually but not necessarily a date format, can be specified.workspecifies that the first day must also be one of Monday to Friday. (Stata 6 required.)

. egen bom = bom(month year), f(%dd_m_y)

bomd(datevar)[,lag(lag)format(format)work] creates an elapsed date variable containing the date of the beginning of the month containing the date in an elapsed date variabledatevar. Optionallylag()specifies a lag: the beginning of the month will be given forlagmonths before the current date.lag(1)refers to the previous month,lag(3)to 3 months ago andlag(-3)to 3 months hence. Thelagmay also be specified by a variable containing integers. Optionally a format, usually but not necessarily a date format, can be specified.workspecifies that the first day must also be one of Monday to Friday. (Stata 6 required.)

. egen bomd = bomd(date), f(%dd_m_y)Note that

workknows nothing about holidays or any special days.

dayofyear(daily_date_variable)[,month(#)day(#)] generates the day of the year, counting from the start of the year, from a daily date variable. The start of the year is 1 January by default:month()and/orday()may be used to specify an alternative. This function thus is a generalisation of the date function doy(). (Stata 8 required.)

. egen dayofyear = dayofyear(date), m(10)

dhms(d h m s)[,format(format)] creates a date variable from Stata date variable or datedwith a fractional part reflecting the number of hours, minutes and seconds past midnight.hcan be a variable containing integers between 0 and 23 inclusive or a single integer in that range.mandscan be variables containing integers between 0 and 59 or single integer(s) in that range. Optionally a format, usually but not necessarily a date format, can be specified. The resulting variable, which is by default stored as a double, may be used in date and time arithmetic in which the time of day is taken into account. (Stata 6 required.)

elap(time)[,format(format)] creates a string variable which contains the number of days, hours, minutes and seconds associated with an integer variable containing a number of elapsed seconds. Such a variable might be the result of date/time arithmetic, where a time interval between two timestamps has been expressed in terms of elapsed seconds. Leading zeroes are included in the hours, minutes, and seconds fields. Optionally, a format can be specified. (Stata 6 required.)

elap2(time1 time2)[,format(format)] creates a string variable which contains the number of days, hours, minutes and seconds associated with a pair of time values, expressed as fractional days, wheretime1is no greater thantime2. Such time values may be generated by functiondhms().elap2()expresses the interval between these time values in readable form. Leading zeroes are included in the hours, minutes, and seconds fields. Optionally, a format can be specified. (Stata 6 required.)

eom(m y)[,lag(lag)format(format)work] creates an elapsed date variable containing the date of the end of monthmand yeary.mcan be a variable containing integers between 1 and 12 inclusive or a single integer in that range.ycan be a variable containing integers within the range covered by elapsed dates or a single integer within that range. Optionallylag()specifies a lag: the end of the month will be given forlagmonths before the current date.lag(1)refers to the previous month,lag(3)to 3 months ago andlag(-3)to 3 months hence. Thelagmay also be specified by a variable containing integers. Optionally a format, usually but not necessarily a date format, can be specified.workspecifies that the last day must also be one of Monday to Friday. (Stata 6 required.)

. egen eom = eom(month year), f(%dd_m_y)

eomd(datevar)[,lag(lag)format(format)work] creates an elapsed date variable containing the date of the end of the month containing the date in an elapsed date variabledatevar. Optionallylag()specifies a lag: the end of the month will be given forlagmonths before the current date.lag(1)refers to the previous month,lag(3)to 3 months ago andlag(-3)to 3 months hence. Thelagmay also be specified by a variable containing integers. Optionally a format, usually but not necessarily a date format, can be specified.workspecifies that the last day must also be one of Monday to Friday. (Stata 6 required.)Note that

workknows nothing about holidays or any special days.

. egen eom = eomd(date), f(%dd_m_y). egen eopm = eomd(date), f(%dd_m_y) lag(1)

ewma(timeseriesvar),a(#)calculates the exponentially weighted moving average, which is

ewma=timeseriesvarfor the first observation=

a *timeseriesvar+(1 - a) * L.ewmaotherwiseThe data must have been declared time series data by tsset. Calculations start afresh after any gap with missing values. (Stata 6 required; superseded by tssmooth.)

filter(timeseriesvar) ,lags(numlist)[coef(numlist){normalise|normalize} ] calculates the linear filter which is the sum of terms

coef_i* Li.timeseriesvarorcoef_i* Fi.timeseriesvar

coef()defaults to a vector the same length aslags()with each element 1.

filter(y), l(0/3) c(0.4(0.1)0.1)calculates

0.4 * y + 0.3 * L1.y + 0.2 * L2.y + 0.1 * L3.y

filter(y), l(0/3)calculates

1 * y + 1 * L1.y + 1 * L2.y + 1 * L3.yory + L1.y + L2.y + L3.yLeads are specified as negative lags.

normalise(ornormalize, according to taste) specifies that coefficients are to be divided by their sum so that they add to 1 and thus specify a weighted mean.

filter(y), l(-2/2) c(1 4 6 4 1) ncalculates

(1/16) * F2.y + (4/16) * F1.y + (6/16) * y+ (4/16) * L1.y + (1/16) *L2.yThe data must have been declared time series data by tsset. Note that this may include panel data, which are automatically filtered separately within each panel.

The order of terms in

coef()is taken to be the same as that inlags. (Stata 8 required; see also tssmooth.)

. egen f2y = filter(y), l(-1/1) c(0.25 0.5 0.25). egen f2y = filter(y), l(-1/1) c(1 2 1) n

filter7(timeseriesvar) ,lags(numlist)coef(numlist)[ {normalise|normalize} ] calculates the linear filter which is the sum of terms

coef_i* Li.timeseriesvarorcoef_i* Fi.timeseriesvar

filter7(y), l(0/3) c(0.4(0.1)0.1)calculates

0.4 * y + 0.3 * L1.y + 0.2 * L2.y + 0.1 * L3.yLeads are specified as negative lags.

normalise(ornormalize, according to taste) specifies that coefficients are to be divided by their sum so that they add to 1 and thus specify a weighted mean.

filter7(y), l(-2/2) c(1 4 6 4 1) ncalculates

(1/16) * F2.y + (4/16) * F1.y + (6/16) * y+ (4/16) * L1.y + (1/16) *L2.yThe data must have been declared time series data by tsset. Note that this may include panel data, which are automatically filtered separately within each panel.

The order of terms in

coef()is taken to be the same as that inlags(). (Stata 7 required; see also tssmooth.)

foy(daily_date_variable)[,month(#)day(#)] generates the fraction of the year elapsed since the start of the year from a daily date variable. The start of the year is 1 January by default:month()and/orday()may be used to specify an alternative. Ifdaily_date_variableis all integers, then the result is (day of year - 0.5) / number of days in year. Ifdaily_date_variablecontains non-integers, then the result is (day of year - 1) / number of days in year. (Stata 8 required.)

. egen frac = foy(date), m(10)

hmm(timevar)[,round(#)trim] generates a string variable showingtimevar, interpreted as indicating time in minutes, represented as hours and minutes in the form"[...h]h:mm". For example, times of9,90,900and9000minutes would be represented as"0:09","1:30","15:00"and"150:00". The optionround(#)rounds the result:round(1)rounds the time to the nearest minute. The optiontrimtrims the result of leading zeros and colons, except that an isolated0is not trimmed. Withtrim"0:09"is trimmed to"9"and"0:00"is trimmed to"0".

hmm()serves equally well for representing times in seconds in minutes and seconds in the form"[...m]m:ss". (Stata 6 required.)

hmmss(timevar)[,round(#)trim] generates a string variable showingtimevar, interpreted as indicating time in seconds, represented as hours, minutes and seconds in the form"[...h:]mm:ss". For example, times of9,90,900and9000seconds would be represented as"00:09","01:30","15:00"and"2:30:00". The optionround(#)rounds the result:round(1)rounds the time to the nearest second. The optiontrimtrims the result of leading zeros and colons, except that an isolated0is not trimmed. Withtrim"00:09"is trimmed to"9"and"00:00"is trimmed to"0". (Stata 6 required.)

hms(h m s)[,format(format)] creates an elapsed time variable containing the number of seconds past midnight.hcan be a variable containing integers between 0 and 23 inclusive or a single integer in that range.mandscan be variables containing integers between 0 and 59 or single integer(s) in that range. Optionally a format can be specified. (Stata 6 required.)

minutes(strvar)[,maxhour(#)] returns time in minutes given a string variablestrvarcontaining a time in hours and minutes in the form"[..h]hh:mm". In particular, minutes are given as two digits between 00 and 59 and hours by default are given as two digits between 00 and 23. Themaxhour()option may be used to change the (unreachable) limit: its default is 24. Note that, strange though it may seem, this function rather thanseconds()is appropriate for converting times in the form"mm:ss"to seconds. The maximum number of minutes acceptable may need then to be specified bymaxhour()[sic]. (Stata 8 required.)

ncyear(datevar),month(#)[day(#)] returns an integer variable labelled with labels such as"1952/53"for non-calendar years starting on the specified month and day. The day defaults to 1.datevaris treated as indicating elapsed dates. For more on dates, see help on dates. (Stata 6 required.)

. egen wtryear = ncyear(date), m(10)(years starting on 1 October)

. egen wwgyear = ncyear(date), m(1) d(21)(years starting on 21 January)

record(exp)[,by(byvarlist)minorder(varlist)] produces the maximum (withminthe minimum) value observed "to date" of the specifiedexp. Thusrecord(wage), by(id) order(year)produces the maximum wage so far in worker's career, calculations being separate for eachidand records being determined within eachidinyearorder. Although explanation and example here refer to dates, nothing inrecord()restricts its use to data ordered in time. If not otherwise specified withby()and/ororder(), records are determined with respect to the current order of observations. No special action is required for missing values, as internallyrecord()uses either themax()or themin()function, both of which return results of missing only if all values are missing. (Stata 6 required.)

. egen hiwage = record(exp(lwage)), by(id) order(year). egen lowage = record(exp(lwage)), by(id) order(year) min

seconds(strvar)[,maxhour(#)] returns time in seconds given a string variable containing a time in hours, minutes and seconds in the form"[..h]hh:mm:ss". In particular, minutes and seconds are each given as two digits between 00 and 59 and hours by default are given as two digits between 00 and 23. Themaxhour()option may be used to change the (unreachable) limit: its default is 24. (Stata 8 required.)

tod(time)[,format(format)] creates a string variable which contains the number of hours, minutes and seconds associated with an integer in the range 0 to 86399, one less than the number of seconds in a day. Such a variable is produced byhms(), which see above. Leading zeroes are included in the hours, minutes, and seconds fields. Colons are used as separators. Optionally a format can be specified. (Stata 6 required.)

Summaries and estimates

adjl(varname)[,by(byvarlist)factor(#)] calculates adjacent lower values. These are the smallest values withinfactor()times the interquartile range of the lower quartile. By defaultfactor()is 1.5, defining the default lower value of a so-called whisker on a Stata box plot. (Stata 8 required.)

adju(varname)[,by(byvarlist)factor(#)] calculates adjacent upper values. These are the largest values withinfactor()times the interquartile range of the upper quartile. By defaultfactor()is 1.5, defining the default upper value of a so-called whisker on a Stata box plot. (Stata 8 required.)

. egen adjl = adjl(mpg), by(foreign). egen adju = adju(mpg), by(foreign)

corr(varname1 varname2)[,covariancespearmantauataubby(byvarlist)] returns the correlation ofvarname1withvarname2. By default, this returns the Pearson correlation coefficient.covarianceindicates that covariances should be calculated;spearmanindicates that Spearman's rank correlation coefficient should be calculated;tauaandtaubreturn Kendall's tau-A and tau-B, respectively. (Stata 8 required.)

density(varname)[,width(#)start(#)frequencypercentfractionby(byvarlist)] calculates the density (or optionally thefrequency,fractionorpercent) of values in bins of widthwidth()(default 1) starting atstart()(default minimum of the data). Note that each value produced will be identical for all observations in the same bin. Commonly for further use it will be desired to select one value from each bin, say by using egen'stag()function. (Stata 8 required.)

gmean(exp)[,by(byvarlist)] returns the geometric mean ofexp. (Stata 6 required.)

. egen gmean = gmean(mpg), by(rep78)

hmean(exp)[, by(byvarlist)] returns the harmonic mean ofexp. (Stata 6 required.)

. egen hmean = hmean(mpg), by(rep78)

nmiss(exp)[,by(byvarlist)] returns the number of missing values inexp. (Stata 6 required.) Remark: Why this was written is a mystery. The one-line commandegen nmiss = sum(missing(exp)(in Stata 9egennmiss = total(missing(exp)) shows that it is unnecessary.

. egen nmiss = nmiss(rep78), by(foreign)

nvals(varname)[,by(byvarlist)missing] returns the number of distinct values invarname. Missing values are ignored unlessmissingis specified. Remark: Much can be done by using egen functiontag()and then summing values as desired. (Stata 6 required.)

outside(varname)[,by(byvarlist)factor(#)] calculates outside values. These are any values more thanfactor()times the interquartile range from the nearer quartile, that is above the upper quartile or below the lower quartile. By defaultfactor()is 1.5, defining the default outside values, those plotted separately, on a Stata box plot. Values not outside are returned as missing. (Stata 8 required.)

ridit(varname)[,by(byvarlist)missingpercentreverse] calculates the ridit forvarname, which is(1/2) count at this value + SUM counts in values below ------------------------------------------------------ SUM counts of all values

With terminology from Tukey (1977, pp.496-497), this could be called a `split fraction below'. The name `ridit' was used by Bross (1958): see also Fleiss (1981, pp.150-7) or Flora (1988). The numerator is a `split count'.

missingspecifies that observations for which values ofbyvarlistare missing will be included in calculations ifby()is specified. The default is to exclude them.percentscales the numbers to percents by multiplying by 100.reversespecifies the use of reverse cumulative probabilities (1 - fraction above). (Stata 6 required.)

semean(exp)[,by(byvarlist)] calculates the standard error of the mean ofexp. (Stata 6 required.)

sumoth(exp)[,by(byvarlist)] returns the sum of the other values ofexpin the same group. Ifby()is specified, distinct combinations ofbyvarlistdefine groups; otherwise all observations define one group. (Stata 6 required.)

var(exp)[,by(byvarlist)] creates a constant (withinbyvarlist) containing the variance ofexp. Note also the egen functionsd(). (Stata 6 required.)

wpctile(varname)[,p(#)weights(varname)altdefby(byvarlist)] is a hack on official Stata'segenfunctionpctile()allowing specification of weights in the calculation of percentiles. By default, the function creates a constant (withinbyvarlist) containing the#th percentile ofvarname. Ifp()is not specified, 50 is assumed, meaning medians.weights()requests weighted calculation of percentiles.altdefuses an alternative formula for calculating percentiles, which is not applicable with weights present.by()requests calculation by groups. You may also use theby:construct. (Stata 8.2 required.)

wtfreq(exp)[,by(byvarlist)] creates a constant (withinbyvarlist) containing the weighted frequency usingexpas weights. (Such frequencies sum to_N.) (Stata 6 required.)

xtile(varname)[,percentiles(numlist)nquantiles(#)weights(varname)altdefby(byvarlist)] categorizesvarnameby specific percentiles. The function works like xtile. By defaultvarnameis dichotomized at the median.percentiles()requests percentiles corresponding tonumlist: for example,p(25(25)75)is used to create a variable according to quartiles. Alternatively you also may have specifiedn(4): to create a variable according to quartiles.weights()requests weighted calculation of percentiles.altdefuses an alternative formula for calculating percentiles. See xtile.by()requests calculation by groups. You may also use theby:construct. (Stata 8.2 required.)

. egen mpg4 = xtile(mpg), by(foreign) p(25(25)75). egen mpg10 = xtile(mpg), by(foreign) nq(10)

First and last

first(varname)[,by(byvarlist)] returns the first non-missing value ofvarname. `First' depends on the existing order of observations.varnamemay be numeric or string. (Stata 6 required.)

ifirst(numvar),value(#)[ {before|after}by(byvarlist)] indicates the first occurrence of integer#withinnumvarby 1 and other observations by 0.

beforeindicates observations before the first occurrence by 1 and other observations by 0.afterindicates observations after the first occurrence by 1 and other observations by 0. The default, the valuebeforeand the valueafteralways sum to 1 for observations analysed.First occurrence is determined as follows: (1) if

iforinis specified, any observations excluded are ignored; (2) ifby()is specified, first is determined separately for each distinct group of observations; (3) first is first in current sort order. If#does not occur, all observations are before the first occurrence. (Stata 6 required.)

. gen warm = celstemp > 20. egen fwarm = ifirst(warm), v(1) by(year)

ilast(numvar),value(#)[ {before|after}by(byvarlist)] indicates the last occurrence of integer#withinnumvarby 1 and other observations by 0.

beforeindicates observations before the last occurrence by 1 and other observations by 0.afterindicates observations after the last occurrence by 1 and other observations by 0. The default, the valuebeforeand the valueafteralways sum to 1 for observations analysed.Last occurrence is determined as follows: (1) if

iforinis specified, any observations excluded are ignored; (2) ifby()is specified, last is determined separately for each distinct group of observations; (3) last is last in current sort order. If#does not occur, all observations are before the last occurrence. (Stata 6 required.)

lastnm(varname)[,by(byvarlist)] returns the last non-missing value ofvarname. `Last' depends on the existing order of observations.varnamemay be numeric or string. Remark:lastnm()would have been better calledlast(), except that anegenprogram with that name for selecting the last `word' in a string was published in STB-50. (Stata 6 required.)

Random numbersmixnorm()[,frac(#)mu1(#)mu2(#)var1(#)var2(#)] generates a new variable of specified type as a mixture of two Normal distributions, with the fractionfrac(#)of the observations defined by the first distribution. Both options for meansmu1(#)andmu2(#)default to 0; both options for variancesvar1(#)andvar2(#)default to 1, whilefrac(#)defaults to 0.5. Only non-default parameters of the desired mixture need be specified. (Stata 8 required.)

. egen mixture = mixnorm(), frac(0.9) mu2(10) var2(4)

rndint(),max(#)[min(#)] generates random integers from a uniform distribution onmin()tomax(), inclusive.min(1)is the default. Remark: Note thatceil(uniform() *#)is a direct way to get random integers from 1 to#. (Stata 6 required.)

. egen integ = rndint(), min(100) max(199)

rndsub()[,ngroup(#){frac(#)|percent(#)}by(byvarlist)] randomly splits observations into groups or subsamples. The result is a categorical variable taking values from 1 upward labelling distinct groups.

ngroup(#)(default 2) defines the number of groups.

frac(#), which is only allowed withngroup(2), specifies that the first group should contain 1 /#of the observations and thus that the second group should contain the remaining observations.

percent(#), which is only allowed withngroup(2), specifies that the first group should contain#% of the observations and thus that the second group should contain the remaining observations.

frac()andpercent()may not be specified together. (Stata 6 required.)

. egen group = rndsub(), by(foreign)

. egen group = rndsub(), by(foreign) f(3)(first group contains 1/3 of observations, second group contains 2/3)

. egen group = rndsub(), by(foreign) p(25)(first group contains 25% of observations, second group contains 75%)For reproducible results, set the seed of the random number generator beforehand and document your choice.

Note that to generate

#random numbers the number of observations must be at least#. If there are no data in memory and you want 100 random numbers, typeset obs 100before using these functions.

Row operationsrall(varlist),cond(condition)[symbol(symbol)] returns 1 for observations for which the condition specified is true for all variables invarlistand 0 otherwise. The condition should be specified usingsymbol(), by default@, as a placeholder for each variable. Thus, for example,rall(varlist), c(@ > 0 & @ < .)tests whether all variables invarlistare positive and non-missing. Note that conditions typically make sense only if variables are either all numeric or all string: one exception ismissing(@). (Stata 6 required.)

rany(varlist),cond(condition)[symbol(symbol)] returns 1 for observations for which the condition specified is true for any variable invarlistand 0 otherwise. The condition should be specified usingsymbol(), by default@, as a placeholder for each variable. Thus, for example,rany(varlist), c(@ > 0 & @ < .)tests whether any variable invarlistis positive and non-missing. Note that conditions typically make sense only if variables are either all numeric or all string: one exception ismissing(@). (Stata 6 required.)

rcount(varlist),cond(condition)[symbol(symbol)] returns the number of variables invarlistfor which the condition specified is true. The condition should be specified usingsymbol(), by default@, as a placeholder for each variable. Thus, for example,rcount(varlist),c(@ > 0 & @ < .)counts for each observation how many variables invarlistare positive and non-missing. Note that conditions typically make sense only if variables are either all numeric or all string: one exception ismissing(@). More precisely,rcount()gives the sum acrossvarlistof condition, evaluated in turn for each variable. (Stata 6 required.)For

rall(),rany(), andrcount(), thesymbol()option may be used to set an alternative to@whenever the latter is inappropriate. For example, if string variables were being searched for literal occurrences of"@", some other symbol not appearing in text or in variable names should be used.

. egen any = rany(b c d e f) , c(@ == a). egen all = rall(b c d e f) , c(@ == a). egen count = rcount(b c d e f) , c(@ == a)(values ofb c d e fmatched by (equal to) those ofa?)

. egen anyw1 = rany(b c d e f) , c(abs(@ - a) <= 1). egen allw1 = rall(b c d e f) , c(abs(@ - a) <= 1). egen countw1 = rcount(b c d e f) , c(abs(@ - a) <= 1)(values ofb c d e fwithin 1 of those ofa?)From Stata 7, foreach provides an alternative that would now be considered better style:

. gen any = 0. gen all = 1. gen count = 0. foreach v of var a b c d e f {. replace any = max(any, inrange(`v', 0, .)). replace all = min(all, inrange(`v', 0, .)). replace count = count + inrange(`v', 0, .). }

rowmedian(varlist)returns the median across observations of the variables invarlist. (Stata 9 required.)

rsum2(varlist)is a generalisation of egen'srsum()(from Stata 9:rowtotal()) function with the extra optionsallmissandanymiss. As withrsum(), it creates the (row) sum of the variables invarlist, treating missing as 0. However, if the optionallmissis selected, the (row) sum for any observation for which all variables invarlistare missing is set equal to missing. Similarly, if the optionanymissis selected the (row) sum for any observation for which any variable invarlistis missing is set equal to missing. (Stata 6 required.)

ReferencesBross, I.D.J. 1958. How to use ridit analysis.

Biometrics14: 38-58.Fleiss, J.L. 1981.

Statistical Methods for Rates and Proportions.New York: John Wiley.Flora, J.D. 1988. Ridit analysis. In Kotz, S. and Johnson, N.L. (eds)

Encyclopedia of Statistical Sciences.New York: John Wiley. 8: 136-139.Tukey, J.W. 1977.

Exploratory Data Analysis.Reading, MA: Addison-Wesley.

MaintainerNicholas J. Cox, Durham University, U.K. n.j.cox@durham.ac.uk

AcknowledgementsKit Baum (baum@bc.edu) is the first author of

record()and the author ofdhms(),elap(),elap2(),hms(),tod()andmixnorm().Ulrich Kohler (kohler@wzb.eu) is the author of

xtile(),mlabvpos(),iso3166()andwpctile().Steven Stillman (s.stillman@verizon.net) is the author of

rsum2().Nick Winter (njw3x@virginia.edu) is the author of

corr()andnoccur().Kit Baum, Sascha Becker, Ronán Conroy, William Gould, Syed Islam, John Moran, Stephen Soldz, Richard Williams, Fred Wolfe and Gerald Wright provided stimulating and helpful comments.

Also seeSTB: STB-50 dm70 for

atan2(),pp(),rev(),rindex(),rmed(),rotate()Manual: [D] egen (before Stata 9 [R] egen)

On-line: help for egen, dates, functions, means, numlist, seed, tsset, varlist (timeseries operators), circular (if installed), ntimeofday (if installed), stimeofday (if installed)