-------------------------------------------------------------------------------
help for egenmore
-------------------------------------------------------------------------------

Extensions to generate (more extras)

egen [type] newvar = fcn(arguments) [if exp] [in range] [, options]

Description

egen creates newvar of the optionally specified storage type equal to fcn(arguments). Depending on fcn(), arguments refers to an expression, a varlist, a numlist, or an empty string. The options are similarly function dependent.

Functions

(The option by(byvarlist) means that computations are performed separately for each group defined by byvarlist.)

Functions are grouped thematically as follows: Grouping and graphing Strings, numbers and conversions Dates, times and time series Summaries and estimates First and last Random numbers Row operations

Grouping and graphing

axis(varlist) [ , gap label(lblvarlist) missing reverse ] resembles egen's group(), but is specifically designed for constructing categorical axis variables for graphs, hence the name. It creates a single variable taking on values 1, 2, ... for the groups formed by varlist. varlist may contain string, numeric, or both string and numeric variables. The order of the groups is that of the sort order of varlist. gap overrides the default numbering of 1 up by adding a gap of 1 whenever a variable changes. label() specifies that labels are to be assigned based on the value labels or values of lblvarlist; if not specified, lblvarlist defaults to varlist. missing indicates that missing values in varlist (either numeric missing or "") are to be treated like any other value when assigning groups, instead of missing values being assigned to the group missing. reverse reverses labelling so that groups that would have been assigned values of 1 ... whatever are instead assigned values of whatever ... 1. (Stata 8 required.)

To order groups of a categorical variable according to their values of another variable, in preparation for a graph or table:

. egen meanmpg = mean(-mpg), by(rep78) . egen Rep78 = axis(meanmpg rep78), label(rep78) . tabstat mpg, by(Rep78) s(min mean max)

clsst(varname) , values(numlist) [ later ] returns whichever of the numlist in values() is closest (differs by least, disregarding sign) to the numeric variable varname. later specifies that in the event of ties values specified later in the list overwrite values specified earlier. If varname is 15 then 10 and 20 specified by values(10 20) are equally close. For any observation containing 15 the default is that 10 is reported, whereas with later 20 is reported. For a numlist containing an increasing sequence, later implies choosing the higher of two equally close values. (Stata 6 required.)

. egen mpgclass = clsst(mpg), v(10(5)40)

egroup(varlist) is a extension of egen's group() function with the extra option label(lblvarlist), which will attach the original values (or value labels if they exist) of lblvarlist as value labels. This option may not be combined with the label option. (Stata 7 required; superseded by axis() above.)

group2(varlist) is a generalisation of egen's group() with the extra option sort(egen_call). Groups of varlist will have values 1 upwards according to their values on the results of a specified egen_call. For example, group2(rep78) sort(mean(mpg)) will produce a variable such that the group of rep78 with the lowest mean of mpg will have value 1, that with the second lowest mean will have value 2, and so forth. As with group(), the label option will attach the original values of varlist (or value labels if they exist) as value labels. The argument of sort() must be a valid call to an egen function, official or otherwise. (Stata 7 required; use of egroup() or axis() above is now considered better style.)

mlabvpos(yvar xvar) [ , log polynomial(#) matrix(5x5 matrix) ] automatically generates a variable giving clock positions of marker labels given names of variables yvar and xvar defining the axes of a scatter plot. Thus the command generates a variable to be used in the scatter option mlabvpos().

The general idea is to pull marker labels away from the data region. So, marker labels in the lower left of the region are at clock positions 7 or 8, and those in the upper right are at clock-position 1 or 2, etc. More precisely, considering the following rectangle as the data region, then marker labels are placed as follows:

+--------------+ |11 12 12 12 1| |10 11 12 1 2| | 9 9 12 3 3| | 8 7 6 5 4| | 7 6 6 6 5| +--------------+

Note that there is no attempt to prevent marker labels from overplotting, which is likely in any dataset with many observations. In such situations you might be better off simply randomizing clock positions with say ceil(uniform() * 12).

If yvar and xvar are highly correlated, than the clock-positions are generated as follows (which is however the same general idea):

+--------------+ | 12 1 3| | 12 12 3 4| |11 11 12 5 5| |10 9 6 6 | | 9 7 6 | +--------------+

To calculate the positions, the x axis is first categorized into 5 equal intervals around the mean of xvar. Afterwards the residuals from regression of yvar on xvar are categorized into 5 equal intervals. Both categorized variables are then used to calculate the positions according to the first table above. The rule can be changed with the option matrix().

log indicates that residuals from regression are to be calculated using the logarithms of xvar. This might be useful if the scatter shows a strong curvilinear relationship.

polynomial(#) indicates that residuals are to be calculated from a regression of yvar on a polynomial of xvar. For example, use poly(2) if the scatter shows a U-shaped relationship.

matrix(#) is used to change the general rule for the plot positions. The positions are specified by a 5 x 5 matrix, in which cell [1,1] gives the clock position of marker labels in the upper left part of the data region, and so forth. (Stata 8.2 required.)

. egen clock = mlabvpos(mpg weight) . scatter mpg weight, mlab(make) mlabvpos(clock) . egen clock2 = mlabvpos(mpg weight), matrix(11 1 12 11 1 \\ 10 2 12 10 2 \\ 9 3 12 9 3 \\ 8 4 6 8 4 \\ 7 5 6 7 5) . sc mpg weight, mlab(make) mlabvpos(clock2)

Strings, numbers and conversions

base(varname) [ , base(#) ] produces a string variable containing the digits of a base # (default 2, possible values 2(1)9) representation of varname, which must contain integers. Thus if varname contains values 0, 1, 2, 3, 4, and the default base is used, then the result will contain the strings "000", "001", "010", "011", "100". If any integer values are negative, all string values will start with - if negative and + otherwise. See also decimal(). The examples show how to unpack this string into individual digits if desired. (Stata 6 required.)

. egen binary = base(code)

Suppose binary is str5. To get individual str1 variables,

. forval i = 1/5 { . gen str1 code`i' = substr(binary, `i',1) . }

and to get individual numeric variables,

. forval i = 1/5 { . gen byte code`i' = real(substr(binary, `i', 1)) . }

decimal(varlist) [ , base(#) ] treats the values of varlist as indicating digits in a base # (default 2, possible values integers >=2) representation of a number and produces the decimal equivalent. Thus if three variables are given with values in a single observation of 1 1 0, and the default base is used, the decimal result is 1 * 2^2 + 1 * 2^1 + 0 * 2^0 = 4 + 2 + 0 = 6. Similarly if base 5 is used, the decimal equivalent of 2 3 4 is 2 * 5^2 + 3 * 5^1 + 4 * 5^0 = 50 + 15 + 4 = 59. Note that the order of variables in varlist is crucial. (Stata 7 required.)

. egen decimal = decimal(q1-q8)

incss(strvarlist) , substr(substring) [ insensitive ] indicates occurrences of substring within any of the variables in a list of string variables by 1 and other observations by 0. insensitive makes comparison case-insensitive. (Stata 6 required; an alternative is now just to use foreach.)

. egen buick = incss(make), sub(buick) i

iso3166(varname) [, origin(codes|names) language(en|fr) verbose update] maps varname containing "official short country names" into a new variable containing the ISO 3166-1-alpha-2 code elements (e.g. DE for "Germany", GB for "United Kingdom" and HM for "Heard Island and McDonald Islands") and vice versa. The official short country names can be in English (default) or French. Correspondingly the function produces country names from ISO 3166-1-alpha-2 codes in English or French. (Version 9.2 required.)

origin(codes|names) declares the character of the country variable that is already in the data. The default is names, meaning that varname holds the "official short country names". This information may be stored as a string variable or as a numeric variable that is labeled accordingly. This default setting produces ISO 3166-1-alpha-2 codes from the country names. If country names should be produced from the two letter codes, use egen newvar = iso3166(varname), origin(codes).

language(en|fr) defines the language in which the country names are stored, or should be produced. language(en) is for English names (default); language(fr) is for French names.

verbose For the mapping from country names to ISO 3166-1-alpha2 codes the program expects official short country names. It cannot handle unofficial country names such as "Great Britain", "Taiwan" or "Russia". Such unofficial country names result in the generation of missing values for the respective countries. By default iso3166() only returns the number of missing values it has produced. With verbose Stata also provides the list of unofficial country names in varname and a clickable link to the list of official country names. This is convenient if one wants to correct the information stored in varname before using iso3166(). For the transformation of ISO 3166-1-alpha2 codes into country names, verbose does something equivalent.

update The ISO 3166-1-alpha2 codes are automaticaly looked up in information provided by the ISO 3166 Maintenance Agency of the International Organization for Standardization. The information is automatically downloaded from the internet when the user specifies iso3166() the first time, or whenever update is specified. Note: Updating the matching list regularly will guarantee that iso3166() always produces up-to-date country names. However, updating the match list may also produce missing values when running older do-files for data sets with countries that no longer exist (for example, Yugoslavia).

Note the implications: This function will only work if your copy of Stata can access the internet, at least for the first time it is called. The results of the function might be not fully reproducible in the future.

msub(strvar) , find(findstr) [ replace(replacestr) n(#) word ] replaces occurrences of the words of findstr by the words of replacestr in the string variable strvar. The words of findstr and of replacestr are separated by spaces or bound by " ": thus find(a b "c d") includes three words, in turn "a", "b" and "c d", and double quotation marks " " should be used to delimit any word including one or more spaces. The number of words in findstr should equal that in replacestr, except that (1) an empty replacestr is taken to specify deletion; (2) a single word in replacestr is taken to mean that each word of findstr is to be replaced by that word. As quotation marks are used for delimiting, literal quotation marks should be included in compound double quotation marks, as in `"""'. By default all occurrences are changed. n(#) specifies that the first # occurrences only should be changed. word specifies that words in findstr are to be replaced only if they occur as separate words in strvar. The substitutions of msub() are made in sequence. (Stata 6 required; msub() depends on the built-in functions subinstr() and subinword().)

. egen newstr = msub(strvar), f(A B C) r(1 2 3) (replaces "A" by "1", "B" by "2", "C" by "3")

. egen newstr = msub(strvar), f(A B C) r(1 2 3) n(1) (replaces "A" by "1", "B" by "2", "C" by "3", first occurrence only)

. egen newstr = msub(strvar), f(A B C) r(1) (replaces "A" by "1", "B" by "1", "C" by "1")

. egen newstr = msub(strvar), f(A B C) (deletes "A", "B", "C")

. egen newstr = msub(strvar), f(" ") (deletes spaces)

. egen newstr = msub(strvar), f(`"""') (deletes quotation mark ")

. egen newstr = msub(strvar) f(frog) w (deletes "frog" only if occurring as single word)

noccur(strvar) , string(substr) creates a variable containing the number of occurrences of the string substr in string variable strvar. Note that occurrences must be disjoint (non-overlapping): thus there are two occurrences of "aa" within "aaaaa". (Stata 7 required.)

nss(strvar) , find(substr) [ insensitive ] returns the number of occurrences of substr within the string variable strvar. insensitive makes counting case-insensitive. (Stata 6 required.)

The inclusion of noccur() and nss(), two almost identical functions, was an act of sheer inadvertence by the maintainer.

ntos(numvar) , from(numlist) to(list of string values) generates a string variable from a numeric variable numvar, mapping each numeric value in numlist to the corresponding string value. The number of elements in each list must be the same. String values containing blanks should be delimited by doube quotation marks " ". Values not defined by the mapping are generated as missing. The type of the string variable is determined automatically. (Stata 6 required.)

. egen grade = ntos(Grade), from(1/5) to(Poor Fair Good "Very good" Excellent)

nwords(strvar) returns the number of words within the string variable strvar. Words are separated by spaces, unless bound by double quotation marks " ". (Stata 6 required; superseded by wordcount()).

repeat() , values(value_list) [ by(byvarlist) block(#) ] produces a repeated sequence of value_list. The items of value_list, which may be a numlist or a set of string values, are assigned cyclically to successive observations. The order of observations is determined (1) after noting any if or in restrictions; (2) within groups specified by by(), if issued; (3) by the current sort order. block() specifies that values should be repeated in blocks of the specified size: the default is 1. The variable type is determined smartly, and need not be specified. (Stata 8 required.)

. egen quarter = repeat(), v(1/4) block(3) . egen months = repeat(), v(`c(Months)') . egen levels = repeat(), v(10 50 200 500)

sieve(strvar) , { keep(classes) | char(chars) | omit(chars) } selects characters from strvar according to a specified criterion and generates a new string variable containing only those characters. This may be done in three ways. First, characters are classified using the keywords alphabetic (any of a-z or A-Z), numeric (any of 0-9), space or other. keep() specifies one or more of those classes: keywords may be abbreviated by as little as one letter. Thus keep(a n) selects alphabetic and numeric characters and omits spaces and other characters. Note that keywords must be separated by spaces. Alternatively, char() specifies each character to be selected or omit() specifies each character to be omitted. Thus char(0123456789.) selects numeric characters and the stop (presumably as decimal point); omit(" ") strips spaces and omit(`"""') strips double quotation marks. (Stata 7 required.)

ston(strvar) , from(list of string values) to(numlist) generates a numeric variable from a string variable strvar, mapping each string value to the corresponding numeric value in numlist. The number of elements in each list must be the same. String values containing blanks should be delimited by " ". Values not defined by the mapping are generated as missing. (Stata 6 required.)

. egen Grade = ston(grade), to(1/5) from(Poor Fair Good "Very good" Excellent)

wordof(strvar) , word(#) returns the #th word of string variable strvar. word(1) is the first word, word(2) the second word, word(-1) the last word, and so forth. Words are separated by spaces, unless bound by quotation marks " ". (Stata 6 required; superseded by word().)

Dates, times and time series

bom(m y) [ , lag(lag) format(format) work ] creates an elapsed date variable containing the date of the beginning of month m and year y. m can be a variable containing integers between 1 and 12 inclusive or a single integer in that range. y can be a variable containing integers within the range covered by elapsed dates or a single integer within that range. Optionally lag() specifies a lag: the beginning of the month will be given for lag months before the current date. lag(1) refers to the previous month, lag(3) to 3 months ago and lag(-3) to 3 months hence. The lag may also be specified by a variable containing integers. Optionally a format, usually but not necessarily a date format, can be specified. work specifies that the first day must also be one of Monday to Friday. (Stata 6 required.)

. egen bom = bom(month year), f(%dd_m_y)

bomd(datevar) [ , lag(lag) format(format) work ] creates an elapsed date variable containing the date of the beginning of the month containing the date in an elapsed date variable datevar. Optionally lag() specifies a lag: the beginning of the month will be given for lag months before the current date. lag(1) refers to the previous month, lag(3) to 3 months ago and lag(-3) to 3 months hence. The lag may also be specified by a variable containing integers. Optionally a format, usually but not necessarily a date format, can be specified. work specifies that the first day must also be one of Monday to Friday. (Stata 6 required.)

. egen bomd = bomd(date), f(%dd_m_y)

Note that work knows nothing about holidays or any special days.

dayofyear(daily_date_variable) [ , month(#) day(#) ] generates the day of the year, counting from the start of the year, from a daily date variable. The start of the year is 1 January by default: month() and/or day() may be used to specify an alternative. This function thus is a generalisation of the date function doy(). (Stata 8 required.)

. egen dayofyear = dayofyear(date), m(10)

dhms(d h m s) [ , format(format) ] creates a date variable from Stata date variable or date d with a fractional part reflecting the number of hours, minutes and seconds past midnight. h can be a variable containing integers between 0 and 23 inclusive or a single integer in that range. m and s can be variables containing integers between 0 and 59 or single integer(s) in that range. Optionally a format, usually but not necessarily a date format, can be specified. The resulting variable, which is by default stored as a double, may be used in date and time arithmetic in which the time of day is taken into account. (Stata 6 required.)

elap(time) [ , format(format) ] creates a string variable which contains the number of days, hours, minutes and seconds associated with an integer variable containing a number of elapsed seconds. Such a variable might be the result of date/time arithmetic, where a time interval between two timestamps has been expressed in terms of elapsed seconds. Leading zeroes are included in the hours, minutes, and seconds fields. Optionally, a format can be specified. (Stata 6 required.)

elap2(time1 time2) [ , format(format) ] creates a string variable which contains the number of days, hours, minutes and seconds associated with a pair of time values, expressed as fractional days, where time1 is no greater than time2. Such time values may be generated by function dhms(). elap2() expresses the interval between these time values in readable form. Leading zeroes are included in the hours, minutes, and seconds fields. Optionally, a format can be specified. (Stata 6 required.)

eom(m y) [ , lag(lag) format(format) work ] creates an elapsed date variable containing the date of the end of month m and year y. m can be a variable containing integers between 1 and 12 inclusive or a single integer in that range. y can be a variable containing integers within the range covered by elapsed dates or a single integer within that range. Optionally lag() specifies a lag: the end of the month will be given for lag months before the current date. lag(1) refers to the previous month, lag(3) to 3 months ago and lag(-3) to 3 months hence. The lag may also be specified by a variable containing integers. Optionally a format, usually but not necessarily a date format, can be specified. work specifies that the last day must also be one of Monday to Friday. (Stata 6 required.)

. egen eom = eom(month year), f(%dd_m_y)

eomd(datevar) [ , lag(lag) format(format) work ] creates an elapsed date variable containing the date of the end of the month containing the date in an elapsed date variable datevar. Optionally lag() specifies a lag: the end of the month will be given for lag months before the current date. lag(1) refers to the previous month, lag(3) to 3 months ago and lag(-3) to 3 months hence. The lag may also be specified by a variable containing integers. Optionally a format, usually but not necessarily a date format, can be specified. work specifies that the last day must also be one of Monday to Friday. (Stata 6 required.)

Note that work knows nothing about holidays or any special days.

. egen eom = eomd(date), f(%dd_m_y) . egen eopm = eomd(date), f(%dd_m_y) lag(1)

ewma(timeseriesvar) , a(#) calculates the exponentially weighted moving average, which is

ewma = timeseriesvar for the first observation

= a * timeseriesvar + (1 - a) * L.ewma otherwise

The data must have been declared time series data by tsset. Calculations start afresh after any gap with missing values. (Stata 6 required; superseded by tssmooth.)

filter(timeseriesvar) , lags(numlist) [ coef(numlist) { normalise | normalize } ] calculates the linear filter which is the sum of terms

coef_i * Li.timeseriesvar or coef_i * Fi.timeseriesvar

coef() defaults to a vector the same length as lags() with each element 1.

filter(y), l(0/3) c(0.4(0.1)0.1) calculates

0.4 * y + 0.3 * L1.y + 0.2 * L2.y + 0.1 * L3.y

filter(y), l(0/3) calculates

1 * y + 1 * L1.y + 1 * L2.y + 1 * L3.y or y + L1.y + L2.y + L3.y

Leads are specified as negative lags. normalise (or normalize, according to taste) specifies that coefficients are to be divided by their sum so that they add to 1 and thus specify a weighted mean.

filter(y), l(-2/2) c(1 4 6 4 1) n calculates

(1/16) * F2.y + (4/16) * F1.y + (6/16) * y + (4/16) * L1.y + (1/16) * L2.y

The data must have been declared time series data by tsset. Note that this may include panel data, which are automatically filtered separately within each panel.

The order of terms in coef() is taken to be the same as that in lags. (Stata 8 required; see also tssmooth.)

. egen f2y = filter(y), l(-1/1) c(0.25 0.5 0.25) . egen f2y = filter(y), l(-1/1) c(1 2 1) n

filter7(timeseriesvar) , lags(numlist) coef(numlist) [ { normalise | normalize } ] calculates the linear filter which is the sum of terms

coef_i * Li.timeseriesvar or coef_i * Fi.timeseriesvar

filter7(y), l(0/3) c(0.4(0.1)0.1) calculates

0.4 * y + 0.3 * L1.y + 0.2 * L2.y + 0.1 * L3.y

Leads are specified as negative lags. normalise (or normalize, according to taste) specifies that coefficients are to be divided by their sum so that they add to 1 and thus specify a weighted mean.

filter7(y), l(-2/2) c(1 4 6 4 1) n calculates

(1/16) * F2.y + (4/16) * F1.y + (6/16) * y + (4/16) * L1.y + (1/16) * L2.y

The data must have been declared time series data by tsset. Note that this may include panel data, which are automatically filtered separately within each panel.

The order of terms in coef() is taken to be the same as that in lags(). (Stata 7 required; see also tssmooth.)

foy(daily_date_variable) [ , month(#) day(#) ] generates the fraction of the year elapsed since the start of the year from a daily date variable. The start of the year is 1 January by default: month() and/or day() may be used to specify an alternative. If daily_date_variable is all integers, then the result is (day of year - 0.5) / number of days in year. If daily_date_variable contains non-integers, then the result is (day of year - 1) / number of days in year. (Stata 8 required.)

. egen frac = foy(date), m(10)

hmm(timevar) [ , round(#) trim ] generates a string variable showing timevar, interpreted as indicating time in minutes, represented as hours and minutes in the form "[...h]h:mm". For example, times of 9, 90, 900 and 9000 minutes would be represented as "0:09","1:30", "15:00" and "150:00". The option round(#) rounds the result: round(1) rounds the time to the nearest minute. The option trim trims the result of leading zeros and colons, except that an isolated 0 is not trimmed. With trim "0:09" is trimmed to "9" and "0:00" is trimmed to "0".

hmm() serves equally well for representing times in seconds in minutes and seconds in the form "[...m]m:ss". (Stata 6 required.)

hmmss(timevar) [ , round(#) trim ] generates a string variable showing timevar, interpreted as indicating time in seconds, represented as hours, minutes and seconds in the form "[...h:]mm:ss". For example, times of 9, 90, 900 and 9000 seconds would be represented as "00:09","01:30", "15:00" and "2:30:00". The option round(#) rounds the result: round(1) rounds the time to the nearest second. The option trim trims the result of leading zeros and colons, except that an isolated 0 is not trimmed. With trim "00:09" is trimmed to "9" and "00:00" is trimmed to "0". (Stata 6 required.)

hms(h m s) [ , format(format) ] creates an elapsed time variable containing the number of seconds past midnight. h can be a variable containing integers between 0 and 23 inclusive or a single integer in that range. m and s can be variables containing integers between 0 and 59 or single integer(s) in that range. Optionally a format can be specified. (Stata 6 required.)

minutes(strvar) [ , maxhour(#) ] returns time in minutes given a string variable strvar containing a time in hours and minutes in the form "[..h]hh:mm". In particular, minutes are given as two digits between 00 and 59 and hours by default are given as two digits between 00 and 23. The maxhour() option may be used to change the (unreachable) limit: its default is 24. Note that, strange though it may seem, this function rather than seconds() is appropriate for converting times in the form "mm:ss" to seconds. The maximum number of minutes acceptable may need then to be specified by maxhour() [sic]. (Stata 8 required.)

ncyear(datevar) , month(#) [ day(#) ] returns an integer variable labelled with labels such as "1952/53" for non-calendar years starting on the specified month and day. The day defaults to 1. datevar is treated as indicating elapsed dates. For more on dates, see help on dates. (Stata 6 required.)

. egen wtryear = ncyear(date), m(10) (years starting on 1 October)

. egen wwgyear = ncyear(date), m(1) d(21) (years starting on 21 January)

record(exp) [ , by(byvarlist) min order(varlist) ] produces the maximum (with min the minimum) value observed "to date" of the specified exp. Thus record(wage), by(id) order(year) produces the maximum wage so far in worker's career, calculations being separate for each id and records being determined within each id in year order. Although explanation and example here refer to dates, nothing in record() restricts its use to data ordered in time. If not otherwise specified with by() and/or order(), records are determined with respect to the current order of observations. No special action is required for missing values, as internally record() uses either the max() or the min() function, both of which return results of missing only if all values are missing. (Stata 6 required.)

. egen hiwage = record(exp(lwage)), by(id) order(year) . egen lowage = record(exp(lwage)), by(id) order(year) min

seconds(strvar) [ , maxhour(#) ] returns time in seconds given a string variable containing a time in hours, minutes and seconds in the form "[..h]hh:mm:ss". In particular, minutes and seconds are each given as two digits between 00 and 59 and hours by default are given as two digits between 00 and 23. The maxhour() option may be used to change the (unreachable) limit: its default is 24. (Stata 8 required.)

tod(time) [ , format(format) ] creates a string variable which contains the number of hours, minutes and seconds associated with an integer in the range 0 to 86399, one less than the number of seconds in a day. Such a variable is produced by hms(), which see above. Leading zeroes are included in the hours, minutes, and seconds fields. Colons are used as separators. Optionally a format can be specified. (Stata 6 required.)

Summaries and estimates

adjl(varname) [ , by(byvarlist) factor(#) ] calculates adjacent lower values. These are the smallest values within factor() times the interquartile range of the lower quartile. By default factor() is 1.5, defining the default lower value of a so-called whisker on a Stata box plot. (Stata 8 required.)

adju(varname) [ , by(byvarlist) factor(#) ] calculates adjacent upper values. These are the largest values within factor() times the interquartile range of the upper quartile. By default factor() is 1.5, defining the default upper value of a so-called whisker on a Stata box plot. (Stata 8 required.)

. egen adjl = adjl(mpg), by(foreign) . egen adju = adju(mpg), by(foreign)

corr(varname1 varname2) [ , covariance spearman taua taub by(byvarlist) ] returns the correlation of varname1 with varname2. By default, this returns the Pearson correlation coefficient. covariance indicates that covariances should be calculated; spearman indicates that Spearman's rank correlation coefficient should be calculated; taua and taub return Kendall's tau-A and tau-B, respectively. (Stata 8 required.)

density(varname) [ , width(#) start(#) frequency percent fraction by(byvarlist) ] calculates the density (or optionally the frequency, fraction or percent) of values in bins of width width() (default 1) starting at start() (default minimum of the data). Note that each value produced will be identical for all observations in the same bin. Commonly for further use it will be desired to select one value from each bin, say by using egen's tag() function. (Stata 8 required.)

gmean(exp) [ , by(byvarlist) ] returns the geometric mean of exp. (Stata 6 required.)

. egen gmean = gmean(mpg), by(rep78)

hmean(exp) [ , by(byvarlist) ] returns the harmonic mean of exp. (Stata 6 required.)

. egen hmean = hmean(mpg), by(rep78)

nmiss(exp) [ , by(byvarlist) ] returns the number of missing values in exp. (Stata 6 required.) Remark: Why this was written is a mystery. The one-line command egen nmiss = sum(missing(exp) (in Stata 9 egen nmiss = total(missing(exp)) shows that it is unnecessary.

. egen nmiss = nmiss(rep78), by(foreign)

nvals(varname) [ , by(byvarlist) missing ] returns the number of distinct values in varname. Missing values are ignored unless missing is specified. Remark: Much can be done by using egen function tag() and then summing values as desired. (Stata 6 required.)

outside(varname) [ , by(byvarlist) factor(#) ] calculates outside values. These are any values more than factor() times the interquartile range from the nearer quartile, that is above the upper quartile or below the lower quartile. By default factor() is 1.5, defining the default outside values, those plotted separately, on a Stata box plot. Values not outside are returned as missing. (Stata 8 required.)

ridit(varname) [ , by(byvarlist) missing percent reverse ] calculates the ridit for varname, which is

(1/2) count at this value + SUM counts in values below ------------------------------------------------------ SUM counts of all values

With terminology from Tukey (1977, pp.496-497), this could be called a `split fraction below'. The name `ridit' was used by Bross (1958): see also Fleiss (1981, pp.150-7) or Flora (1988). The numerator is a `split count'.

missing specifies that observations for which values of byvarlist are missing will be included in calculations if by() is specified. The default is to exclude them. percent scales the numbers to percents by multiplying by 100. reverse specifies the use of reverse cumulative probabilities (1 - fraction above). (Stata 6 required.)

semean(exp) [ , by(byvarlist) ] calculates the standard error of the mean of exp. (Stata 6 required.)

sumoth(exp) [ , by(byvarlist) ] returns the sum of the other values of exp in the same group. If by() is specified, distinct combinations of byvarlist define groups; otherwise all observations define one group. (Stata 6 required.)

var(exp) [ , by(byvarlist) ] creates a constant (within byvarlist) containing the variance of exp. Note also the egen function sd(). (Stata 6 required.)

wpctile(varname) [ , p(#) weights(varname) altdef by(byvarlist) ] is a hack on official Stata's egen function pctile() allowing specification of weights in the calculation of percentiles. By default, the function creates a constant (within byvarlist) containing the #th percentile of varname. If p() is not specified, 50 is assumed, meaning medians. weights() requests weighted calculation of percentiles. altdef uses an alternative formula for calculating percentiles, which is not applicable with weights present. by() requests calculation by groups. You may also use the by: construct. (Stata 8.2 required.)

wtfreq(exp) [ , by(byvarlist) ] creates a constant (within byvarlist) containing the weighted frequency using exp as weights. (Such frequencies sum to _N.) (Stata 6 required.)

xtile(varname) [ , percentiles(numlist) nquantiles(#) weights(varname) altdef by(byvarlist) ] categorizes varname by specific percentiles. The function works like xtile. By default varname is dichotomized at the median. percentiles() requests percentiles corresponding to numlist: for example, p(25(25)75) is used to create a variable according to quartiles. Alternatively you also may have specified n(4): to create a variable according to quartiles. weights() requests weighted calculation of percentiles. altdef uses an alternative formula for calculating percentiles. See xtile. by() requests calculation by groups. You may also use the by: construct. (Stata 8.2 required.)

. egen mpg4 = xtile(mpg), by(foreign) p(25(25)75) . egen mpg10 = xtile(mpg), by(foreign) nq(10)

First and last

first(varname) [ , by(byvarlist) ] returns the first non-missing value of varname. `First' depends on the existing order of observations. varname may be numeric or string. (Stata 6 required.)

ifirst(numvar) , value(#) [ { before | after } by(byvarlist) ] indicates the first occurrence of integer # within numvar by 1 and other observations by 0.

before indicates observations before the first occurrence by 1 and other observations by 0. after indicates observations after the first occurrence by 1 and other observations by 0. The default, the value before and the value after always sum to 1 for observations analysed.

First occurrence is determined as follows: (1) if if or in is specified, any observations excluded are ignored; (2) if by() is specified, first is determined separately for each distinct group of observations; (3) first is first in current sort order. If # does not occur, all observations are before the first occurrence. (Stata 6 required.)

. gen warm = celstemp > 20 . egen fwarm = ifirst(warm), v(1) by(year)

ilast(numvar) , value(#) [ { before | after } by(byvarlist) ] indicates the last occurrence of integer # within numvar by 1 and other observations by 0.

before indicates observations before the last occurrence by 1 and other observations by 0. after indicates observations after the last occurrence by 1 and other observations by 0. The default, the value before and the value after always sum to 1 for observations analysed.

Last occurrence is determined as follows: (1) if if or in is specified, any observations excluded are ignored; (2) if by() is specified, last is determined separately for each distinct group of observations; (3) last is last in current sort order. If # does not occur, all observations are before the last occurrence. (Stata 6 required.)

lastnm(varname) [ , by(byvarlist) ] returns the last non-missing value of varname. `Last' depends on the existing order of observations. varname may be numeric or string. Remark: lastnm() would have been better called last(), except that an egen program with that name for selecting the last `word' in a string was published in STB-50. (Stata 6 required.)

Random numbers mixnorm() [ , frac(#) mu1(#) mu2(#) var1(#) var2(#) ] generates a new variable of specified type as a mixture of two Normal distributions, with the fraction frac(#) of the observations defined by the first distribution. Both options for means mu1(#) and mu2(#) default to 0; both options for variances var1(#) and var2(#) default to 1, while frac(#) defaults to 0.5. Only non-default parameters of the desired mixture need be specified. (Stata 8 required.)

. egen mixture = mixnorm(), frac(0.9) mu2(10) var2(4)

rndint() , max(#) [ min(#) ] generates random integers from a uniform distribution on min() to max(), inclusive. min(1) is the default. Remark: Note that ceil(uniform() * #) is a direct way to get random integers from 1 to #. (Stata 6 required.)

. egen integ = rndint(), min(100) max(199)

rndsub() [ , ngroup(#) { frac(#) | percent(#) } by(byvarlist) ] randomly splits observations into groups or subsamples. The result is a categorical variable taking values from 1 upward labelling distinct groups.

ngroup(#) (default 2) defines the number of groups.

frac(#), which is only allowed with ngroup(2), specifies that the first group should contain 1 / # of the observations and thus that the second group should contain the remaining observations.

percent(#), which is only allowed with ngroup(2), specifies that the first group should contain #% of the observations and thus that the second group should contain the remaining observations.

frac() and percent() may not be specified together. (Stata 6 required.)

. egen group = rndsub(), by(foreign)

. egen group = rndsub(), by(foreign) f(3) (first group contains 1/3 of observations, second group contains 2/3)

. egen group = rndsub(), by(foreign) p(25) (first group contains 25% of observations, second group contains 75%)

For reproducible results, set the seed of the random number generator beforehand and document your choice.

Note that to generate # random numbers the number of observations must be at least #. If there are no data in memory and you want 100 random numbers, type set obs 100 before using these functions.

Row operations rall(varlist) , cond(condition) [ symbol(symbol) ] returns 1 for observations for which the condition specified is true for all variables in varlist and 0 otherwise. The condition should be specified using symbol(), by default @, as a placeholder for each variable. Thus, for example, rall(varlist), c(@ > 0 & @ < .) tests whether all variables in varlist are positive and non-missing. Note that conditions typically make sense only if variables are either all numeric or all string: one exception is missing(@). (Stata 6 required.)

rany(varlist) , cond(condition) [ symbol(symbol) ] returns 1 for observations for which the condition specified is true for any variable in varlist and 0 otherwise. The condition should be specified using symbol(), by default @, as a placeholder for each variable. Thus, for example, rany(varlist), c(@ > 0 & @ < .) tests whether any variable in varlist is positive and non-missing. Note that conditions typically make sense only if variables are either all numeric or all string: one exception is missing(@). (Stata 6 required.)

rcount(varlist) , cond(condition) [ symbol(symbol) ] returns the number of variables in varlist for which the condition specified is true. The condition should be specified using symbol(), by default @, as a placeholder for each variable. Thus, for example, rcount(varlist), c(@ > 0 & @ < .) counts for each observation how many variables in varlist are positive and non-missing. Note that conditions typically make sense only if variables are either all numeric or all string: one exception is missing(@). More precisely, rcount() gives the sum across varlist of condition, evaluated in turn for each variable. (Stata 6 required.)

For rall(), rany(), and rcount(), the symbol() option may be used to set an alternative to @ whenever the latter is inappropriate. For example, if string variables were being searched for literal occurrences of "@", some other symbol not appearing in text or in variable names should be used.

. egen any = rany(b c d e f) , c(@ == a) . egen all = rall(b c d e f) , c(@ == a) . egen count = rcount(b c d e f) , c(@ == a) (values of b c d e f matched by (equal to) those of a?)

. egen anyw1 = rany(b c d e f) , c(abs(@ - a) <= 1) . egen allw1 = rall(b c d e f) , c(abs(@ - a) <= 1) . egen countw1 = rcount(b c d e f) , c(abs(@ - a) <= 1) (values of b c d e f within 1 of those of a?)

From Stata 7, foreach provides an alternative that would now be considered better style:

. gen any = 0 . gen all = 1 . gen count = 0 . foreach v of var a b c d e f { . replace any = max(any, inrange(`v', 0, .)) . replace all = min(all, inrange(`v', 0, .)) . replace count = count + inrange(`v', 0, .) . }

rowmedian(varlist) returns the median across observations of the variables in varlist. (Stata 9 required.)

rsum2(varlist) is a generalisation of egen's rsum() (from Stata 9: rowtotal()) function with the extra options allmiss and anymiss. As with rsum(), it creates the (row) sum of the variables in varlist, treating missing as 0. However, if the option allmiss is selected, the (row) sum for any observation for which all variables in varlist are missing is set equal to missing. Similarly, if the option anymiss is selected the (row) sum for any observation for which any variable in varlist is missing is set equal to missing. (Stata 6 required.)

References

Bross, I.D.J. 1958. How to use ridit analysis. Biometrics 14: 38-58.

Fleiss, J.L. 1981. Statistical Methods for Rates and Proportions. New York: John Wiley.

Flora, J.D. 1988. Ridit analysis. In Kotz, S. and Johnson, N.L. (eds) Encyclopedia of Statistical Sciences. New York: John Wiley. 8: 136-139.

Tukey, J.W. 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley.

Maintainer

Nicholas J. Cox, Durham University, U.K. n.j.cox@durham.ac.uk

Acknowledgements

Kit Baum (baum@bc.edu) is the first author of record() and the author of dhms(), elap(), elap2(), hms(), tod() and mixnorm().

Ulrich Kohler (kohler@wzb.eu) is the author of xtile(), mlabvpos(), iso3166() and wpctile().

Steven Stillman (s.stillman@verizon.net) is the author of rsum2().

Nick Winter (njw3x@virginia.edu) is the author of corr() and noccur().

Kit Baum, Sascha Becker, Ronán Conroy, William Gould, Syed Islam, John Moran, Stephen Soldz, Richard Williams, Fred Wolfe and Gerald Wright provided stimulating and helpful comments.

Also see

STB: STB-50 dm70 for atan2(), pp(), rev(), rindex(), rmed(), rotate()

Manual: [D] egen (before Stata 9 [R] egen)

On-line: help for egen, dates, functions, means, numlist, seed, tsset, varlist (timeseries operators), circular (if installed), ntimeofday (if installed), stimeofday (if installed)