```-------------------------------------------------------------------------------
help for egenmore
-------------------------------------------------------------------------------

Extensions to generate (more extras)

egen [type] newvar = fcn(arguments) [if exp] [in range] [, options]

Description

egen creates newvar of the optionally specified storage type equal to
fcn(arguments).  Depending on fcn(), arguments refers to an expression, a
varlist, a numlist, or an empty string. The options are similarly
function dependent.

Functions

(The option by(byvarlist) means that computations are performed
separately for each group defined by byvarlist.)

Functions are grouped thematically as follows:
Grouping and graphing
Strings, numbers and conversions
Dates, times and time series
Summaries and estimates
First and last
Random numbers
Row operations

Grouping and graphing

axis(varlist) [ , gap label(lblvarlist) missing reverse ] resembles
egen's group(), but is specifically designed for constructing
categorical axis variables for graphs, hence the name. It creates a
single variable taking on values 1, 2, ...  for the groups formed by
varlist.  varlist may contain string, numeric, or both string and
numeric variables.  The order of the groups is that of the sort order
of varlist.  gap overrides the default numbering of 1 up by adding a
gap of 1 whenever a variable changes.  label() specifies that labels
are to be assigned based on the value labels or values of lblvarlist;
if not specified, lblvarlist defaults to varlist.  missing indicates
that missing values in varlist (either numeric missing or "") are to
be treated like any other value when assigning groups, instead of
missing values being assigned to the group missing. reverse reverses
labelling so that groups that would have been assigned values of 1
...  whatever are instead assigned values of whatever ... 1. (Stata 8
required.)

To order groups of a categorical variable according to their values of
another variable, in preparation for a graph or table:

. egen meanmpg = mean(-mpg), by(rep78)
. egen Rep78 = axis(meanmpg rep78), label(rep78)
. tabstat mpg, by(Rep78) s(min mean max)

clsst(varname) , values(numlist) [ later ] returns whichever of the
numlist in values() is closest (differs by least, disregarding sign)
to the numeric variable varname. later specifies that in the event of
ties values specified later in the list overwrite values specified
earlier. If varname is 15 then 10 and 20 specified by values(10 20)
are equally close. For any observation containing 15 the default is
that 10 is reported, whereas with later 20 is reported. For a numlist
containing an increasing sequence, later implies choosing the higher
of two equally close values. (Stata 6 required.)

. egen mpgclass = clsst(mpg), v(10(5)40)

egroup(varlist) is a extension of egen's group() function with the extra
option label(lblvarlist), which will attach the original values (or
value labels if they exist) of lblvarlist as value labels.  This
option may not be combined with the label option.  (Stata 7 required;
superseded by axis() above.)

group2(varlist) is a generalisation of egen's group() with the extra
option sort(egen_call).  Groups of varlist will have values 1 upwards
according to their values on the results of a specified egen_call.
For example, group2(rep78) sort(mean(mpg)) will produce a variable
such that the group of rep78 with the lowest mean of mpg will have
value 1, that with the second lowest mean will have value 2, and so
forth.  As with group(), the label option will attach the original
values of varlist (or value labels if they exist) as value labels.
The argument of sort() must be a valid call to an egen function,
official or otherwise. (Stata 7 required; use of egroup() or axis()
above is now considered better style.)

mlabvpos(yvar xvar) [ , log polynomial(#) matrix(5x5 matrix) ]
automatically generates a variable giving clock positions of marker
labels given names of variables yvar and xvar defining the axes of a
scatter plot. Thus the command generates a variable to be used in the
scatter option mlabvpos().

The general idea is to pull marker labels away from the data region.
So, marker labels in the lower left of the region are at clock
positions 7 or 8, and those in the upper right are at clock-position
1 or 2, etc.  More precisely, considering the following rectangle as
the data region, then marker labels are placed as follows:

+--------------+
|11 12 12 12  1|
|10 11 12  1  2|
| 9  9 12  3  3|
| 8  7  6  5  4|
| 7  6  6  6  5|
+--------------+

Note that there is no attempt to prevent marker labels from
overplotting, which is likely in any dataset with many observations.
In such situations you might be better off simply randomizing clock
positions with say ceil(uniform() * 12).

If yvar and xvar are highly correlated, than the clock-positions are
generated as follows (which is however the same general idea):

+--------------+
|      12  1  3|
|   12 12  3  4|
|11 11 12  5  5|
|10  9  6  6   |
| 9  7  6      |
+--------------+

To calculate the positions, the x axis is first categorized into 5
equal intervals around the mean of xvar. Afterwards the residuals
from regression of yvar on xvar are categorized into 5 equal
intervals. Both categorized variables are then used to calculate the
positions according to the first table above.  The rule can be
changed with the option matrix().

using the logarithms of xvar. This might be useful if the scatter
shows a strong curvilinear relationship.

polynomial(#) indicates that residuals are to be calculated from a
regression of yvar on a polynomial of xvar. For example, use poly(2)
if the scatter shows a U-shaped relationship.

matrix(#) is used to change the general rule for the plot positions.
The positions are specified by a 5 x 5 matrix, in which cell [1,1]
gives the clock position of marker labels in the upper left part of
the data region, and so forth.  (Stata 8.2 required.)

. egen clock = mlabvpos(mpg weight)
. scatter mpg weight, mlab(make) mlabvpos(clock)
. egen clock2 = mlabvpos(mpg weight), matrix(11 1 12 11 1 \\ 10 2 12 10 2
\\ 9 3 12 9 3 \\ 8 4 6 8 4 \\ 7 5 6 7 5)
. sc mpg weight, mlab(make) mlabvpos(clock2)

Strings, numbers and conversions

base(varname) [ , base(#) ] produces a string variable containing the
digits of a base # (default 2, possible values 2(1)9) representation
of varname, which must contain integers. Thus if varname contains
values 0, 1, 2, 3, 4, and the default base is used, then the result
will contain the strings "000", "001", "010", "011", "100".  If any
integer values are negative, all string values will start with - if
to unpack this string into individual digits if desired. (Stata 6
required.)

. egen binary = base(code)

Suppose binary is str5.  To get individual str1 variables,

. forval i = 1/5 {
.         gen str1 code`i' = substr(binary, `i',1)
. }

and to get individual numeric variables,

. forval i = 1/5 {
.         gen byte code`i' = real(substr(binary, `i', 1))
. }

decimal(varlist) [ , base(#) ] treats the values of varlist as indicating
digits in a base # (default 2, possible values integers >=2)
representation of a number and produces the decimal equivalent. Thus
if three variables are given with values in a single observation of 1
1 0, and the default base is used, the decimal result is 1 * 2^2 + 1
* 2^1 + 0 * 2^0 = 4 + 2 + 0 = 6.  Similarly if base 5 is used, the
decimal equivalent of 2 3 4 is 2 * 5^2 + 3 * 5^1 + 4 * 5^0 = 50 + 15
+ 4 = 59. Note that the order of variables in varlist is crucial.
(Stata 7 required.)

. egen decimal = decimal(q1-q8)

incss(strvarlist) , substr(substring) [ insensitive ] indicates
occurrences of substring within any of the variables in a list of
string variables by 1 and other observations by 0. insensitive makes
comparison case-insensitive. (Stata 6 required; an alternative is now
just to use foreach.)

. egen buick = incss(make), sub(buick) i

iso3166(varname) [, origin(codes|names) language(en|fr) verbose update]
maps varname containing "official short country names" into a new
variable containing the ISO 3166-1-alpha-2 code elements (e.g. DE for
"Germany", GB for "United Kingdom" and HM for "Heard Island and
McDonald Islands") and vice versa. The official short country names
can be in English (default) or French. Correspondingly the function
produces country names from ISO 3166-1-alpha-2 codes in English or
French. (Version 9.2 required.)

origin(codes|names) declares the character of the country variable
that is already in the data. The default is names, meaning that
varname holds the "official short country names". This information
may be stored as a string variable or as a numeric variable that is
labeled accordingly. This default setting produces ISO 3166-1-alpha-2
codes from the country names. If country names should be produced
from the two letter codes, use egen newvar = iso3166(varname),
origin(codes).

language(en|fr) defines the language in which the country names are
stored, or should be produced. language(en) is for English names
(default); language(fr) is for French names.

verbose For the mapping from country names to ISO 3166-1-alpha2 codes
the program expects official short country names. It cannot handle
unofficial country names such as "Great Britain", "Taiwan" or
"Russia". Such unofficial country names result in the generation of
missing values for the respective countries. By default iso3166()
only returns the number of missing values it has produced. With
verbose Stata also provides the list of unofficial country names in
varname and a clickable link to the list of official country names.
This is convenient if one wants to correct the information stored in
varname before using iso3166(). For the transformation of ISO
3166-1-alpha2 codes into country names, verbose does something
equivalent.

update The ISO 3166-1-alpha2 codes are automaticaly looked up in
information provided by the ISO 3166 Maintenance Agency of the
International Organization for Standardization. The information is
iso3166() the first time, or whenever update is specified. Note:
Updating the matching list regularly will guarantee that iso3166()
always produces up-to-date country names. However, updating the match
list may also produce missing values when running older do-files for
data sets with countries that no longer exist (for example,
Yugoslavia).

Note the implications: This function will only work if your copy of
Stata can access the internet, at least for the first time it is
called.  The results of the function might be not fully reproducible
in the future.

msub(strvar) , find(findstr) [ replace(replacestr) n(#) word ] replaces
occurrences of the words of findstr by the words of replacestr in the
string variable strvar. The words of findstr and of replacestr are
separated by spaces or bound by " ": thus find(a b "c d") includes
three words, in turn "a", "b" and "c d", and double quotation marks "
" should be used to delimit any word including one or more spaces.
The number of words in findstr should equal that in replacestr,
except that (1) an empty replacestr is taken to specify deletion; (2)
a single word in replacestr is taken to mean that each word of
findstr is to be replaced by that word. As quotation marks are used
for delimiting, literal quotation marks should be included in
compound double quotation marks, as in `"""'.  By default all
occurrences are changed. n(#) specifies that the first # occurrences
only should be changed. word specifies that words in findstr are to
be replaced only if they occur as separate words in strvar. The
substitutions of msub() are made in sequence.  (Stata 6 required;
msub() depends on the built-in functions subinstr() and subinword().)

. egen newstr = msub(strvar), f(A B C) r(1 2 3)
(replaces "A" by "1", "B" by "2", "C" by "3")

. egen newstr = msub(strvar), f(A B C) r(1 2 3) n(1)
(replaces "A" by "1", "B" by "2", "C" by "3", first occurrence only)

. egen newstr = msub(strvar), f(A B C) r(1)
(replaces "A" by "1", "B" by "1", "C" by "1")

. egen newstr = msub(strvar), f(A B C)
(deletes "A", "B", "C")

. egen newstr = msub(strvar), f(" ")
(deletes spaces)

. egen newstr = msub(strvar), f(`"""')
(deletes quotation mark ")

. egen newstr = msub(strvar) f(frog) w
(deletes "frog" only if occurring as single word)

noccur(strvar) , string(substr) creates a variable containing the number
of occurrences of the string substr in string variable strvar.  Note
that occurrences must be disjoint (non-overlapping): thus there are
two occurrences of "aa" within "aaaaa". (Stata 7 required.)

nss(strvar) , find(substr) [ insensitive ] returns the number of
occurrences of substr within the string variable strvar.  insensitive
makes counting case-insensitive. (Stata 6 required.)

The inclusion of noccur() and nss(), two almost identical functions, was
an act of sheer inadvertence by the maintainer.

ntos(numvar) , from(numlist) to(list of string values) generates a string
variable from a numeric variable numvar, mapping each numeric value
in numlist to the corresponding string value.  The number of elements
in each list must be the same. String values containing blanks should
be delimited by doube quotation marks " ". Values not defined by the
mapping are generated as missing. The type of the string variable is
determined automatically. (Stata 6 required.)

Excellent)

nwords(strvar) returns the number of words within the string variable
strvar. Words are separated by spaces, unless bound by double
quotation marks " ". (Stata 6 required; superseded by wordcount()).

repeat() , values(value_list) [ by(byvarlist) block(#) ] produces a
repeated sequence of value_list. The items of value_list, which may
be a numlist or a set of string values, are assigned cyclically to
successive observations. The order of observations is determined (1)
after noting any if or in restrictions; (2) within groups specified
by by(), if issued; (3) by the current sort order. block() specifies
that values should be repeated in blocks of the specified size: the
default is 1. The variable type is determined smartly, and need not
be specified. (Stata 8 required.)

. egen quarter = repeat(), v(1/4) block(3)
. egen months = repeat(), v(`c(Months)')
. egen levels = repeat(), v(10 50 200 500)

sieve(strvar) , { keep(classes) | char(chars) | omit(chars) } selects
characters from strvar according to a specified criterion and
generates a new string variable containing only those characters.
This may be done in three ways. First, characters are classified
using the keywords alphabetic (any of a-z or A-Z), numeric (any of
0-9), space or other. keep() specifies one or more of those classes:
keywords may be abbreviated by as little as one letter.  Thus keep(a
n) selects alphabetic and numeric characters and omits spaces and
other characters. Note that keywords must be separated by spaces.
Alternatively, char() specifies each character to be selected or
omit() specifies each character to be omitted. Thus char(0123456789.)
selects numeric characters and the stop (presumably as decimal
point); omit(" ") strips spaces and omit(`"""') strips double
quotation marks.  (Stata 7 required.)

ston(strvar) , from(list of string values) to(numlist) generates a
numeric variable from a string variable strvar, mapping each string
value to the corresponding numeric value in numlist. The number of
elements in each list must be the same. String values containing
blanks should be delimited by " ". Values not defined by the mapping
are generated as missing. (Stata 6 required.)

Excellent)

wordof(strvar) , word(#) returns the #th word of string variable strvar.
word(1) is the first word, word(2) the second word, word(-1) the last
word, and so forth. Words are separated by spaces, unless bound by
quotation marks " ". (Stata 6 required; superseded by word().)

Dates, times and time series

bom(m y) [ , lag(lag) format(format) work ] creates an elapsed date
variable containing the date of the beginning of month m and year y.
m can be a variable containing integers between 1 and 12 inclusive or
a single integer in that range.  y can be a variable containing
integers within the range covered by elapsed dates or a single
integer within that range. Optionally lag() specifies a lag: the
beginning of the month will be given for lag months before the
current date. lag(1) refers to the previous month, lag(3) to 3 months
ago and lag(-3) to 3 months hence. The lag may also be specified by a
variable containing integers. Optionally a format, usually but not
necessarily a date format, can be specified.  work specifies that the
first day must also be one of Monday to Friday. (Stata 6 required.)

. egen bom = bom(month year), f(%dd_m_y)

bomd(datevar) [ , lag(lag) format(format) work ] creates an elapsed date
variable containing the date of the beginning of the month containing
the date in an elapsed date variable datevar.  Optionally lag()
specifies a lag: the beginning of the month will be given for lag
months before the current date. lag(1) refers to the previous month,
lag(3) to 3 months ago and lag(-3) to 3 months hence. The lag may
also be specified by a variable containing integers. Optionally a
format, usually but not necessarily a date format, can be specified.
work specifies that the first day must also be one of Monday to
Friday. (Stata 6 required.)

. egen bomd = bomd(date), f(%dd_m_y)

Note that work knows nothing about holidays or any special days.

dayofyear(daily_date_variable) [ , month(#) day(#) ] generates the day of
the year, counting from the start of the year, from a daily date
variable. The start of the year is 1 January by default: month()
and/or day() may be used to specify an alternative.  This function
thus is a generalisation of the date function doy().  (Stata 8
required.)

. egen dayofyear = dayofyear(date), m(10)

dhms(d h m s) [ , format(format) ] creates a date variable from Stata
date variable or date d with a fractional part reflecting the number
of hours, minutes and seconds past midnight.  h can be a variable
containing integers between 0 and 23 inclusive or a single integer in
that range. m and s can be variables containing integers between 0
and 59 or single integer(s) in that range.  Optionally a format,
usually but not necessarily a date format, can be specified. The
resulting variable, which is by default stored as a double, may be
used in date and time arithmetic in which the time of day is taken
into account. (Stata 6 required.)

elap(time) [ , format(format) ] creates a string variable which contains
the number of days, hours, minutes and seconds associated with an
integer variable containing a number of elapsed seconds. Such a
variable might be the result of date/time arithmetic, where a time
interval between two timestamps has been expressed in terms of
elapsed seconds. Leading zeroes are included in the hours, minutes,
and seconds fields. Optionally, a format can be specified. (Stata 6
required.)

elap2(time1 time2) [ , format(format) ] creates a string variable which
contains the number of days, hours, minutes and seconds associated
with a pair of time values, expressed as fractional days, where time1
is no greater than time2.  Such time values may be generated by
function dhms(). elap2() expresses the interval between these time
minutes, and seconds fields.  Optionally, a format can be specified.
(Stata 6 required.)

eom(m y) [ , lag(lag) format(format) work ] creates an elapsed date
variable containing the date of the end of month m and year y. m can
be a variable containing integers between 1 and 12 inclusive or a
single integer in that range. y can be a variable containing integers
within the range covered by elapsed dates or a single integer within
that range. Optionally lag() specifies a lag: the end of the month
will be given for lag months before the current date. lag(1) refers
to the previous month, lag(3) to 3 months ago and lag(-3) to 3 months
hence. The lag may also be specified by a variable containing
integers. Optionally a format, usually but not necessarily a date
format, can be specified.  work specifies that the last day must also
be one of Monday to Friday.  (Stata 6 required.)

. egen eom = eom(month year), f(%dd_m_y)

eomd(datevar) [ , lag(lag) format(format) work ] creates an elapsed date
variable containing the date of the end of the month containing the
date in an elapsed date variable datevar.  Optionally lag() specifies
a lag: the end of the month will be given for lag months before the
current date. lag(1) refers to the previous month, lag(3) to 3 months
ago and lag(-3) to 3 months hence. The lag may also be specified by a
variable containing integers. Optionally a format, usually but not
necessarily a date format, can be specified.  work specifies that the
last day must also be one of Monday to Friday. (Stata 6 required.)

Note that work knows nothing about holidays or any special days.

. egen eom = eomd(date), f(%dd_m_y)
. egen eopm = eomd(date), f(%dd_m_y) lag(1)

ewma(timeseriesvar) , a(#) calculates the exponentially weighted moving
average, which is

ewma = timeseriesvar for the first observation

= a * timeseriesvar + (1 - a) * L.ewma otherwise

The data must have been declared time series data by tsset.
Calculations start afresh after any gap with missing values.  (Stata
6 required; superseded by tssmooth.)

filter(timeseriesvar) , lags(numlist) [ coef(numlist) { normalise |
normalize } ] calculates the linear filter which is the sum of terms

coef_i * Li.timeseriesvar or coef_i * Fi.timeseriesvar

coef() defaults to a vector the same length as lags() with each
element 1.

filter(y), l(0/3) c(0.4(0.1)0.1) calculates

0.4 * y + 0.3 * L1.y + 0.2 * L2.y + 0.1 * L3.y

filter(y), l(0/3) calculates

1 * y + 1 * L1.y + 1 * L2.y + 1 * L3.y or y + L1.y + L2.y + L3.y

Leads are specified as negative lags.  normalise (or normalize,
according to taste) specifies that coefficients are to be divided by
their sum so that they add to 1 and thus specify a weighted mean.

filter(y), l(-2/2) c(1 4 6 4 1) n calculates

(1/16) * F2.y + (4/16) * F1.y + (6/16) * y + (4/16) * L1.y + (1/16) *
L2.y

The data must have been declared time series data by tsset.  Note
that this may include panel data, which are automatically filtered
separately within each panel.

The order of terms in coef() is taken to be the same as that in lags.

. egen f2y = filter(y), l(-1/1) c(0.25 0.5 0.25)
. egen f2y = filter(y), l(-1/1) c(1 2 1) n

filter7(timeseriesvar) , lags(numlist) coef(numlist) [ { normalise |
normalize } ] calculates the linear filter which is the sum of terms

coef_i * Li.timeseriesvar or coef_i * Fi.timeseriesvar

filter7(y), l(0/3) c(0.4(0.1)0.1) calculates

0.4 * y + 0.3 * L1.y + 0.2 * L2.y + 0.1 * L3.y

Leads are specified as negative lags.  normalise (or normalize,
according to taste) specifies that coefficients are to be divided by
their sum so that they add to 1 and thus specify a weighted mean.

filter7(y), l(-2/2) c(1 4 6 4 1) n calculates

(1/16) * F2.y + (4/16) * F1.y + (6/16) * y + (4/16) * L1.y + (1/16) *
L2.y

The data must have been declared time series data by tsset.  Note
that this may include panel data, which are automatically filtered
separately within each panel.

The order of terms in coef() is taken to be the same as that in

foy(daily_date_variable) [ , month(#) day(#) ] generates the fraction of
the year elapsed since the start of the year from a daily date
variable. The start of the year is 1 January by default: month()
and/or day() may be used to specify an alternative.  If
daily_date_variable is all integers, then the result is
(day of year - 0.5) / number of days in year. If daily_date_variable
contains non-integers, then the result is (day of year - 1) / number
of days in year.  (Stata 8 required.)

. egen frac = foy(date), m(10)

hmm(timevar) [ , round(#) trim ] generates a string variable showing
timevar, interpreted as indicating time in minutes, represented as
hours and minutes in the form "[...h]h:mm".  For example, times of 9,
90, 900 and 9000 minutes would be represented as "0:09","1:30",
"15:00" and "150:00". The option round(#) rounds the result: round(1)
rounds the time to the nearest minute. The option trim trims the
result of leading zeros and colons, except that an isolated 0 is not
trimmed. With trim "0:09" is trimmed to "9" and "0:00" is trimmed to
"0".

hmm() serves equally well for representing times in seconds in
minutes and seconds in the form "[...m]m:ss". (Stata 6 required.)

hmmss(timevar) [ , round(#) trim ] generates a string variable showing
timevar, interpreted as indicating time in seconds, represented as
hours, minutes and seconds in the form "[...h:]mm:ss". For example,
times of 9, 90, 900 and 9000 seconds would be represented as
"00:09","01:30", "15:00" and "2:30:00". The option round(#) rounds
the result:  round(1) rounds the time to the nearest second. The
option trim trims the result of leading zeros and colons, except that
an isolated 0 is not trimmed. With trim "00:09" is trimmed to "9" and
"00:00" is trimmed to "0". (Stata 6 required.)

hms(h m s) [ , format(format) ] creates an elapsed time variable
containing the number of seconds past midnight. h can be a variable
containing integers between 0 and 23 inclusive or a single integer in
that range. m and s can be variables containing integers between 0
and 59 or single integer(s) in that range.  Optionally a format can
be specified. (Stata 6 required.)

minutes(strvar) [ , maxhour(#) ] returns time in minutes given a string
variable strvar containing a time in hours and minutes in the form
"[..h]hh:mm".  In particular, minutes are given as two digits between
00 and 59 and hours by default are given as two digits between 00 and
23. The maxhour() option may be used to change the (unreachable)
limit: its default is 24. Note that, strange though it may seem, this
function rather than seconds() is appropriate for converting times in
the form "mm:ss" to seconds.  The maximum number of minutes
acceptable may need then to be specified by maxhour() [sic].  (Stata
8 required.)

ncyear(datevar) , month(#) [ day(#) ] returns an integer variable
labelled with labels such as "1952/53" for non-calendar years
starting on the specified month and day.  The day defaults to 1.
datevar is treated as indicating elapsed dates. For more on dates,
see help on dates. (Stata 6 required.)

. egen wtryear = ncyear(date), m(10)
(years starting on 1 October)

. egen wwgyear = ncyear(date), m(1) d(21)
(years starting on 21 January)

record(exp) [ , by(byvarlist) min order(varlist) ] produces the maximum
(with min the minimum) value observed "to date" of the specified exp.
Thus record(wage), by(id) order(year) produces the maximum wage so
far in worker's career, calculations being separate for each id and
records being determined within each id in year order. Although
explanation and example here refer to dates, nothing in record()
restricts its use to data ordered in time. If not otherwise specified
with by() and/or order(), records are determined with respect to the
current order of observations. No special action is required for
missing values, as internally record() uses either the max() or the
min() function, both of which return results of missing only if all
values are missing. (Stata 6 required.)

. egen hiwage = record(exp(lwage)), by(id) order(year)
. egen lowage = record(exp(lwage)), by(id) order(year) min

seconds(strvar) [ , maxhour(#) ] returns time in seconds given a string
variable containing a time in hours, minutes and seconds in the form
"[..h]hh:mm:ss".  In particular, minutes and seconds are each given
as two digits between 00 and 59 and hours by default are given as two
digits between 00 and 23. The maxhour() option may be used to change
the (unreachable) limit: its default is 24.  (Stata 8 required.)

tod(time) [ , format(format) ] creates a string variable which contains
the number of hours, minutes and seconds associated with an integer
in the range 0 to 86399, one less than the number of seconds in a
day. Such a variable is produced by hms(), which see above. Leading
zeroes are included in the hours, minutes, and seconds fields. Colons
are used as separators.  Optionally a format can be specified.
(Stata 6 required.)

Summaries and estimates

values. These are the smallest values within factor() times the
interquartile range of the lower quartile.  By default factor() is
1.5, defining the default lower value of a so-called whisker on a
Stata box plot. (Stata 8 required.)

values. These are the largest values within factor() times the
interquartile range of the upper quartile.  By default factor() is
1.5, defining the default upper value of a so-called whisker on a
Stata box plot. (Stata 8 required.)

corr(varname1 varname2) [ , covariance spearman taua taub by(byvarlist) ]
returns the correlation of varname1 with varname2.  By default, this
returns the Pearson correlation coefficient.  covariance indicates
that covariances should be calculated; spearman indicates that
Spearman's rank correlation coefficient should be calculated; taua
and taub return Kendall's tau-A and tau-B, respectively. (Stata 8
required.)

density(varname) [ , width(#) start(#) frequency percent fraction
by(byvarlist) ] calculates the density (or optionally the frequency,
fraction or percent) of values in bins of width width() (default 1)
starting at start() (default minimum of the data). Note that each
value produced will be identical for all observations in the same
bin. Commonly for further use it will be desired to select one value
from each bin, say by using egen's tag() function. (Stata 8
required.)

gmean(exp) [ , by(byvarlist) ] returns the geometric mean of exp. (Stata
6 required.)

. egen gmean = gmean(mpg), by(rep78)

hmean(exp) [ , by(byvarlist) ] returns the harmonic mean of exp. (Stata 6
required.)

. egen hmean = hmean(mpg), by(rep78)

nmiss(exp) [ , by(byvarlist) ] returns the number of missing values in
exp. (Stata 6 required.) Remark: Why this was written is a mystery.
The one-line command egen nmiss = sum(missing(exp) (in Stata 9 egen
nmiss = total(missing(exp)) shows that it is unnecessary.

. egen nmiss = nmiss(rep78), by(foreign)

nvals(varname) [ , by(byvarlist) missing ] returns the number of distinct
values in varname. Missing values are ignored unless missing is
specified.  Remark: Much can be done by using egen function tag() and
then summing values as desired. (Stata 6 required.)

outside(varname) [ , by(byvarlist) factor(#) ] calculates outside values.
These are any values more than factor() times the interquartile range
from the nearer quartile, that is above the upper quartile or below
the lower quartile.  By default factor() is 1.5, defining the default
outside values, those plotted separately, on a Stata box plot.
Values not outside are returned as missing.  (Stata 8 required.)

ridit(varname) [ , by(byvarlist) missing percent reverse ] calculates the
ridit for varname, which is

(1/2) count at this value + SUM counts in values below
------------------------------------------------------
SUM counts of all values

With terminology from Tukey (1977, pp.496-497), this could be called
a `split fraction below'. The name `ridit' was used by Bross (1958):
see also Fleiss (1981, pp.150-7) or Flora (1988). The numerator is a
`split count'.

missing specifies that observations for which values of byvarlist are
missing will be included in calculations if by() is specified. The
default is to exclude them. percent scales the numbers to percents by
multiplying by 100.  reverse specifies the use of reverse cumulative
probabilities (1 - fraction above). (Stata 6 required.)

semean(exp) [ , by(byvarlist) ] calculates the standard error of the mean
of exp. (Stata 6 required.)

sumoth(exp) [ , by(byvarlist) ] returns the sum of the other values of
exp in the same group. If by() is specified, distinct combinations of
byvarlist define groups; otherwise all observations define one group.
(Stata 6 required.)

var(exp) [ , by(byvarlist) ] creates a constant (within byvarlist)
containing the variance of exp.  Note also the egen function sd().
(Stata 6 required.)

wpctile(varname) [ , p(#) weights(varname) altdef by(byvarlist) ] is a
hack on official Stata's egen function pctile() allowing
specification of weights in the calculation of percentiles. By
default, the function creates a constant (within byvarlist)
containing the #th percentile of varname. If p() is not specified, 50
is assumed, meaning medians. weights() requests weighted calculation
of percentiles. altdef uses an alternative formula for calculating
percentiles, which is not applicable with weights present. by()
requests calculation by groups.  You may also use the by: construct.
(Stata 8.2 required.)

wtfreq(exp) [ , by(byvarlist) ] creates a constant (within byvarlist)
containing the weighted frequency using exp as weights. (Such
frequencies sum to _N.) (Stata 6 required.)

xtile(varname) [ , percentiles(numlist) nquantiles(#) weights(varname)
altdef by(byvarlist) ] categorizes varname by specific percentiles.
The function works like xtile. By default varname is dichotomized at
the median. percentiles() requests percentiles corresponding to
numlist: for example, p(25(25)75) is used to create a variable
according to quartiles. Alternatively you also may have specified
n(4): to create a variable according to quartiles.  weights()
requests weighted calculation of percentiles.  altdef uses an
alternative formula for calculating percentiles.  See xtile. by()
requests calculation by groups.  You may also use the by: construct.
(Stata 8.2 required.)

. egen mpg4 = xtile(mpg), by(foreign) p(25(25)75)
. egen mpg10 = xtile(mpg), by(foreign) nq(10)

First and last

first(varname) [ , by(byvarlist) ] returns the first non-missing value of
varname. `First' depends on the existing order of observations.
varname may be numeric or string.  (Stata 6 required.)

ifirst(numvar) , value(#) [ { before | after } by(byvarlist) ] indicates
the first occurrence of integer # within numvar by 1 and other
observations by 0.

before indicates observations before the first occurrence by 1 and
other observations by 0.  after indicates observations after the
first occurrence by 1 and other observations by 0.  The default, the
value before and the value after always sum to 1 for observations
analysed.

First occurrence is determined as follows: (1) if if or in is
specified, any observations excluded are ignored; (2) if by() is
specified, first is determined separately for each distinct group of
observations; (3) first is first in current sort order.  If # does
not occur, all observations are before the first occurrence. (Stata 6
required.)

. gen warm = celstemp > 20
. egen fwarm = ifirst(warm), v(1) by(year)

ilast(numvar) , value(#) [ { before | after } by(byvarlist) ] indicates
the last occurrence of integer # within numvar by 1 and other
observations by 0.

before indicates observations before the last occurrence by 1 and
other observations by 0.  after indicates observations after the last
occurrence by 1 and other observations by 0.  The default, the value
before and the value after always sum to 1 for observations analysed.

Last occurrence is determined as follows: (1) if if or in is
specified, any observations excluded are ignored; (2) if by() is
specified, last is determined separately for each distinct group of
observations; (3) last is last in current sort order.  If # does not
occur, all observations are before the last occurrence. (Stata 6
required.)

lastnm(varname) [ , by(byvarlist) ] returns the last non-missing value of
varname. `Last' depends on the existing order of observations.
varname may be numeric or string. Remark: lastnm() would have been
better called last(), except that an egen program with that name for
selecting the last `word' in a string was published in STB-50.
(Stata 6 required.)

Random numbers

mixnorm() [ , frac(#) mu1(#) mu2(#) var1(#) var2(#) ] generates a new
variable of specified type as a mixture of two Normal distributions,
with the fraction frac(#) of the observations defined by the first
distribution.  Both options for means mu1(#) and mu2(#) default to 0;
both options for variances var1(#) and var2(#) default to 1, while
frac(#) defaults to 0.5. Only non-default parameters of the desired
mixture need be specified. (Stata 8 required.)

. egen mixture = mixnorm(), frac(0.9) mu2(10) var2(4)

rndint() , max(#) [ min(#) ] generates random integers from a uniform
distribution on min() to max(), inclusive. min(1) is the default.
Remark: Note that ceil(uniform() * #) is a direct way to get random
integers from 1 to #. (Stata 6 required.)

. egen integ = rndint(), min(100) max(199)

rndsub() [ , ngroup(#) { frac(#) | percent(#) } by(byvarlist) ] randomly
splits observations into groups or subsamples. The result is a
categorical variable taking values from 1 upward labelling distinct
groups.

ngroup(#) (default 2) defines the number of groups.

frac(#), which is only allowed with ngroup(2), specifies that the
first group should contain 1 / # of the observations and thus that
the second group should contain the remaining observations.

percent(#), which is only allowed with ngroup(2), specifies that the
first group should contain #% of the observations and thus that the
second group should contain the remaining observations.

frac() and percent() may not be specified together.  (Stata 6
required.)

. egen group = rndsub(), by(foreign)

. egen group = rndsub(), by(foreign) f(3)
(first group contains 1/3 of observations, second group contains 2/3)

. egen group = rndsub(), by(foreign) p(25)
(first group contains 25% of observations, second group contains 75%)

For reproducible results, set the seed of the random number generator

Note that to generate # random numbers the number of observations must be
at least #. If there are no data in memory and you want 100 random
numbers, type set obs 100 before using these functions.

Row operations

rall(varlist) , cond(condition) [ symbol(symbol) ] returns 1 for
observations for which the condition specified is true for all
variables in varlist and 0 otherwise. The condition should be
specified using symbol(), by default @, as a placeholder for each
variable.  Thus, for example, rall(varlist), c(@ > 0 & @ < .) tests
whether all variables in varlist are positive and non-missing. Note
that conditions typically make sense only if variables are either all
numeric or all string: one exception is missing(@).  (Stata 6
required.)

rany(varlist) , cond(condition) [ symbol(symbol) ] returns 1 for
observations for which the condition specified is true for any
variable in varlist and 0 otherwise. The condition should be
specified using symbol(), by default @, as a placeholder for each
variable.  Thus, for example, rany(varlist), c(@ > 0 & @ < .) tests
whether any variable in varlist is positive and non-missing.  Note
that conditions typically make sense only if variables are either all
numeric or all string: one exception is missing(@).  (Stata 6
required.)

rcount(varlist) , cond(condition) [ symbol(symbol) ] returns the number
of variables in varlist for which the condition specified is true.
The condition should be specified using symbol(), by default @, as a
placeholder for each variable. Thus, for example, rcount(varlist),
c(@ > 0 & @ < .) counts for each observation how many variables in
varlist are positive and non-missing. Note that conditions typically
make sense only if variables are either all numeric or all string:
one exception is missing(@).  More precisely, rcount() gives the sum
across varlist of condition, evaluated in turn for each variable.
(Stata 6 required.)

For rall(), rany(), and rcount(), the symbol() option may be used to set
an alternative to @ whenever the latter is inappropriate. For example, if
string variables were being searched for literal occurrences of "@", some
other symbol not appearing in text or in variable names should be used.

. egen any = rany(b c d e f) , c(@ == a)
. egen all = rall(b c d e f) , c(@ == a)
. egen count = rcount(b c d e f) , c(@ == a)
(values of b c d e f matched by (equal to) those of a?)

. egen anyw1 = rany(b c d e f) , c(abs(@ - a) <= 1)
. egen allw1 = rall(b c d e f) , c(abs(@ - a) <= 1)
. egen countw1 = rcount(b c d e f) , c(abs(@ - a) <= 1)
(values of b c d e f within 1 of those of a?)

From Stata 7, foreach provides an alternative that would now be
considered better style:

. gen any = 0
. gen all = 1
. gen count = 0
. foreach v of var a b c d e f {
.         replace any = max(any, inrange(`v', 0, .))
.         replace all = min(all, inrange(`v', 0, .))
.         replace count = count + inrange(`v', 0, .)
. }

rowmedian(varlist) returns the median across observations of the
variables in varlist.  (Stata 9 required.)

rsum2(varlist) is a generalisation of egen's rsum() (from Stata 9:
rowtotal()) function with the extra options allmiss and anymiss.  As
with rsum(), it creates the (row) sum of the variables in varlist,
treating missing as 0.  However, if the option allmiss is selected,
the (row) sum for any observation for which all variables in varlist
are missing is set equal to missing. Similarly, if the option anymiss
is selected the (row) sum for any observation for which any variable
in varlist is missing is set equal to missing. (Stata 6 required.)

References

Bross, I.D.J. 1958. How to use ridit analysis. Biometrics 14: 38-58.

Fleiss, J.L. 1981. Statistical Methods for Rates and Proportions.  New
York: John Wiley.

Flora, J.D. 1988. Ridit analysis. In Kotz, S. and Johnson, N.L. (eds)
Encyclopedia of Statistical Sciences. New York: John Wiley. 8:
136-139.

Maintainer

Nicholas J. Cox, Durham University, U.K.
n.j.cox@durham.ac.uk

Acknowledgements

Kit Baum (baum@bc.edu) is the first author of record() and the author of
dhms(), elap(), elap2(), hms(), tod() and mixnorm().

Ulrich Kohler (kohler@wzb.eu) is the author of xtile(), mlabvpos(),
iso3166() and wpctile().

Steven Stillman (s.stillman@verizon.net) is the author of rsum2().

Nick Winter (njw3x@virginia.edu) is the author of corr() and noccur().

Kit Baum, Sascha Becker, Ronán Conroy, William Gould, Syed Islam, John
Moran, Stephen Soldz, Richard Williams, Fred Wolfe and Gerald Wright