set more off
clear all
/* OVERVIEW
The purpose of this example is to show how to set up and run a new-user nested
cohort analysis of the kind used by Hernan et al to analyze the Nurses' Health
Study data on hormone therapy. This uses the same data as the MSM example, but
in a slightly modified format: specifically, we only see semi-annual values of
the confounder-mediator, as in a cohort study, rather than the monthly values
we had then. The major advantage of this set-up is that IP weights for treatment
are not needed: instead, we control for the value of the confounder mediator
at the start of treatment for new users, and at a comparable time origin for
appropriate controls. This gets around the need for developing the IP models
for treatment, which must be correctly specified for the IP-weighted models to
give good estimates. However, we still do need IPC weights to minimize
informative censoring by the confounder-mediator. And the nested-cohort
set-up requires some slightly complicated data management. The steps are
spelled out below.
As in the MSM example, we also have the potential outcomes dataset in unchanged
form, so that we can see what the real treatment effect is.
First, take a look at the main dataset. */
use nested_new_user_cohort_example, clear
describe
list in 1/20, noobs clean
/* The first step is to define multiple nested new-user cohorts. In each interval
between semi-annual visits, we identify new users plus a set of controls who
start treatment after the current interval, if at all. Prevalent users at the
beginning of the interval are omitted. This set-up means that subjects as
well as events may be counted in multiple cohorts, in some cases first as a
control and later as a new user. The set-up begins with initializing a few
variables that we will then replace for each nested cohort. Then we loop over
the eight 6-month intervals before the 6-, 12-, 18-, ... 48-month visits. Each
cohort is nested within the last. Note that you have to run the entire loop,
down the the closing }, all at once. */
qui gen byte cohort = .
label variable cohort "Nested new-user cohort number"
qui gen coh_cm = .
label variable coh_cm "Confounder mediator at nested cohort baseline"
format coh_cm %3.2f
qui gen byte new_user = .
label variable new_user "New user of treatment"
qui gen origin = .
label variable origin "New time origin on original monthly scale"
qui gen coh_month = .
label variable coh_month "Months from id/cohort-specific origin"
qui gen coh_mcc = .
label variable coh_mcc "Month of cross-over for controls"
forvalues i = 1/8 {
/* Cohort number is needed as a stratification variable in the IPC and outcome
models.*/
qui replace cohort = `i'
* Macro variable giving month of previous visit, needed below.
local last = 6*(`i'-1)
/* Omit subjects who began treatment or failed before the beginning of the
interval; variable fmtx gives the month of treatment initiation, and month
gives the number of months from baseline to the outcome event or censoring. */
qui drop if (fmtx <= `last' | month <= `last')
/* Evaluate confounder/mediator measured at the last visit -- that is, at
the beginning of the current interval. This will be used to control for
confounding by indication, in place of IP weights. */
qui replace coh_cm = cm_`last'
* Identify new users who initiate treatment in the interval.
qui replace new_user = (fmtx <= 6*`i')
label values new_user noyes
/* Define new time origin for new users. We subtract 1 so that the first
month of treatment counts as month 1. This is done so that events in the
first month will be retained by stset in the processing below. This assumes
that treatment begins before the event when both occur in the same month,
and that treatment effects are immediate. In practice it would be important
to ascertain which happened first when the event occurs very near the start
of treatment, and whether immediate effects are plausible; it might make
more sense in some cases to handle early events differently. */
qui replace origin = fmtx-1 if new_user==1
/* Use mean starting time among new users as origin for controls. We make
an exception for controls with an event before the new origin but after
the start of the interval, so that no events are lost. Otherwise we would lose
13 events. Doing the analysis without this exception gives almost identical
results; you can check this by using the commented out line of code in place
of the one now being used. */
qui sum fmtx if new_user==1
qui replace origin = min(month-1, round(r(mean))) if new_user==0
*qui replace origin = round(r(mean)) if new_user==0
/* Recalculate follow-up from new time origin. */
qui replace coh_month = month-origin if month>=origin
* Omit observations with outcome month before the origin
qui drop if coh_month==.
/* Recalculate first month on treatment; this will be used in censoring
controls when they begin treatment. */
qui replace coh_mcc = fmtx-origin if new_user==0
* Save the data for this cohort
qui save cohort`i', replace
}
* Pool the data for the 8 cohorts and clean up.
use cohort1, clear
forvalues i = 2/8 {
qui append using cohort`i'
erase cohort`i'.dta
}
erase cohort1.dta
/* Check where the events are. The numbers are total N, N of events, and
proportion of events: **/
table new_user cohort , c(count event sum event mean event) format(%5.2g)
save preliminary, replace
/**TN: Now would be a good time to stop and take a look at the dataset using
the viewer. You can drag over, right-click and and hide columns you don't
want to see. Start with cohort 1. Do values of coh_cm, origin, coh_month,
coh_mcc make sense? Seek help if you can't figure it out.
Now look at the data for several subjects with relatively late events. One
thing to note is that the cohort only counts new users once, in their last
cohort; that is because treatment in these data is once-on, always-on. If
particpants could stop treatment, we would need to do something more complicated.
Another thing to notice is that these late events are included in earlier
cohorts as controls, but censored at the time they cross-over to treatment. */
keep if month>36 & event==1 & fmtx=`i' & cm_`i'~= .
}
format cm coh_cm %5.2f
/*Here we have to add an extra record for the censoring models to be used in
calculating inverse probability of censoring (IPC) weights. The code is the
same as what we used in the MSM example for this purpose. Again, censoring by LTFU is
taken to happen the month after the last month in follow-up for subjects who remain
event-free and do not continue in the study through month 48. We first calculate
the largest follow-time for each record, then keep those last records if they
do not represent an event (event==0) and are before month 48. For the additional
record, we advance the time by one month, and retain the treatment and
confounder-mediator values, which will be needed for evaluating the inverse
probability of censoring weights later. */
gen byte censored = 0
label values censored noyes
sort newid month
save preliminary, replace
by newid: egen maxt = max(month)
keep if month==maxt & maxt<48-origin-1 & event==0
replace event = .
replace month = month+1
replace censored = 1
append using preliminary
/* But in this case we also have censoring by going on treatment among among the
non-using controls in each nested cohort. These subjects are counted as censored
in the first month of treatment. So we need to change the event for that last
record (when they went on treatment) to missing and we need to delete all of
the records corresponding to person time after they went on medication. Note
that their person time after treatment will still be counted because in the
cohort in which they begin treatment they will have new_user==1 and their
person time will not be deleted by the commands below. **/
sort newid month
replace censored = 1 if month==coh_mcc & event==0 & new_user==0
/* The next line ensures that events occurring in the first month of treatment
among controls are not counted as control events in any given cohort. */
replace event = . if censored==1
* omit observations after going on treatment for controls
drop if new_user==0 & month>coh_mcc
/* This tabulation shows the proportons of new-users and controls who are censored
either by LTFU or controls crossing over to active treatment. */
tab new_user censored, row
* Restricted cubic spline in time since cohort entry, needed for PLRs as usual.
mkspline month_sp = month, cubic
save preliminary, replace
/* Now look at some data to see what all this has done. ID 3 is an example of a
subject who begins treatment late, so is included in the first six cohorts as
a control and so is censored, then in the seventh cohort as a new user, and
remains event-free and in follow-up up to the 48-month visit. */
keep if id == 3
sort id fmtx cohort origin coh_mcc new_user coh_cm month
list id cohort month fmtx origin coh_mcc new_user coh_cm cm censored event, ///
noobs clean
use preliminary, clear
/* As with treatment in the Cook article, the predictors of censoring may depend
on treatment status, since new users are censored for LTFU only, while controls
are censored either for LTFU or going on treatment. Hence we will estimate
separate IPC models for the two groups. The basic methods for calculating the
weights are otherwise unchanged. As before, we use the complement of the
fitted probabilities from the logistic models, because weights are actually
defined in terms of the probability of remaining in follow-up. The censored
records are not used in the outcome models (because we set the event indicator
to missing). */
gen ipcw_d = .
gen ipcw_n = .
forvalues i = 0/1 {
* denominator model including the current value of the confounder-mediator
logistic censored cm coh_cm i.cohort month_sp* if new_user==`i'
capture drop p
predict p
replace ipcw_d = 1-p if new_user==`i'
* numerator model EXcluding the current value of the confounder-mediator
logistic censored coh_cm i.cohort month_sp* if new_user==`i'
cap drop p
predict p
replace ipcw_n = 1-p if new_user==`i'
}
* Calculate cumulative probabilities of remaining in follow-up
sort newid month
by newid: replace ipcw_d = ipcw_d*ipcw_d[_n-1] if _n!=1
by newid: replace ipcw_n = ipcw_n*ipcw_n[_n-1] if _n!=1
* Stabilized IPCW
gen ipcw = ipcw_n/ipcw_d
sum ipcw
save final, replace
/* Estimate the true treatment effect using the potential outcomes data. As
in the MSM example, we see about a 75% reduction in the odds of an event due
to treatment. (It is the same simulated dataset, just renamed to keep things
straight.) */
use nested_new_user_cohort_potential_outcomes, clear
describe
logistic event tx month_sp*
/* Now we fit PLR model using the IPC weight, and adjusting for confounder-
mediator, evaluated at the baseline time for each nested cohort. As usual
we include a spline in month to allow for smooth variation in the baseline
failure rate, and also add a cohort- specific intercept. */
use final, clear
logistic event i.new_user coh_cm i.cohort month_sp* [pweight=ipcw], ///
vce(cluster id)
/* Since we have used an adjusted model, we need to use potential outcomes
estimation (POE) to get the marginal OR. The code now uses a matrix function
for extracting scalars from matrices in the return register. The conditional
and marginal estimates are very similar. In interpreting the probabilities in
the margins result, remember that these are monthly event rates. */
margins new_user
scalar EY0 = el(r(b), 1, 1)
scalar EY1 = el(r(b), 1, 2)
dis _newline as result "Marginal OR: " round(EY1/(1-EY1)/(EY0/(1-EY0)), .001)
/* The analysis seems to work pretty well -- in this dataset at any rate. But
we need to do some checking. Following Hernan et al, we first check for
treatment/cohort interaction. */
logistic event i.new_user##i.cohort month_sp* coh_cm [pweight=ipcw], ///
vce(cluster id)
testparm 1.new_user#i.cohort
/* Those results look OK, though the test is weakened by having to drop cohorts
5, 6, and 8. We also check for interaction between treatment and the time origin
used. */
logistic event i.new_user##c.origin i.cohort month_sp* coh_cm [pweight=ipcw], ///
vce(cluster id)
testparm 1.new_user#c.origin
/* Very weak evidence for some variation here, although the data generating
model did not include this feature.
Finally we check for time-varying treatment effect, using an interaction between
the treatment indicator and the linear term in months since the cohort-specific
baseline. Note that the first month spline variable gets dropped, because it
is just the linear term. */
logistic event i.new_user##c.month month_sp* coh_cm i.cohort [pweight=ipcw], ///
vce(cluster id)
/* So we find weak evidence for a decreasing treatment benefit (p=.1). This is
a type-I error, since the data were generated with a constant treatment effect.
Additional analysis checking for interaction of treatment with the other spline
components also gave weak evidence for time-varying effects, but the non-linear
pattern was pretty implausible -- and a false positive, as we know from the
data-generating model with constant treatment effect. */
/**First, just for fun, let's see what happens if we do this very naively:" **/
tab event
gen tx_ever = fmtx<.
logistic event tx
logistic event tx cm_0
gen early_tx=fmtx<15
gen late_tx=fmtx>15 &fmtx<.
logistic event early late cm_0
/**TN: So analysis above suggestes that after adjusting for the confounder at
baseline, people who were EVER treated were about 80% less likely to have an
event, and that this was true whether they were treated early or late in the
study. It seems like this is the kind of result you would get if you naively
analyzed the study as a nested case-control study, where anyone with an event
was a case, everyone else was a control, and you asked everyone if they had
ever been treated.
If we naively analyzed it as a case control study, we'd lose all of the people
who were lost to f/u before having an event:**/
logistic event early late cm_0 if event|month==48
/**TN: alternatively, if we started the study at month 18 and compared prevalent
users then with those who were no users then, we'd get: **/
keep if month >=18
gen tx_by_18mos= fmtx<=18
tab tx_by
**Now we just see whether being on tx at 18 mos predicts subsequent events
logistic event tx_by_18mos cm_18
**This effect will initially be attenuated by people who go on tx later.
** Now exclude if they go on tx later
logistic event tx_by_18mos cm_18 late if (fmtx<=18|fmtx==.)
* clean up
erase preliminary.dta
erase final.dta