set more off clear all /* Section I: OVERVIEW The purpose of this example is to show you how to run a pooled logistic regression (PLR) model with inverse probability (IP) and inverse probability of censoring (IPC) weights to estimate the marginal hazard ratio for the effect of a time-dependent treatment on a survival endpoint, in the presence of at time-dependent confounder that is affected by treatment, and thus mediates at least part of its effect. A Cox model adjusting for the confounder-mediator as a time-dependent covariate (TDC) would at best estimate a conditional relative hazard for the effect of treatment via other pathways ("at best" because that model would have to meet conditional independence assumptions for both treatment and the confounder-mediator, which may be unrealistic in practice). The gist is that we have to adjust for current values of the confounder-mediator to meet the sequential randomization assumption (SRA), the extension of conditional independence to the longitudinal setting. That is, if the confounder-mediator is changing, we need to adjust for its history up to the time treatment is initiated to control for confounding by indication. This motivates including it as a TDC. The trouble is that the TDC for the confounder-mediator would keep on getting updated after treatment is initiated, and so subsequent values would reflect the effects of treatment. Thus adjusting for it would adjust away whatever part of the treatment effect is mediated via this pathway. In contrast, the IPW-PLR model incorporates the time-dependent confounder- mediator as weights. These make the treated and untreated sub-samples at any time point representative of the population that would still be at risk if we had both potential outcomes for each subject. Thus we can use an unadjusted model for treatment, which avoids adjusting away the indirect effect, but still get rids of the confounding by indication. You will see that a big part of this example is taken up with transforming the dataset for analysis, involving a lot of relatively low-level data management tasks. If you decide to fit IPW-PLR models, it is very unlikely that the data you begin with will be in a form suitable for running the final analysis, and you will need make these moves (or be able to tell your programmer analyst how to do them). Stata actually makes this about as easy as it can possibly be, but there is still a lot of it. In the second section of the example, most of these tasks will be taken care of already. OBSERVED OUTCOMES DATASET This do-file runs an analysis of simulated longitudinal data with a time-dependent treatment, which begins at a specified first month on treatment, and then continues (in short, once on treatment, always on treatment); a time-dependent confounder-mediator; and a survival outcome, which occurs, or not, by the end of follow-up for each subject. Think of the data as arising from a study of four years duration. Subjects who remain event-free at four years are censored. A subgroup of them are also lost to follow-up (LTFU) earlier. POTENTIAL OUTCOMES DATASET Because this is simulated, not real data, we were also able to simulate the true potential outcomes, free of informative censoring by early losses to follow-up. That is, we simulated an outcome assuming the subject was on treatment from the start, and another as if he or she were off treatment the entire period. Those are the two treatment histories we would like to compare. We can analyze those potential outcomes data to get an estimate of the true marginal hazard ratio, which is hard to calculate analytically from the parameter values used in generating the data. Of course we can ONLY do this in simulated data. We will do this just before getting estimates from the observed data. Comparing the true value with the estimate allows us to evaluate the performance of the estimate. PREPARING THE OBSERVED OUTCOMES DATASET FOR PLR ANALYSIS The dataset, msm.dta is in wide format, with one record per subject, and includes the following variables: id: subject id fmtx: first month on treatment month: end of follow-up, in months event: event indicator (0 = censored, 1 = event) cm_0-cm_48: values of the confounder-mediator measured in each month, with cm_0 giving the baseline value. At the end of the example we will look at what happens when we only get to see its values every 6 months, but treatment and events depend on the monthly values. */ use msm, clear describe list id month event fmtx cm_0 cm_1 cm_2 cm_3 cm_4 cm_5 cm_48 in 1/5, noobs clean /* We must first transform msm.dta into the long format we need for the PLR models. The first step in the procedure is to stset the data, as if we were going to run a conventional survival analysis, then use the stsplit command to obtain a single record for each month. The variable start specified in the stsplit command gives the starting time of each interval; it is not used later, but stsplit won't run unless we specify a name for it. */ stset month, failure(event) id(id) // need to specify id to run stsplit next stsplit start, at(0/48) // start variable required, not used later drop start _t0 _st month event // some made by stset, not needed for analyses rename _t month // month of current record is what we need rename _d event // event indicator specific to month, only 1 in last month label variable month "Month of follow-up" label variable event "Event indicator" sort id month save preliminary, replace * check what this has done keep if _n <= 100 by id: list month event cm_0 cm_1 cm_2 cm_48, noobs clean use preliminary, clear /* Next we evaluate time-dependent treatment indicator, by comparing the month of the current record, month, to the first month on treatment (fmtx), which was on the incoming dataset. The variable fmtx is missing if the subject never began treatment during the 48 months of follow-up. */ capture drop tx gen byte tx = (month >= fmtx) /* The statement in parentheses evaluates to 1 if true, otherwise to zero. */ label variable tx "Currently on treatment" gen tx_lagged = tx[_n-1] if id==id[_n-1] label variable tx_lagged "On treatment in previous month" /* Next we evaluate confounder-mediator as a time-dependent covariate, using the current value measured monthly; this will be used in estimating the censoring weights. We also need the lagged value from the previous month, because that is what we assume drives current treatment decisions, and so will be used in estimating the treatement weights. Note that the ">" sign is used in the definition of cm_current and cm_lagged to insure that non-missing values are assigned at each time point. The last non-missing value is carried forward to the current observation." */ cap drop cm_current cm_lagged gen cm_current = cm_0 gen cm_lagged = cm_0 label variable cm_current "Confounder-mediator: current value" label variable cm_lagged "Confounder-mediator: lagged value" forvalues i = 1/48 { * next line chooses confounder-mediator value for month `i', if non-missing qui replace cm_current = cm_`i' if month>=`i' & cm_`i'~= . local lagged = `i'-1 qui replace cm_lagged = cm_`lagged' if month>=`i' & cm_`lagged' ~= . } format cm_current cm_lagged %4.2f /* Here we have to add an extra record for the loss-to-follow-up (LTFU) models to be used in calculating inverse probability of censoring (IPC) weights. LTFU is taken to happen the month after the last month in follow-up for subjects who remain event-free and do not continue in the study through month 48. We first calculate the largest follow-time for each record, then keep those last records if they do not represent an event (event==0) and are before month 48. For the additional record, we advance the time by one month, and retain the treatment and confounder-mediator values, which will be needed for evaluating the inverse probability of censoring weights later. */ gen byte LTFU = 0 // set the LTFU indicator to zero on all existing records label variable LTFU "Lost to follow-up indicator" save preliminary, replace // save the main dataset by id: egen maxt = max(month) // figure out the last month in follow-up keep if month==maxt & month<48 & event==0 // pick out last record if LTFU qui replace month = month+1 // reset LTFU event time to the following month qui replace LTFU = 1 // reset LTFU indicator to 1 keep id month tx cm_0 cm_current LTFU // needed in the LTFU models append using preliminary // join the new record to the analysis dataset sort id month save preliminary, replace /* Now print some records for LTFU subjects to see what this has done. You'll see that we have an extra record on which event is missing, so those records will not be used in outcome analyses; rather they will only be used in models for the IPC weights. */ by id: egen everLTFU = max(LTFU) by id: egen lastmonth = max(month) keep if everLTFU==1 & lastmonth<=15 keep if _n<=100 sort id month by id: list month LTFU event, noobs clean use preliminary, clear /* Finally we make a restricted cubic spline in months since study entry. Then we compress and save the almost analysis-ready dataset. */ mkspline month_sp = month, cubic keep id month event fmtx tx tx_lagged month_sp* cm_current cm_0 cm_lagged LTFU compress // stores variables more efficiently when possible save preliminary, replace /**TN: It's good to take a look at the dataset to keep track of what's going on:*/ keep if _n <= 100 sort id fmtx cm_0 by id fmtx cm_0: list month tx cm_current cm_lagged LTFU event, noobs clean use preliminary, clear /* Now we will verify that the confounder-mediator is affected by treatment -- in point of fact, with these data, on the value of treatment in the last month. So we will evaluate the lagged value of the treatment, then run a repeated measures model to test the association between the lagged value of treatment and the confounder-mediator. */ xtmixed cm_current month_sp* tx_lagged || id: month /* So this analysis verifies that lagged treatment is a strong predictor of the confounder-mediator. So we do need to use the IPW-PLR approach. Accordingly, these last preparatory step is to make the combined stabilized weight for the confounder-mediator and censoring. First we calculate the denominator of the inverse probability of treatment/exposure weight, which is allowed to depend on the time-dependent value of the confounder- mediator. This is a PLR model for initiation of treatment, since we assume in this example that assume once treatment is initiated, it continues without break; records after the first month on treatment (t>fmtx) are not used; we also exclude the extra records for subjects who are LTFU before going on treatment. Note that the argument`varlist' we will specify when we call the program is plugged in the next line. */ logit tx month_sp* cm_lagged if month<=fmtx & LTFU~=1 /* The next line calculates conditional probability of being on treatment; this is essentially a time-dependent propensity score. */ predict cp_tx_d label variable cp_tx_d "Conditional prob starting tx" /* The conditional probability is 1 after initiation, under our assumption that once on treatment, always on treatment. */ replace cp_tx_d = 1 if month>fmtx /* The weight should depend on the probability of treatment received, so we calculate the compliment for each month where the subject is not yet on treatment. */ replace cp_tx_d = 1-cp_tx_d if tx==0 /* Finally, we calculate marginal probability of treatment history. We don't get this directly from the PLR model. Because it is restricted to subjects still off treatment (except for the first month of treatment), it estimates the conditional probability of starting treatment, given having remained off treatment so far. To get the marginal probability of remaining off treatment in month 10, say, we have to take the product of the probabilities of remaining off in month 1, the conditional probability of remaining off in month 2, given that you remained off through month 1, and so on, through month 10. If you started on treatment at month 13, and are at risk in month 16, the marginal probability of your treatment history then is the product of the probabilities of not starting in months 1-12, the probability of starting in month 13, given off in 12, and 1 for months 14-16. That's what the next line of code does, by multiplying each record by all previous records for the same subject. We have to sort by id and month to make this work. The if statement at the end of the second command prevents the code from cumulating the probabilities across subjects; id[_n-1] is the id number of the previous record in the dataset (which is sorted by id and month). */ sort id month gen mp_tx_d = cp_tx_d // marginal = conditional at time 0 label variable mp_tx_d "Marginal prob of treatment history" replace mp_tx_d = mp_tx_d * mp_tx_d[_n-1] if id==id[_n-1] /* The numerator of the weight is calculated using the same algorithm, but the time-dependent values of the confounder-mediator are not used in the model. */ logit tx month_sp* if month<=fmtx & LTFU~=1 predict cp_tx_n label variable cp_tx_n "Condition prob of starting treatment" replace cp_tx_n = 1 if month>fmtx replace cp_tx_n = 1-cp_tx_n if tx==0 sort id month gen mp_tx_n = cp_tx_n label variable mp_tx_n "Marginal probabiliity of treatment history" replace mp_tx_n = mp_tx_n * mp_tx_n[_n-1] if id==id[_n-1] /* In a final step, we calculate the stabilized IP weight as the ratio of the numerator and denominator marginal probabilities of treatment observed. */ gen ip_wt = mp_tx_n/mp_tx_d label variable ip_wt "Stabilized inverse probability of treatment weight" /* The censoring weights are now calculated using essentially the same strategy, but now LTFU is the outcome, and we are allowed to use current treatment in both weights; however, the time-dependent confounder- mediator can only be used in the denominator of the weight. Since we are trying to weight up the sample remaining in follow-up to the overall surviving sample, we need to weight by the inverse probability of remaining in follow-up. Hence we will call what the course notes call inverse probability of censoring (IPC) weights inverse probability of in follow-up (ipinfl) weights. */ logit LTFU month_sp* tx cm_current // denominator predict cp_LTFU_d label variable cp_LTFU_d "Conditional probability of LTFU" gen cp_infu_d = 1-cp_LTFU_d label variable cp_infu_d "Conditional probability in FU" sort id month gen mp_infu_d = cp_infu_d label variable mp_infu_d "Marginal probability in FU" replace mp_infu_d = mp_infu_d * mp_infu_d[_n-1] if id==id[_n-1] logit LTFU month_sp* tx // numerator predict cp_LTFU_n label variable cp_LTFU_n "Conditional probability of LTFU" gen cp_infu_n = 1-cp_LTFU_n label variable cp_infu_n "Condition probability in FU" sort id month gen mp_infu_n = cp_infu_n label variable mp_infu_n "Marginal probability in FU" replace mp_infu_n = mp_infu_n * mp_infu_n[_n-1] if id==id[_n-1] gen ip_infu_wt = mp_infu_n/mp_infu_d label variable ip_infu_wt "Stabilized IP in FU weight" /* In a final step, we calculate combined stabilized as the product of the treatment and censoring weights. */ gen sc_wt = ip_wt*ip_infu_wt label variable sc_wt "Stabilized combined weight" format cp* mp* ip* sc_wt %7.3f save final, replace /* Check how the marginal probabilities change over time. */ keep if id==48 sort id month by id: list month tx cp_tx_d mp_tx_d ip_wt, noobs clean /* Have a look at the resulting data for the first subject. */ use final, clear keep if id==48 // keep the record for one subject; feel free to look at someone else sort id fmtx cm_0 by id fmtx cm_0: list month event tx cm_lagged ip_wt ip_infu_wt sc_wt, noobs clean /* Make sure you understand all the variables you see. ESTIMATING THE TRUE TREATMENT EFFECT USING THE POTENTIAL OUTCOMES DATASET We can use this dataset and fit a PLR model to estimate the parameters of the true marginal structural model. The dataset has already been transformed into the format needed. */ use msm_potential_outcomes, clear describe logistic event tx month_sp* /* In this analysis, the odds-ratio of 0.25 is interpretable as a hazard ratio, because the data for each monthly outcome are restricted to those still at risk (so like the hazard ratio, the OR is conditional on having survived to the current month), and the incidence is very low in any given month (so the OR and HR don't differ much). RUNNING THE ANALYSES USING THE OBSERVED DATA First we will run an unadjusted PLR model, which would be expected to give an estimate of the treatment effect that is badly confounded by indication. */ use final, clear logistic event tx month_sp* /* Comparing the result to the true marginal RH of about 0.25, that is the case. Now we run a PLR model adjusting for the current value of confounder-mediator. This is the naive model that inverse weighting is supposed to improve upon. Because we are adjusting away the effects of treatment mediated by its effects on the confounder-mediator, we would also expect the treatment to appear less beneficial in this model. We need to turn on the robust SEs because we are including a TDC in the model. */ logistic event tx month_sp* cm_current, vce(cluster id) /* So the bias here is very considerable. (In many applications it is not so bad; we turned up the heat to make the point). Now we are almost ready to run the IPW-PLR model. But we should check the weights before proceeding -- really big ones are a common problem and represent potential violations of the positivity assumption. Big weights can be trimmed or those observations can be dropped, but no well- established methods exist for doing this. Note that because of the stabilization, a weight >=20 no longer corresponds to probability of observed treatment history of less than 5%. */ sum sc_wt, detail /* Since there are essentially no big weights, we can run the model using the combined stabilized weights as estimated. The weights are specified as pweights, as in the IPTW propensity score models, and we use robust SEs, also required in this application. */ logistic event tx month_sp* [pweight=sc_wt], vce(cluster id) /* Finally, the estimate is reasonABLY close to the relative hazard of 0.25 that we obtained from the potential outcomes dataset. So at least with nice clean simulated data like this, IPW-PLR can work well. Section II In this section of the example, we will consider four extensions of the IPW-PLR model from Section I. That model used a failure time outcome, assumed that once treatment began, it would continue without interruption until failure or censoring, and that the treatment effect would be constant across time. In this section, we will consider four extensions: 1. Treatment that can start and stop repeatedly. The models for the IP weights have to be modified to accommodate this, by including previous treatment as a predictor, and by including observations after treatment is initiated. The line of code setting the conditional probability of treatment to 1 after initiation is not needed, because continuation on treatment is not a sure thing after initiation. The causal effect estimated in this model is still the effect of being on treatment. So this model posits an immediate effect of being on treatment, which will not always hold. In the next part, we consider how to estimate a treatment effect in cases where there may be no immediate effect, but with time the effect increases. 2. Treatment effects that vary with time. We just need to include the effect of time on treatment in the IPW-PLR model to accommodate this. In this part of the example, we stick with once-on-treatment, always on treatment, and failure time outcomes, but accommodating time-dependent treatment effects could be done the same way in other contexts. Time-on-treatment would admittedly be tricky to define in the on-off case: for example, would it make more sense just to include the current episode of use in calculating time on treatment, or should we include all observed episodes, perhaps discounting more distant episodes? This is a substantive issue depending on how the treatment works and the length of the gaps between episodes of use. In either case, this type of variable treatment effect is estimated by specifying a model that allows the effect to vary with time on treatment. The protective effect of statins against CHD might more or less follow this pattern; its effects though reducing LDL would be expected to take time, though there might be a more immediate effect through the inflammation pathway. 3. Repeated rather than failure time outcomes. Rather than IPW-PLR models for the outcome we have to use IP-weighted repeated measures models. In the example, treatment can stop and start, the repeated outcome is binary, and is measured at 6, 12, 18, .... 48 months. For example, a repeated outcome might be hospitalizations for asthma, and treatments might be use of controller medications. 4. A baseline covariate modifies the effect of treatment. The treatment pattern is once-treated, always-treated, as in part 2 and Section 1, and the outcome is a failure time, as in parts 1 and 2 and Section 1. For example, is the effect of bisphophonates on hip fracture may differ between men and women, or by baseline BMD. In all four cases, informative censoring is allowed, so IPC weights are again needed, but nothing about them changes. For each of the four extensions, a different version of the simulated data is provided, for which the modified analysis would be appropriate. This does simplify the problem, to provide a gentler introduction to these models. With real data it's easy enough to know or figure out in advance whether treatment follows the once-on, always-on pattern, and whether your outcome is a repeated measure or a failure time. But it would be up to you to determine whether the effect of treatment changes with time or is modified by a baseline covariate. We also need to make some minor modifications to the data management steps, detailed below. I simplified the variable names, to make the code more compact and easier to write without making typos, but they are systematic and easy to learn; use describe to get the variable labels. To clarify temporality, the order of events is assumed to be as follows: the confounder-mediator is measured, and on that basis, treatment may be modified. By the next month or visit, treatment has affected the confounder-mediator, and risk of the outcome depends on the new value of both treatment and the confounder-mediator. Accordingly, at any given time point: - treatment depends on the "lagged" (ie most recent) value of the confounder-mediator, as well as the lagged value of treatment in cases where it can stop and start - the confounder-mediator depends on the current value of treatment - the outcome depends on the current values of treatment and the confounder mediator. This new set-up requires a variable giving the value of the confounder-mediator at the most recent time point, which we make using code similar to that used for LVCF values. To keep things within bounds, the analyses using LVCF values of the confounder- mediator are omitted. Also, the transformations to long format for the potential outcomes datasets have been done in the data generation step -- you can just have a look at the data and run the model. Finally, we suggest doing at least 2 parts of this section, the ones that seem most likely to be useful for you, only going on to the third and fourth extensions if they also seem worthwhile. Feel free to take looks at the data in additional places; use methods from last week to do that. PART 1: ON-OFF TREATMENT In this set-up, we assume that the probability of remaining on treatment depends strongly on treatment status in the previous month, with subjects who are already on treatment much more likely to remain on it, despite improvements in the confounder-mediator. So we have to evaluate a variable giving treatment status in the previous month. PREPARING THE OBSERVED OUTCOMES DATASET FOR ANALYSIS The dataset, msm_onoff.dta is in wide format and includes the following variables: id: subject id month: end of follow-up, in months event: event indicator (0 = censored, 1 = event) cm_0-cm_48: values of the confounder-mediator measured in each month tx_0-tx_48: indicators for current treatment in each month tx_months: a variable giving the total months on treatment -- not needed in the analysis, but good for picking out subjects with more time on treatment */ use msm_onoff, clear describe list id month event tx_months tx_0-tx_9 in 1/10, noobs clean * The transformation into long format is unchanged. stset month, failure(event) id(id) stsplit start, at(0/48) drop start _t0 _st month event rename _t month rename _d event label variable month "Month of follow-up" label variable event "Event indicator" label values event noyes sort id month /**TN: I added next 2 commands to take a look at the data for some subjects with events**/ list id if event==1 & id <30 list id month cm_0 cm_1 cm_2 tx_0 tx_1 tx_months event if id==7|id==14 /* Next we evaluate TDCs for current and lagged treatment, which will be used in the models for current treatment and censoring. */ cap drop tx cap drop tx_lagged gen byte tx = 0 // no one on treatment at baseline -- new users only gen byte tx_lagged = 0 forvalues i = 1/48 { qui replace tx = tx_`i' if month>=`i' & tx_`i'~= . //use most recent if missing local lagged = `i'-1 // defines "macro" variable analogous to `i' qui replace tx_lagged = tx_`lagged' if month>=`i' & tx_`lagged' ~=. } label variable tx "Currently on treatment" label variable tx_lagged "On treatment in previous month" label values tx tx_lagged noyes * current and lagged values of confounder-mediator; same code as above cap drop cm gen cm_current = cm_0 gen cm_lagged = cm_0 label variable cm_current "Confounder-mediator: current value" label variable cm_lagged "Confounder-mediator: lagged value" forvalues i = 1/48 { qui replace cm_current = cm_`i' if month>=`i' & cm_`i'~= . local lagged = `i'-1 qui replace cm_lagged = cm_`lagged' if month>=`i' & cm_`lagged'~= . } format cm_current cm_lagged %5.3f /**TN: I added next command to make sure these commands did what we wanted**/ list id month cm_lagged cm_current tx_lagged tx tx_months event if id==7 * extra record for LTFU, unchanged except for the keep list gen byte LTFU = 0 label variable LTFU "Lost to follow-up indicator" save preliminary, replace by id: egen maxt = max(month) keep if month==maxt & month<48 & event==0 qui replace month = month+1 qui replace LTFU = 1 keep id month tx tx_lagged cm_0 cm_current cm_lagged tx_months LTFU append using preliminary sort id month *spline in month number, unchanged, for fitting baseline event rate mkspline month_sp = month, cubic keep id month event tx tx_lagged tx_months month_sp* cm_current cm_lagged LTFU save preliminary, replace * Take a look at some records for observations with time on treatment keep if tx_months>0 keep if _n <= 100 sort id tx_months by id tx_months: list month event tx tx_lagged cm_current cm_lagged, noobs clean use preliminary, clear /* The last preparatory step is to make the combined stabilized weight for the confounder-mediator and censoring. The differences are that we adjust for treatment history (using insider information that the lagged value is all that matters). Also, records after the first month of treatment are included in the models for treatment, because treatment can switch off and then on again. Finally, we do not impute cp_tx=1 for all months after the first month of treatment, for the same reason; it is no longer a done deal at that point. In accord with the clarified temporality, we used the lagged value of the confounder-mediator in the models for treatment. Note that we now have to exclude the extra LTFU records in the IP weight models. */ * IP weights logit tx cm_lagged tx_lagged month_sp* if event~= . // exclude extra LTFU records /**TN: Recall cp is for cumulative probability and _d and _n refer to denominator and numerator for the stabilized weights.**/ predict cp_tx_d replace cp_tx_d = 1-cp_tx_d if tx==0 sort id month gen mp_tx_d = cp_tx_d replace mp_tx_d = mp_tx_d*mp_tx_d[_n-1] if id==id[_n-1] logit tx tx_lagged month_sp* if event~=. predict cp_tx_n replace cp_tx_n = 1-cp_tx_n if tx==0 sort id month gen mp_tx_n = cp_tx_n replace mp_tx_n = mp_tx_n*mp_tx_n[_n-1] if id==id[_n-1] gen ip_tx_wt = mp_tx_n/mp_tx_d * IPC weights logit LTFU cm_current tx month_sp* predict cp_LTFU_d gen cp_infu_d = 1-cp_LTFU_d sort id month gen mp_infu_d = cp_infu_d replace mp_infu_d = mp_infu_d*mp_infu_d[_n-1] if id==id[_n-1] logit LTFU tx month_sp* predict cp_LTFU_n gen cp_infu_n = 1-cp_LTFU_n sort id month gen mp_infu_n = cp_infu_n replace mp_infu_n = mp_infu_n*mp_infu_n[_n-1] if id==id[_n-1] gen ip_infu_wt = mp_infu_n/mp_infu_d * stabilized combined weight gen sc_wt = ip_tx_wt*ip_infu_wt label variable sc_wt "Stabilized combined weight" format cp* mp* ip* sc_wt %7.3f * check positivity and trim weights recode cp_tx_d 0/0.05=1 0.05/0.95=0 0.95/1=1, gen(ppv) label variable ppv "Potential positivity violation" label values ppv noyes tab ppv sum sc_wt if ppv==1, detail /* In this case, there are a lot of PPV observations, but only a few have big weights. The next step shows that there are additional big weights among the rest. We will use sensitivity analyses to see if the big weights make a difference. To conduct a proper analysis omitting the PPV observations, we would need to recalculate the weights after omitting the PPV observations; we will do that in Part 4 of the example. */ sum sc_wt, detail recode sc_wt 0/20=0 20/max=1, gen(big_wt) label variable big_wt "Stabilized combined weight > 20" label values big_wt noyes tab big_wt recode sc_wt 20/max=20, gen(tr_sc_wt) label variable tr_sc_wt "Trimmed stabilized combined weight" save final, replace * Have a look at the resulting data for the first few subject use final, clear keep if id<=5 sort id tx_months by id tx_months: list month event ip_tx_wt ip_infu_wt sc_wt, noobs clean /* RUNNING THE ANALYSIS USING THE POTENTIAL OUTCOMES DATASET We again use the potential outcomes dataset, but this week it is already in long format, so we can directly proceed to estimating the marginal RH. */ use msm_onoff_potential_outcomes, clear sort id newid list id newid tx month event in 1/100 logistic event tx month_sp* /* This suggests that the effect of treatment is to reduce the event rate by >70% (the OR is approximately interpretable as a hazard ratio in the PLR model). */ * RUNNING THE ANALYSES USING THE OBSERVED DATA use final, clear * unadjusted PLR model; robust SEs not needed because there are no TDCs logistic event tx month_sp* /* The estimate is somewhat attenuated by confounding by indication. Now adjusting for the confounder mediator. */ logistic event tx cm_current month_sp*, vce(cluster id) /* The estimate is worse when we naively adjust for the confounder-mediator. Finally, the IPW-PLR model. */ logistic event tx month_sp* [pw=sc_wt], vce(cluster id) * sensitivity analysis using trimmed weights logistic event tx month_sp* [pw=tr_sc_wt], vce(cluster id) * sensitivity analysis omitting big weights logistic event tx month_sp* [pw=sc_wt] if big_wt==0, vce(cluster id) /* So all IPW models give results that are consistent with the estimate based on the potential outcomes dataset, and show little sensitivity to big weights. A caution: in our experience with big datasets, weights can sometimes be much larger, and quite influential. */ /* PART 2: MODELING TRENDS IN THE TREATMENT EFFECT This extension requires that we estimate the effect of time on treatment, which we have to calcalate as a TDC. PREPARING THE OBSERVED OUTCOMES DATASET FOR PLR ANALYSIS The dataset, msm_txtrend.dta is in wide format and includes the following variables: id: subject id month: end of follow-up, in months event: event indicator (0 = censored, 1 = event) fmtx: first month on treatment cm_0-cm_48: values of the confounder-mediator measured in each month */ use msm_txtrend, clear describe list id month event fmtx in 1/5, noobs clean * The transformation into long format for PLR is unchanged from Section 1. stset month, failure(event) id(id) stsplit start, at(0/48) drop start _t0 _st month event rename _t month rename _d event label variable month "Month of follow-up" label variable event "Event indicator" label values event noyes sort id month save preliminary, replace /* In this analysis we will assess the effect of treatment duration, which we calculate in each month counting from the first month on treatment. This is set to zero before treatment starts. */ cap drop tx cap drop tx_months gen tx = month>=fmtx gen tx_months = max(0, month-(fmtx-1)) // max(0, x) = 0 if x<0 or x==. gen tx_years = tx_months/12 // rescale to get more interpretable point estimate label variable tx "Currently on treatment" label variable tx_months "Current treatment duration (months)" label values tx noyes * current and lagged values of confounder-mediator (unchanged from Section 1) cap drop cm gen cm_current = cm_0 gen cm_lagged = cm_0 label variable cm_current "Confounder-mediator: current value" label variable cm_lagged "Confounder-mediator: lagged value" forvalues i = 1/48 { qui replace cm_current = cm_`i' if month>=`i' & cm_`i'~= . local lagged = `i'-1 qui replace cm_lagged = cm_`lagged' if month>=`i' & cm_`lagged'~= . } format cm_current cm_lagged %5.3f * extra record for LTFU, unchanged from Section 1 gen byte LTFU = 0 label variable LTFU "Lost to follow-up indicator" label values LTFU noyes save preliminary, replace by id: egen maxt = max(month) keep if month==maxt & month<48 & event==0 qui replace month = month+1 qui replace LTFU = 1 keep id month tx tx_months cm_current cm_lagged LTFU append using preliminary sort id month *spline in month number, also unchanged mkspline month_sp = month, cubic keep id month event fmtx tx tx_months tx_years month_sp* cm_0 /// cm_current cm_lagged LTFU save preliminary, replace sort id fmtx keep if _n <= 50 by id fmtx: list month cm_current tx tx_months event, noobs clean use preliminary, clear * IP weights logit tx cm_lagged month_sp* if month<=fmtx predict cp_tx_d replace cp_tx_d = 1 if month>fmtx replace cp_tx_d = 1-cp_tx_d if tx==0 sort id month gen mp_tx_d = cp_tx_d replace mp_tx_d = mp_tx_d*mp_tx_d[_n-1] if id==id[_n-1] logit tx month_sp* if month<=fmtx predict cp_tx_n replace cp_tx_n = 1 if month>fmtx replace cp_tx_n = 1-cp_tx_n if tx==0 sort id month gen mp_tx_n = cp_tx_n replace mp_tx_n = mp_tx_n*mp_tx_n[_n-1] if id==id[_n-1] gen ip_tx_wt = mp_tx_n/mp_tx_d * IPC weights logit LTFU cm_current tx month_sp* predict cp_LTFU_d gen cp_infu_d = 1-cp_LTFU_d sort id month gen mp_infu_d = cp_infu_d replace mp_infu_d = mp_infu_d*mp_infu_d[_n-1] if id==id[_n-1] logit LTFU tx month_sp* predict cp_LTFU_n gen cp_infu_n = 1-cp_LTFU_n sort id month gen mp_infu_n = cp_infu_n replace mp_infu_n = mp_infu_n*mp_infu_n[_n-1] if id==id[_n-1] gen ip_infu_wt = mp_infu_n/mp_infu_d gen sc_wt = ip_tx_wt*ip_infu_wt label variable sc_wt "Stabilized combined weight" format cp* mp* ip* sc_wt %7.3f * Now we will check for potential positivity violations. recode cp_tx_d 0/0.05=1 0.05/0.95=0 0.95/1=1, gen(ppv) replace ppv = 0 if month>fmtx // cp_tx_d==1 after treatment begun label variable ppv "Potential positivity violation" label values ppv noyes tab tx ppv, row sum sc_wt if ppv==1, detail /* So we have a lot of potential positivity violations, but no large weights among those observations, or overall. The PPVs are almost all among subjects who are off treatment, meaning that there was little indication for them to start. So the probability of their observed treatment status was nearly 1, and they got small weights. Restriction of the dataset to exclude these subjects might be appropriate; they are like the infants in the phototherapy study who are almost never treated. Again, the weights would need to be re-stimated after restricting the dataset to exclude them, which we do in the part 4 example. The next check shows that there are also no big weights overall. */ sum sc_wt, detail save final, replace * look at some data keep if id<=4 sort id fmtx by id fmtx: list month event tx tx_months sc_wt, noobs clean /* We again use the potential outcomes dataset, already in long format, to get an estimate of the true marginal treatment effects. */ use msm_txtrend_potential_outcomes, clear logistic event tx tx_years month_sp* * RH at annual intervals forvalues i = 0/4 { dis _newline as result "Estimated RH after `i' years of treatment:" lincom tx + `i'*tx_years } /* In the potential outcomes data, we see almost no immediate effect of treatment, but increasing effects over time, reaching 86% after 4 years of treatment. */ /* NOW USING THE OBSERVED DATA */ * Unadjusted PLR model use final, clear logistic event tx tx_years month_sp* /* The trend estimate is OK but main effect is badly confounded by indication. Now the naive PLR model adjusting for the current value of confounder-mediator. */ logistic event tx tx_years month_sp* cm_current, vce(cluster id) /* Now both estimates are off. That is because a large part of the treatment effect is mediated by its gradually increasing effect on the confounder-mediator. Finally we will run the IPW-PLR model.*/ logistic event tx tx_years month_sp* [pw=sc_wt], vce(cluster id) * RH at annual intervals forvalues i = 0/4 { dis _newline as result "Estimated RH after `i' years of treatment:" lincom tx + `i'*tx_years } /* These estimates are OK, at least qualitatively.*/ /* PART 3: REPEATED BINARY OUTCOME IP weighting applies to repeated measures as well as survival outcomes. In the PLR model, we have a repeated binary outcome every month, until the outcome occurs or censoring. In this context, we have a repeated outcome every visit, until censoring, but the outcome can be continuous or a count as well as binary, and even if it is binary, it can happen repeatedly -- so subjects are seen again after a first positive outcome. In this example, treatment is allowed to stop as well as start, so the IP weights are calculated as in PART 1. The other required change is substituting a repeated measures model for the outcome in place of the PLR model used with failure times. In Stata, we can't use the xt commands, which do not allow time-varying weights -- the same problem that forces us to use PLR in place of the Cox model with failure times. Instead we use standard regression commands with the option vce(cluster id). This is tantamount to fitting GEE models with so-called independence working correlation and robust SEs. Use of IPC weights makes this procedure consistent if the data are missing at random, given covariates and the outcomes we observe (MAR), provided the IPC weight models are right and there are no unmeasured confounders of dropout. Without the IPC weights, GEE is only consistent if the data are missing completely at random (MCAR), given covariates, with no dependence of dropout on observed outcomes. PREPARING THE OBSERVED OUTCOMES DATASET FOR ANALYSIS The dataset, msm_repeated.dta is in wide format, with one record per subject, and includes the following variables: id: subject id cm_bl: confounder-mediator at baseline cm_6, cm_12, ..., cm_48: the confounder-mediator at each 6 month visit tx_6, tx_12, ..., tx_48: treatment indicator at follow-up visits event_6, event_12, ..., event_48: outcome event indicator at each visit. The only thing we keep from baseline is the value of the confounder mediator. We assume that some subjects may have been started on treatment after their baseline visit, on the basis of the confounder-mediator value observed there. So by the 6-month visit, the confounder-mediator and the outcome depend on treatment. But treatment status depends on the baseline confounder-mediator value, not the current value. */ use msm_repeated, clear describe list id tx_* in 1/10, noobs clean list id event_* in 1/10, noobs clean /* The transformation is simpler here, aided by the variable names, which were chosen to be compatible with the reshape command. In particular the baseline value of the confounder-mediator is called cm_bl, to avoid making records for everyone at baseline; these are not needed, because there are no outcomes at baseline. */ reshape long cm_ tx_ event_, i(id) j(month) drop if event_==. * for convenience get rid of the trailing underscores in the variable names foreach x in cm tx event { rename `x'_ `x' } label variable tx "Currently on treatment" label variable month "Month of follow-up" label variable event "Event indicator" label values tx event noyes sort id month tab month tx tab month event save preliminary, replace * check what this has done keep if _n <= 100 by id: list month event, noobs clean use preliminary, clear * get lagged values of treatment and confounder-mediator sort id month gen tx_lagged = 0 if month==6 replace tx_lagged = tx[_n-1] if month>6 & id==id[_n-1] label values tx_lagged noyes gen cm_lagged = cm_bl if month==6 replace cm_lagged = cm[_n-1] if month>6 & id==id[_n-1] rename cm cm_current format cm_current cm_lagged %4.2f * extra LTFU record gen byte LTFU = 0 // set the LTFU indicator to zero on all existing records label variable LTFU "Lost to follow-up indicator" label values LTFU noyes save preliminary, replace // save the main dataset by id: egen maxt = max(month) // identify last month in follow-up for all subjects keep if month==maxt & month<48 // pick out last record for LTFU subjects qui replace month = month+6 // reset LTFU event time to the following month qui replace LTFU = 1 // reset LTFU indicator to 1 keep id month tx tx_lagged cm_current cm_lagged LTFU // needed in the LTFU models append using preliminary // join the new record to the analysis dataset sort id month save preliminary, replace keep id month event tx tx_lagged cm_current cm_lagged LTFU save preliminary, replace * Check this keep if _n <= 100 sort id by id: list month tx tx_lagged cm_current cm_lagged LTFU event, noobs clean use preliminary, clear * IP weights logit tx cm_lagged tx_lagged i.month if event~= . // omit extra LTFU records predict cp_tx_d replace cp_tx_d = 1-cp_tx_d if tx==0 sort id month gen mp_tx_d = cp_tx_d replace mp_tx_d = mp_tx_d * mp_tx_d[_n-1] if id==id[_n-1] logit tx tx_lagged i.month if event~=. predict cp_tx_n replace cp_tx_n = 1-cp_tx_n if tx==0 sort id month gen mp_tx_n = cp_tx_n replace mp_tx_n = mp_tx_n * mp_tx_n[_n-1] if id==id[_n-1] gen ip_tx_wt = mp_tx_n/mp_tx_d * IPC weights logit LTFU tx cm_current i.month predict cp_LTFU_d gen cp_infu_d = 1-cp_LTFU_d sort id month gen mp_infu_d = cp_infu_d replace mp_infu_d = mp_infu_d * mp_infu_d[_n-1] if id==id[_n-1] logit LTFU tx i.month predict cp_LTFU_n gen cp_infu_n = 1-cp_LTFU_n sort id month gen mp_infu_n = cp_infu_n replace mp_infu_n = mp_infu_n * mp_infu_n[_n-1] if id==id[_n-1] gen ip_infu_wt = mp_infu_n/mp_infu_d * stabilized combined weight gen sc_wt = ip_tx_wt*ip_infu_wt label variable sc_wt "Stabilized combined weight" format ip* sc*_* %7.3f * check positivity and trim weights recode cp_tx_d 0/0.05=1 0.05/0.95=0 0.95/1=1, gen(ppv) label variable ppv "Potential positivity violation" label values ppv noyes tab ppv * combined stabilized weights in potential positivity violations sum sc_wt if ppv==1, detail * overall combined stabilized weight distribution sum sc_wt, detail recode sc_wt 0/20=0 20/max=1, gen(big_wt) label variable big_wt "Stabilized combined weight > 20" label values big_wt noyes tab big_wt recode sc_wt 20/max=20, gen(tr_sc_wt) label variable tr_sc_wt "Trimmed stabilized combined weight" save final, replace * Have a look at the resulting data use final, clear keep if id<=10 sort id by id: /// list month event tx tx_lagged cm_current cm_lagged ip_tx_wt ip_infu_wt sc_wt, /// noobs clean * RUNNING THE ANALYSES USING THE POTENTIAL OUTCOMES DATASET use msm_repeated_potential_outcomes, clear describe *list in 1/5, noobs clean logistic event tx i.month, vce(cluster id) * RUNNING THE ANALYSES USING THE OBSERVED DATA use final, clear * unadjusted repeated measures model logistic event tx i.month, vce(cluster id) * Naive repeated measures model adjusting for confounder-mediator logistic event tx cm_current i.month, vce(cluster id) /* Using non-independence working correlation or a random effects model does not help, Uncomment the next two lines if you want to verify this. */ * xtlogit event tx cm_current i.month, pa i(id) vce(robust) * xtlogit event tx cm_current i.month, i(id) * IPW repeated measures model using stabilized combined weights logistic event tx i.month [pw=sc_wt], vce(cluster id) * sensitivity analysis omitting potential positivity violations logistic event tx i.month [pw=sc_wt] if ppv==0, vce(cluster id) * sensitivity analysis using trimmed weights logistic event tx i.month [pw=tr_sc_wt], vce(cluster id) * sensitivity analysis omitting big weights logistic event tx i.month [pw=sc_wt] if big_wt==0, vce(cluster id) * What do you find? /* PART 4: MODELING INTERACTIONS BETWEEN BASELINE COVARIATE AND TREATMENT This extension requires that we calculate the interaction, which is a TDC because treatment is a TDC. In this part of the example, we go back to a once-on, always-on treatment, but the approach would be essentially the same with an on/off treatment. PREPARING THE OBSERVED OUTCOMES DATASET FOR PLR ANALYSIS The dataset, msm_blia.dta is in wide format and includes the following variables: id: subject id bl_em: fixed binary baseline effect modifier month: end of follow-up, in months event: event indicator (0 = censored, 1 = event) fmtx: first month on treatment cm_0-cm_48: values of the confounder-mediator measured in each month */ use msm_blia, clear describe label values event noyes list id bl_em month event fmtx in 1/5, noobs clean * The transformation into long format for PLR is unchanged from Section 1. stset month, failure(event) id(id) stsplit start, at(0/48) drop start _t0 _st month event rename _t month rename _d event label variable month "Month of follow-up" label variable event "Event indicator" label values event noyes sort id month save preliminary, replace /* In this analysis we will assess the effect of the interaction between treatment and the baseline effect modifier. We could calculate a product term in this simple case, or use Stata factor coding to do it for us; the latter would be preferred in cases where the baseline variable has more levels, or the treatment variables are complicated. */ cap drop tx cap drop tx_months gen tx = month>=fmtx gen tx_blia = tx * bl_em label variable tx "Currently on treatment" label variable tx_blia "Interaction of tx with baseline effect modifier" label values bl_em tx tx_blia noyes * current and lagged values of confounder-mediator (unchanged from Section 1) cap drop cm gen cm_current = cm_0 gen cm_lagged = cm_0 label variable cm_current "Confounder-mediator: current value" label variable cm_lagged "Confounder-mediator: lagged value" forvalues i = 1/48 { qui replace cm_current = cm_`i' if month>=`i' & cm_`i'~= . local lagged = `i'-1 qui replace cm_lagged = cm_`lagged' if month>=`i' & cm_`lagged'~= . } format cm_current cm_lagged %5.3f * extra record for LTFU, unchanged from Section 1 gen byte LTFU = 0 label variable LTFU "Lost to follow-up indicator" label values LTFU noyes save preliminary, replace by id: egen maxt = max(month) keep if month==maxt & month<48 & event==0 qui replace month = month+1 qui replace LTFU = 1 keep id month tx tx_blia bl_em cm_current cm_lagged LTFU append using preliminary sort id month *spline in month number, also unchanged mkspline month_sp = month, cubic keep id month event fmtx tx tx_blia bl_em month_sp* cm_0 cm_current cm_lagged LTFU save preliminary, replace keep if _n <= 50 sort id fmtx bl_em by id fmtx bl_em: list month cm_current tx tx_blia event, noobs clean use preliminary, clear * IP weights logit tx cm_lagged bl_em month_sp* if month<=fmtx predict cp_tx_d replace cp_tx_d = 1 if month>fmtx replace cp_tx_d = 1-cp_tx_d if tx==0 sort id month gen mp_tx_d = cp_tx_d replace mp_tx_d = mp_tx_d*mp_tx_d[_n-1] if id==id[_n-1] logit tx bl_em month_sp* if month<=fmtx predict cp_tx_n replace cp_tx_n = 1 if month>fmtx replace cp_tx_n = 1-cp_tx_n if tx==0 sort id month gen mp_tx_n = cp_tx_n replace mp_tx_n = mp_tx_n*mp_tx_n[_n-1] if id==id[_n-1] gen ip_tx_wt = mp_tx_n/mp_tx_d * IPC weights logit LTFU cm_current tx bl_em month_sp* predict cp_LTFU_d gen cp_infu_d = 1-cp_LTFU_d sort id month gen mp_infu_d = cp_infu_d replace mp_infu_d = mp_infu_d*mp_infu_d[_n-1] if id==id[_n-1] logit LTFU tx bl_em month_sp* predict cp_LTFU_n gen cp_infu_n = 1-cp_LTFU_n sort id month gen mp_infu_n = cp_infu_n replace mp_infu_n = mp_infu_n*mp_infu_n[_n-1] if id==id[_n-1] gen ip_infu_wt = mp_infu_n/mp_infu_d gen sc_wt = ip_tx_wt*ip_infu_wt label variable sc_wt "Stabilized combined weight" format cp* mp* ip* sc_wt %7.3f * Now we will check for potential positivity violations. recode cp_tx_d 0/0.05=1 0.05/0.95=0 0.95/1=1, gen(ppv) replace ppv = 0 if month>fmtx // cp_tx_d==1 after treatment begun label variable ppv "Potential positivity violation" label values ppv noyes tab tx ppv, row sum sc_wt if ppv==1, detail /* Again, we have a lot of potential positivity violations, but no large weights among those observations, or overall. The PPVs are almost all among subjects who are off treatment, meaning that there was little indication for them to start. So the probability of their observed treatment status was nearly 1, and they got small weights. At the end of this section, we omit these observations, re-evaluate the weights, and re-run the analysis. */ sum sc_wt, detail save final, replace /* We again use the potential outcomes dataset to get an estimate of the true marginal treatment effects. This week it is already in long format, so proceed to estimating the marginal RH. */ use msm_blia_potential_outcomes, clear logistic event tx tx_blia bl_em month_sp* * treatment effect in subjects with effect modifier lincom tx + tx_blia * rerun using factor coding logistic event i.tx##i.bl_em month_sp* * treatment effect in subjects with effect modifier lincom 1.tx + 1.tx#1.bl_em /* The effect modifier is associated with increased baseline risk, but effectively doubles the protective effect of treatment. */ /* NOW USING THE OBSERVED DATA */ * Unadjusted PLR model use final, clear logistic event tx tx_blia bl_em month_sp* * treatment effect in subjects with effect modifier lincom tx + tx_blia /* The effect modifier is well estimated, as is the its interaction with treatment, but the main effect for treatment is badly confounded; this also biases the effect in the subjects with the effect modifier. */ logistic event tx tx_blia bl_em month_sp* cm_current, vce(cluster id) * treatment effect in subjects with effect modifier lincom tx + tx_blia /* Adjusting for the TDCM biases all estimates with the exception of the main effect of the effect modifier. Now the IPW model. */ logistic event tx tx_blia bl_em month_sp* [pw=sc_wt], vce(cluster id) * treatment effect in subjects with effect modifier lincom tx + tx_blia /* Both estimates are pretty good, but we have lost considerable precision in the interaction estimate, which we might be tempted to omit from the model. Finally, a sensitivity analysis omitting PPV observations. We will first recalculate the weights */ drop if ppv == 1 drop cp_* mp_* ip_* sc_* logit tx cm_lagged bl_em month_sp* if month<=fmtx predict cp_tx_d replace cp_tx_d = 1 if month>fmtx replace cp_tx_d = 1-cp_tx_d if tx==0 sort id month gen mp_tx_d = cp_tx_d replace mp_tx_d = mp_tx_d*mp_tx_d[_n-1] if id==id[_n-1] logit tx bl_em month_sp* if month<=fmtx predict cp_tx_n replace cp_tx_n = 1 if month>fmtx replace cp_tx_n = 1-cp_tx_n if tx==0 sort id month gen mp_tx_n = cp_tx_n replace mp_tx_n = mp_tx_n*mp_tx_n[_n-1] if id==id[_n-1] gen ip_tx_wt = mp_tx_n/mp_tx_d * IPC weights logit LTFU cm_current tx bl_em month_sp* predict cp_LTFU_d gen cp_infu_d = 1-cp_LTFU_d sort id month gen mp_infu_d = cp_infu_d replace mp_infu_d = mp_infu_d*mp_infu_d[_n-1] if id==id[_n-1] logit LTFU tx bl_em month_sp* predict cp_LTFU_n gen cp_infu_n = 1-cp_LTFU_n sort id month gen mp_infu_n = cp_infu_n replace mp_infu_n = mp_infu_n*mp_infu_n[_n-1] if id==id[_n-1] gen ip_infu_wt = mp_infu_n/mp_infu_d gen sc_wt = ip_tx_wt*ip_infu_wt label variable sc_wt "Stabilized combined weight" format cp* mp* ip* sc_wt %7.3f logistic event tx tx_blia bl_em month_sp* [pw=sc_wt], vce(cluster id) * treatment effect in subjects with effect modifier lincom tx + tx_blia /* These are reasonably good estimates */ * clean up erase preliminary.dta erase final.dta