set more off clear all  /* Section I: 
 
 OVERVIEW
 
 The purpose of this example is to show you how to run a pooled logistic regression 
 (PLR) model with inverse probability (IP) and inverse probability of censoring 
 (IPC) weights to estimate the marginal hazard ratio for the effect of a 
 time-dependent treatment on a survival endpoint, in the presence of at 
 time-dependent confounder that is affected by treatment, and thus mediates at 
 least part of its effect.  A Cox model adjusting for the confounder-mediator 
 as a time-dependent covariate (TDC) would at best estimate a conditional 
 relative hazard for the effect of treatment via other pathways ("at best" 
 because that model would have to meet conditional independence assumptions 
 for both treatment and the confounder-mediator, which may be unrealistic 
 in practice).
 
 The gist is that we have to adjust for current values of the confounder-mediator 
 to meet the sequential randomization assumption (SRA), the extension of 
 conditional independence to the longitudinal setting. That is, if the 
 confounder-mediator is changing, we need to adjust for its history up to the 
 time treatment is initiated to control for confounding by indication. This 
 motivates including it as a TDC.
 
 The trouble is that the TDC for the confounder-mediator would keep on getting 
 updated after treatment is initiated, and so subsequent values would reflect 
 the effects of treatment.  Thus adjusting for it would adjust away whatever 
 part of the treatment effect is mediated via this pathway.
 
 In contrast, the IPW-PLR model incorporates the time-dependent confounder-
 mediator as weights.  These make the treated and untreated sub-samples at 
 any time point representative of the population that would still be at risk 
 if we had both potential outcomes for each subject.  Thus we can use an
 unadjusted model for treatment, which avoids adjusting away the indirect 
 effect, but still get rids of the confounding by indication. 
 
 You will see that a big part of this example is taken up with transforming the 
 dataset for analysis, involving a lot of relatively low-level data management 
 tasks. If you decide to fit IPW-PLR models, it is very unlikely that the data 
 you begin with will be in a form suitable for running the final analysis, and 
 you will need make these moves (or be able to tell your programmer analyst 
 how to do them). Stata actually makes this about as easy as it can possibly 
 be, but there is still a lot of it. In the second section of the example, most of 
 these tasks will be taken care of already.
  
 OBSERVED OUTCOMES DATASET 
  This do-file runs an analysis of simulated longitudinal data with a time-dependent  treatment, which begins at a specified first month on treatment, and then  continues (in short, once on treatment, always on treatment); a time-dependent  confounder-mediator; and a survival outcome, which occurs, or not, by  the end of follow-up for each subject.  Think of the data as arising from  a study of four years duration. Subjects who remain event-free at four years are censored. A subgroup of them are also lost to follow-up (LTFU) earlier.
 
 POTENTIAL OUTCOMES DATASET   Because this is simulated, not real data, we were also able to simulate the true 
 potential outcomes, free of informative censoring by early losses to follow-up. 
 That is, we simulated an outcome assuming the subject was on treatment from the 
 start, and another as if he or she were off treatment the entire period. Those 
 are the two treatment histories we would like to compare.  We can analyze those potential outcomes data to get an estimate of the true 
 marginal hazard ratio, which is hard to calculate analytically from the parameter 
 values used in generating the data. Of course we can ONLY do this in simulated 
 data.  We will do this just before getting estimates from the observed data.
 Comparing the true value with the estimate allows us to evaluate the performance 
 of the estimate.  
 PREPARING THE OBSERVED OUTCOMES DATASET FOR PLR ANALYSIS
  The dataset, msm.dta is in wide format, with one record per subject, and includes 
 the following variables: 	id: subject id 	fmtx: first month on treatment 	month: end of follow-up, in months 	event: event indicator (0 = censored, 1 = event) 	cm_0-cm_48: values of the confounder-mediator measured in each month, with cm_0 	giving the baseline value. At the end of the example we will look at what happens  	when we only get to see its values every 6 months, but treatment and events  	depend on the monthly values.
 */ use msm, clear describe list id month event fmtx cm_0 cm_1 cm_2 cm_3 cm_4 cm_5 cm_48 in 1/5, noobs clean
 
 /* 
 We must first transform msm.dta into the long format we need for the PLR models.  
 The first step in the procedure is to stset the data, as if we were going to 
 run a conventional survival analysis, then use the stsplit command to obtain 
 a single record for each month. The variable start specified in the stsplit 
 command gives the starting time of each interval; it is not used later, but 
 stsplit won't run unless we specify a name for it.
 */
  stset month, failure(event) id(id) // need to specify id to run stsplit next stsplit start, at(0/48) // start variable required, not used later
 drop start _t0 _st month event // some made by stset, not needed for analyses
 rename _t month // month of current record is what we need
 rename _d event // event indicator specific to month, only 1 in last month
 label variable month "Month of follow-up" label variable event "Event indicator" sort id month
 save preliminary, replace
 
 * check what this has done
 keep if _n <= 100
 by id: list month event cm_0 cm_1 cm_2 cm_48, noobs clean
 use preliminary, clear
   /*  Next we evaluate time-dependent treatment indicator, by comparing the month of the current record, month, to the first month on treatment (fmtx), which was on the 
 incoming dataset. The variable fmtx is missing if the  subject never began treatment 
 during the 48 months of follow-up. */
  capture drop tx gen byte tx = (month >= fmtx)  /* The statement in parentheses evaluates to 1 if true, otherwise to zero. */ label variable tx "Currently on treatment"
 gen tx_lagged = tx[_n-1] if id==id[_n-1]
 label variable tx_lagged "On treatment in previous month" /*  Next we evaluate confounder-mediator as a time-dependent covariate, using the 
 current value measured monthly; this will be used in estimating the censoring
 weights. We also need the lagged value from the previous month, because that 
 is what we assume drives current treatment decisions, and so will be used in
 estimating the treatement weights.  Note that the ">" sign is used in the 
 definition of cm_current and cm_lagged to insure that non-missing values are 
 assigned at each time point. The last non-missing value is carried forward to 
 the current observation." */
  cap drop cm_current cm_lagged gen cm_current = cm_0
 gen cm_lagged = cm_0 label variable cm_current "Confounder-mediator: current value" label variable cm_lagged "Confounder-mediator: lagged value" forvalues i = 1/48 { 	* next line chooses confounder-mediator value for month `i', if non-missing  	qui replace cm_current = cm_`i' if month>=`i' & cm_`i'~= .
  	local lagged = `i'-1
  	qui replace cm_lagged = cm_`lagged' if month>=`i' & cm_`lagged' ~= .  	}
 format cm_current cm_lagged %4.2f	  /*  Here we have to add an extra record for the loss-to-follow-up (LTFU) models to be  used in calculating inverse probability of censoring (IPC) weights.  LTFU is  taken to happen the month after the last month in follow-up for subjects who remain  event-free and do not continue in the study through month 48. We first calculate  the largest follow-time for each record, then keep those last records if they  do not represent an event (event==0) and are before month 48. For the additional  record, we advance the time by one month, and retain the treatment and  confounder-mediator values, which will be needed for evaluating the inverse  probability of censoring weights later.  */ gen byte LTFU = 0 // set the LTFU indicator to zero on all existing records label variable LTFU "Lost to follow-up indicator" save preliminary, replace // save the main dataset by id: egen maxt = max(month) // figure out the last month in follow-up keep if month==maxt & month<48 & event==0 // pick out last record if LTFU qui replace month = month+1 // reset LTFU event time to the following month qui replace LTFU = 1 // reset LTFU indicator to 1 keep id month tx cm_0 cm_current LTFU // needed in the LTFU models append using preliminary // join the new record to the analysis dataset sort id month
 save preliminary, replace
 
 /* 
 Now print some records for LTFU subjects to see what this has done. You'll see 
 that we have an extra record on which event is missing, so those records will 
 not be used in outcome analyses; rather they will only be used in models for 
 the IPC weights.
 */
 by id: egen everLTFU = max(LTFU)
 by id: egen lastmonth = max(month)
 keep if everLTFU==1 & lastmonth<=15
 keep if _n<=100
 sort id month
 by id: list month LTFU event, noobs clean 
 use preliminary, clear
 /*  Finally we make a restricted cubic spline in months since study entry. Then we 
 compress and save the almost analysis-ready dataset. */   mkspline month_sp = month, cubic keep id month event fmtx tx tx_lagged month_sp* cm_current cm_0 cm_lagged LTFU compress // stores variables more efficiently when possible save preliminary, replace 
  /**TN: It's good to take a look at the dataset to keep track of what's going on:*/
 keep if _n <= 100
 sort id fmtx cm_0
 by id fmtx cm_0: list month tx cm_current cm_lagged LTFU event, noobs clean
 use preliminary, clear
 
 /* Now we will verify that the confounder-mediator is affected by treatment -- 
 in point of fact, with these data, on the value of treatment in the last month.  
 So we will evaluate the lagged value of the treatment, then run a repeated 
 measures model to test the association between the lagged value of treatment 
 and the confounder-mediator.
 */

 xtmixed cm_current month_sp* tx_lagged || id: month
 /* 
 So this analysis verifies that lagged treatment is a strong predictor of the 
 confounder-mediator.  So we do need to use the IPW-PLR approach.  Accordingly,
 these last preparatory step is to make the combined stabilized weight for the 
 confounder-mediator and censoring.  
  First we calculate the denominator of the inverse probability of treatment/exposure  weight, which is allowed to depend on the time-dependent value of the confounder- mediator. This is a PLR model for initiation of treatment, since we assume in 
 this example that assume once treatment is initiated, it continues without break;  records after the first month on treatment (t>fmtx) are not used; we also exclude
 the extra records for subjects who are LTFU before going on treatment. Note that the  argument`varlist' we will specify when we call the program is plugged in the next line.  */ logit tx month_sp* cm_lagged if month<=fmtx & LTFU~=1  /*  The next line calculates conditional probability of being on treatment; this is essentially a time-dependent propensity score.  */ predict cp_tx_d  label variable cp_tx_d "Conditional prob starting tx"  /* The conditional probability is 1 after initiation, under our assumption that  once on treatment, always on treatment. */ replace cp_tx_d = 1 if month>fmtx  /*  The weight should depend on the probability of treatment received, so we  calculate the compliment for each month where the subject is not yet on treatment.  */ replace cp_tx_d = 1-cp_tx_d if tx==0  /*  Finally, we calculate marginal probability of treatment history. We don't get 
 this directly from the PLR model.  Because it is restricted to subjects still off  treatment (except for the first month of treatment), it estimates the conditional  probability of starting treatment, given having remained off treatment so far.   To get the marginal probability of remaining off treatment in month 10, say,  we have to take the product of the probabilities of remaining off in month 1,  the conditional probability of remaining off in month 2, given that you remained off through month 1, and so on, through month 10. If you started on treatment at month 13, and are at risk in month 16, the marginal probability of your treatment history then is the product of the probabilities of not starting in months 1-12, the probability of starting in month 13, given off in 12, and 1 for months 14-16.  That's what the next line of code does, by multiplying each record by all previous  records for the same subject.  We have to sort by id and month to make this work.  The if statement at the end of the second command prevents the code from cumulating the probabilities across subjects; id[_n-1] is the id number of the previous record in the dataset (which is sorted by id and month). */ 
 sort id month gen mp_tx_d = cp_tx_d // marginal = conditional at time 0 label variable mp_tx_d "Marginal prob of treatment history" replace mp_tx_d = mp_tx_d * mp_tx_d[_n-1] if id==id[_n-1]   
 /*  The numerator of the weight is calculated using the same algorithm, but  the time-dependent values of the confounder-mediator are not used in the model. */ 
 logit tx month_sp* if month<=fmtx & LTFU~=1 predict cp_tx_n  label variable cp_tx_n "Condition prob of starting treatment" replace cp_tx_n = 1 if month>fmtx replace cp_tx_n = 1-cp_tx_n if tx==0 sort id month gen mp_tx_n = cp_tx_n label variable mp_tx_n "Marginal probabiliity of treatment history" replace mp_tx_n = mp_tx_n * mp_tx_n[_n-1] if id==id[_n-1]  
 /*  In a final step, we calculate the stabilized IP weight as the ratio of the  numerator and denominator marginal probabilities of treatment observed.  */ gen ip_wt = mp_tx_n/mp_tx_d label variable ip_wt "Stabilized inverse probability of treatment weight"  /*  The censoring weights are now calculated using essentially the same  strategy, but now LTFU is the outcome, and we are allowed to use  current treatment in both weights; however, the time-dependent confounder- mediator can only be used in the denominator of the weight. Since we are trying to weight up the sample remaining in follow-up to the overall surviving sample, we need to weight by the inverse probability of  remaining in follow-up. Hence we will call what the course notes call 
 inverse probability of censoring (IPC) weights inverse probability of in
 follow-up (ipinfl) weights. */ logit LTFU month_sp* tx cm_current // denominator predict cp_LTFU_d  label variable cp_LTFU_d "Conditional probability of LTFU" gen cp_infu_d = 1-cp_LTFU_d  label variable cp_infu_d "Conditional probability in FU"  sort id month gen mp_infu_d = cp_infu_d  label variable mp_infu_d "Marginal probability in FU" replace mp_infu_d = mp_infu_d * mp_infu_d[_n-1] if id==id[_n-1]  logit LTFU month_sp* tx // numerator predict cp_LTFU_n  label variable cp_LTFU_n "Conditional probability of LTFU" gen cp_infu_n = 1-cp_LTFU_n  label variable cp_infu_n "Condition probability in FU" sort id month gen mp_infu_n = cp_infu_n  label variable mp_infu_n "Marginal probability in FU" replace mp_infu_n = mp_infu_n * mp_infu_n[_n-1] if id==id[_n-1]  gen ip_infu_wt = mp_infu_n/mp_infu_d  label variable ip_infu_wt "Stabilized IP in FU weight" /*  In a final step, we calculate combined stabilized as the product of  the treatment and censoring weights.  */ gen sc_wt = ip_wt*ip_infu_wt  label variable sc_wt "Stabilized combined weight"  format cp* mp* ip* sc_wt %7.3f  save final, replace   /* Check how the marginal probabilities change over time.   */ keep if id==48 sort id month by id: list month tx cp_tx_d mp_tx_d ip_wt, noobs clean   /*  Have a look at the resulting data for the first subject.  */ use final, clear keep if id==48 // keep the record for one subject; feel free to look at someone else sort id fmtx cm_0 by id fmtx cm_0: list month event tx cm_lagged ip_wt ip_infu_wt sc_wt, noobs clean 
 /* 
 Make sure you understand all the variables you see. 
  ESTIMATING THE TRUE TREATMENT EFFECT USING THE POTENTIAL OUTCOMES DATASET
 We can use this dataset and fit a PLR model to estimate the parameters of the 
 true marginal structural model. The dataset has already been transformed into 
 the format needed. */
 use msm_potential_outcomes, clear describe logistic event tx month_sp* 
 
 /* In this analysis, the odds-ratio of 0.25 is interpretable as a hazard ratio, 
 because the data for each monthly outcome are restricted to those still at risk 
 (so like the hazard ratio, the OR is conditional on having survived to the 
 current month), and the incidence is very low in any given month (so the OR 
 and HR don't differ much). 
 
 RUNNING THE ANALYSES USING THE OBSERVED DATA
  First we will run an unadjusted PLR model, which would be expected to give an  estimate of the treatment effect that is badly confounded by indication.  */ use final, clear logistic event tx month_sp* /*  Comparing the result to the true marginal RH of about 0.25, that is the case.  Now we run a PLR model adjusting for the current value of confounder-mediator. 
 This is the naive model that inverse weighting is supposed to improve upon. 
 Because we are adjusting away the effects of treatment mediated by its effects 
 on the confounder-mediator, we would also expect the treatment to appear less 
 beneficial in this model. We need to turn on the robust SEs because we are 
 including a TDC in the model. */
  logistic event tx month_sp* cm_current, vce(cluster id)  /*  So the bias here is very considerable. (In many applications it is not so bad; we
 turned up the heat to make the point).
 
 Now we are almost ready to run the IPW-PLR model.  
 
 But we should check the weights before proceeding -- really big ones are a 
 common problem and represent potential violations of the positivity assumption. 
 Big weights can be trimmed or those observations can be dropped, but no well-
 established methods exist for doing this.  Note that because of the stabilization, 
 a weight >=20 no longer corresponds to probability of observed treatment history 
 of less than 5%.  */ sum sc_wt, detail
  /* Since there are essentially no big weights, we can run the model using the combined stabilized weights as estimated. The weights are specified as pweights, as in 
 the IPTW propensity score models, and we use robust SEs, also required in this 
 application. */
  logistic event tx month_sp* [pweight=sc_wt], vce(cluster id)  /* 
 Finally, the estimate is reasonABLY close to the relative hazard of 0.25 that we 
 obtained from  the potential outcomes dataset. So at least with nice clean 
 simulated data like this, IPW-PLR can work well.    Section II
 
 In this section of the example, we will consider four extensions of the IPW-PLR 
 model from Section I.  That model used a failure time outcome, assumed that 
 once treatment began, it would continue without interruption until failure or 
 censoring, and that the treatment effect would be constant across time.  In this
 section, we will consider four extensions:
 
 	1. Treatment that can start and stop repeatedly.  The models for the IP 
 	weights have to be modified to accommodate this, by including previous 
 	treatment as a predictor, and by including observations after treatment 
 	is initiated. The line of code setting the conditional probability of 
 	treatment to 1 after initiation is not needed, because continuation on 
 	treatment is not a sure thing after initiation. 
 	
 	The causal effect estimated in this model is still the effect of being on 
 	treatment.  So this model posits an immediate effect of being on treatment, 
 	which will not always hold.  In the next part, we consider how to estimate 
 	a treatment effect in cases where there may be no immediate effect, but 
 	with time the effect increases.  

 	2. Treatment effects that vary with time. We just need to include the 
 	effect of time on treatment in the IPW-PLR model to accommodate this.  In 
 	this part of the example, we stick with once-on-treatment, always on treatment, 
 	and failure time outcomes, but accommodating time-dependent treatment 
 	effects could be done the same way in other contexts. Time-on-treatment 
 	would admittedly be tricky to define in the on-off case: for example, 
 	would it make more sense just to include the current episode of use in 
 	calculating time on treatment, or should we include all observed episodes, 
 	perhaps discounting more distant episodes?  This is a substantive issue 
 	depending on how the treatment works and the length of the gaps between 
 	episodes of use. In either case, this type of variable treatment effect 
 	is estimated by specifying a model that allows the effect to vary with 
 	time on treatment.  The protective effect of statins against CHD might 
 	more or less follow this pattern; its effects though reducing LDL would 
 	be expected to take time, though there might be a more immediate effect 
 	through the inflammation pathway.
 	
 	3. Repeated rather than failure time outcomes. Rather than IPW-PLR models 
 	for the outcome we have to use IP-weighted repeated measures models.  In 
 	the example, treatment can stop and start, the repeated outcome is binary, 
 	and is measured at 6, 12, 18, .... 48 months.  For example, a repeated 
 	outcome might be hospitalizations for asthma, and treatments might be 
 	use of controller medications.
 	
 	4. A baseline covariate modifies the effect of treatment.  The treatment 
 	pattern is once-treated, always-treated, as in part 2 and Section 1, and 
 	the outcome is a failure time, as in parts 1 and 2 and Section 1.  For 
 	example, is the effect of bisphophonates on hip fracture may differ between 
 	men and women, or by baseline BMD.
 	
 In all four cases, informative censoring is allowed, so IPC weights are again 
 needed, but nothing about them changes. For each of the four extensions, a 
 different version of the simulated data is provided, for which the modified 
 analysis would be appropriate. This does simplify the problem, to provide a 
 gentler introduction to these models. With real data it's easy enough to know 
 or figure out in advance whether treatment follows the once-on, always-on 
 pattern, and whether your outcome is a repeated measure or a failure time.  
 But it would be up to you to determine whether the effect of treatment changes 
 with time or is modified by a baseline covariate.
 
 We also need to make some minor modifications to the data management steps, 
 detailed below.  I simplified the variable names, to make the code more compact 
 and easier to write without making typos, but they are systematic and easy to 
 learn; use describe to get the variable labels.
 
 To clarify temporality, the order of events is assumed to be as follows: the 
 confounder-mediator is measured, and on that basis, treatment may be modified. 
 By the next month or visit, treatment has affected the confounder-mediator, 
 and risk of the outcome depends on the new value of both treatment and the 
 confounder-mediator.  Accordingly, at any given time point: - treatment depends 
 on the "lagged" (ie most recent) value of the confounder-mediator, as well as 
 the lagged value of treatment in cases where it can stop and start - the 
 confounder-mediator depends on the current value of treatment - the outcome 
 depends on the current values of treatment and the confounder mediator.  
 This new set-up requires a variable giving the value of the confounder-mediator 
 at the most recent time point, which we make using code similar to that used 
 for LVCF values.
 
 To keep things within bounds, the analyses using LVCF values of the confounder-
 mediator are omitted. Also, the transformations to long format for the potential 
 outcomes datasets have been done in the data generation step -- you can just 
 have a look at the data and run the model.  Finally, we suggest doing at least 
 2 parts of this section, the ones that seem most likely to be useful for you, 
 only going on to the third and fourth extensions if they also seem worthwhile.
 
 Feel free to take looks at the data in additional places; use methods from last 
 week to do that.
  
 PART 1: ON-OFF TREATMENT
 
 In this set-up, we assume that the probability of remaining on treatment depends 
 strongly on treatment status in the previous month, with subjects who are 
 already on treatment much more likely to remain on it, despite improvements 
 in the confounder-mediator. So we have to evaluate a variable giving treatment 
 status in the previous month.  

 PREPARING THE OBSERVED OUTCOMES DATASET FOR ANALYSIS
  The dataset, msm_onoff.dta is in wide format and includes the following variables: 	id: subject id 	month: end of follow-up, in months 	event: event indicator (0 = censored, 1 = event) 	cm_0-cm_48: values of the confounder-mediator measured in each month
 	tx_0-tx_48: indicators for current treatment in each month
 	tx_months: a variable giving the total months on treatment -- not needed in the
 		analysis, but good for picking out subjects with more time on treatment 
 */
  use msm_onoff, clear
 describe list id month event tx_months tx_0-tx_9 in 1/10, noobs clean
 
 * The transformation into long format is unchanged. stset month, failure(event) id(id)  stsplit start, at(0/48) 
 drop start _t0 _st month event 
 rename _t month 
 rename _d event 
 label variable month "Month of follow-up" label variable event "Event indicator"
 label values event noyes sort id month
 
 /**TN:  I added next 2 commands to take a look at the data for some subjects 
 with events**/
list id if event==1 & id <30
list id month cm_0 cm_1 cm_2 tx_0 tx_1 tx_months event if id==7|id==14
    /* Next we evaluate TDCs for current and lagged treatment, which will be used 
 in the models for current treatment and censoring.  */
 cap drop tx
 cap drop tx_lagged
 gen byte tx = 0 // no one on treatment at baseline -- new users only
 gen byte tx_lagged = 0 forvalues i = 1/48 {  	qui replace tx = tx_`i' if month>=`i' & tx_`i'~= . //use most recent if missing
   	local lagged = `i'-1 // defines "macro" variable analogous to `i'
  	qui replace tx_lagged = tx_`lagged' if month>=`i' & tx_`lagged' ~=.  	} label variable tx "Currently on treatment"
 label variable tx_lagged "On treatment in previous month"
 label values tx tx_lagged noyes
 
* current and lagged values of confounder-mediator; same code as above cap drop cm gen cm_current = cm_0
 gen cm_lagged = cm_0 label variable cm_current "Confounder-mediator: current value" label variable cm_lagged "Confounder-mediator: lagged value" forvalues i = 1/48 {  	qui replace cm_current = cm_`i' if month>=`i' & cm_`i'~= .
  	local lagged = `i'-1  	qui replace cm_lagged = cm_`lagged' if month>=`i' & cm_`lagged'~= .  	}
 format cm_current cm_lagged %5.3f
 
 /**TN:  I added next command to make sure these commands did what we wanted**/
 list id month cm_lagged cm_current tx_lagged tx tx_months event if id==7
  
 * extra record for LTFU, unchanged except for the keep list gen byte LTFU = 0 label variable LTFU "Lost to follow-up indicator" save preliminary, replace  by id: egen maxt = max(month) 
 keep if month==maxt & month<48 & event==0  qui replace month = month+1  qui replace LTFU = 1  keep id month tx tx_lagged cm_0 cm_current cm_lagged tx_months LTFU  append using preliminary sort id month
 
 *spline in month number, unchanged, for fitting baseline event rate
 mkspline month_sp = month, cubic keep id month event tx tx_lagged tx_months month_sp* cm_current cm_lagged LTFU save preliminary, replace 
 * Take a look at some records for observations with time on treatment
 keep if tx_months>0
 keep if _n <= 100
 sort id tx_months
 by id tx_months: list month event tx tx_lagged cm_current cm_lagged, noobs clean
 use preliminary, clear
 /* The last preparatory step is to make the combined stabilized weight for the 
 confounder-mediator and censoring.  The differences are that we adjust for 
 treatment history  (using insider information that the lagged value is all that 
 matters).  Also, records after the first month of treatment are included in the 
 models for treatment, because treatment can switch off and then on again.  
 Finally, we do not impute cp_tx=1 for all months after the first month of 
 treatment, for the same reason; it is no longer a done deal at that point.
 In accord with the clarified temporality, we used the lagged value of the 
 confounder-mediator in the models for treatment.  Note that we now have to 
 exclude the extra LTFU records in the IP weight models. */

 * IP weights
 logit tx cm_lagged tx_lagged month_sp* if event~= . // exclude extra LTFU records

 /**TN:  Recall cp is for cumulative probability and _d and _n refer to denominator 
 and numerator for the stabilized weights.**/
 	predict cp_tx_d 		replace cp_tx_d = 1-cp_tx_d if tx==0		sort id month	gen mp_tx_d = cp_tx_d 
	replace mp_tx_d = mp_tx_d*mp_tx_d[_n-1] if id==id[_n-1] 	
 
 logit tx tx_lagged month_sp* if event~=.	predict cp_tx_n 	replace cp_tx_n = 1-cp_tx_n if tx==0	sort id month	gen mp_tx_n = cp_tx_n	replace mp_tx_n = mp_tx_n*mp_tx_n[_n-1] if id==id[_n-1]
 gen ip_tx_wt = mp_tx_n/mp_tx_d	
	 * IPC weights	 logit LTFU cm_current tx month_sp*	predict cp_LTFU_d 	gen cp_infu_d = 1-cp_LTFU_d 		sort id month	gen mp_infu_d = cp_infu_d 	replace mp_infu_d = mp_infu_d*mp_infu_d[_n-1] if id==id[_n-1] logit LTFU tx month_sp*	predict cp_LTFU_n 	gen cp_infu_n = 1-cp_LTFU_n 	sort id month	gen mp_infu_n = cp_infu_n 	replace mp_infu_n = mp_infu_n*mp_infu_n[_n-1] if id==id[_n-1] gen ip_infu_wt = mp_infu_n/mp_infu_d 
	 * stabilized combined weight gen sc_wt = ip_tx_wt*ip_infu_wt 
 label variable sc_wt "Stabilized combined weight"	 format cp* mp* ip* sc_wt %7.3f 
	
 * check positivity and trim weights
 recode cp_tx_d 0/0.05=1 0.05/0.95=0 0.95/1=1, gen(ppv)
 label variable ppv "Potential positivity violation"
 label values ppv noyes
 tab ppv
 sum sc_wt if ppv==1, detail
 
 /* In this case, there are a lot of PPV observations, but only a few have big 
 weights.  The next step shows that there are additional big weights among the 
 rest. We will use sensitivity analyses to see if the big weights make a 
 difference. To conduct a proper analysis omitting the PPV observations, we 
 would need to recalculate the weights after omitting the PPV observations; 
 we will do that in Part 4 of the example. */
 
 sum sc_wt, detail
 recode sc_wt 0/20=0 20/max=1, gen(big_wt)
 label variable big_wt "Stabilized combined weight > 20"
 label values big_wt noyes
 tab big_wt
 recode sc_wt 20/max=20, gen(tr_sc_wt)
 label variable tr_sc_wt "Trimmed stabilized combined weight"
  
 save final, replace 
 * Have a look at the resulting data for the first few subject  use final, clear keep if id<=5  sort id tx_months by id tx_months: list month event ip_tx_wt ip_infu_wt sc_wt, noobs clean    /* 
 RUNNING THE ANALYSIS USING THE POTENTIAL OUTCOMES DATASET We again use the potential outcomes dataset, but this week it is already in long 
 format, so we can directly proceed to estimating the marginal RH.  */ use msm_onoff_potential_outcomes, clear
 sort id newid
 list id newid tx month event in 1/100 logistic event tx month_sp* 
 
 /* This suggests that the effect of treatment is to reduce the event rate by >70% 
 (the OR is approximately interpretable as a hazard ratio in the PLR model). */

 * RUNNING THE ANALYSES USING THE OBSERVED DATA use final, clear
 
 * unadjusted PLR model; robust SEs not needed because there are no TDCs logistic event tx month_sp*
 
 /* The estimate is somewhat attenuated by confounding by indication. Now adjusting
 for the confounder mediator. */ logistic event tx  cm_current month_sp*, vce(cluster id)  /* The estimate is worse when we naively adjust for the confounder-mediator. Finally,
 the IPW-PLR model. */
 logistic event tx month_sp* [pw=sc_wt], vce(cluster id)
   * sensitivity analysis using trimmed weights logistic event tx month_sp* [pw=tr_sc_wt], vce(cluster id) 
  * sensitivity analysis omitting big weights logistic event tx month_sp* [pw=sc_wt] if big_wt==0, vce(cluster id) 
 
 /* So all IPW models give results that are consistent with the estimate based on
 the potential outcomes dataset, and show little sensitivity to big weights. A 
 caution: in our experience with big datasets, weights can sometimes be much 
 larger, and quite influential. */
 
 /* PART 2: MODELING TRENDS IN THE TREATMENT EFFECT
 
 This extension requires that we estimate the effect of time on treatment, which we
 have to calcalate as a TDC.
  PREPARING THE OBSERVED OUTCOMES DATASET FOR PLR ANALYSIS
  The dataset, msm_txtrend.dta is in wide format and includes the following variables: 	id: subject id 	month: end of follow-up, in months 	event: event indicator (0 = censored, 1 = event)
 	fmtx: first month on treatment 	cm_0-cm_48: values of the confounder-mediator measured in each month
 */
  use msm_txtrend, clear
 describe list id month event fmtx in 1/5, noobs clean
 
 * The transformation into long format for PLR is unchanged from Section 1. stset month, failure(event) id(id)  stsplit start, at(0/48) 
 drop start _t0 _st month event 
 rename _t month 
 rename _d event 
 label variable month "Month of follow-up" label variable event "Event indicator"
 label values event noyes sort id month
 save preliminary, replace
 
 /* In this analysis we will assess the effect of treatment duration, which we
 calculate in each month counting from the first month on treatment.  This is
 set to zero before treatment starts. */
 cap drop tx
 cap drop tx_months
 gen tx = month>=fmtx 
 gen tx_months = max(0, month-(fmtx-1)) // max(0, x) = 0 if x<0 or x==.
 gen tx_years = tx_months/12 // rescale to get more interpretable point estimate label variable tx "Currently on treatment"
 label variable tx_months "Current treatment duration (months)"
 label values tx noyes
 
 * current and lagged values of confounder-mediator (unchanged from Section 1) cap drop cm gen cm_current = cm_0
 gen cm_lagged = cm_0 label variable cm_current "Confounder-mediator: current value" label variable cm_lagged "Confounder-mediator: lagged value" forvalues i = 1/48 {  	qui replace cm_current = cm_`i' if month>=`i' & cm_`i'~= .
  	local lagged = `i'-1  	qui replace cm_lagged = cm_`lagged' if month>=`i' & cm_`lagged'~= .  	}
 format cm_current cm_lagged %5.3f
  	 
 * extra record for LTFU, unchanged from Section 1 gen byte LTFU = 0  label variable LTFU "Lost to follow-up indicator"
 label values LTFU noyes save preliminary, replace  by id: egen maxt = max(month)  keep if month==maxt & month<48 & event==0  qui replace month = month+1  qui replace LTFU = 1  keep id month tx tx_months cm_current cm_lagged LTFU  append using preliminary  sort id month
 
*spline in month number, also unchanged
 mkspline month_sp = month, cubic keep id month event fmtx tx tx_months tx_years month_sp* cm_0 ///
 	cm_current cm_lagged LTFU save preliminary, replace
 
 sort id fmtx
 keep if _n <= 50
 by id fmtx: list month cm_current tx tx_months event, noobs clean
 use preliminary, clear
 
 * IP weights logit tx cm_lagged month_sp* if month<=fmtx	predict cp_tx_d
	replace cp_tx_d = 1 if month>fmtx		replace cp_tx_d = 1-cp_tx_d if tx==0	sort id month	gen mp_tx_d = cp_tx_d 	replace mp_tx_d = mp_tx_d*mp_tx_d[_n-1] if id==id[_n-1]  logit tx month_sp* if month<=fmtx	predict cp_tx_n 
	replace cp_tx_n = 1 if month>fmtx	replace cp_tx_n = 1-cp_tx_n if tx==0	sort id month	gen mp_tx_n = cp_tx_n	replace mp_tx_n = mp_tx_n*mp_tx_n[_n-1] if id==id[_n-1]		gen ip_tx_wt = mp_tx_n/mp_tx_d * IPC weights logit LTFU cm_current tx month_sp* 	predict cp_LTFU_d 	gen cp_infu_d = 1-cp_LTFU_d 	sort id month	gen mp_infu_d = cp_infu_d 	replace mp_infu_d = mp_infu_d*mp_infu_d[_n-1] if id==id[_n-1] logit LTFU tx month_sp* 
 	predict cp_LTFU_n 	gen cp_infu_n = 1-cp_LTFU_n 	sort id month	gen mp_infu_n = cp_infu_n 	replace mp_infu_n = mp_infu_n*mp_infu_n[_n-1] if id==id[_n-1] gen ip_infu_wt = mp_infu_n/mp_infu_d 
  gen sc_wt = ip_tx_wt*ip_infu_wt  label variable sc_wt "Stabilized combined weight"	 format cp* mp* ip* sc_wt %7.3f 
	
 * Now we will check for potential positivity violations.
 recode cp_tx_d 0/0.05=1 0.05/0.95=0 0.95/1=1, gen(ppv)
 replace ppv = 0 if month>fmtx // cp_tx_d==1 after treatment begun
 label variable ppv "Potential positivity violation"
 label values ppv noyes
 tab tx ppv, row 
 sum sc_wt if ppv==1, detail
 
 /* So we have a lot of potential positivity violations, but no large weights 
 among those observations, or overall. The PPVs are almost all among subjects 
 who are off treatment, meaning that there was little indication for them to 
 start. So the probability of their observed treatment status was nearly 1, 
 and they got small weights. Restriction of the dataset to exclude these 
 subjects might be appropriate; they are like the infants in the phototherapy 
 study who are almost never treated.  Again, the weights would need to be 
 re-stimated after restricting the dataset to exclude them, which we do in 
 the part 4 example. The next check shows that there are also no big weights 
 overall. */
 
 sum sc_wt, detail save final, replace  
 * look at some data keep if id<=4 sort id fmtx by id fmtx: list month event tx tx_months sc_wt, noobs clean
 
 /* We again use the potential outcomes dataset, already in long format, to get an 
 estimate of the true marginal treatment effects. */
  use msm_txtrend_potential_outcomes, clear logistic event tx tx_years month_sp* 
 * RH at annual intervals
 forvalues i = 0/4 {
 	dis _newline as result "Estimated RH after `i' years of treatment:"
 	lincom tx + `i'*tx_years
 	}
 	
 /* In the potential outcomes data, we see almost no immediate effect of 
 treatment, but increasing effects over time, reaching 86% after 4 years of 
 treatment. */
 
 /*  NOW USING THE OBSERVED DATA */
  * Unadjusted PLR model
  use final, clear logistic event tx tx_years month_sp*
  /* The trend estimate is OK but main effect is badly confounded by indication.   Now the naive PLR model adjusting for the current value of confounder-mediator. */
  logistic event tx tx_years month_sp* cm_current, vce(cluster id)
  /* Now both estimates are off. That is because a large part of the treatment 
 effect is mediated by its gradually increasing effect on the confounder-mediator. 
 Finally we will run the IPW-PLR model.*/
 
 logistic event tx tx_years month_sp* [pw=sc_wt], vce(cluster id)
 * RH at annual intervals
 forvalues i = 0/4 {
 	dis _newline as result "Estimated RH after `i' years of treatment:"
 	lincom tx + `i'*tx_years
 	}
 	
 /* These estimates are OK, at least qualitatively.*/
 
 /* PART 3: REPEATED BINARY OUTCOME
 
 IP weighting applies to repeated measures as well as survival outcomes. In 
 the PLR model, we have a repeated binary outcome every month, until the outcome 
 occurs or censoring.  In this context, we have a repeated outcome every visit, 
 until censoring, but the outcome can be continuous or a count as well as 
 binary, and even if it is binary, it can happen repeatedly -- so subjects are 
 seen again after a first 
 positive outcome.  
 
 In this example, treatment is allowed to stop as well as start, so the IP 
 weights are calculated as in PART 1.  The other required change is substituting 
 a repeated measures model for the outcome in place of the PLR model used with 
 failure times. In Stata, we can't use the xt commands, which do not allow 
 time-varying weights -- the same problem that forces us to use PLR in place of 
 the Cox model with failure times.  Instead we use standard regression commands 
 with the option vce(cluster id).  This is tantamount to fitting GEE models with 
 so-called independence working correlation and robust SEs.  Use of IPC weights 
 makes this procedure consistent if the data are missing at random, given 
 covariates and the outcomes we observe (MAR), provided the IPC weight models 
 are right and there are no unmeasured confounders of dropout.  Without the IPC 
 weights, GEE is only consistent if the data are missing completely at random 
 (MCAR), given covariates, with no dependence of dropout on observed outcomes.
 
 PREPARING THE OBSERVED OUTCOMES DATASET FOR ANALYSIS
  The dataset, msm_repeated.dta is in wide format, with one record per subject, and 
 includes the following variables: 	id: subject id
 	cm_bl: confounder-mediator at baseline  	cm_6, cm_12, ..., cm_48: the confounder-mediator at each 6 month visit
 	tx_6, tx_12, ..., tx_48: treatment indicator at follow-up visits
 	event_6, event_12, ..., event_48: outcome event indicator at each visit.
 	
 The only thing we keep from baseline is the value of the confounder mediator.  
 We assume that some subjects may have been started on treatment after their 
 baseline visit, on the basis of the confounder-mediator value observed there.  
 So by the 6-month visit, the confounder-mediator and the outcome depend on 
 treatment.  But treatment status depends on the baseline confounder-mediator 
 value, not the current
 value. */ use msm_repeated, clear describe list id tx_* in 1/10, noobs clean
 list id event_* in 1/10, noobs clean
 
 /* The transformation is simpler here, aided by the variable names, which were 
 chosen to be compatible with the reshape command.  In particular the baseline 
 value of the confounder-mediator is called cm_bl, to avoid making records for 
 everyone at baseline; these are not needed, because there are no outcomes at 
 baseline. */
  reshape long cm_ tx_ event_, i(id) j(month)
 drop if event_==.
 * for convenience get rid of the trailing underscores in the variable names
 foreach x in cm tx event {
 	rename `x'_ `x'
 	}
 label variable tx "Currently on treatment"
 label variable month "Month of follow-up" label variable event "Event indicator"
 label values tx event noyes sort id month
 tab month tx
 tab month event
 save preliminary, replace
 
 * check what this has done
 keep if _n <= 100
 by id: list month event, noobs clean
 use preliminary, clear
 
 * get lagged values of treatment and confounder-mediator
 sort id month
 gen tx_lagged = 0 if month==6
 replace tx_lagged = tx[_n-1] if month>6 & id==id[_n-1]
 label values tx_lagged noyes
 gen cm_lagged = cm_bl if month==6
 replace cm_lagged = cm[_n-1] if month>6 & id==id[_n-1]
 rename cm cm_current
 format cm_current cm_lagged %4.2f

 * extra LTFU record gen byte LTFU = 0 // set the LTFU indicator to zero on all existing records label variable LTFU "Lost to follow-up indicator"
 label values LTFU noyes save preliminary, replace // save the main dataset by id: egen maxt = max(month) // identify last month in follow-up for all subjects keep if month==maxt & month<48 // pick out last record for LTFU subjects qui replace month = month+6 // reset LTFU event time to the following month qui replace LTFU = 1 // reset LTFU indicator to 1 keep id month tx tx_lagged cm_current cm_lagged LTFU // needed in the LTFU models append using preliminary // join the new record to the analysis dataset sort id month
 save preliminary, replace
  keep id month event tx tx_lagged cm_current cm_lagged LTFU save preliminary, replace 
 * Check this
 keep if _n <= 100
 sort id 
 by id: list month tx tx_lagged cm_current cm_lagged LTFU event, noobs clean
 use preliminary, clear
 
 * IP weights
  logit tx cm_lagged tx_lagged i.month if event~= . // omit extra LTFU records	predict cp_tx_d 		replace cp_tx_d = 1-cp_tx_d if tx==0		sort id month	gen mp_tx_d = cp_tx_d 	replace mp_tx_d = mp_tx_d * mp_tx_d[_n-1] if id==id[_n-1] logit tx tx_lagged i.month if event~=. 	predict cp_tx_n 	replace cp_tx_n = 1-cp_tx_n if tx==0	sort id month	gen mp_tx_n = cp_tx_n	replace mp_tx_n = mp_tx_n * mp_tx_n[_n-1] if id==id[_n-1]	 gen ip_tx_wt = mp_tx_n/mp_tx_d
 
 * IPC weights
 logit LTFU tx cm_current i.month	predict cp_LTFU_d 	gen cp_infu_d = 1-cp_LTFU_d 	sort id month	gen mp_infu_d = cp_infu_d 	replace mp_infu_d = mp_infu_d * mp_infu_d[_n-1] if id==id[_n-1] logit LTFU tx i.month 	predict cp_LTFU_n 	gen cp_infu_n = 1-cp_LTFU_n 
	sort id month	gen mp_infu_n = cp_infu_n 		replace mp_infu_n = mp_infu_n * mp_infu_n[_n-1] if id==id[_n-1] gen ip_infu_wt = mp_infu_n/mp_infu_d 	
 
 * stabilized combined weight gen sc_wt = ip_tx_wt*ip_infu_wt  label variable sc_wt "Stabilized combined weight" format ip* sc*_* %7.3f 
 
 * check positivity and trim weights
 recode cp_tx_d 0/0.05=1 0.05/0.95=0 0.95/1=1, gen(ppv)
 label variable ppv "Potential positivity violation"
 label values ppv noyes
 tab ppv
 * combined stabilized weights in potential positivity violations
 sum sc_wt if ppv==1, detail
 * overall combined stabilized weight distribution
 sum sc_wt, detail
 recode sc_wt 0/20=0 20/max=1, gen(big_wt)
 label variable big_wt "Stabilized combined weight > 20"
 label values big_wt noyes
 tab big_wt
 recode sc_wt 20/max=20, gen(tr_sc_wt)
 label variable tr_sc_wt "Trimmed stabilized combined weight"
  save final, replace * Have a look at the resulting data use final, clear keep if id<=10  sort id  by id: ///
 	list month event tx tx_lagged cm_current cm_lagged ip_tx_wt ip_infu_wt sc_wt, ///
 	noobs clean
  
 * RUNNING THE ANALYSES USING THE POTENTIAL OUTCOMES DATASET use msm_repeated_potential_outcomes, clear describe *list in 1/5, noobs clean logistic event tx i.month, vce(cluster id) 
 
 * RUNNING THE ANALYSES USING THE OBSERVED DATA use final, clear
 * unadjusted repeated measures model logistic event tx i.month, vce(cluster id)
  * Naive repeated measures model adjusting for confounder-mediator logistic event tx cm_current i.month, vce(cluster id)
 
 /* Using non-independence working correlation or a random effects model does 
 not help,  Uncomment the next two lines if you want to verify this. */
 * xtlogit event tx cm_current i.month, pa i(id) vce(robust)
 * xtlogit event tx cm_current i.month, i(id)
  * IPW repeated measures model using stabilized combined weights logistic event tx i.month [pw=sc_wt], vce(cluster id)
  * sensitivity analysis omitting potential positivity violations logistic event tx i.month [pw=sc_wt] if ppv==0, vce(cluster id) 
  * sensitivity analysis using trimmed weights logistic event tx i.month [pw=tr_sc_wt], vce(cluster id) 
  * sensitivity analysis omitting big weights logistic event tx i.month [pw=sc_wt] if big_wt==0, vce(cluster id)
 
 * What do you find? 
 
 /* PART 4: MODELING INTERACTIONS BETWEEN BASELINE COVARIATE AND TREATMENT
 
 This extension requires that we calculate the interaction, which is a TDC 
 because treatment is a TDC.  In this part of the example, we go back to a once-on, 
 always-on treatment, but the approach would be essentially the same with an 
 on/off treatment.  
  PREPARING THE OBSERVED OUTCOMES DATASET FOR PLR ANALYSIS
  The dataset, msm_blia.dta is in wide format and includes the following variables: 	id: subject id
 	bl_em: fixed binary baseline effect modifier 	month: end of follow-up, in months 	event: event indicator (0 = censored, 1 = event)
 	fmtx: first month on treatment 	cm_0-cm_48: values of the confounder-mediator measured in each month
 */
  use msm_blia, clear
 describe
 label values event noyes list id bl_em month event fmtx in 1/5, noobs clean
 
 * The transformation into long format for PLR is unchanged from Section 1. stset month, failure(event) id(id)  stsplit start, at(0/48) 
 drop start _t0 _st month event 
 rename _t month 
 rename _d event 
 label variable month "Month of follow-up" label variable event "Event indicator"
 label values event noyes sort id month
 save preliminary, replace
 
 /* In this analysis we will assess the effect of the interaction between treatment
 and the baseline effect modifier.  We could calculate a product term in this simple
 case, or use Stata factor coding to do it for us; the latter would be preferred in 
 cases where the baseline variable has more levels, or the treatment variables are
 complicated. */
 cap drop tx
 cap drop tx_months
 gen tx = month>=fmtx 
 gen tx_blia = tx * bl_em label variable tx "Currently on treatment"
 label variable tx_blia "Interaction of tx with baseline effect modifier"
 label values bl_em tx tx_blia noyes
 
 * current and lagged values of confounder-mediator (unchanged from Section 1) cap drop cm gen cm_current = cm_0
 gen cm_lagged = cm_0 label variable cm_current "Confounder-mediator: current value" label variable cm_lagged "Confounder-mediator: lagged value" forvalues i = 1/48 {  	qui replace cm_current = cm_`i' if month>=`i' & cm_`i'~= .
  	local lagged = `i'-1  	qui replace cm_lagged = cm_`lagged' if month>=`i' & cm_`lagged'~= .  	}
 format cm_current cm_lagged %5.3f
  	 
 * extra record for LTFU, unchanged from Section 1 gen byte LTFU = 0  label variable LTFU "Lost to follow-up indicator"
 label values LTFU noyes save preliminary, replace  by id: egen maxt = max(month)  keep if month==maxt & month<48 & event==0  qui replace month = month+1  qui replace LTFU = 1  keep id month tx tx_blia bl_em cm_current cm_lagged LTFU  append using preliminary  sort id month
 
 *spline in month number, also unchanged
 mkspline month_sp = month, cubic keep id month event fmtx tx tx_blia bl_em month_sp* cm_0 cm_current cm_lagged LTFU save preliminary, replace
 
 keep if _n <= 50
 sort id fmtx bl_em
 by id fmtx bl_em: list month cm_current tx tx_blia event, noobs clean
 use preliminary, clear
 
 * IP weights logit tx cm_lagged bl_em month_sp* if month<=fmtx	predict cp_tx_d
	replace cp_tx_d = 1 if month>fmtx		replace cp_tx_d = 1-cp_tx_d if tx==0	sort id month	gen mp_tx_d = cp_tx_d 	replace mp_tx_d = mp_tx_d*mp_tx_d[_n-1] if id==id[_n-1]  logit tx bl_em month_sp* if month<=fmtx	predict cp_tx_n 
	replace cp_tx_n = 1 if month>fmtx	replace cp_tx_n = 1-cp_tx_n if tx==0	sort id month	gen mp_tx_n = cp_tx_n	replace mp_tx_n = mp_tx_n*mp_tx_n[_n-1] if id==id[_n-1]		gen ip_tx_wt = mp_tx_n/mp_tx_d * IPC weights logit LTFU cm_current tx bl_em month_sp* 	predict cp_LTFU_d 	gen cp_infu_d = 1-cp_LTFU_d 	sort id month	gen mp_infu_d = cp_infu_d 	replace mp_infu_d = mp_infu_d*mp_infu_d[_n-1] if id==id[_n-1] logit LTFU tx bl_em month_sp* 	predict cp_LTFU_n 	gen cp_infu_n = 1-cp_LTFU_n 	sort id month	gen mp_infu_n = cp_infu_n 	replace mp_infu_n = mp_infu_n*mp_infu_n[_n-1] if id==id[_n-1] gen ip_infu_wt = mp_infu_n/mp_infu_d 
  gen sc_wt = ip_tx_wt*ip_infu_wt  label variable sc_wt "Stabilized combined weight"	 format cp* mp* ip* sc_wt %7.3f 
	
 * Now we will check for potential positivity violations.
 recode cp_tx_d 0/0.05=1 0.05/0.95=0 0.95/1=1, gen(ppv)
 replace ppv = 0 if month>fmtx // cp_tx_d==1 after treatment begun
 label variable ppv "Potential positivity violation"
 label values ppv noyes
 tab tx ppv, row 
 sum sc_wt if ppv==1, detail
 /* Again, we have a lot of potential positivity violations, but no large 
 weights among those observations, or overall. The PPVs are almost all among 
 subjects who are off treatment, meaning that there was little indication for 
 them to start. So the probability of their observed treatment status was 
 nearly 1, and they got small weights. At the end of this section, we omit 
 these observations, re-evaluate the weights, and re-run the analysis. */
 sum sc_wt, detail save final, replace  
 /* We again use the potential outcomes dataset to get an estimate of the true 
 marginal treatment effects. This week it is already in long format, so proceed 
 to estimating the marginal RH. */ use msm_blia_potential_outcomes, clear logistic event tx tx_blia bl_em month_sp* 
 * treatment effect in subjects with effect modifier
 lincom tx + tx_blia
 
 * rerun using factor coding logistic event i.tx##i.bl_em month_sp* 
 * treatment effect in subjects with effect modifier
 lincom 1.tx + 1.tx#1.bl_em
 	
 /* The effect modifier is associated with increased baseline risk, but 
 effectively doubles the protective effect of treatment. */
 
 /*  NOW USING THE OBSERVED DATA */
  * Unadjusted PLR model
  use final, clear logistic event tx tx_blia bl_em month_sp*
 * treatment effect in subjects with effect modifier
 lincom tx + tx_blia
  /* The effect modifier is well estimated, as is the its interaction with 
 treatment, but the main effect for treatment is badly confounded; this also 
 biases the effect in the subjects with the effect modifier. */
  logistic event tx tx_blia bl_em month_sp* cm_current, vce(cluster id)
 * treatment effect in subjects with effect modifier
 lincom tx + tx_blia
  /* Adjusting for the TDCM biases all estimates with the exception of the main 
 effect of the effect modifier. Now the IPW model. */
 
 logistic event tx tx_blia bl_em month_sp* [pw=sc_wt], vce(cluster id)
 * treatment effect in subjects with effect modifier
 lincom tx + tx_blia
 	
 /* Both estimates are pretty good, but we have lost considerable precision in the 
 interaction estimate, which we might be tempted to omit from the model.
 
 Finally, a sensitivity analysis omitting PPV observations.  We will first 
 recalculate the weights */

 drop if ppv == 1
 drop cp_* mp_* ip_* sc_* logit tx cm_lagged bl_em month_sp* if month<=fmtx	predict cp_tx_d
	replace cp_tx_d = 1 if month>fmtx		replace cp_tx_d = 1-cp_tx_d if tx==0	sort id month	gen mp_tx_d = cp_tx_d 	replace mp_tx_d = mp_tx_d*mp_tx_d[_n-1] if id==id[_n-1]  logit tx bl_em month_sp* if month<=fmtx	predict cp_tx_n 
	replace cp_tx_n = 1 if month>fmtx	replace cp_tx_n = 1-cp_tx_n if tx==0	sort id month	gen mp_tx_n = cp_tx_n	replace mp_tx_n = mp_tx_n*mp_tx_n[_n-1] if id==id[_n-1]		gen ip_tx_wt = mp_tx_n/mp_tx_d * IPC weights logit LTFU cm_current tx bl_em month_sp* 	predict cp_LTFU_d 	gen cp_infu_d = 1-cp_LTFU_d 	sort id month	gen mp_infu_d = cp_infu_d 	replace mp_infu_d = mp_infu_d*mp_infu_d[_n-1] if id==id[_n-1] logit LTFU tx bl_em month_sp* 	predict cp_LTFU_n 	gen cp_infu_n = 1-cp_LTFU_n 	sort id month	gen mp_infu_n = cp_infu_n 	replace mp_infu_n = mp_infu_n*mp_infu_n[_n-1] if id==id[_n-1] gen ip_infu_wt = mp_infu_n/mp_infu_d 
  gen sc_wt = ip_tx_wt*ip_infu_wt  label variable sc_wt "Stabilized combined weight"	 format cp* mp* ip* sc_wt %7.3f 

 logistic event tx tx_blia bl_em month_sp* [pw=sc_wt], vce(cluster id)
 * treatment effect in subjects with effect modifier
 lincom tx + tx_blia
 	
 /* These are reasonably good estimates */
 * clean up
 erase preliminary.dta erase final.dta