R tutorial on data preparation for World Trade Center Health Registry (WTCHR)

: Published: 26 January 2018

Recently I received a question from a researcher at the World Trade Center Health Registry (WTCHR). The WTCHR is a prospective cohort study of the physical and psychological effects of the 9/11 terrorist attacks. I was asked how to prepare a dataset for inverse probability of censoring weighting (IPCW) with the R package ipw. In response I wrote a tutorial with R code and simulated example data.

The data and R code can be downloaded here:

R code and data for IPCW example

Below I give some background information.

Like many longitudinal studies, the WTCHR suffers from drop-out or attrition over time. They are are investigating methods to compensate or adjust for informative drop-out. A possible method to adjust for informative drop-out is inverse probability weighting (IPW), see e.g. Cole & Hernán (2008). That is why the WTCHR is interested in the R package ipw, which I co-authored. It can be used to adjust for informative drop-out using inverse probability of censoring weighting (IPCW).

The researcher had read my article in the Journal of Statistical Software, 2011. The article lists two types of applications: 1) A point exposure/treatment with an outcome at a single time point, and 2) A longitudinal study with time varying exposure and time varying covariates, plus drop-out. He wanted to know whether the ipw package can also be applied to a study with a point exposure (e.g. at baseline of a study) in a longitudinal study, with or without time varying covariates, but with drop-out over time. In essence, the exposure is not time-varying in this example. Especially, he was interested in how to prepare the dataset for this analysis, and how to use the ipw package to perform the analysis. I gave a sketch of how to do it. Basically, you will need the same method as used for a longitudinal study with a time varying exposure.

I am assuming you will have a point exposure A which could perhaps indicate the manner in which respondents were affected, e.g. it could be if they were exposured to the dust cloud, or the number of injuries they sustained. Just as an example, I will assume A indicates dust cloud exposure (0 = no, 1 = yes). A relevant outcome Y could then be the onset of COPD. I would model the effect of A on Y with a Cox proportional hazards model, with Y as a time-to-event outcome. Normally, you would build a dataset with for each respondent the following columns:

Exposure A (0 or 1).
The “endtime”, corresponding to the end of follow-up for each respondent. This is either censoring or the onset of the event, developing COPD.
Outcome Y, which indicates if the endtime corresponds to either censoring (value 0) or the onset of the event (value 1).

These are right censored data. You could then fit a Cox model in R e.g. using the coxph() function.

However, it could be the case that censoring is due to dropout that is affected by covariates that also affect the outcome. Then you have informative censoring. To correct for this using IPW you will need a dataset build up with interval data. So you will need to split the follow-up of each individual into pieces of follow-up. Each piece of follow-up will have a start and stop time, as shown in dataset “haartdat” in the ipw package. The time points at which the intervals are split need to correspond to all observed end times. So for each individual, you will have multiple rows in the dataset with each row corresponding to an observed end time, up to the endtime of the individual (but not beyond). Note that end times could either be at irregular intervals, or at regular intervals, depending on how your data was measured. You could also choose to round the end times so that there are fewer unique values, perhaps to decrease the size of your dataset. You could test the sensitivity of your results to the amount of rounding.

Then, for each interval of follow-up, for each individual, you would need to have measurements of the covariates that predict dropout. These could be both baseline and longitudinal. I could imagine that all sorts of health and psychological indicators could be used. The problem is often that longitudinal covariates are measured at different time points than are needed in the dataset for fitting the model. You could obtain values for these covariates either by using last value carried forward imputation, or from imputation using a longitudinal model. I have often used a mixed effects model for this, when it was desired to also smooth out the measurement error.

When you have built your dataset in the above described manner, you can estimate the inverse probability of censoring weights (IPCW) as described in our article (the “temp2” object). Subsequently you can fit your main model, using the estimated weights to weight your sample. Note that it is also possible to use more than one weighting. For instance, in the above described analysis, death could also be a form of informative censoring, which you want to correct for.

Finally, a couple of important points about your weighted model. It is necessary to use a robust standard error estimator (e.g. sandwich or bootstrap), since the weighting introduces clustering in the data. It is also wise to check for “positivity”, e.g. if dropout can occur across the range of your covariates, e.g. using scatter plots. And to increase the statistical efficiency of the IPW estimator, you can truncate the weights. This is illustrated in our article. All of this is further explained in Cole SR, Hernán MA (2008) Constructing Inverse Probability Weights for Marginal Structural Models. Am J Epidemiol 168:656–664. Of interest is also another article, in which inverse probability weighting was used to correct for informative censoring, as a background: Hernán MA, Brumback B, Robins JM (2000) Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 11:561–570.

The above is a sketch of what to do. In response, the researcher remarked that the procedure described creates a larger version of the original dataset, having several/many intervals per person, for the purpose of calculating IPCW weights. Employing these corrects for the effect of drop-out. He wanted clarification how the point exposure A is treated in this process. For example, if all persons in the study were exposed to the dust cloud (to a greater or lesser extent) on 9/11/2001, which would presumably be the first time interval for each person, how does one represent this in the process described?

Regarding the point exposure A (e.g. lesser dust cloud exposure = 0, greater dust cloud exposure = 1), in the start - stop interval type notation this exposure would be treated just like a baseline covariate. It needs to have a value for each separate interval for each individual, but this value would be constant within individuals. For instance, suppose follow up consists of discrete measurement times, at the end of each year. Time origin 0 corresponds to 9/11. A person with lesser dust cloud exposure, who has developed COPD at the fifth year would look like this:

_ID	_tstart	_tstop	_A	_L	_V	_COPD	_dropout
₁	₀	₁	₀	_10.1	_male	₀	₀
₁	₁	₂	₀	_11.3	_male	₀	₀
₁	₂	₃	₀	_9.6	_male	₀	₀
₁	₃	₄	₀	_14.5	_male	₀	₀
₁	₄	₅	₀	_15.7	_male	₁	₀

As an illustration, I have also included the baseline covariate sex V, which is also constant within this individual, and some longitudinal covariate L. And a female with greater dust cloud exposure, who drops out after three years could look like this:

_ID	_tstart	_tstop	_A	_L	_V	_COPD	_dropout
₂	₀	₁	₁	_8.1	_female	₀	₀
₂	₁	₂	₁	_9.0	_female	₀	₀
₂	₂	₃	₁	_7.7	_female	₀	₁

As a sidenote, if IPW is used to correct for confounding of a time varying exposure, and the value of this exposure is stochastic at time origin 0, it would be necessary to also include the interval tstart = -1 (or any other negative value) and tstop = 0, just like in our article about the ipw package.

The next ingredient in this process before performing IPW is to create a dataset with interval data, from an initial dataset with a single line per person/subject/unit. It is really a combination of a few commands like split, unsplit, lapply and merge. To illustrate how to do it I wrote the following:

analysis_data_IPCW_V01.R - an extensively commented R script, explaining how to prepare the dataset for IPCW in this setting. I also show how to perform IPCW, and fit the final weighted model.
longdat.complete.rda - simulated example data with a longitunal covariate L.
longdat.rda - simulated example data with a longitunal covariate L, with missing measurements.
basdat.rda - simulated example time-fixed data, one row per individual, including a baseline covariate, endtime and indicator for the exposure, event and dropout.

How to use the example data is indicated in the R script. I suggest opening it in an editor with color coding such as RStudio or Notepad++. I think it is worthwhile, especially for analists new to R, to study the script in detail. It contains some useful tricks. The script and datafiles can be downloaded here:

R code and data for IPCW example