Vignette overview

The washb R package was developed to help standardize and facilitate intention-to-treat analyses in the WASH Benefits trial. This vignette provides an overview of the main functions included in package. As an illustrative example, the vignette uses the Bangladesh trial primary outcomes (diarrhea, length-for-age Z-scores) to calculate several unadjusted and adjusted estimates of intervention effects. It illustrates a range of different estimators from unadjusted approaches such as the paired t-test (see washb_ttest) to targeted maximum likelihood estimation with ensemble machine learning (see washb_tmle).

For unadjusted analyses in a paired design, the paired t-test (washb_ttest) for continuous outcomes and Mantel-Haenszel pooled risk difference or ratio (washb_mh) are “oldies but goodies” – they provide a simple and robust way to compare outcomes between arms. However, for unpaired analyses and/or any type of adjusted analyses, you should consider regression (washb_glm) or semi-parametric, double-robust (washb_tmle) estimators.

Brief description of functions documented in the vignette

  • washb_mean Means estimated with robust standard errors for the WASH Benefits trials. Calculate means for a variable along with robust sandwich SEs and 95% confidence intervals that account for clustering within id.

  • washb_ttest Implements a paired t-test to compare continuous outcomes between two arms of the study.

  • washb_mh Implements the pooled Mantel-Haenszel estimator of the ITT prevalence ratio or difference between two arms of the study.

  • washb_glm Estimates ITT effects using generalized linear models (GLM) with robust standard errors for unadjusted and adjusted estimates. Includes conversions of coefficients to relative risks (for binary outcomes), calculates confidence intervals, and prescreens possible adjustment covariates using the internal washb_prescreen function. Can be used for subgroup analyses (tests of effect modification).

  • washb_tmle Estimates ITT effects using targeted maximum likelihood estimation, which is particularly useful for adjusted analyses. Like washb_glm, this function provides measures of effect on absolute and relative scales (for binary outcomes), and pre-screens possible adjustment covariates using the internal washb_prescreen function. Unlike other estimation approaches in the package, washb_tmle can account for missing outcomes that may be missing systematically by treatment arm or baseline covariates.

  • washb_permute Conducts a permutation test of the independence of an outcome by treatment arm, conditional on randomization block using the Wilcoxon rank-sum test statistic.

Internal package functions (only used internally by the main package functions): * washb_prescreen Pre-screens adjustment covariates – restrict to those with a LR test P<0.2 (or user-set p-value). Called internally in the washb_glm and washb_tmle functions, or can be called on its own.

  • sandwichSE Calculates the Huber Sandwich Estimator for robust standard errors (Freedman 2006).

  • washb_lincom Estimates SEs, CIs, and P-values for Pr>|Z| from a linear combination of GLM regression coefficients.

In addition to these functions, the package includes data from the wash benefits trials completed in Bangladesh by the UC Berkeley team. These data sets include:

  • washb_bang_tr Bangladesh treatment assignments dataset

  • washb_bangladesh_antro Bangladesh anthropometry analysis dataset

  • washb_bangladesh_diar Bangladesh diarrhea dataset

  • washb_bangladesh_enrol Bangladesh enrollment analysis dataset

  • washb_bangladesh_track_compound Bangladesh track compound dataset

  • washb_bangladesh_uptake Bangladesh uptake analysis dataset

To load data for analysis, use the command data(washb_bang_tr), replacing washb_bang_tr with the appropriate dataset name.

Getting started

R resources

For new users of R, the following sources are helpful learning resources:
1. Calendar of UC Berkeley D-Lab workshops (Look for R bootcamps)
2. Cookbook for R
3. Datacamp: R for SAS, SPSS, and Stata users (This course is for those familiar with statistical programming who only need to learn R-specific syntax. Great GUI learning interface, and free for introductory course, and academic discount for paid courses.)
4. Try R codeschool (An interactive introductory course)
5. UCLA Institute For Digital Research and Education: R resources (Great source for topic-specific tutorials)
6. R -tutor introduction
7. Johns Hopkins Coursera course in R programming

Installation from GitHub

The WASH Benefits R package is developed on GitHub: https://ben-arnold.github.io/washb. To install the R package from GitHub, you will need the devtools package. From within R, you can install devtools with the code: install.packages("devtools"). This only needs to occur once (per computer). After installation, load the devtools package (type library(devtools)), and then install the washb package by typing:

devtools::install_github("ben-arnold/washb"")

We will periodically update the washb package with improvements and bug fixes. When a new package version is available, the UCB team will send an email to notify the WASHB community of the update. You can always ensure you have the latest version by re-installing the package. If you encounter a bug, find something frustrating or confusing, or have suggestions for improvements email Andrew and Ben to let them know!

Installing required packages

The washb package relies on several functions found in a few different packages that extend the basic R installation. These packages need to be installed onto each computer once, and loaded into the R workspace each time R is opened. If you are installing the packages for the first time, use the following code:

install.packages("sandwich")
install.packages("lmtest")
install.packages("coin")
install.packages("plyr")
install.packages("metafor")

If you plan to use targeted maximum likelihood estimation and machine learning through the washb_tmle function you will additionally need to install:

install.packages("tmle")
install.packages("SuperLearner")

Once packages are installed for the first time, they will be automatically loaded along with the washb package when you load the package:

library(washb)

The library() command can be used to load any installed package. It is not needed for any of the required packages listed above, but it is for suggested packages. For example, to fit a negative binomial GLM model, the package is required, and can be installed once with install.packages("MASS") and loaded each time R is opened with library("MASS").

Suggested packages include: nnlsm, zoo, arm, MASS, Matrix, lme4, gam, splines, foreach, and glmnet.

Package documentation

Use ??washb to see full list of package documentation, or use ?function_name to see help documentation for any specific package. For example, ?washb_glm returns the help page for the washb_glm function.

A note on clustered SEs

IMPORTANT Read this at least once if you are a WASH B investigator.

The WASH Benefits trials were designed as pair-matched, cluster-randomized trials. The pair-matching is sometimes not immediately obvious because the trial has so many arms. But, we enrolled clusters in groups of 8 geographically contiguous clusters (9 in Kenya, due to the passive control arm) and then allocated the clusters to either one of six intervention arms or the double-sized control (Arnold et al. 2013). It is a geographically pair-matched design, because any comparison between two arms is pair-matched within the randomization block. For example, within a randomization block there will be one cluster allocated to the Nutrition arm “matched” to 2 clusters allocated to the control arm.

Our inference assumes that clusters are independent units. That is important for estimating effects, because if clusters were not independent then we could have contamination (spillover) between arms and it would be impossible to use randomization alone to identify the intervention effects. (We formally tested this assumption in the Bangladesh trial and found no evidence for between-cluster spillover effects; tests for Kenya will be complete by late October.)

To correctly estimate the variance in the pair-matched design, we need to treat the independent unit as the pair (randomization block), because there is some dependence in the outcome induced by matching.

When is it reasonable to treat clusters as the independent unit? First, when estimating the marginal mean of a variable, perhaps by trial arm with washb_means, then there will be little/no difference between treating clusters versus pairs as the independent unit since with the exception of the control arm there are no replicated units within pair. In designs that are not pair-matched, such as the NIH R21 substudy (where sanitation and control clusters were not enrolled within matched pairs), then it is reasonable to treat the cluster as the independent unit. Treating the pairs/blocks as the independent unit is never wrong, it will just be slightly conservative in analyses that are not pair-matched.

Diarrhea data processing

Below we will use some actual analyses from the Bangladesh trial to illustrate the different functions. To do that, we also need to read in final analysis datasets, merge them to baseline characteristics, and merge on the actual (unblinded) treatment assignments. We also do a little bit of minimal data processing to help facilitate the analyses.

This is an example of how to load and merge the enrollment and diarrhea data, with the actual treatment assignments. It creates a final analysis data frame called dd (for diarrhea data). At this time, these datasets are not included with the package – if you need access to them, please email the UCB team.

# Load Data from package
data(washb_bang_tr)
data(washb_bangladesh_enrol)
data(washb_bangladesh_diar)
# load and merge the final analysis files
# treatment assignments, enrollment characteristics, and diarrhea measurements
# note: these are unblinded treatment assignments in the encrypted volume 0-treatment-assignments
# if you do not have access to these data through Dropbox, but need it, please contact UCB
washb_bd_enrol <- washb_bd_anthro <- NULL

washb_bd_tr    <- washb_bang_tr
washb_bd_enrol <- washb_bangladesh_enrol
washb_bd_diar  <- washb_bangladesh_diar

# drop svydate and month because they are superseded in the child level diarrhea data
washb_bd_enrol$svydate <- NULL
washb_bd_enrol$month <- NULL

# merge the treatment assignments to the baseline dataset
dd <- merge(washb_bd_enrol,washb_bd_tr,by=c("clusterid","block"),all.x=T,all.y=T)

# merge the baseline data with diarrhea measurements (keep only compounds with follow-up data)
dd <- merge(dd,washb_bd_diar,by=c("dataid","clusterid","block"),all.x=F,all.y=T)

# subset to post-intervention measurements: Year 1 or Year 2
dd <- subset(dd,svy==1|svy==2)

# exclude new births that are not index children
dd <- subset(dd,sibnewbirth==0)

# exclude children with missing data
dd <- subset(dd,!is.na(dd$diar7d))

# re-order the tr factor for convenience
dd$tr <- factor(dd$tr,levels=c("Control","Water","Sanitation","Handwashing","WSH","Nutrition","Nutrition + WSH"))

# ensure that month is coded as a factor
dd$month <- factor(dd$month)

# sort the data for perfect replication when using random splits for V-fold cross-validation (washb_tmle and adj permutation tests)
dd <- dd[order(dd$block,dd$clusterid,dd$dataid,dd$childid),]

LAZ data processing

This is an example of how to load and merge the anthropometry and enrollment datasets. It pretty much parallels the processing steps of the diarrhea data (above). It creates a final analysis data frame called lazd. At this time, the anthropometry dataset is not included with the package – if you need access to it, please email the UCB team.

# Load data from package
data(washb_bangladesh_anthro)
# load the anthropometry dataset
washb_bd_anthro <- washb_bangladesh_anthro

#  merge to final analysis files, loaded above (keep only compounds with follow-up data)
lazd <- merge(washb_bd_enrol,washb_bd_tr,by=c("clusterid","block"),all.x=T,all.y=T)
lazd <- merge(lazd,washb_bd_anthro,by=c("dataid","clusterid","block"),all.x=F,all.y=T)

# subset to the index (target) children measured in Year 2 (primary outcome)
lazd <- subset(lazd,svy==2)
lazd <- subset(lazd,tchild=="Target child")

# drop children with extreme LAZ values
lazd <- subset(lazd,laz_x!=1)

# re-order the tr factor for convenience
lazd$tr <- factor(lazd$tr,levels=c("Control","Water","Sanitation","Handwashing","WSH","Nutrition","Nutrition + WSH"))

# ensure that month is coded as a factor
lazd$month <- factor(lazd$month)

# rename aged to agedays (for consistency with the diarrhea file)
lazd$agedays <- lazd$aged

# sort the data for perfect replication when using random splits for V-fold cross-validation (washb_tmle and adj permutation tests)
lazd <- lazd[order(lazd$block,lazd$clusterid,lazd$dataid,lazd$childid),]

Now that the dataset is cleaned and merged, the washb package functions can be applied and the results will match the WASH Benefits Bangladesh primary analysis.

washb_mean

function overview

The washb_mean function is most useful for calculating variable means and confidence intervals – for example, calculating average compliance (uptake) within a given intervention arm, or calculating the average LAZ by arm or measurement round. In the WASH Benefits trials, the independent unit is typically the cluster, so the id argument should identify the cluster ID.

Arguments:
Y= Outcome variable.
id= ID variable for independent units (e.g., cluster ID).
print= Logical. If (default) the function will print the results.

Usage:

washb_mean(Y,id,print=TRUE)

example

Using the WASH Benefits enrollment survey data and the washb_mean function, the means and 95% confidence intervals of several maternal characteristics are calculated below:

MomAge<-washb_mean(Y=washb_bd_enrol$momage,id=washb_bd_enrol$clusterid)
MomEduY<-washb_mean(Y=washb_bd_enrol$momeduy,id=washb_bd_enrol$clusterid)

The <- command assigns the output of washb_mean to the new objects called MomAge and MomEduY Notice how there are 20 missing observations in the momage, but none in the momeduy variable.

If you want to compare means between groups using a difference, prevalence ratio, or incidence ratio (depending on the outcome), use washb_ttest (for t-test of unadjusted continuous outcomes), or washb_mh (for Mantel-Haenszel test on unadjusted binary outcomes), washb_glm for unadjusted or adjusted estimates using generalized linear models, or washb_tmle for unadjusted or adjusted estimates using targeted maximum likelihood estimation.

washb_ttest

function overview

washb_ttest conducts a paired t-test for the difference in a continuous outcome between two treatment arms. It estimates the paired t-test for differences in means within matched pair (randomization block). This function can be used for unadjusted analyses of continuous outcomes, where the parameter of interest is the difference in means.

Arguments:

Y=binary outcome (in this example, Y= child length for age z-score lazd$laz)

tr=binary treatment group variable, comparison group first

strat=stratification variable (In WASH Benefits: the pair-matched block variable)

contrast= character vector of the format c("contrast1", "contrast2") of the two factor levels within the treatment variable (“tr”) to compare. "contrast1" is the reference group and "contrast2" is the active comparison.

Usage:

washb_ttest(Y,tr,strat,contrast)

paired t-test for mean differences

Below, washb_ttest is used to conduct a paired t-test of the difference in LAZ between the nutrition arm and the control arm in the Bangladesh trial. The code chunk provides an example of how to “vectorize” the call across arms, which works because the washb_ttest returns a simple vector of numbers (this type of efficient programming does not work well with functions that return more complex output, such as washb_glm or washb_tmle).

# Run washb_ttest on Nutrition vs. control arm comparison
ttest.C.N <- washb_ttest(Y=lazd$laz,tr=lazd$tr,strat=lazd$block, contrast=c("Control", "Nutrition"))
round(ttest.C.N,4)
  diff  ci.lb  ci.ub t-stat      p 
0.2526 0.1470 0.3581 4.7545 0.0000 

Advanced R tip: for a function like washb_ttest that returns a simple vector or list of single parameter estimates, you can use sapply command to efficiently apply the function to all the treatment arm contrasts

# Create vector of contrasts for the pre-specified hypothesis 1
# (each intervention vs. control) to use sapply
h1.contrasts <- list(c("Control","Water"),
                     c("Control","Sanitation"),
                     c("Control","Handwashing"),
                     c("Control","WSH"),
                     c("Control","Nutrition"),
                     c("Control","Nutrition + WSH"))

# Use sapply to apply washb_ttest() across all contrasts
diff.h1LAZ <- t(sapply(h1.contrasts,washb_ttest,Y=lazd$laz,tr=lazd$tr,strat=lazd$block))
rownames(diff.h1LAZ) <- c("Water v C","Sanitation v C","Handwashing v C","WSH v C","Nutrition v C","Nutrition + WSH v C")
round(diff.h1LAZ,4)
                       diff   ci.lb  ci.ub  t-stat      p
Water v C           -0.0646 -0.1812 0.0520 -1.1015 0.2737
Sanitation v C      -0.0227 -0.1383 0.0928 -0.3913 0.6965
Handwashing v C     -0.0679 -0.1763 0.0404 -1.2458 0.2161
WSH v C              0.0174 -0.0930 0.1278  0.3135 0.7546
Nutrition v C        0.2526  0.1470 0.3581  4.7545 0.0000
Nutrition + WSH v C  0.1292  0.0166 0.2418  2.2797 0.0250

washb_mh

function overview

washb_mh estimates a treatment effect (either difference or ratio) for a binary outcome pooled across strata. In the WASH Benefits trials, this function can estimate the unadjusted prevalence difference or ratio accounting for pair-matching in the design (where pair is the stratifying variable).

Arguments:

Y=binary outcome (in this example, Y= 7-day diarrheal disease recall dd$diar7d)
tr=binary treatment group variable, comparison group first
strat=stratification variable (In WASH Benefits: the pair-matched block variable)
contrast= character vector of the format c("contrast1", "contrast2") of the two factor levels within the treatment variable (“tr”) to compare. "contrast1" is the reference group and "contrast2" is the active comparison.

measure=measure of effect. RR = prev ratio, RD = prev difference

Usage:

washb_mh(Y,tr,strat,contrast,measure="RR")

estimate a pooled prevalence (or risk) ratio

Apply washb_mh to the sanitation vs. control arm contrast (With the round() function added to clean output and keep it from spilling across multiple lines).

mh.C.S <- washb_mh(Y=dd$diar7d,tr=dd$tr, contrast=c("Control","Sanitation"), strat=dd$block,measure="RR")
round(mh.C.S,4)
     PR,    ci.lb    ci.ub    logPR se.logPR        Z        p 
  0.6085   0.4569   0.8102  -0.4968   0.1461  -3.4001   0.0007 

estimate a pooled prevalence (or risk) difference

The function washb_mh can also be used to calculate an unadjusted risk difference:

round(washb_mh(Y=dd$diar7d,tr=dd$tr, contrast=c("Control","Sanitation"), strat=dd$block,measure="RD"),4)
     RD   se.RD   ci.lb   ci.ub       Z       p 
-0.0220  0.0059 -0.0335 -0.0105 -3.7396  0.0002 

Advanced R tip: for a function like washb_mh that returns a simple vector or list of single parameter estimates, you can use sapply command to efficiently apply the function to all the treatment arm contrasts

# Create vector of contrasts for the pre-specified hypothesis 1
# (each intervention vs. control) to use sapply
h1.contrasts <- list(c("Control","Water"),
                     c("Control","Sanitation"),
                     c("Control","Handwashing"),
                     c("Control","WSH"),
                     c("Control","Nutrition"),
                     c("Control","Nutrition + WSH"))

# Apply sapply to run the function on each comparison in the h1.contrasts list
diff.h1 <- t(sapply(h1.contrasts,washb_mh,Y=dd$diar7d,tr=dd$tr,strat=dd$block,measure="RR"))
rownames(diff.h1) <- c("Water v C","Sanitation v C","Handwashing v C","WSH v C","Nutrition v C","Nutrition + WSH v C")
print(round(diff.h1,4))
                       PR,  ci.lb  ci.ub   logPR se.logPR       Z      p
Water v C           0.8883 0.6966 1.1326 -0.1185   0.1240 -0.9556 0.3393
Sanitation v C      0.6085 0.4569 0.8102 -0.4968   0.1461 -3.4001 0.0007
Handwashing v C     0.6002 0.4526 0.7958 -0.5105   0.1440 -3.5463 0.0004
WSH v C             0.6917 0.5310 0.9010 -0.3686   0.1349 -2.7324 0.0063
Nutrition v C       0.6438 0.4872 0.8505 -0.4404   0.1421 -3.0990 0.0019
Nutrition + WSH v C 0.6193 0.4709 0.8146 -0.4791   0.1399 -3.4258 0.0006

washb_glm

function overview

washb_glm fits a generalized linear model to estimate intention-to-treat (ITT) effects in a trial. The contrast argument enables you to specify the arms that you wish to compare (reference group in the first argument, comparison group in the second). To estimate adjusted effects, you can provide a data frame of adjustment covariates using the W argument – by default, covariates are pre-screened to only include those that are associated with the outcome based on a likelihood ratio test. To over-ride the pre-screening algorithm for some or all covariates, use the forcedW argument.

If the design is pair-matched (all primary outcome analyses in WASH B should be pair matched), use the pair argument to specify an id variable for pairs, and specify the same variable in the id argument to get correct SEs. If the design is not pair-matched, then the id argument should identify the smallest independent unit in the trial (e.g., cluster). The function computes robust standard errors.

Note that this function automatically drops observations from the analysis if they are from a pair with no treatment contrast – this can happen with incomplete randomization blocks in the Kenya trial.

Note that for binary outcomes such as diarrhea, you can estimate the prevalence ratio in a GLM model using a log link (family=binomial(link='log')) rather than the canonical logit link (family='binomial') to estimate the odds ratio (not recommended because ORs are harder to interpret). Occasionally, a GLM model with a non-canonical link will fail to converge, particularly if the data are sparse. If this occurs for a binomial family with the log link, we recommend a modified poisson regression to estimate prevalence ratio using the argument family=poisson(link='log'). See Zou 2004 and Yelland et al. 2011 for details.

Finally, washb_glm also makes it straight forward to estimate conditional (i.e., subgroup) effects using the optional V argument (example below).

Arguments:
Y= Binary, count, or continuous outcome
tr= binary treatment group variable, comparison group first
pair=stratification variable (In WASH Benefits: the pair-matched block variable)
W= Optional data frame that includes adjustment covariates (for adjusted estimates). W will be prescreened and only those associated with the outcome will be adjusted for. See washb_prescreen for more details.
forcedW= Optional vector of variable names to force as adjustment covariates (no screening)
V= Optional variable name for subgroup analyses, which is interacted with tr. Continuous variables can be used and an interaction term will be fit, but the calculated point estimates and confidence intervals for the treatment effect between subgroups will only be returned if V is a factor.
id= id of cluster used in calculating robust standard errors.
contrast= character vector of the format c("contrast1", "contrast2") of the two factor levels within the treatment variable (“tr”) to compare. "contrast1" is the reference group and "contrast2" is the active comparison. family= GLM model family (gaussian, binomial, poisson, and negative binomial). Use “binomial(link=‘log’)” to return prevalence ratios instead of odds ratios when the outcome is binary. Use “neg.binom” for a Negative binomial model.
pval= level of p=value to use in prescreening the set of potential adjustment variables W. Any variable in W with a likelihood ratio test p-value below this threshold will be included in the final adjustment set. Defaults to 0.2
print= logical (TRUE or FALSE, defaults to TRUE) for whether to print output or just store it in the assigned object.
verbose= logical (TRUE or FALSE, defaults to FALSE) for whether to print a description of objects to the output.

Usage:

washb_glm(Y, tr, pair, W=NULL, forcedW=NULL, V=NULL, id, contrast, family='gaussian', pval=0.2, print=TRUE)

Returned objects: * objectname$TR to return the treatment effect and inference.
* objectname$fit to return full glm model estimates.
* objectname$vcv to return the variance-covariance matrix.
* objectname$rowdropped to return the vector list of observations included in the model fit.
* objectname$lincom to return subgroup-specific conditional relative risk estimates if a subgroup V is specified.
* objectname$glmModel to return the glm model fit to be used with predict() to return model predictions of the outcome. Note that this is the glm model fit without adjusting the standard errors for repeated measures so you should not use it for inference if the data include repeated observations and/or if you have fit a modified Poisson model. It is safest to use the returned $TR and $fit objects for inference, which include robust SEs.

unadjusted analysis (binary outcome ratio)

As an example, the following code applies the washb_glm function to compare diarrhea prevalence between the sanitation and control arms. To estimate the prevalence ratio (PR), we fit the GLM model with a log link, family=binomial(link='log'). Occasionally, a glm model with a non-canonical link function like family=binomial(link='log') will fail to converge. If this occurs, use a modified poisson regression to estimate prevalence ratio using the argument family=poisson(link='log') (see details above for references).

Diar.glm.C.S <- washb_glm(Y=dd$diar7d,tr=dd$tr,pair=dd$block, id=dd$block, contrast=c("Control","Sanitation"), family=binomial(link='log'))


-----------------------------------------
 GLM Fit: Sanitation vs. Control 
-----------------------------------------
                    PR      2.5%     97.5%   Estimate Std. Error   z value    Pr(>|z|)
trSanitation 0.6109162 0.4510202 0.8274985 -0.4927955  0.1548202 -3.183019 0.001457482

On top of the function’s auto-printed output, the washb_glm function contains a number of objects. For example, 'objectname'$TR returns just the treatment effect of the intervention arm, useful when saving to a row of an R object containing only the treatment effects across all the desired arm contrasts.

round(Diar.glm.C.S$TR,4)
                 PR  2.5%  97.5% Estimate Std. Error z value Pr(>|z|)
trSanitation 0.6109 0.451 0.8275  -0.4928     0.1548  -3.183   0.0015

This result shows that the sanitation intervention led to a (1-0.61) = 39 percent relative reduction in diarrhea compared with the control arm.

unadjusted analysis (binary outcome difference)

To estimate a difference between arms for a binary outcome (the prevalence difference or risk difference, depending on the outcome definition), the syntax is the same as the previous example except you need to use a different link to specify a linear (rather than log-linear) model: family='gaussian'. This is sometimes called a linear probability model.

RD.Diar.glm.C.S <- washb_glm(Y=dd$diar7d,tr=dd$tr,pair=dd$block, id=dd$block, contrast=c("Control","Sanitation"), family='gaussian')


-----------------------------------------
 GLM Fit: Sanitation vs. Control 
-----------------------------------------
                   Coef.        2.5%      97.5% Std. Error   z value     Pr(>|z|)
trSanitation -0.02200495 -0.03458109 -0.0094288  0.0064164 -3.429485 0.0006047275

This result shows that the sanitation intervention led to a -2.2 precentage point (absolute) reduction in diarrhea compared with the control arm.

unadjusted analysis (continuous outcome)

As an example, the following code applies the washb_glm function to compare LAZ scores between the sanitation and control arms of the trial. Since it is a pair-matched design, we specify pair=lazd$block and id=lazd$block, where block is the randomization block (identified matched pairs). For a continuous outcome, the family should be family='gaussian'. Finally, the contrast argument needs to correspond to different levels in the tr variable. In this example, tr is a factor with many levels, but we want to compare sanitation to control – the reference group should be listed first: contrast=c("Control","Sanitation"). The full command and results are:

LAZ.glm.C.S <- washb_glm(Y=lazd$laz,tr=lazd$tr,pair=lazd$block, id=lazd$block, contrast=c("Control","Sanitation"), family="gaussian", verbose=FALSE)


-----------------------------------------
 GLM Fit: Sanitation vs. Control 
-----------------------------------------
                   Coef.       2.5%      97.5% Std. Error    z value  Pr(>|z|)
trSanitation -0.01862361 -0.1356985 0.09845124 0.05973207 -0.3117859 0.7552033

This result shows there was no evidence for a difference in LAZ between the sanitation and control arms.

adjusted analyses

The W= argument in washb_glm enables you to adjust for covariates in the GLM - W simply requires a data frame of potential adjustment covariates.

If you embark on adjusted analyses it can be convenient to make a vector that includes the adjustment variable names to screen for inclusion in adjusted GLM models. The convenience of this approach will be apparent in the examples below, where we repeatedly refer to this list of variables when subsetting the data to specify adjustment covariates.

# list of potential adjustment covariates
Wadj <- c("month","agedays","sex","momage","momedu","momheight","hfiacat","Nlt18","Ncomp","watmin","elec","floor","walls","roof","asset_wardrobe","asset_table","asset_chair","asset_khat","asset_chouki","asset_tv","asset_refrig","asset_bike","asset_moto","asset_sewmach","asset_mobile")

For the diarrhea analysis data frame created above named dd, you can now subset it to the adjustment covariates above using the syntax: dd[Wadj]. Similarly, for the LAZ analysis data frame named lazd, you can subset it to adjustment covariates: lazd[Wadj].

If W is specified, the washb_glm will pre-screen the covariates in the data frame provided in the W= argument. Variables with a p-value<0.2 from a likelihood-ratio test are included as adjustment covariates in the final model. The pval= argument can be used to change the p-value threshold used to screen the covariates.

adj.Diar.glm.C.S <- washb_glm(Y=dd$diar7d,tr=dd$tr,pair=dd$block, W=dd[Wadj], id=dd$block, contrast=c("Control","Sanitation"), family=binomial(link='log'))

-----------------------------------------
Dropping 67 observations due to missing values in 1 or more variables
 Final sample size: 5210 
-----------------------------------------

-----------------------------------------
Pre-screening the adjustment covariates:
-----------------------------------------

Likelihood Ratio Test P-values:
               P-value
month          0.00000
agedays        0.00139
sex            0.69302
momage         0.95834
momedu         0.00023
momheight      0.47061
hfiacat        0.02598
Nlt18          0.41379
Ncomp          0.72487
watmin         0.07485
elec           0.00614
floor          0.05576
walls          0.63743
roof           0.94900
asset_wardrobe 0.10699
asset_table    0.07399
asset_chair    0.43212
asset_khat     0.00174
asset_chouki   0.60563
asset_tv       0.42975
asset_refrig   0.02540
asset_bike     0.48036
asset_moto     0.28840
asset_sewmach  0.13751
asset_mobile   0.59945


Covariates selected (P<0.2):
                       P-value
month          0.0000000664917
agedays        0.0013854051931
momedu         0.0002294082996
hfiacat        0.0259780896649
watmin         0.0748457908310
elec           0.0061424259087
floor          0.0557563342543
asset_wardrobe 0.1069918101893
asset_table    0.0739884683794
asset_khat     0.0017386974934
asset_refrig   0.0254037503987
asset_sewmach  0.1375079553959


-----------------------------------------
 GLM Fit: Sanitation vs. Control 
-----------------------------------------
                    PR      2.5%     97.5%   Estimate Std. Error   z value    Pr(>|z|)
trSanitation 0.5847469 0.4317727 0.7919187 -0.5365762  0.1547345 -3.467721 0.000524891

 RR of covariates
                                       PR      2.5%     97.5%
trSanitation                    0.5847469 0.4317727 0.7919187
month2                          0.4849529 0.1581594 1.4869765
month3                          0.4044081 0.1471559 1.1113787
month4                          1.1956219 0.6122722 2.3347652
month5                          1.0343119 0.4435036 2.4121587
month6                          2.0379062 0.8100683 5.1268043
month7                          1.1737215 0.4495119 3.0647067
month8                          1.6673212 0.6254334 4.4448537
month9                          0.3258038 0.1182398 0.8977361
month10                         0.2834488 0.1395233 0.5758405
month11                         0.5983482 0.1371178 2.6110440
month12                         0.6922535 0.1773029 2.7028034
agedays                         0.9993945 0.9990421 0.9997470
momeduPrimary (1-5y)            1.2507804 0.8158203 1.9176423
momeduSecondary (>5y)           0.9339511 0.5751086 1.5166957
hfiacatMildly Food Insecure     1.1719294 0.6807637 2.0174674
hfiacatModerately Food Insecure 1.1108363 0.7919408 1.5581435
hfiacatSeverely Food Insecure   1.3250600 0.6893238 2.5471107
watmin                          0.9383602 0.8299307 1.0609560
elec                            0.8751570 0.6510430 1.1764198
floor                           1.2400760 0.6799353 2.2616689
asset_wardrobe                  1.0590677 0.6619020 1.6945475
asset_table                     0.9294539 0.7099895 1.2167569
asset_khat                      0.8254681 0.6264260 1.0877544
asset_refrig                    0.8144093 0.4206895 1.5766080
asset_sewmach                   0.8804924 0.4872256 1.5911867

Forced adjustment covariates

If there are specific adjustment covariates you want to include in the GLM model, regardless of prescreening results, you can use the forcedW option. This argument takes a variable name or a vector of variable names. These variables must already be in the W dataframe of potential adjustment covariates. For example, if adjustment by age and sex is standard, force them into the GLM model.

glm.C.S <- washb_glm(Y=dd$diar7d,tr=dd$tr,pair=dd$block, W=dd[Wadj], forcedW=c("agedays","sex"), id=dd$block, contrast=c("Control","Sanitation"), family=binomial(link='log'))

-----------------------------------------
Dropping 67 observations due to missing values in 1 or more variables
 Final sample size: 5210 
-----------------------------------------

-----------------------------------------
Include the following adjustment covariates without screening:
-----------------------------------------
[1] "agedays" "sex"    

-----------------------------------------
Pre-screening the adjustment covariates:
-----------------------------------------

Likelihood Ratio Test P-values:
               P-value
month          0.00000
momage         0.95834
momedu         0.00023
momheight      0.47061
hfiacat        0.02598
Nlt18          0.41379
Ncomp          0.72487
watmin         0.07485
elec           0.00614
floor          0.05576
walls          0.63743
roof           0.94900
asset_wardrobe 0.10699
asset_table    0.07399
asset_chair    0.43212
asset_khat     0.00174
asset_chouki   0.60563
asset_tv       0.42975
asset_refrig   0.02540
asset_bike     0.48036
asset_moto     0.28840
asset_sewmach  0.13751
asset_mobile   0.59945


Covariates selected (P<0.2):
                       P-value
month          0.0000000664917
momedu         0.0002294082996
hfiacat        0.0259780896649
watmin         0.0748457908310
elec           0.0061424259087
floor          0.0557563342543
asset_wardrobe 0.1069918101893
asset_table    0.0739884683794
asset_khat     0.0017386974934
asset_refrig   0.0254037503987
asset_sewmach  0.1375079553959


-----------------------------------------
 GLM Fit: Sanitation vs. Control 
-----------------------------------------
                    PR      2.5%     97.5%   Estimate Std. Error   z value     Pr(>|z|)
trSanitation 0.5847484 0.4317425 0.7919783 -0.5365736  0.1547716 -3.466874 0.0005265495

 RR of covariates
                                       PR      2.5%     97.5%
trSanitation                    0.5847484 0.4317425 0.7919783
agedays                         0.9993945 0.9990414 0.9997478
sexmale                         0.9991030 0.7684642 1.2989634
month2                          0.4849484 0.1580972 1.4875339
month3                          0.4043968 0.1471298 1.1115136
month4                          1.1955974 0.6122445 2.3347747
month5                          1.0342857 0.4441253 2.4086600
month6                          2.0377791 0.8108934 5.1209488
month7                          1.1736033 0.4500471 3.0604461
month8                          1.6672885 0.6255902 4.4435652
month9                          0.3258044 0.1182710 0.8975029
month10                         0.2834577 0.1395702 0.5756837
month11                         0.5983104 0.1370407 2.6121825
month12                         0.6922687 0.1772347 2.7039619
momeduPrimary (1-5y)            1.2508399 0.8150730 1.9195833
momeduSecondary (>5y)           0.9340191 0.5760518 1.5144325
hfiacatMildly Food Insecure     1.1718797 0.6794290 2.0212589
hfiacatModerately Food Insecure 1.1108493 0.7923887 1.5572990
hfiacatSeverely Food Insecure   1.3249358 0.6838226 2.5671203
watmin                          0.9383588 0.8301018 1.0607340
elec                            0.8751449 0.6511680 1.1761613
floor                           1.2401087 0.6798679 2.2620125
asset_wardrobe                  1.0590718 0.6618194 1.6947722
asset_table                     0.9294398 0.7093481 1.2178200
asset_khat                      0.8254342 0.6261901 1.0880746
asset_refrig                    0.8143569 0.4216935 1.5726522
asset_sewmach                   0.8805112 0.4874324 1.5905795

subgroup analyses

The V argument can be used to create an interaction term with tr, the treatment variable. You do not pass an actual vector to V – instead, you specify the string name of a variable that is included in the W matrix. Below is an example for a subgroup analysis of the effect of the nutrition intervention on diarrheal disease, stratified by index child (vs. non-index) – note how we include "tchild" in both the forcedW and V arguments – this ensures that the stratification variable isn’t dropped if it is not marginally significantly associated with the outcome.

# Estimate subgroup analysis glm with washb_glm
# stratified by index child status "tchild"
glm.C.N.byChildType <- washb_glm(Y=dd$diar7d,tr=dd$tr,pair=dd$block, W=dd["tchild"], V="tchild", id=dd$block, contrast=c("Control","Nutrition"), family=binomial(link='log'), verbose=FALSE)

-----------------------------------------
Include the following adjustment covariates without screening:
-----------------------------------------
[1] "tchild"


-----------------------------------------
 GLM Fit: Nutrition vs. Control  by Subgroup: ' tchild '
-----------------------------------------
  Tr vs. C by Subgroup    est se.est est.lb est.ub       Z      P
1              Sibling 0.6255 0.3300 0.3276 1.1945 -1.4215 0.1552
2         Target child 0.6421 0.1742 0.4563 0.9035 -2.5425 0.0110


-----------------------------------------
 Significance of effect modification variables 
-----------------------------------------
                                PR  Pr(>|z|)
trNutrition:VTarget child 1.026499 0.9360759
# Examine the treatment effect across subgroups with `objectname'$lincom
glm.C.N.byChildType$lincom
  Tr vs. C by Subgroup    est se.est est.lb est.ub       Z      P
1              Sibling 0.6255 0.3300 0.3276 1.1945 -1.4215 0.1552
2         Target child 0.6421 0.1742 0.4563 0.9035 -2.5425 0.0110
# Estimate subgroup analysis glm with washb_glm
glm.C.N.byChildType <- washb_glm(Y=dd$diar7d,tr=dd$tr,pair=dd$block, W=dd["tchild"], V="tchild", id=dd$block, contrast=c("Control","Nutrition"), family="gaussian", verbose=FALSE)

-----------------------------------------
Include the following adjustment covariates without screening:
-----------------------------------------
[1] "tchild"


-----------------------------------------
 GLM Fit: Nutrition vs. Control  by Subgroup: ' tchild '
-----------------------------------------
  Tr vs. C by Subgroup     est se.est  est.lb  est.ub       Z      P
1              Sibling -0.0144 0.0103 -0.0346  0.0058 -1.3978 0.1622
2         Target child -0.0226 0.0080 -0.0383 -0.0068 -2.8116 0.0049


-----------------------------------------
 Significance of effect modification variables 
-----------------------------------------
                                Coef.  Pr(>|z|)
trNutrition:VTarget child -0.00816172 0.4778686
# Examine the treatment effect across subgroups with `objectname'$lincom
glm.C.N.byChildType$lincom
  Tr vs. C by Subgroup     est se.est  est.lb  est.ub       Z      P
1              Sibling -0.0144 0.0103 -0.0346  0.0058 -1.3978 0.1622
2         Target child -0.0226 0.0080 -0.0383 -0.0068 -2.8116 0.0049

washb_tmle

Estimate intention-to-treat parameters using targeted maximum likelihood estimation (TMLE), potentially adjusted for covariates and missing outcomes

function overview

The washb_tmle function is mainly a convenience wrapper for the tmle package. It estimates ITT effects in a trial using targeted maximum likelihood estimation (TMLE). In brief, the function does the following: it restricts the data to complete observations in the two arms listed in the contrast argument, it pre-screens covariates (W), if specified, to select those that have a univariate association with the outcome, and then it estimates the intention-to-treat effect using TMLE. If family='binomial', then the function returns effects on the absolute, relative, and odds ratio scale. If Delta is specified (i.e., observations with missing outcomes are included), then the function will adjust the effects for missing values using inverse probability of censoring weights, with the weights estimated using data-adaptive super learning for Pr(Delta|A,W).

some technical details

If the analysis is pair-matched (as for primary outcomes), be sure to specify the pair ID in the id argument. Do not include pair IDs in the adjustment covariate set.

If you specify adjustment covariates (W), then by default washb_tmle pre-screens them and the subset that is associated with the outcome based on a likelihood ratio test are used in the estimation. There are some other important defaults to be aware of. First, washb_tmle estimates the treatment mechanism even though it is a randomized trial. There are two reasons for this – one theoretical and one practical. The theoretical reason is that estimating the treatment mechanism gains efficiency (see Balzer et al. 2016); the practical reason is that unless the analysis is conducted at the cluster level (i.e., providing cluster means to the washb_tmle function), then the empirical treatment probabilities differ slightly due to varying cluster sizes. Estimating the treatment mechanism ensures that the variance calculation correctly accounts for the empirical treatment probabilities in the data.

Another default is that washb_tmle uses the SuperLearner algorithm to adjust for covariates and to predict the treatment mechanism and censoring mechanism (if adjusting for missing outcomes). The default algorithm library includes the simple mean, main terms GLM, main terms Bayes GLM with non-informative priors, generalized additive models (degree 2), and lasso (glmnet). These are the pre-specified algorithms from the original trial statistical analysis plan. You can type listWrappers() to see the full set of algorithms implemented in the super learner package. If you just want to use a main effects GLM model to adjust for the covariates with TMLE, then you can specify Q.SL.library="SL.glm". If you are dealing with very small sample sizes (e.g., in a substudy), then you may wish to use even simpler libraries, such as a set of univariate regressions (as in Balzer et al. 2016, but probably best to check with the UCB team before making big changes to the recommended algorithm library).

Finally, by default the function uses the same algorithm library to predict the outcome (Q.SL.library) and the treatment and censoring mechanisms. You can specify a different library for the treatment and censoring mechanisms using the g.SL.library argument.

If you want to adjust for missing outcomes in the analysis, then you need to include observations that have a missing outcome (Y) with Delta=0 for those observations. Observations with missing outcomes should have treatment (tr) and covariate (W) information, which are used to create weights for Pr(Delta|A,W).

Arguments:
Y= Outcome variable (continuous, such as LAZ, or binary, such as diarrhea)
tr= Treatment group variable (binary or factor)
W= Data frame that includes adjustment covariates
id= ID variable for independent units. For pair-matched designs, this is the matched pair and should be the same as the pair argument. For analyses that are not pair-matched, then it should typically be the cluster.
pair= An optional ID variable to identify the matched pair unit (In WASH Benefits, blocks) if conducting a matched-pair analysis. This argument is used to drop pairs that are missing one or more treatment groups. Incomplete pairs is not an issue in the overall Bangladesh trial (there were no incomplete blocks), but is an issue in the Kenya trial where there were some incomplete blocks. Delta= indicator of missing outcome. 1 - observed, 0 - missing.
family= Outcome family: gaussian (continuous outcomes, like LAZ) or binomial (binary outcomes like diarrhea or stunting)

contrast= Vector of length 2 that includes the treatment groups to contrast (e.g., contrast=c('Control','Nutrition'))

Q.SL.Library= Library of algorithms to include in the SuperLearner for the outcome model

g.SL.library= Library of algorithms to include in the SuperLearner for the treatment model \(Pr(A|W)\), and for the missingness model \(Pr(\Delta|A,W)\) (if Delta is specified)

pval= The p-value threshold used to pre-screen covariates (W) based on a likelihood ratio test in a univariate regression with the outcome (Y). Variables with a univariate association p-value below this threshold will be used in the final model. Defaults to 0.2.

seed= A seed for the pseudo-random cross-validation split used in model selection (use for perfectly reproducible results).

print= Logical for printed output, defaults to true. If false, no output will be printed to the console if the returned object is saved to an R object.

Usage:

washb_tmle(Y,tr,W=NULL,id,pair=NULL, Delta = rep(1,length(Y)), family="gaussian", contrast, Q.SL.library=c("SL.mean","SL.glm","SL.bayesglm","SL.gam","SL.glmnet") ,g.SL.library=Q.SL.library, pval=0.2, seed=NULL, print=TRUE)

References:
1. van der Laan MJ, Rose S. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Series in Statistics; 2011.Link

  1. Gruber S, van der Laan M. tmle: An R Package for Targeted Maximum Likelihood Estimation. J Stat Softw. 2012;51: 1–35.Link

  2. Balzer LB, van der Laan MJ, Petersen ML, SEARCH Collaboration. Adaptive pre-specification in randomized trials with and without pair-matching. Stat Med. 2016; doi:10.1002/sim.7023 Link

  3. Schuler MS, Rose S. Targeted Maximum Likelihood Estimation for Causal Inference in Observational Studies. Am J Epidemiol. 2017;185: 65–73. Link

TMLE for a continuous outcome

The washb_tmle has similar arguments as washb_glm (by design), with the addition of the super learner libraries. The default library (c("SL.mean","SL.glm","SL.bayesglm","SL.gam","SL.glmnet")) is pretty good in this context, so we don’t actually specify it below (instead relying on the default value).

Below, here is an example of estimating an adjusted difference in LAZ between the nutrition and control arms in the trial:

adj.LAZ.tmle.C.N <- washb_tmle(Y=lazd$laz,tr=lazd$tr,pair=lazd$block,id=lazd$block, W=lazd[Wadj], contrast=c("Control","Nutrition"), family="gaussian")

----------------------------------------- 
By specifying the pair argument,
you have indicated that this is a matched pair analysis.

Note: the analysis will only include pairs
that have a contrast in the treatment variable. 
-----------------------------------------
-----------------------------------------
Total of 10 observations dropped due to missing
values in one or more variables
  Final sample size: 1660 
-----------------------------------------

-----------------------------------------
Pre-screening the adjustment covariates
using a univariate liklihood ratio test:
-----------------------------------------

Likelihood Ratio Test P-values:
               P-value
month          0.00000
agedays        0.96208
sex            0.86270
momage         0.20990
momedu         0.00000
momheight      0.00000
hfiacat        0.00000
Nlt18          0.00055
Ncomp          0.07214
watmin         0.86821
elec           0.00000
floor          0.00000
walls          0.16072
roof           0.09029
asset_wardrobe 0.00000
asset_table    0.00027
asset_chair    0.00000
asset_khat     0.00000
asset_chouki   0.00111
asset_tv       0.00000
asset_refrig   0.00000
asset_bike     0.00904
asset_moto     0.00000
asset_sewmach  0.00006
asset_mobile   0.00000


Covariates selected (P<0.2):
                    P-value
month          4.846005e-06
momedu         3.261193e-15
momheight      4.845440e-54
hfiacat        5.621381e-07
Nlt18          5.455514e-04
Ncomp          7.213989e-02
elec           1.398539e-10
floor          9.067623e-10
walls          1.607179e-01
roof           9.028871e-02
asset_wardrobe 8.376452e-14
asset_table    2.660365e-04
asset_chair    1.059169e-08
asset_khat     1.259119e-13
asset_chouki   1.109571e-03
asset_tv       4.039419e-16
asset_refrig   3.848861e-15
asset_bike     9.038690e-03
asset_moto     6.298749e-07
asset_sewmach  5.775318e-05
asset_mobile   3.976215e-06

-----------------------------------------

-----------------------------------------
Estimation Results:
-----------------------------------------
 Initial estimation of Q
     Procedure: SuperLearner
     Model:
         Y ~  SL.mean_All + SL.glm_All + SL.bayesglm_All + SL.gam_All + SL.glmnet_All

     Coefficients: 
         SL.mean_All    0.02291811 
          SL.glm_All    0 
     SL.bayesglm_All    0 
          SL.gam_All    0.1276187 
       SL.glmnet_All    0.8494632 

 Estimation of g (treatment mechanism)
     Procedure: SuperLearner 
     Model:
         A ~  SL.mean_All + SL.glm_All + SL.bayesglm_All + SL.gam_All + SL.glmnet_All 

     Coefficients: 
         SL.mean_All    1 
          SL.glm_All    0 
     SL.bayesglm_All    0 
          SL.gam_All    0 
       SL.glmnet_All    0 

 Estimation of g.Z (intermediate variable assignment mechanism)
     Procedure: No intermediate variable 

 Estimation of g.Delta (missingness mechanism)
     Procedure: No missingness 

 Bounds on g: ( 0.025 0.975 )

 Additive Effect
   Parameter Estimate:  0.27377
   Estimated Variance:  0.002313
              p-value:  0.00000001252
    95% Conf Interval: (0.17951, 0.36803) 

 Additive Effect among the Treated
   Parameter Estimate:  0.27377
   Estimated Variance:  0.0023124
              p-value:  0.000000012469
    95% Conf Interval: (0.17952, 0.36802) 

 Additive Effect among the Controls
   Parameter Estimate:  0.27377
   Estimated Variance:  0.0023132
              p-value:  0.000000012546
    95% Conf Interval: (0.1795, 0.36804) 

-----------------------------------------

The default output notifies you if any observations were dropped due to missing values. It also summarizes the weighting of the models or algorithms used to estimate the outcome model (estimation of Q), the treatment mechanism (estimation of g), and the missingness mechanism (estimation of g.Delta).

The function returns an object of class tmle (you can see the package for details). The effect estimate results are stored in the list $estimates$ATE:

unlist(adj.LAZ.tmle.C.N$estimates$ATE)
             psi          var.psi              CI1              CI2           pvalue 
0.27377026630744 0.00231295492552 0.17950761388421 0.36803291873068 0.00000001251959 

TMLE for a binary outcome

For binary outcomes such as diarrhea, the washb_tmle syntax is almost unchanged but the family should be specified as family="binomial". Unlike GLM, you don’t (and shouldn’t) specify the link function – TMLE is a non-parametric estimation approach, so it by default can estimate any function of the means in the treatment groups. The software currently estimates the difference, ratio, and odds ratio. You can access the different parameter estimates using the estimates$ lists:

adj.Diar.tmle.C.S <- washb_tmle(Y=dd$diar7d,tr=dd$tr,pair=dd$block,id=dd$block, W=dd[Wadj], contrast=c("Control","Sanitation"), family="binomial",Q.SL.library=c("SL.mean","SL.glm","SL.bayesglm","SL.gam","SL.glmnet"))

----------------------------------------- 
By specifying the pair argument,
you have indicated that this is a matched pair analysis.

Note: the analysis will only include pairs
that have a contrast in the treatment variable. 
-----------------------------------------
-----------------------------------------
Total of 67 observations dropped due to missing
values in one or more variables
  Final sample size: 5210 
-----------------------------------------

-----------------------------------------
Pre-screening the adjustment covariates
using a univariate liklihood ratio test:
-----------------------------------------

Likelihood Ratio Test P-values:
               P-value
month          0.00000
agedays        0.00138
sex            0.69302
momage         0.95823
momedu         0.00023
momheight      0.47178
hfiacat        0.02598
Nlt18          0.41119
Ncomp          0.72382
watmin         0.07505
elec           0.00614
floor          0.05576
walls          0.63743
roof           0.94900
asset_wardrobe 0.10699
asset_table    0.07399
asset_chair    0.43212
asset_khat     0.00174
asset_chouki   0.60563
asset_tv       0.42975
asset_refrig   0.02540
asset_bike     0.48036
asset_moto     0.28840
asset_sewmach  0.13751
asset_mobile   0.59945


Covariates selected (P<0.2):
                       P-value
month          0.0000000664917
agedays        0.0013774954196
momedu         0.0002294082996
hfiacat        0.0259780896649
watmin         0.0750477584382
elec           0.0061424259087
floor          0.0557563342543
asset_wardrobe 0.1069918101894
asset_table    0.0739884683794
asset_khat     0.0017386974934
asset_refrig   0.0254037503987
asset_sewmach  0.1375079553959

-----------------------------------------

-----------------------------------------
Estimation Results:
-----------------------------------------
 Initial estimation of Q
     Procedure: SuperLearner
     Model:
         Y ~  SL.mean_All + SL.glm_All + SL.bayesglm_All + SL.gam_All + SL.glmnet_All

     Coefficients: 
         SL.mean_All    0.2733102 
          SL.glm_All    0.3744479 
     SL.bayesglm_All    0.3522419 
          SL.gam_All    0 
       SL.glmnet_All    0 

 Estimation of g (treatment mechanism)
     Procedure: SuperLearner 
     Model:
         A ~  SL.mean_All + SL.glm_All + SL.bayesglm_All + SL.gam_All + SL.glmnet_All 

     Coefficients: 
         SL.mean_All    1 
          SL.glm_All    0 
     SL.bayesglm_All    0 
          SL.gam_All    0 
       SL.glmnet_All    0 

 Estimation of g.Z (intermediate variable assignment mechanism)
     Procedure: No intermediate variable 

 Estimation of g.Delta (missingness mechanism)
     Procedure: No missingness 

 Bounds on g: ( 0.025 0.975 )

 Additive Effect
   Parameter Estimate:  -0.022805
   Estimated Variance:  0.000038653
              p-value:  0.00024436
    95% Conf Interval: (-0.034991, -0.01062) 

 Additive Effect among the Treated
   Parameter Estimate:  -0.022866
   Estimated Variance:  0.000038987
              p-value:  0.00025022
    95% Conf Interval: (-0.035104, -0.010627) 

 Additive Effect among the Controls
   Parameter Estimate:  -0.022805
   Estimated Variance:  0.000038541
              p-value:  0.00023931
    95% Conf Interval: (-0.034973, -0.010637) 

 Relative Risk
   Parameter Estimate:  0.60282
              p-value:  0.00077352
    95% Conf Interval: (0.44879, 0.80971) 

              log(RR):  -0.50614
    variance(log(RR)):  0.022663 

 Odds Ratio
   Parameter Estimate:  0.58858
              p-value:  0.00072346
    95% Conf Interval: (0.43285, 0.80033) 

              log(OR):  -0.53005
    variance(log(OR)):  0.024584 

-----------------------------------------
# difference
round(unlist(adj.Diar.tmle.C.S$estimates$ATE),4)
    psi var.psi     CI1     CI2  pvalue 
-0.0228  0.0000 -0.0350 -0.0106  0.0002 
# prevalence ratio
round(unlist(adj.Diar.tmle.C.S$estimates$RR),4)
        psi         CI1         CI2      pvalue     log.psi var.log.psi 
     0.6028      0.4488      0.8097      0.0008     -0.5061      0.0227 
# odds ratio
round(unlist(adj.Diar.tmle.C.S$estimates$OR),4)
        psi         CI1         CI2      pvalue     log.psi var.log.psi 
     0.5886      0.4328      0.8003      0.0007     -0.5300      0.0246 

TMLE with inverse probability of censoring weights (IPCW)

The washb_tmle function enables you to account for missing data in the outcome using inverse probability of censoring weights (IPCW) to the estimator. You can do this using the Delta= argument to specify an indicator of whether the outcome. A key step for conducting an IPCW analysis that differs from other analyses in this vignette is that you need to create the “full” data, where there is an observation for every potential measurement (i.e., including those that were missing in the trial). To do this, you need to use the enrolled study population, limited to live births (since the trial’s inference is limited to live births). Below, we provide an example using the control and nutrition arms in the Bangladesh trial.

# Load data from package
data(washb_bangladesh_track_compound)
#---------------------------------------
# create a shell of the full data (dfull)
# as if every index child with a live birth
# were measured at year 1 and year 2
#---------------------------------------

washb_bd_track<-NULL

washb_bd_track <- washb_bangladesh_track_compound

table(washb_bd_track$miss1reason=="No live birth") # total number of unique compounds w/ live birth = 5190

FALSE  TRUE 
 5190   361 
# create the full data (i.e., a dataset with rows for every potential measurement)
dfull <- subset(washb_bd_track,miss1reason!="No live birth")
dfull$svy <- 1
dfull2 <- dfull
dfull2$svy <- 2
dfull <- rbind(dfull,dfull2)

# merge treatment and enrollment data onto this shell of the full data
dfull <- merge(dfull,washb_bd_tr,by=c("clusterid","block", "tr"),all.x=T,all.y=F)
dfull <- merge(dfull,washb_bd_enrol,by=c("dataid","clusterid","block", "tr"),all.x=T,all.y=F)

# Since we are only analyzing the primary outcome at year 2 in this example
# we need to re-subset the full data to year 2,
# just before merging to the LAZ data (lazd, created above)
dfull <- subset(dfull,svy==2)

# now merge the observed anthropometry outcomes (lazd) onto this full dataset
lazdfull <- merge(dfull,washb_bd_anthro,by=c("dataid","clusterid","block","svy", "tr"),all.x=T,all.y=T)
lazdfull$agedays <- lazdfull$aged # naming consisetncy across files

# create an indicator equal to 1 if the outcome is observed, 0 otherwise
lazdfull$Delta <- ifelse(is.na(lazdfull$laz),0,1)
table(lazdfull$Delta[lazdfull$tr=="Control"|lazdfull$tr=="Nutrition"])

   0    1 
 295 3372 
# set missing outcomes to an arbitrary, non-missing value. In this case use 9
lazdfull$Ydelta <- lazdfull$laz
lazdfull$Ydelta[lazdfull$Delta==0] <- 9

# estimate an IPCW-TMLE parameter
psi_ipcw <- washb_tmle(Delta=lazdfull$Delta,tr=lazdfull$tr, id=lazdfull$block, pair=lazdfull$block,Y=lazdfull$Ydelta, family="gaussian", contrast=c("Control","Nutrition"), W=lazdfull[Wadj], seed=12345)

----------------------------------------- 
By specifying the pair argument,
you have indicated that this is a matched pair analysis.

Note: the analysis will only include pairs
that have a contrast in the treatment variable. 
-----------------------------------------
-----------------------------------------
Total of 1985 observations dropped due to missing
values in one or more variables
  Final sample size: 1682 
-----------------------------------------

-----------------------------------------
Pre-screening the adjustment covariates
using a univariate liklihood ratio test:
-----------------------------------------

Likelihood Ratio Test P-values:
               P-value
month          0.06760
agedays        0.96208
sex            0.86270
momage         0.20990
momedu         0.00000
momheight      0.00000
hfiacat        0.00000
Nlt18          0.00055
Ncomp          0.07214
watmin         0.86821
elec           0.00000
floor          0.00000
walls          0.16072
roof           0.09029
asset_wardrobe 0.00000
asset_table    0.00027
asset_chair    0.00000
asset_khat     0.00000
asset_chouki   0.00111
asset_tv       0.00000
asset_refrig   0.00000
asset_bike     0.00904
asset_moto     0.00000
asset_sewmach  0.00006
asset_mobile   0.00000


Covariates selected (P<0.2):
                    P-value
month          6.759682e-02
momedu         3.261193e-15
momheight      4.845440e-54
hfiacat        5.621381e-07
Nlt18          5.455514e-04
Ncomp          7.213989e-02
elec           1.398539e-10
floor          9.067623e-10
walls          1.607179e-01
roof           9.028871e-02
asset_wardrobe 8.376452e-14
asset_table    2.660365e-04
asset_chair    1.059169e-08
asset_khat     1.259119e-13
asset_chouki   1.109571e-03
asset_tv       4.039419e-16
asset_refrig   3.848861e-15
asset_bike     9.038690e-03
asset_moto     6.298749e-07
asset_sewmach  5.775318e-05
asset_mobile   3.976215e-06

-----------------------------------------

-----------------------------------------
Estimation Results:
-----------------------------------------
 Initial estimation of Q
     Procedure: SuperLearner
     Model:
         Y ~  SL.mean_All + SL.glm_All + SL.bayesglm_All + SL.gam_All + SL.glmnet_All

     Coefficients: 
         SL.mean_All    0 
          SL.glm_All    0 
     SL.bayesglm_All    0 
          SL.gam_All    0 
       SL.glmnet_All    1 

 Estimation of g (treatment mechanism)
     Procedure: SuperLearner 
     Model:
         A ~  SL.mean_All + SL.glm_All + SL.bayesglm_All + SL.gam_All + SL.glmnet_All 

     Coefficients: 
         SL.mean_All    0.5907074 
          SL.glm_All    0 
     SL.bayesglm_All    0.03792591 
          SL.gam_All    0 
       SL.glmnet_All    0.3713667 

 Estimation of g.Z (intermediate variable assignment mechanism)
     Procedure: No intermediate variable 

 Estimation of g.Delta (missingness mechanism)
     Procedure: SuperLearner 
     Model:
         Delta ~ SL.mean_All + SL.glm_All + SL.bayesglm_All + SL.gam_All + SL.glmnet_All 

     Coefficients: 
         SL.mean_All    1 
          SL.glm_All    0 
     SL.bayesglm_All    0 
          SL.gam_All    0 
       SL.glmnet_All    0 

 Bounds on g: ( 0.025 0.975 )

 Additive Effect
   Parameter Estimate:  0.2768
   Estimated Variance:  0.0023262
              p-value:  0.0000000095249
    95% Conf Interval: (0.18226, 0.37133) 

 Additive Effect among the Treated
   Parameter Estimate:  0.2768
   Estimated Variance:  0.0025443
              p-value:  0.000000040753
    95% Conf Interval: (0.17793, 0.37566) 

 Additive Effect among the Controls
   Parameter Estimate:  0.2768
   Estimated Variance:  0.0026254
              p-value:  0.000000065866
    95% Conf Interval: (0.17637, 0.37722) 

-----------------------------------------

In this particular example, accounting for missing outcomes makes very little difference after adjusting for covariates – adjusted, and adjusted+IPCW estimates are nearly identical.

#---------------------------------------
# compare the estimates
#---------------------------------------
round(unlist(adj.LAZ.tmle.C.N$estimates$ATE),4)
    psi var.psi     CI1     CI2  pvalue 
 0.2738  0.0023  0.1795  0.3680  0.0000 
round(unlist(psi_ipcw$estimates$ATE),4)
    psi var.psi     CI1     CI2  pvalue 
 0.2768  0.0023  0.1823  0.3713  0.0000 

Comparison of ITT estimators

In unadjusted ITT analyses, all the estimators we have discussed above are asymptotically equivalent. What this means is that as sample size increases to infinity, the estimates will converge to the same number. So, for unadjusted analyses, the choice of estimator is not very consequential. The primary analyses in Kenya and Bangladesh used paired t-tests (continuous) and stratified Mantel-Haenszel (binary) estimators for unadjusted analyses, and TMLE for adjusted analyses. GLM is a perfectly sensible approach as well, and one advantage of using GLM is that you can use the same estimator through all analyses (unadjusted, adjusted).

TMLE with super learning has some advantages in the adjusted analysis in that it will likely be more powerful than GLM because of the data-adaptive estimation in predicting the outcome. Note that a main effects GLM with covariates is a TMLE estimator as long as there are no interactions between covariates and the treatment arm, just assuming a parametric submodel (confused yet? See Balzer et al. 2016 and/or talk to us at UCB) . One final advantage of TMLE is that it will enable you to account for missing outcomes, which are common for some secondary specimen outcomes.

paired t-test, GLM, and TMLE

unadjusted analyses

As seen below, the estimates and 95 percent confidence intervals are virtually the same across estimators in an unadjusted analysis, even in this finite sample example (Nutrition vs. Control LAZ).

# t-test
ttest.C.N <- washb_ttest(Y=lazd$laz,tr=lazd$tr,strat=lazd$block, contrast=c("Control","Nutrition"))
round(ttest.C.N,4)
  diff  ci.lb  ci.ub t-stat      p 
0.2526 0.1470 0.3581 4.7545 0.0000 
# GLM, unadjusted
LAZ.glm.C.N <- washb_glm(Y=lazd$laz,tr=lazd$tr,pair=lazd$block, id=lazd$block, contrast=c("Control","Nutrition"), family="gaussian", print=FALSE)
round(LAZ.glm.C.N$TR,4)
             Coef.   2.5%  97.5% Std. Error z value Pr(>|z|)
trNutrition 0.2427 0.1403 0.3451     0.0522  4.6467        0
# TMLE, unadjusted
LAZ.tmle.C.N <- washb_tmle(Y=lazd$laz,tr=lazd$tr,pair=lazd$block,id=lazd$block, W=NULL, contrast=c("Control","Nutrition"), family="gaussian",print=FALSE,seed=12345)

----------------------------------------- 
By specifying the pair argument,
you have indicated that this is a matched pair analysis.

Note: the analysis will only include pairs
that have a contrast in the treatment variable. 
-----------------------------------------
round(unlist(LAZ.tmle.C.N$estimates$ATE),4)
    psi var.psi     CI1     CI2  pvalue 
 0.2549  0.0029  0.1495  0.3602  0.0000 

adjusted analyses

The GLM and TMLE estimators are also similar in adjusted analyses. Below, we repeat the contrast above, also adjusting for pre-specified covariates by specifying W=lazd[Wadj]. In this particular example, the esitmates and confidence intervals are very similar between the two approaches. We have found over a large number of comparisons that the TMLE estimator is the same or more efficient than the GLM estimator.

# GLM, adjusted
adj.LAZ.glm.C.N <- washb_glm(Y=lazd$laz,tr=lazd$tr,pair=lazd$block, id=lazd$block, W=lazd[Wadj], contrast=c("Control","Nutrition"), family="gaussian",  print=FALSE)
round(adj.LAZ.glm.C.N$TR,4)
            Coef. 2.5%  97.5% Std. Error z value Pr(>|z|)
trNutrition 0.272 0.18 0.3641      0.047  5.7939        0
# TMLE, adjusted
adj.LAZ.tmle.C.N <- washb_tmle(Y=lazd$laz,tr=lazd$tr,pair=lazd$block,id=lazd$block, W=lazd[Wadj], contrast=c("Control","Nutrition"), family="gaussian",print=FALSE,seed=12345)

----------------------------------------- 
By specifying the pair argument,
you have indicated that this is a matched pair analysis.

Note: the analysis will only include pairs
that have a contrast in the treatment variable. 
-----------------------------------------
round(unlist(adj.LAZ.tmle.C.N$estimates$ATE),4)
    psi var.psi     CI1     CI2  pvalue 
 0.2749  0.0023  0.1814  0.3683  0.0000 

Mantel-Haenszel, GLM, and TMLE

unadjusted analysis

As seen below, the estimates and 95 percent confidence intervals for the prevalence ratio of diarrhea (binary outcome) are virtually the same, even in this finite sample example (Sanitation vs. Control).

# Mantel-Haenszel (estimated above)
mh.C.S <- washb_mh(Y=dd$diar7d,tr=dd$tr, contrast=c("Control","Sanitation"), strat=dd$block,measure="RR")
round(mh.C.S,4)
     PR,    ci.lb    ci.ub    logPR se.logPR        Z        p 
  0.6085   0.4569   0.8102  -0.4968   0.1461  -3.4001   0.0007 
# GLM, unadjusted
Diar.glm.C.S <- washb_glm(Y=dd$diar7d,tr=dd$tr,pair=dd$block, id=dd$block, contrast=c("Control","Sanitation"), family=binomial(link='log'),print=FALSE)
round(Diar.glm.C.S$TR,4)
                 PR  2.5%  97.5% Estimate Std. Error z value Pr(>|z|)
trSanitation 0.6109 0.451 0.8275  -0.4928     0.1548  -3.183   0.0015
# TMLE, unadjusted
Diar.tmle.C.S <- washb_tmle(Y=dd$diar7d,tr=dd$tr,pair=dd$block,id=dd$block, W=NULL, contrast=c("Control","Sanitation"), family="binomial",print=FALSE,seed=12345)

----------------------------------------- 
By specifying the pair argument,
you have indicated that this is a matched pair analysis.

Note: the analysis will only include pairs
that have a contrast in the treatment variable. 
-----------------------------------------
round(unlist(Diar.tmle.C.S$estimates$RR),4)
        psi         CI1         CI2      pvalue     log.psi var.log.psi 
     0.6096      0.4529      0.8205      0.0011     -0.4950      0.0230 

adjusted analyses

The GLM and TMLE estimators are also similar in adjusted analyses. Below, we repeat the contrast above, also adjusting for pre-specified covariates by specifying W=dd[Wadj]. In this particular example, the estimates and confidence intervals are very similar between the two approaches. We have found over a large number of comparisons that the TMLE estimator is the same or more efficient than the GLM estimator.

# GLM, adjusted
adj.Diar.glm.C.S <- washb_glm(Y=dd$diar7d,tr=dd$tr,pair=dd$block, id=dd$block, W=dd[Wadj],contrast=c("Control","Sanitation"), family=binomial(link='log'),print=FALSE)
round(adj.Diar.glm.C.S$TR,4)
                 PR   2.5%  97.5% Estimate Std. Error z value Pr(>|z|)
trSanitation 0.5847 0.4318 0.7919  -0.5366     0.1547 -3.4677   0.0005
# TMLE, adjusted
adj.Diar.tmle.C.S <- washb_tmle(Y=dd$diar7d,tr=dd$tr,pair=dd$block,id=dd$block, W=dd[Wadj], contrast=c("Control","Sanitation"), family="binomial",print=FALSE,seed=12345)

----------------------------------------- 
By specifying the pair argument,
you have indicated that this is a matched pair analysis.

Note: the analysis will only include pairs
that have a contrast in the treatment variable. 
-----------------------------------------
round(unlist(adj.Diar.tmle.C.S$estimates$RR),4)
        psi         CI1         CI2      pvalue     log.psi var.log.psi 
     0.6014      0.4473      0.8087      0.0008     -0.5084      0.0228 

washb_permute

Non-parametric test of complete independence (i.e., the strong null hypothesis) using the Wilcoxon Signed Rank permutation test.

function overview

WASH Benefits Wilcoxon Signed Rank permutation test function for two treatment arms conditional on randomization block. It conducts a permutation test of the independence of Y and tr, conditional on matched pair (randomization block) using the Wilcoxon rank-sum test statistic.

The washb_glm and related functions (above) enable us to test hypotheses about whether the average outcome differs between trial arms – that is, whether the difference in means is equal to zero. The mean is just one function of the outcome distribution. A sharper test is the “sharp null hypothesis” first proposed by Ronald Fisher, in which we test whether there is any difference at all in the outcome distributions between intervention arms. Under the null hypothesis of no treatment effect, the outcome distributions should be indistinguishable from random sampling variation. A way to test the sharp null hypothesis is with a permutation test (also called a Fisher randomization test). The intuition behind the test is that there is a single source of random variation in the trial: namely the random allocation of treatment. If we re-randomize treatment in every possible combination (or a very large number of combinations), and compare arms using a test statistic for each combination, then this provides us with the test statistic’s distribution under the null hypothesis of no effect (since we re-shuffle treatment in each iteration, it is completely uninformative). We can then compare the observed test statistic with real group assignments to the null distribution to see how likely it would be to have occurred by chance. For more details on randomization tests, see the references below.

In the WASH Benefits primary analysis, we pre-specified that we would test this sharp null using a Wilcoxon signed rank test, which is a non-parametric test statistic that has been shown to have good power against alternatives for outcomes that could potentially have skewed distributions Imbens GW, Rubin DB. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press; 2015..

Note, however, that in every test we have performed, p-values obtained under this strong null hypothesis have been entirely consistent with p-values obtained from paired t-tests, Mantel-Haenszel chi-squared tests, and even adjusted estimates. Therefore, going the extra mile to estimate these tests may be unnecessary for many analyses in the WASH Benefits trials. Nevertheless, we have included the function in the package because it might be useful.

Arguments:
Y= Outcome variable (continuous, such as LAZ, or binary, such as diarrhea).
tr= Binary treatment group variable (ideally a factor), comparison group first.
pair= Pair-matched randomization ID variable (in WASH Benefits: block).
contrast= Vector of length 2 that includes the groups to contrast, e.g., c(“Control”,“Water”).
nreps= Number of permutations to run.
seed= Number for psuedo-random number generation in R.

Usage:

washb_permute(Y,tr,pair,contrast,nreps=100000,seed=NULL)

References:
1. Gail, M. H., Mark, S. D., Carroll, R. J., Green, S. B. & Pee, D. On Design Considerations and Randomization-Based Inference for Community Intervention Trials. Statist. Med. 15, 1069–1092 (1996).Link

  1. Braun, T. M. & Feng, Z. Optimal Permutation Tests for the Analysis of Group Randomized Trials. Journal of the American Statistical Association 96, 1424–1432 (2001). Link

  2. Rosenbaum, P. R. Covariance Adjustment in Randomized Experiments and Observational Studies. Statist. Sci. 17, 286–327 (2002). Link

unadjusted permutation test

Apply the washb_permute function to the Sanitation-Control arm comparison for diarrhea:

permute.C.S <- washb_permute(Y = dd$diar7d, tr = dd$tr, pair = dd$block, contrast = c("Control", "Sanitation"), nreps = 100000, seed = 242524)

    Approximative Wilcoxon-Pratt Signed-Rank Test

data:  y by x (pos, neg) 
     stratified by block
Z = 3.551, p-value = 0.00032
alternative hypothesis: true mu is not equal to 0

adjusted permutation test

The permutation test can also test for independence of Y and tr, conditional on W, by applying the washb_permute function to the residuals of an adjusted model (adjusted for W, but not tr). The permutation test can have even greater power against alternatives if it is conducted on residuals from an algorithmic fit that removes variability of the outcome due to characteristics other than treatment. The intuition is that by removing outcome variability that is not associated with the treatment of interest, we can more narrowly focus our inference on differences due to treatment. In the example below, we show how to use a simple linear regression to predict LAZ, and then conduct the permutation test on the residuals from the regression. Alternatively, a more flexible, machine-learning algorithm approach, such as super learner, can be used in the initial prediction step.

Here is an example of an adjusted permutation test, using linear regression to adjust for covariates:

# Use the washb_prescreen function to select adjustment covariates associated with the outcome
adj.W<-washb_prescreen(Y=lazd$laz,lazd[Wadj],family="gaussian", pval=0.2)

#Subset the LAZ dataset to control and nutrition arms:
      laz.subset=lazd[which(lazd$tr=="Control"|lazd$tr=="Nutrition"),]

#Subset the LAZ dataset to the selected adjustment covariates and LAZ, as well as tr and block, which will be needed in the permutation test
perm.adj.data <- subset(laz.subset,select=c(adj.W, "laz", "tr", "block"))

#Subset to complete cases
perm.adj.data<-perm.adj.data[complete.cases(perm.adj.data),]

Wselect <- subset(perm.adj.data,select=c(adj.W, "laz"))
#fit the glm model
fit <- glm(laz~., family="gaussian", data=Wselect)

#Use the predict to return predicted LAZ from the adjusted glm model, and subtract it from the observed LAZ outcome
residuals<-Wselect$laz-predict(fit)

#run the permutation test function
permute.adj.C.S <- washb_permute(Y=residuals,tr=perm.adj.data$tr,pair=perm.adj.data$block,contrast=c("Control","Sanitation"), nreps=100000,seed=12345)

And here is an example of using an adaptive algorithm, super learner, to estimate the residuals in advance of the permutation test:

# pre-screen the covariates for those associated with the outcome (LR test P<0.2)
# uses internal functions washb_prescreen and design_matrix
# subset to complete observations (no missing values)
adjpdata <- subset(lazd,select=c("laz","block","tr",Wadj))
adjpdata <- adjpdata[complete.cases(adjpdata),]
Wscreen <- washb_prescreen(Y=adjpdata$laz,Ws=adjpdata[Wadj],family="gaussian")

Likelihood Ratio Test P-values:
               P-value
month          0.00000
agedays        0.69656
sex            0.41464
momage         0.26153
momedu         0.00000
momheight      0.00000
hfiacat        0.00000
Nlt18          0.00005
Ncomp          0.00135
watmin         0.74293
elec           0.00000
floor          0.00000
walls          0.63752
roof           0.02256
asset_wardrobe 0.00000
asset_table    0.00000
asset_chair    0.00000
asset_khat     0.00000
asset_chouki   0.00000
asset_tv       0.00000
asset_refrig   0.00000
asset_bike     0.00000
asset_moto     0.00000
asset_sewmach  0.00000
asset_mobile   0.00000


Covariates selected (P<0.2):
                     P-value
month           3.071077e-11
momedu          3.376024e-37
momheight      5.954595e-136
hfiacat         7.311145e-17
Nlt18           4.959634e-05
Ncomp           1.347806e-03
elec            1.117490e-19
floor           1.936064e-24
roof            2.256134e-02
asset_wardrobe  8.744450e-29
asset_table     3.372179e-19
asset_chair     6.383480e-26
asset_khat      8.252063e-33
asset_chouki    8.884088e-07
asset_tv        1.896792e-33
asset_refrig    1.229590e-35
asset_bike      8.952416e-07
asset_moto      2.026057e-13
asset_sewmach   2.865749e-14
asset_mobile    5.007999e-21
# convert adjustment covariates into a full design matrix
Wselect <- subset(adjpdata,select=Wscreen)
Wselect <- design_matrix(Wselect)

# algorithmic fit of the outcome as a function of selected covariates
set.seed(12345)
SLfit1 <- SuperLearner(Y=adjpdata$laz,X=Wselect,id=adjpdata$block,
                       family="gaussian",
                    SL.library=c("SL.mean","SL.glm","SL.bayesglm","SL.gam","SL.glmnet")
                       )
SLfit1

Call:  SuperLearner(Y = adjpdata$laz, X = Wselect, family = "gaussian", SL.library = c("SL.mean", "SL.glm", "SL.bayesglm", "SL.gam", "SL.glmnet"), id = adjpdata$block) 

                     Risk      Coef
SL.mean_All     1.0541483 0.0000000
SL.glm_All      0.8665921 0.0000000
SL.bayesglm_All 0.8665833 0.0000000
SL.gam_All      0.8666606 0.3591644
SL.glmnet_All   0.8656952 0.6408356
adjpdata$pY <- as.vector(predict(SLfit1)$pred)
adjpdata$r <- adjpdata$laz-adjpdata$pY

# Test of the strong-null hypothesis for differences between sanitation and control arms in LAZ distributions
permute.sl.C.S <- washb_permute(Y=adjpdata$r,tr=adjpdata$tr,pair=adjpdata$block,contrast=c("Control","Sanitation"),nreps=100000,seed=12345)

    Approximative Wilcoxon-Pratt Signed-Rank Test

data:  y by x (pos, neg) 
     stratified by block
Z = -0.46071, p-value = 0.648
alternative hypothesis: true mu is not equal to 0

washb_prescreen (internal)

Note: The washb_prescreen function was written as internal function (called by washb_glm and washb_tmle), but we have provided some additional details for how they work in case investigators wish to use them separately.

The washb_prescreen function selects covariates with univariate associations with the outcome of P<0.2 (default) based on a likelihood ratio test. It is called as a part of the washb_glm function to select adjustment covariates, but can also be run as an independent function. The pval= argument can be used to alter the p-value threshold for inclusion.

Arguments:
Y Outcome variable (continuous, such as LAZ, or binary, such as diarrhea).
Ws data frame that includes candidate adjustment covariates to screen.
family GLM model family ("gaussian", "binomial", "poisson", binomial(link='log'), or "neg.binom" for negative binomial).
pval The p-value threshold: any variables with a p-value from the likelihood ratio test below this threshold will be returned. Defaults to 0.2.
print Logical for whether to print function output, defaults to TRUE.

Usage:

washb_prescreen(Y=dd$diar7d,Ws=W,family=binomial(link='log'), pval=0.2, print=TRUE)

run the washb_prescreen function

The washb_prescreen function performs a likelihood ratio test on a set of potential covariates and returns all covariates with an associated p-value<0.2 (unless a custom pvalue is specified). It can be called as a function on its own, and is also called internally within the washb_glm function. Unless otherwise specified by the use, the washb_glm function with therefore only include covariates with a LR-test p-value <0.2 in the model. See the Function: washb_glm section for more details.

prescreened_varnames<-washb_prescreen(Y=dd$diar7d,dd[Wadj],family=binomial(link='log'), pval=0.2)

Likelihood Ratio Test P-values:
               P-value
month          0.00000
agedays        0.00001
sex            0.15910
momage         0.85843
momedu         0.00113
momheight      0.83679
hfiacat        0.00044
Nlt18          0.14607
Ncomp          0.85849
watmin         0.01745
elec           0.00166
floor          0.00882
walls          0.17286
roof           0.44633
asset_wardrobe 0.00334
asset_table    0.27762
asset_chair    0.26366
asset_khat     0.05397
asset_chouki   0.88290
asset_tv       0.10924
asset_refrig   0.01527
asset_bike     0.00498
asset_moto     0.23256
asset_sewmach  0.00352
asset_mobile   0.71326


Covariates selected (P<0.2):
                      P-value
month          0.000001277665
agedays        0.000008578316
sex            0.159101014584
momedu         0.001131482118
hfiacat        0.000436393481
Nlt18          0.146074279888
watmin         0.017447529756
elec           0.001659255089
floor          0.008816329210
walls          0.172858463464
asset_wardrobe 0.003338351150
asset_khat     0.053968452008
asset_tv       0.109235025959
asset_refrig   0.015267527279
asset_bike     0.004977085684
asset_sewmach  0.003515782703

description of output

The function runs a likelihood ratio test between a generalized linear model fit to each screened variable and a null model. The first section of the output includes all screened variables and the p-values from the likelihood ratio test. The second section outputs only those variables with a p-value less than the thresholds set with the pval argument (<0.2 by default).

Use the following code to return the saved a list of variable names for the selected variables. Then, subset the covariate dataframe to only those variables:

prescreened_varnames
 [1] "month"          "agedays"        "sex"            "momedu"         "hfiacat"        "Nlt18"          "watmin"         "elec"           "floor"          "walls"          "asset_wardrobe" "asset_khat"     "asset_tv"       "asset_refrig"   "asset_bike"     "asset_sewmach" 
prescreened_vars <- subset(dd[Wadj],select=prescreened_varnames)
# Examine the first five observations of the second selected variable, child age in days:
prescreened_vars[1:5,2]
[1]  346  810 1289 1753 1780

washb_lincom (internal)

Note: washb_lincom function was written as an internal function (called by washb_glm, above), but we have provided some additional details for how it works in case investigators wish to use it separately.

The subgroup analysis option in washb_glm internally calls the washb_lincom function to calculate estimates, SEs, CIs, and P-values from a linear combination of regression coefficients. The washb_lincom function can be called alone to calculate a custom linear combination of coefficients.

Arguments:
lc= Index vector of coefficients from glm modellinear combination of coefficients varlist= Character vector of variables to include. Alternative to lc. If lc is specified, this argument is ignored. fit= Model object returned from coeftest (class=coeftest) command within the washb_glm function. Accessed with $fit from washb_glm output. vcv= variance-covariance matrix of coefficients. Accessed with $vcv from washb_glm output. measure= measure of effect. RR = risk ratio, RD = risk difference flag= Internal argument used to flag and suppress printing if the washb_lincom function is called within another function.

Usage:

washb_lincom(lc=NULL,varlist=NULL,fit,vcv, measure="RR", flag=NULL)

Index child v. sibling subgroup analysis

Below is an example of using the lincom() function to externally replicate the index child/sibling subgroup analysis of the diarrheal disease outcome. First, recall the GLM model contrasting diarrheal disease prevalence ratio between index child and sibling fitted to diarrheal disease:

glm.C.N.byChildType$lincom
  Tr vs. C by Subgroup     est se.est  est.lb  est.ub       Z      P
1              Sibling -0.0144 0.0103 -0.0346  0.0058 -1.3978 0.1622
2         Target child -0.0226 0.0080 -0.0383 -0.0068 -2.8116 0.0049

Next, use the lincom function to verify the subgroup analysis calculated within the washb_glm() function. washb_lincom() takes as arguments the fit and variance-covariance matrix returned by the glm function, and an index vector lc.
lc= vector indexing coefficients to be linearly combined. If a single coefficient is chosen, it will return the same coefficient and confidence interval as the GLM coefficient. Example:

  #Create lc vector of 0's equal in length to the number of coefficients from the glm model.
lc=rep(0,nrow(glm.C.N.byChildType$fit))
#Examine model coefficients (minus the pair-matched block estimates) to determine the position of coefficients to combine.
glm.C.N.byChildType$fit[1:3,]
                    Coef.        2.5%      97.5%  Std. Error    z value    Pr(>|z|)
(Intercept)    0.00644170 -0.00641030 0.01929369 0.006557140  0.9823943 0.325905625
trNutrition   -0.01440157 -0.03459600 0.00579286 0.010303280 -1.3977654 0.162183529
VTarget child  0.02491546  0.00632508 0.04350584 0.009484888  2.6268585 0.008617717
  #Replace the second position in the vector with 1 (the position of the treatment coefficient in the model)
lc[2]<-1
  #Run the lincom function and compare output to the treatment effect from the GLM model.
washb_lincom(lc=lc,fit=glm.C.N.byChildType$fit,vcv=glm.C.N.byChildType$vcv, measure="RR")

Linear combination of coefficients:
[1] "trNutrition"
         RR  se.RR  RR.lb RR.ub Z      P
[1,] 1.0104 0.0103 0.9902 1.031 1 0.3173

Because sibling is the reference level of the VTarget child term, running the lincom() function with only the treatment contrast term marked in the lc index vector equals both the prevalence ratio of the trSanitation term and the prevalence ratio of the sibling subgroup returned in the glm.C.S.byChildType$lincom object. To estimate the PR from the target child subgroup, the interaction term trSanitation:VTarget child needs to be marked in the lc index.

#Add a 1 at the 4th position in the lc vector to include the 4th coefficient, the interaction term, in the linear combination.
lc[4]<-1
washb_lincom(lc=lc,fit=glm.C.N.byChildType$fit,vcv=glm.C.N.byChildType$vcv, measure="RR")

Linear combination of coefficients:
[1] "trNutrition + pair2"
         RR  se.RR  RR.lb  RR.ub      Z      P
[1,] 1.0116 0.0091 0.9936 1.0299 1.2624 0.2068

This estimate equals the target child PR estimated in the washb_glm estimate with the V argument specified. Note that the coefficient for the VTarget child term is not included in this linear combination calculation. Including that term would calculate the prevalence ratio between sanitation arm target children and control arm siblings, whereas in a subgroup analysis the prevalence ratio between treatments within, rather than across, the groups is desired.