Title: | Regression Standardization for Causal Inference |
---|---|
Description: | Contains more modern tools for causal inference using regression standardization. Four general classes of models are implemented; generalized linear models, conditional generalized estimating equation models, Cox proportional hazards models, and shared frailty gamma-Weibull models. Methodological details are described in Sjölander, A. (2016) <doi:10.1007/s10654-016-0157-3>. Also includes functionality for doubly robust estimation for generalized linear models in some special cases, and the ability to implement custom models. |
Authors: | Michael C Sachs [aut, cre], Arvid Sjölander [aut], Erin E Gabriel [aut], Johan Sebastian Ohlendorff [aut], Adam Brand [aut] |
Maintainer: | Michael C Sachs <[email protected]> |
License: | AGPL (>= 3) |
Version: | 1.0.2 |
Built: | 2025-01-07 06:00:55 UTC |
Source: | https://github.com/sachsmc/stdReg2 |
Contains more modern tools for causal inference using regression standardization. Four general classes of models are implemented; generalized linear models, conditional generalized estimating equation models, Cox proportional hazards models, and shared frailty gamma-Weibull models. Methodological details are described in Sjölander, A. (2016) doi:10.1007/s10654-016-0157-3. Also includes functionality for doubly robust estimation for generalized linear models in some special cases, and the ability to implement custom models.
Maintainer: Michael C Sachs [email protected]
Authors:
Arvid Sjölander
Erin E Gabriel
Johan Sebastian Ohlendorff
Adam Brand
Useful links:
parfrailty
fits shared frailty gamma-Weibull models. It is
specifically designed to work with the function standardize_parfrailty
, which
performs regression standardization in shared frailty gamma-Weibull models.
parfrailty(formula, data, clusterid, init)
parfrailty(formula, data, clusterid, init)
formula |
an object of class " |
data |
a data frame containing the variables in the model. |
clusterid |
a string containing the name of a cluster identification variable. |
init |
an optional vector of initial values for the model parameters. |
parfrailty
fits the shared frailty gamma-Weibull model
where and
are the survival time and covariate
vector for subject
in cluster
, respectively.
is the Weibull baseline hazard function
where is the shape
parameter and
is the scale parameter.
is the
unobserved frailty term for cluster
, which is assumed to have a
gamma distribution with scale = 1/shape =
.
is
the regression function as specified by the
formula
argument,
parameterized by a vector . The ML estimates
are
obtained by maximizing the marginal (over
) likelihood.
An object of class "parfrailty"
which is a list containing:
est |
the Maximum Likelihood (ML) estimates |
vcov |
the variance-covariance vector of the ML estimates. |
score |
a matrix containing the cluster-specific contributions to the ML score equations. |
If left truncation is present, it is assumed that it is strong left truncation. This means that even if the truncation time may be subject-specific, the whole cluster is unobserved if at least one subject in the cluster dies before his/her truncation time. If all subjects in the cluster survive beyond their subject-specific truncation times, then the whole cluster is observed (Van den Berg and Drepper, 2016).
Arvid Sjölander and Elisabeth Dahlqwist.
Dahlqwist E., Pawitan Y., Sjölander A. (2019). Regression standardization and attributable fraction estimation with between-within frailty models for clustered survival data. Statistical Methods in Medical Research 28(2), 462-485.
Van den Berg G.J., Drepper B. (2016). Inference for shared frailty survival models with left-truncated data. Econometric Reviews, 35(6), 1075-1098.
require(survival) # simulate data set.seed(5) n <- 200 m <- 3 alpha <- 1.5 eta <- 1 phi <- 0.5 beta <- 1 id <- rep(1:n, each = m) U <- rep(rgamma(n, shape = 1 / phi, scale = phi), each = m) X <- rnorm(n * m) # reparameterize scale as in rweibull function weibull.scale <- alpha / (U * exp(beta * X))^(1 / eta) T <- rweibull(n * m, shape = eta, scale = weibull.scale) # right censoring C <- runif(n * m, 0, 10) D <- as.numeric(T < C) T <- pmin(T, C) # strong left-truncation L <- runif(n * m, 0, 2) incl <- T > L incl <- ave(x = incl, id, FUN = sum) == m dd <- data.frame(L, T, D, X, id) dd <- dd[incl, ] fit <- parfrailty(formula = Surv(L, T, D) ~ X, data = dd, clusterid = "id") print(fit)
require(survival) # simulate data set.seed(5) n <- 200 m <- 3 alpha <- 1.5 eta <- 1 phi <- 0.5 beta <- 1 id <- rep(1:n, each = m) U <- rep(rgamma(n, shape = 1 / phi, scale = phi), each = m) X <- rnorm(n * m) # reparameterize scale as in rweibull function weibull.scale <- alpha / (U * exp(beta * X))^(1 / eta) T <- rweibull(n * m, shape = eta, scale = weibull.scale) # right censoring C <- runif(n * m, 0, 10) D <- as.numeric(T < C) T <- pmin(T, C) # strong left-truncation L <- runif(n * m, 0, 2) incl <- T > L incl <- ave(x = incl, id, FUN = sum) == m dd <- data.frame(L, T, D, X, id) dd <- dd[incl, ] fit <- parfrailty(formula = Surv(L, T, D) ~ X, data = dd, clusterid = "id") print(fit)
This is a plot
method for class "std_glm"
.
## S3 method for class 'std_glm' plot( x, plot_ci = TRUE, ci_type = "plain", ci_level = 0.95, transform = NULL, contrast = NULL, reference = NULL, summary_fun = "summary_std_glm", ... )
## S3 method for class 'std_glm' plot( x, plot_ci = TRUE, ci_type = "plain", ci_level = 0.95, transform = NULL, contrast = NULL, reference = NULL, summary_fun = "summary_std_glm", ... )
x |
An object of class |
plot_ci |
if |
ci_type |
A string, indicating the type of confidence intervals. Either "plain", which gives untransformed intervals, or "log", which gives log-transformed intervals. |
ci_level |
Coverage probability of confidence intervals. |
transform |
If set to |
contrast |
If set to |
reference |
If |
summary_fun |
For internal use only. Do not change. |
... |
Unused. |
None. Creates a plot as a side effect
# see standardize_glm
# see standardize_glm
This is a plot
method for class "std_surv"
.
## S3 method for class 'std_surv' plot( x, plot_ci = TRUE, ci_type = "plain", ci_level = 0.95, transform = NULL, contrast = NULL, reference = NULL, legendpos = "bottomleft", summary_fun = "summary_std_coxph", ... )
## S3 method for class 'std_surv' plot( x, plot_ci = TRUE, ci_type = "plain", ci_level = 0.95, transform = NULL, contrast = NULL, reference = NULL, legendpos = "bottomleft", summary_fun = "summary_std_coxph", ... )
x |
An object of class |
plot_ci |
if |
ci_type |
A string, indicating the type of confidence intervals. Either "plain", which gives untransformed intervals, or "log", which gives log-transformed intervals. |
ci_level |
Coverage probability of confidence intervals. |
transform |
If set to |
contrast |
If set to |
reference |
If |
legendpos |
position of the legend; see legend. |
summary_fun |
For internal use only. Do not change. |
... |
Unused. |
None. Creates a plot as a side effect
Prints summary of regression standardization fit
## S3 method for class 'std_surv' print(x, ...) ## S3 method for class 'std_glm' print(x, ...) ## S3 method for class 'std_custom' print(x, ...)
## S3 method for class 'std_surv' print(x, ...) ## S3 method for class 'std_glm' print(x, ...) ## S3 method for class 'std_custom' print(x, ...)
x |
an object of class |
... |
unused |
The object being printed, invisibly.
Print method for parametric frailty fits
## S3 method for class 'summary.parfrailty' print(x, digits = max(3L, getOption("digits") - 3L), ...)
## S3 method for class 'summary.parfrailty' print(x, digits = max(3L, getOption("digits") - 3L), ...)
x |
An object of class "parfrailty" |
digits |
Number of digits to print |
... |
Not used |
The object being printed, invisibly
Compute the sandwich variance components from a model fit
sandwich(fit, data, weights, t, fit.detail)
sandwich(fit, data, weights, t, fit.detail)
fit |
|
data |
The data used to fit the model |
weights |
Optional weights |
t |
Optional fixed time point for survival objects |
fit.detail |
For Cox models, the result of running coxph.detail on the model fit |
A list consisting of the Fisher information matrix (I) and the Score equations (U)
Get standardized estimates using the g-formula with a custom model
standardize( fitter, arguments, predict_fun, data, values, B = NULL, ci_level = 0.95, contrasts = NULL, reference = NULL, seed = NULL, times = NULL, transforms = NULL, progressbar = TRUE )
standardize( fitter, arguments, predict_fun, data, values, B = NULL, ci_level = 0.95, contrasts = NULL, reference = NULL, seed = NULL, times = NULL, transforms = NULL, progressbar = TRUE )
fitter |
The function to call to fit the data. |
arguments |
The arguments to be used in the fitter function as a |
predict_fun |
The function used to predict the means/probabilities for a new data set on the response level. For survival data, this should be a matrix where each column is the time, and each row the data. |
data |
The data. |
values |
A named list or data.frame specifying the variables and values at which marginal means of the outcome will be estimated. |
B |
Number of nonparametric bootstrap resamples. Default is |
ci_level |
Coverage probability of confidence intervals. |
contrasts |
A vector of contrasts in the following format:
If set to |
reference |
A vector of reference levels in the following format:
If |
seed |
The seed to use with the nonparametric bootstrap. |
times |
For use with survival data. Set to |
transforms |
A vector of transforms in the following format:
If set to |
progressbar |
Logical, if TRUE will print bootstrapping progress to the console |
Let ,
, and
be the outcome, the exposure, and a
vector of covariates, respectively.
standardize
uses a
model to estimate the standardized
mean ,
where
is a specific value of
,
and the outer expectation is over the marginal distribution of
.
With survival data,
,
and a vector of different time points
times
() can be given,
where
is the uncensored survival time.
An object of class std_custom
.
This is a list with components estimates and fit for the outcome model.
Rothman K.J., Greenland S., Lash T.L. (2008). Modern Epidemiology, 3rd edition. Lippincott, Williams & Wilkins.
Sjölander A. (2016). Regression standardization with the R-package stdReg. European Journal of Epidemiology 31(6), 563-574.
Sjölander A. (2016). Estimation of causal effect measures with the R-package stdReg. European Journal of Epidemiology 33(9), 847-858.
set.seed(6) n <- 100 Z <- rnorm(n) X <- rnorm(n, mean = Z) Y <- rbinom(n, 1, prob = (1 + exp(X + Z))^(-1)) dd <- data.frame(Z, X, Y) prob_predict.glm <- function(...) predict.glm(..., type = "response") x <- standardize( fitter = "glm", arguments = list( formula = Y ~ X * Z, family = "binomial" ), predict_fun = prob_predict.glm, data = dd, values = list(X = seq(-1, 1, 0.1)), B = 100, reference = 0, contrasts = "difference" ) x require(survival) prob_predict.coxph <- function(object, newdata, times) { fit.detail <- suppressWarnings(basehaz(object)) cum.haz <- fit.detail$hazard[sapply(times, function(x) max(which(fit.detail$time <= x)))] predX <- predict(object = object, newdata = newdata, type = "risk") res <- matrix(NA, ncol = length(times), nrow = length(predX)) for (ti in seq_len(length(times))) { res[, ti] <- exp(-predX * cum.haz[ti]) } res } set.seed(68) n <- 500 Z <- rnorm(n) X <- rnorm(n, mean = Z) T <- rexp(n, rate = exp(X + Z + X * Z)) # survival time C <- rexp(n, rate = exp(X + Z + X * Z)) # censoring time U <- pmin(T, C) # time at risk D <- as.numeric(T < C) # event indicator dd <- data.frame(Z, X, U, D) x <- standardize( fitter = "coxph", arguments = list( formula = Surv(U, D) ~ X + Z + X * Z, method = "breslow", x = TRUE, y = TRUE ), predict_fun = prob_predict.coxph, data = dd, times = 1:5, values = list(X = c(-1, 0, 1)), B = 100, reference = 0, contrasts = "difference" ) x
set.seed(6) n <- 100 Z <- rnorm(n) X <- rnorm(n, mean = Z) Y <- rbinom(n, 1, prob = (1 + exp(X + Z))^(-1)) dd <- data.frame(Z, X, Y) prob_predict.glm <- function(...) predict.glm(..., type = "response") x <- standardize( fitter = "glm", arguments = list( formula = Y ~ X * Z, family = "binomial" ), predict_fun = prob_predict.glm, data = dd, values = list(X = seq(-1, 1, 0.1)), B = 100, reference = 0, contrasts = "difference" ) x require(survival) prob_predict.coxph <- function(object, newdata, times) { fit.detail <- suppressWarnings(basehaz(object)) cum.haz <- fit.detail$hazard[sapply(times, function(x) max(which(fit.detail$time <= x)))] predX <- predict(object = object, newdata = newdata, type = "risk") res <- matrix(NA, ncol = length(times), nrow = length(predX)) for (ti in seq_len(length(times))) { res[, ti] <- exp(-predX * cum.haz[ti]) } res } set.seed(68) n <- 500 Z <- rnorm(n) X <- rnorm(n, mean = Z) T <- rexp(n, rate = exp(X + Z + X * Z)) # survival time C <- rexp(n, rate = exp(X + Z + X * Z)) # censoring time U <- pmin(T, C) # time at risk D <- as.numeric(T < C) # event indicator dd <- data.frame(Z, X, U, D) x <- standardize( fitter = "coxph", arguments = list( formula = Surv(U, D) ~ X + Z + X * Z, method = "breslow", x = TRUE, y = TRUE ), predict_fun = prob_predict.coxph, data = dd, times = 1:5, values = list(X = c(-1, 0, 1)), B = 100, reference = 0, contrasts = "difference" ) x
standardize_coxph
performs regression standardization in Cox proportional
hazards models at specified values of the exposure over the sample
covariate distribution. Let ,
, and
be the survival
outcome, the exposure, and a vector of covariates, respectively.
standardize_coxph
fits a Cox proportional hazards model and the Breslow estimator
of the baseline hazard in order to estimate the
standardized survival function when
measure = "survival"
or the standardized restricted mean survival up to time t when
measure = "rmean"
, where
is a specific value of
,
is a specific value of
, and the expectation is over the marginal distribution of
.
standardize_coxph( formula, data, values, times, measure = c("survival", "rmean"), clusterid, ci_level = 0.95, ci_type = "plain", contrasts = NULL, family = "gaussian", reference = NULL, transforms = NULL )
standardize_coxph( formula, data, values, times, measure = c("survival", "rmean"), clusterid, ci_level = 0.95, ci_type = "plain", contrasts = NULL, family = "gaussian", reference = NULL, transforms = NULL )
formula |
The formula which is used to fit the model for the outcome. |
data |
The data. |
values |
A named list or data.frame specifying the variables and values at which marginal means of the outcome will be estimated. |
times |
A vector containing the specific values of |
measure |
Either "survival" to estimate the survival function at times or "rmean" for the restricted mean survival up to the largest of times. |
clusterid |
An optional string containing the name of a cluster identification variable when data are clustered. |
ci_level |
Coverage probability of confidence intervals. |
ci_type |
A string, indicating the type of confidence intervals. Either "plain", which gives untransformed intervals, or "log", which gives log-transformed intervals. |
contrasts |
A vector of contrasts in the following format:
If set to |
family |
The family argument which is used to fit the glm model for the outcome. |
reference |
A vector of reference levels in the following format:
If |
transforms |
A vector of transforms in the following format:
If set to |
standardize_coxph
fits the Cox proportional hazards model
Breslow's estimator of the cumulative baseline hazard
is used together with the partial
likelihood estimate of
to obtain estimates of the survival
function
if
measure = "survival"
:
For each in the
t
argument and for each in the
x
argument, these estimates are averaged across all subjects (i.e.
all observed values of ) to produce estimates
where
is the value of
for subject
,
. The variance
for
is obtained by the sandwich formula.
If measure = "rmean"
, then
is used together with the partial
likelihood estimate of
to obtain estimates of the restricted mean survival
up to time t:
for each element of
times
. The estimation
and inference is done using the method described in Chen and Tsiatis 2001.
Currently, we can only estimate the difference in RMST for a single binary
exposure. Two separate Cox models are fit for each level of the exposure,
which is expected to be coded as 0/1.
An object of class std_surv
.
This is basically a list with components estimates and covariance estimates in res
Results for transformations, contrasts, references are stored in res_contrasts
.
The output contains estimates for contrasts and confidence intervals for all
combinations of transforms and reference levels.
Obtain numeric results in a data frame with the tidy function.
Standardized survival functions are sometimes referred to as (direct) adjusted survival functions in the literature.
standardize_coxph/standardize_parfrailty
does not currently handle time-varying exposures or
covariates.
standardize_coxph/standardize_parfrailty
internally loops over all values in the t
argument.
Therefore, the function will usually be considerably faster if
length(t)
is small.
The variance calculation performed by standardize_coxph
does not condition on
the observed covariates . To see how this
matters, note that
The usual parameter in a Cox proportional hazards model does not
depend on
. Thus,
is independent
of
as well (since
), so that
the term
in the corresponding variance
decomposition for
becomes equal to 0. However,
depends on
through the average over the
sample distribution for
, and thus the term
is not 0, unless one conditions on
. The variance calculation by Gail and Byar (1986) ignores this
term, and thus effectively conditions on
.
Arvid Sjölander, Adam Brand, Michael Sachs
Chang I.M., Gelman G., Pagano M. (1982). Corrected group prognostic curves and summary statistics. Journal of Chronic Diseases 35, 669-674.
Gail M.H. and Byar D.P. (1986). Variance calculations for direct adjusted survival curves, with applications to testing for no treatment effect. Biometrical Journal 28(5), 587-599.
Makuch R.W. (1982). Adjusted survival curve estimation using covariates. Journal of Chronic Diseases 35, 437-443.
Sjölander A. (2016). Regression standardization with the R-package stdReg. European Journal of Epidemiology 31(6), 563-574.
Sjölander A. (2018). Estimation of causal effect measures with the R-package stdReg. European Journal of Epidemiology 33(9), 847-858.
Chen, P. Y., Tsiatis, A. A. (2001). Causal inference on the difference of the restricted mean lifetime between two groups. Biometrics, 57(4), 1030-1038.
require(survival) set.seed(7) n <- 300 Z <- rnorm(n) Zbin <- rbinom(n, 1, .3) X <- rnorm(n, mean = Z) T <- rexp(n, rate = exp(X + Z + X * Z)) # survival time C <- rexp(n, rate = exp(X + Z + X * Z)) # censoring time fact <- factor(sample(letters[1:3], n, replace = TRUE)) U <- pmin(T, C) # time at risk D <- as.numeric(T < C) # event indicator dd <- data.frame(Z, Zbin, X, U, D, fact) fit.std.surv <- standardize_coxph( formula = Surv(U, D) ~ X + Z + X * Z, data = dd, values = list(X = seq(-1, 1, 0.5)), times = 1:5 ) print(fit.std.surv) plot(fit.std.surv) tidy(fit.std.surv) fit.std <- standardize_coxph( formula = Surv(U, D) ~ X + Zbin + X * Zbin + fact, data = dd, values = list(Zbin = 0:1), times = 1.5, measure = "rmean", contrast = "difference", reference = 0 ) print(fit.std) tidy(fit.std)
require(survival) set.seed(7) n <- 300 Z <- rnorm(n) Zbin <- rbinom(n, 1, .3) X <- rnorm(n, mean = Z) T <- rexp(n, rate = exp(X + Z + X * Z)) # survival time C <- rexp(n, rate = exp(X + Z + X * Z)) # censoring time fact <- factor(sample(letters[1:3], n, replace = TRUE)) U <- pmin(T, C) # time at risk D <- as.numeric(T < C) # event indicator dd <- data.frame(Z, Zbin, X, U, D, fact) fit.std.surv <- standardize_coxph( formula = Surv(U, D) ~ X + Z + X * Z, data = dd, values = list(X = seq(-1, 1, 0.5)), times = 1:5 ) print(fit.std.surv) plot(fit.std.surv) tidy(fit.std.surv) fit.std <- standardize_coxph( formula = Surv(U, D) ~ X + Zbin + X * Zbin + fact, data = dd, values = list(Zbin = 0:1), times = 1.5, measure = "rmean", contrast = "difference", reference = 0 ) print(fit.std) tidy(fit.std)
standardize_gee
performs regression standardization in linear and log-linear
fixed effects models, at specified values of the exposure, over the sample
covariate distribution. Let ,
, and
be the outcome,
the exposure, and a vector of covariates, respectively. It is assumed that
data are clustered with a cluster indicator
.
standardize_gee
uses
fitted fixed effects model, with cluster-specific intercept (see
details
), to estimate the standardized mean
, where
is a specific value of
, and the outer expectation is over the marginal distribution of
.
standardize_gee( formula, link = "identity", data, values, clusterid, case_control = FALSE, ci_level = 0.95, ci_type = "plain", contrasts = NULL, family = "gaussian", reference = NULL, transforms = NULL )
standardize_gee( formula, link = "identity", data, values, clusterid, case_control = FALSE, ci_level = 0.95, ci_type = "plain", contrasts = NULL, family = "gaussian", reference = NULL, transforms = NULL )
formula |
A formula to be used with |
link |
The link function to be used with |
data |
The data. |
values |
A named list or data.frame specifying the variables and values at which marginal means of the outcome will be estimated. |
clusterid |
An optional string containing the name of a cluster identification variable when data are clustered. |
case_control |
Whether the data comes from a case-control study. |
ci_level |
Coverage probability of confidence intervals. |
ci_type |
A string, indicating the type of confidence intervals. Either "plain", which gives untransformed intervals, or "log", which gives log-transformed intervals. |
contrasts |
A vector of contrasts in the following format:
If set to |
family |
The family argument which is used to fit the glm model for the outcome. |
reference |
A vector of reference levels in the following format:
If |
transforms |
A vector of transforms in the following format:
If set to |
standardize_gee
assumes that a fixed effects model
has been fitted. The link
function is assumed to be the identity link or the log link. The
conditional generalized estimating equation (CGEE) estimate of
is used to obtain estimates of the cluster-specific means:
where
if is the
identity link, and
if is the log link, and
is the value of
for subject
in cluster
,
,
. The CGEE estimate of
and the estimate of
are used to estimate the mean
:
For
each in the
x
argument, these estimates are averaged across
all subjects (i.e. all observed values of and all estimated values
of
) to produce estimates
where .
The variance for
is obtained by the sandwich formula.
An object of class std_glm
.
This is basically a list with components estimates and covariance estimates in res
.
Results for transformations, contrasts, references are stored in res_contrasts
.
Obtain numeric results in a data frame with the tidy function.
The variance calculation performed by standardize_gee
does not condition
on the observed covariates . To see how
this matters, note that
The usual parameter in a generalized linear model does not
depend on
. Thus,
is independent
of
as well (since
), so that
the term
in the corresponding variance
decomposition for
becomes equal to 0. However,
depends on
through the average over the sample
distribution for
, and thus the term
is not 0, unless one conditions on
.
Arvid Sjölander.
Goetgeluk S. and Vansteelandt S. (2008). Conditional generalized estimating equations for the analysis of clustered and longitudinal data. Biometrics 64(3), 772-780.
Martin R.S. (2017). Estimation of average marginal effects in multiplicative unobserved effects panel models. Economics Letters 160, 16-19.
Sjölander A. (2019). Estimation of marginal causal effects in the presence of confounding by cluster. Biostatistics doi: 10.1093/biostatistics/kxz054
require(drgee) set.seed(4) n <- 300 ni <- 2 id <- rep(1:n, each = ni) ai <- rep(rnorm(n), each = ni) Z <- rnorm(n * ni) X <- rnorm(n * ni, mean = ai + Z) Y <- rnorm(n * ni, mean = ai + X + Z + 0.1 * X^2) dd <- data.frame(id, Z, X, Y) fit.std <- standardize_gee( formula = Y ~ X + Z + I(X^2), link = "identity", data = dd, values = list(X = seq(-3, 3, 0.5)), clusterid = "id" ) print(fit.std) plot(fit.std)
require(drgee) set.seed(4) n <- 300 ni <- 2 id <- rep(1:n, each = ni) ai <- rep(rnorm(n), each = ni) Z <- rnorm(n * ni) X <- rnorm(n * ni, mean = ai + Z) Y <- rnorm(n * ni, mean = ai + X + Z + 0.1 * X^2) dd <- data.frame(id, Z, X, Y) fit.std <- standardize_gee( formula = Y ~ X + Z + I(X^2), link = "identity", data = dd, values = list(X = seq(-3, 3, 0.5)), clusterid = "id" ) print(fit.std) plot(fit.std)
Get regression standardized estimates from a glm
standardize_glm( formula, data, values, clusterid, matched_density_cases, matched_density_controls, matching_variable, p_population, case_control = FALSE, ci_level = 0.95, ci_type = "plain", contrasts = NULL, family = "gaussian", reference = NULL, transforms = NULL )
standardize_glm( formula, data, values, clusterid, matched_density_cases, matched_density_controls, matching_variable, p_population, case_control = FALSE, ci_level = 0.95, ci_type = "plain", contrasts = NULL, family = "gaussian", reference = NULL, transforms = NULL )
formula |
The formula which is used to fit the model for the outcome. |
data |
The data. |
values |
A named list or data.frame specifying the variables and values at which marginal means of the outcome will be estimated. |
clusterid |
An optional string containing the name of a cluster identification variable when data are clustered. |
matched_density_cases |
A function of the matching variable. The probability (or density) of the matched variable among the cases. |
matched_density_controls |
A function of the matching variable. The probability (or density) of the matched variable among the controls. |
matching_variable |
The matching variable extracted from the data set. |
p_population |
Specifies the incidence in the population when |
case_control |
Whether the data comes from a case-control study. |
ci_level |
Coverage probability of confidence intervals. |
ci_type |
A string, indicating the type of confidence intervals. Either "plain", which gives untransformed intervals, or "log", which gives log-transformed intervals. |
contrasts |
A vector of contrasts in the following format:
If set to |
family |
The family argument which is used to fit the glm model for the outcome. |
reference |
A vector of reference levels in the following format:
If |
transforms |
A vector of transforms in the following format:
If set to |
standardize_glm
performs regression standardization
in generalized linear models,
at specified values of the exposure, over the sample covariate distribution.
Let ,
, and
be the outcome, the exposure, and a
vector of covariates, respectively.
standardize_glm
uses a fitted generalized linear
model to estimate the standardized
mean ,
where
is a specific value of
,
and the outer expectation is over the marginal distribution of
.
An object of class std_glm
.
This is basically a list with components estimates and covariance estimates in res
.
Results for transformations, contrasts, references are stored in res_contrasts
.
Obtain numeric results in a data frame with the tidy function.
Rothman K.J., Greenland S., Lash T.L. (2008). Modern Epidemiology, 3rd edition. Lippincott, Williams & Wilkins.
Sjölander A. (2016). Regression standardization with the R-package stdReg. European Journal of Epidemiology 31(6), 563-574.
Sjölander A. (2016). Estimation of causal effect measures with the R-package stdReg. European Journal of Epidemiology 33(9), 847-858.
# basic example # needs to correctly specify the outcome model and no unmeasered confounders # (+ standard causal assunmptions) set.seed(6) n <- 100 Z <- rnorm(n) X <- rnorm(n, mean = Z) Y <- rbinom(n, 1, prob = (1 + exp(X + Z))^(-1)) dd <- data.frame(Z, X, Y) x <- standardize_glm( formula = Y ~ X * Z, family = "binomial", data = dd, values = list(X = 0:1), contrasts = c("difference", "ratio"), reference = 0 ) x # different transformations of causal effects # example from Sjölander (2016) with case-control data # here the matching variable needs to be passed as an argument singapore <- AF::singapore Mi <- singapore$Age m <- mean(Mi) s <- sd(Mi) d <- 5 standardize_glm( formula = Oesophagealcancer ~ (Everhotbev + Age + Dial + Samsu + Cigs)^2, family = binomial, data = singapore, values = list(Everhotbev = 0:1), clusterid = "Set", case_control = TRUE, matched_density_cases = function(x) dnorm(x, m, s), matched_density_controls = function(x) dnorm(x, m - d, s), matching_variable = Mi, p_population = 19.3 / 100000 ) # multiple exposures set.seed(7) n <- 100 Z <- rnorm(n) X1 <- rnorm(n, mean = Z) X2 <- rnorm(n) Y <- rbinom(n, 1, prob = (1 + exp(X1 + X2 + Z))^(-1)) dd <- data.frame(Z, X1, X2, Y) x <- standardize_glm( formula = Y ~ X1 + X2 + Z, family = "binomial", data = dd, values = list(X1 = 0:1, X2 = 0:1), contrasts = c("difference", "ratio"), reference = c(X1 = 0, X2 = 0) ) x tidy(x) # continuous exposure set.seed(2) n <- 100 Z <- rnorm(n) X <- rnorm(n, mean = Z) Y <- rnorm(n, mean = X + Z + 0.1 * X^2) dd <- data.frame(Z, X, Y) x <- standardize_glm( formula = Y ~ X * Z, family = "gaussian", data = dd, values = list(X = seq(-1, 1, 0.1)) ) # plot standardized mean as a function of x plot(x) # plot standardized mean - standardized mean at x = 0 as a function of x plot(x, contrast = "difference", reference = 0)
# basic example # needs to correctly specify the outcome model and no unmeasered confounders # (+ standard causal assunmptions) set.seed(6) n <- 100 Z <- rnorm(n) X <- rnorm(n, mean = Z) Y <- rbinom(n, 1, prob = (1 + exp(X + Z))^(-1)) dd <- data.frame(Z, X, Y) x <- standardize_glm( formula = Y ~ X * Z, family = "binomial", data = dd, values = list(X = 0:1), contrasts = c("difference", "ratio"), reference = 0 ) x # different transformations of causal effects # example from Sjölander (2016) with case-control data # here the matching variable needs to be passed as an argument singapore <- AF::singapore Mi <- singapore$Age m <- mean(Mi) s <- sd(Mi) d <- 5 standardize_glm( formula = Oesophagealcancer ~ (Everhotbev + Age + Dial + Samsu + Cigs)^2, family = binomial, data = singapore, values = list(Everhotbev = 0:1), clusterid = "Set", case_control = TRUE, matched_density_cases = function(x) dnorm(x, m, s), matched_density_controls = function(x) dnorm(x, m - d, s), matching_variable = Mi, p_population = 19.3 / 100000 ) # multiple exposures set.seed(7) n <- 100 Z <- rnorm(n) X1 <- rnorm(n, mean = Z) X2 <- rnorm(n) Y <- rbinom(n, 1, prob = (1 + exp(X1 + X2 + Z))^(-1)) dd <- data.frame(Z, X1, X2, Y) x <- standardize_glm( formula = Y ~ X1 + X2 + Z, family = "binomial", data = dd, values = list(X1 = 0:1, X2 = 0:1), contrasts = c("difference", "ratio"), reference = c(X1 = 0, X2 = 0) ) x tidy(x) # continuous exposure set.seed(2) n <- 100 Z <- rnorm(n) X <- rnorm(n, mean = Z) Y <- rnorm(n, mean = X + Z + 0.1 * X^2) dd <- data.frame(Z, X, Y) x <- standardize_glm( formula = Y ~ X * Z, family = "gaussian", data = dd, values = list(X = seq(-1, 1, 0.1)) ) # plot standardized mean as a function of x plot(x) # plot standardized mean - standardized mean at x = 0 as a function of x plot(x, contrast = "difference", reference = 0)
Get regression standardized doubly-robust estimates from a glm
standardize_glm_dr( formula_outcome, formula_exposure, data, values, ci_level = 0.95, ci_type = "plain", contrasts = NULL, family_outcome = "gaussian", family_exposure = "binomial", reference = NULL, transforms = NULL )
standardize_glm_dr( formula_outcome, formula_exposure, data, values, ci_level = 0.95, ci_type = "plain", contrasts = NULL, family_outcome = "gaussian", family_exposure = "binomial", reference = NULL, transforms = NULL )
formula_outcome |
The formula which is used to fit the glm model for the outcome. |
formula_exposure |
The formula which is used to fit the glm model for the exposure.
If not |
data |
The data. |
values |
A named list or data.frame specifying the variables and values at which marginal means of the outcome will be estimated. |
ci_level |
Coverage probability of confidence intervals. |
ci_type |
A string, indicating the type of confidence intervals. Either "plain", which gives untransformed intervals, or "log", which gives log-transformed intervals. |
contrasts |
A vector of contrasts in the following format:
If set to |
family_outcome |
The family argument which is used to fit the glm model for the outcome. |
family_exposure |
The family argument which is used to fit the glm model for the exposure. |
reference |
A vector of reference levels in the following format:
If |
transforms |
A vector of transforms in the following format:
If set to |
standardize_glm_dr
performs regression standardization
in generalized linear models, see e.g., documentation for standardize_glm_dr
. Specifically,
this version uses a doubly robust estimator for standardization, meaning inference is valid
when either the outcome regression or the exposure model is correctly specified
and there is no unmeasured confounding.
An object of class std_glm
.
This is basically a list with components estimates and covariance estimates in res
.
Results for transformations, contrasts, references are stored in res_contrasts
.
Obtain numeric results in a data frame with the tidy function.
Gabriel E.E., Sachs, M.C., Martinussen T., Waernbaum I., Goetghebeur E., Vansteelandt S., Sjölander A. (2024), Inverse probability of treatment weighting with generalized linear outcome models for doubly robust estimation. Statistics in Medicine, 43(3):534–547.
# doubly robust estimator # needs to correctly specify either the outcome model or the exposure model # for confounding # NOTE: only works with binary exposures data <- AF::clslowbwt x <- standardize_glm_dr( formula_outcome = bwt ~ smoker * (race + age + lwt) + I(age^2) + I(lwt^2), formula_exposure = smoker ~ race * age * lwt + I(age^2) + I(lwt^2), family_outcome = "gaussian", family_exposure = "binomial", data = data, values = list(smoker = c(0, 1)), contrasts = "difference", reference = 0 ) set.seed(6) n <- 100 Z <- rnorm(n) X <- rbinom(n, 1, prob = (1 + exp(Z))^(-1)) Y <- rbinom(n, 1, prob = (1 + exp(X + Z))^(-1)) dd <- data.frame(Z, X, Y) x <- standardize_glm_dr( formula_outcome = Y ~ X * Z, formula_exposure = X ~ Z, family_outcome = "binomial", data = dd, values = list(X = 0:1), reference = 0, contrasts = c("difference"), transforms = c("odds") )
# doubly robust estimator # needs to correctly specify either the outcome model or the exposure model # for confounding # NOTE: only works with binary exposures data <- AF::clslowbwt x <- standardize_glm_dr( formula_outcome = bwt ~ smoker * (race + age + lwt) + I(age^2) + I(lwt^2), formula_exposure = smoker ~ race * age * lwt + I(age^2) + I(lwt^2), family_outcome = "gaussian", family_exposure = "binomial", data = data, values = list(smoker = c(0, 1)), contrasts = "difference", reference = 0 ) set.seed(6) n <- 100 Z <- rnorm(n) X <- rbinom(n, 1, prob = (1 + exp(Z))^(-1)) Y <- rbinom(n, 1, prob = (1 + exp(X + Z))^(-1)) dd <- data.frame(Z, X, Y) x <- standardize_glm_dr( formula_outcome = Y ~ X * Z, formula_exposure = X ~ Z, family_outcome = "binomial", data = dd, values = list(X = 0:1), reference = 0, contrasts = c("difference"), transforms = c("odds") )
Get standardized estimates using the g-formula with and separate models for each exposure level in the data
standardize_level( fitter_list, arguments, predict_fun_list, data, values, B = NULL, ci_level = 0.95, contrasts = NULL, reference = NULL, seed = NULL, times = NULL, transforms = NULL, progressbar = TRUE )
standardize_level( fitter_list, arguments, predict_fun_list, data, values, B = NULL, ci_level = 0.95, contrasts = NULL, reference = NULL, seed = NULL, times = NULL, transforms = NULL, progressbar = TRUE )
fitter_list |
The function to call to fit the data (as a list). |
arguments |
The arguments to be used in the fitter function as a |
predict_fun_list |
The function used to predict the means/probabilities for a new data set on the response level. For survival data, this should be a matrix where each column is the time, and each row the data (as a list). |
data |
The data. |
values |
A named list or data.frame specifying the variables and values at which marginal means of the outcome will be estimated. |
B |
Number of nonparametric bootstrap resamples. Default is |
ci_level |
Coverage probability of confidence intervals. |
contrasts |
A vector of contrasts in the following format:
If set to |
reference |
A vector of reference levels in the following format:
If |
seed |
The seed to use with the nonparametric bootstrap. |
times |
For use with survival data. Set to |
transforms |
A vector of transforms in the following format:
If set to |
progressbar |
Logical, if TRUE will print bootstrapping progress to the console |
See standardize
. The difference is here that different models
can be fitted for each value of x
in values
.
An object of class std_custom
.
This is a list with components estimates and fit for the outcome model.
Rothman K.J., Greenland S., Lash T.L. (2008). Modern Epidemiology, 3rd edition. Lippincott, Williams & Wilkins.
Sjölander A. (2016). Regression standardization with the R-package stdReg. European Journal of Epidemiology 31(6), 563-574.
Sjölander A. (2016). Estimation of causal effect measures with the R-package stdReg. European Journal of Epidemiology 33(9), 847-858.
require(survival) prob_predict.coxph <- function(object, newdata, times) { fit.detail <- suppressWarnings(basehaz(object)) cum.haz <- fit.detail$hazard[sapply(times, function(x) max(which(fit.detail$time <= x)))] predX <- predict(object = object, newdata = newdata, type = "risk") res <- matrix(NA, ncol = length(times), nrow = length(predX)) for (ti in seq_len(length(times))) { res[, ti] <- exp(-predX * cum.haz[ti]) } res } set.seed(68) n <- 500 Z <- rnorm(n) X <- rbinom(n, 1, prob = 0.5) T <- rexp(n, rate = exp(X + Z + X * Z)) # survival time C <- rexp(n, rate = exp(X + Z + X * Z)) # censoring time U <- pmin(T, C) # time at risk D <- as.numeric(T < C) # event indicator dd <- data.frame(Z, X, U, D) x <- standardize_level( fitter_list = list("coxph", "coxph"), arguments = list( list( formula = Surv(U, D) ~ X + Z + X * Z, method = "breslow", x = TRUE, y = TRUE ), list( formula = Surv(U, D) ~ X, method = "breslow", x = TRUE, y = TRUE ) ), predict_fun_list = list(prob_predict.coxph, prob_predict.coxph), data = dd, times = seq(1, 5, 0.1), values = list(X = c(0, 1)), B = 100, reference = 0, contrasts = "difference" ) print(x)
require(survival) prob_predict.coxph <- function(object, newdata, times) { fit.detail <- suppressWarnings(basehaz(object)) cum.haz <- fit.detail$hazard[sapply(times, function(x) max(which(fit.detail$time <= x)))] predX <- predict(object = object, newdata = newdata, type = "risk") res <- matrix(NA, ncol = length(times), nrow = length(predX)) for (ti in seq_len(length(times))) { res[, ti] <- exp(-predX * cum.haz[ti]) } res } set.seed(68) n <- 500 Z <- rnorm(n) X <- rbinom(n, 1, prob = 0.5) T <- rexp(n, rate = exp(X + Z + X * Z)) # survival time C <- rexp(n, rate = exp(X + Z + X * Z)) # censoring time U <- pmin(T, C) # time at risk D <- as.numeric(T < C) # event indicator dd <- data.frame(Z, X, U, D) x <- standardize_level( fitter_list = list("coxph", "coxph"), arguments = list( list( formula = Surv(U, D) ~ X + Z + X * Z, method = "breslow", x = TRUE, y = TRUE ), list( formula = Surv(U, D) ~ X, method = "breslow", x = TRUE, y = TRUE ) ), predict_fun_list = list(prob_predict.coxph, prob_predict.coxph), data = dd, times = seq(1, 5, 0.1), values = list(X = c(0, 1)), B = 100, reference = 0, contrasts = "difference" ) print(x)
standardize_parfrailty
performs regression standardization in shared frailty
gamma-Weibull models, at specified values of the exposure, over the sample
covariate distribution. Let ,
, and
be the survival
outcome, the exposure, and a vector of covariates, respectively.
standardize_parfrailty
fits a parametric frailty model to
estimate the standardized survival function
, where
is a specific value of
,
is a specific value of
, and the expectation is over
the marginal distribution of
.
standardize_parfrailty( formula, data, values, times, clusterid, ci_level = 0.95, ci_type = "plain", contrasts = NULL, family = "gaussian", reference = NULL, transforms = NULL )
standardize_parfrailty( formula, data, values, times, clusterid, ci_level = 0.95, ci_type = "plain", contrasts = NULL, family = "gaussian", reference = NULL, transforms = NULL )
formula |
The formula which is used to fit the model for the outcome. |
data |
The data. |
values |
A named list or data.frame specifying the variables and values at which marginal means of the outcome will be estimated. |
times |
A vector containing the specific values of |
clusterid |
An optional string containing the name of a cluster identification variable when data are clustered. |
ci_level |
Coverage probability of confidence intervals. |
ci_type |
A string, indicating the type of confidence intervals. Either "plain", which gives untransformed intervals, or "log", which gives log-transformed intervals. |
contrasts |
A vector of contrasts in the following format:
If set to |
family |
The family argument which is used to fit the glm model for the outcome. |
reference |
A vector of reference levels in the following format:
If |
transforms |
A vector of transforms in the following format:
If set to |
standardize_parfrailty
fits a shared frailty gamma-Weibull model
, with parameterization as described in the help section for parfrailty. Integrating out the gamma frailty gives the survival function
where is the cumulative baseline hazard
The ML estimates of
are used to obtain estimates of the survival function
:
For each in the
t
argument and for each in the
x
argument, these estimates are averaged across all subjects (i.e.
all observed values of ) to produce estimates
The variance for
is obtained by the sandwich formula.
An object of class std_surv
.
This is basically a list with components estimates and covariance estimates in res
Results for transformations, contrasts, references are stored in res_contrasts
.
The output contains estimates for contrasts and confidence intervals for all
combinations of transforms and reference levels.
Obtain numeric results in a data frame with the tidy function.
Standardized survival functions are sometimes referred to as (direct) adjusted survival functions in the literature.
standardize_coxph/standardize_parfrailty
does not currently handle time-varying exposures or
covariates.
standardize_coxph/standardize_parfrailty
internally loops over all values in the t
argument.
Therefore, the function will usually be considerably faster if
length(t)
is small.
The variance calculation performed by standardize_coxph
does not condition on
the observed covariates . To see how this
matters, note that
The usual parameter in a Cox proportional hazards model does not
depend on
. Thus,
is independent
of
as well (since
), so that
the term
in the corresponding variance
decomposition for
becomes equal to 0. However,
depends on
through the average over the
sample distribution for
, and thus the term
is not 0, unless one conditions on
. The variance calculation by Gail and Byar (1986) ignores this
term, and thus effectively conditions on
.
Arvid Sjölander
Chang I.M., Gelman G., Pagano M. (1982). Corrected group prognostic curves and summary statistics. Journal of Chronic Diseases 35, 669-674.
Dahlqwist E., Pawitan Y., Sjölander A. (2019). Regression standardization and attributable fraction estimation with between-within frailty models for clustered survival data. Statistical Methods in Medical Research 28(2), 462-485.
Gail M.H. and Byar D.P. (1986). Variance calculations for direct adjusted survival curves, with applications to testing for no treatment effect. Biometrical Journal 28(5), 587-599.
Makuch R.W. (1982). Adjusted survival curve estimation using covariates. Journal of Chronic Diseases 35, 437-443.
require(survival) # simulate data set.seed(6) n <- 300 m <- 3 alpha <- 1.5 eta <- 1 phi <- 0.5 beta <- 1 id <- rep(1:n, each = m) U <- rep(rgamma(n, shape = 1 / phi, scale = phi), each = m) X <- rnorm(n * m) # reparameterize scale as in rweibull function weibull.scale <- alpha / (U * exp(beta * X))^(1 / eta) T <- rweibull(n * m, shape = eta, scale = weibull.scale) # right censoring C <- runif(n * m, 0, 10) D <- as.numeric(T < C) T <- pmin(T, C) # strong left-truncation L <- runif(n * m, 0, 2) incl <- T > L incl <- ave(x = incl, id, FUN = sum) == m dd <- data.frame(L, T, D, X, id) dd <- dd[incl, ] fit.std <- standardize_parfrailty( formula = Surv(L, T, D) ~ X, data = dd, values = list(X = seq(-1, 1, 0.5)), times = 1:5, clusterid = "id" ) print(fit.std) plot(fit.std)
require(survival) # simulate data set.seed(6) n <- 300 m <- 3 alpha <- 1.5 eta <- 1 phi <- 0.5 beta <- 1 id <- rep(1:n, each = m) U <- rep(rgamma(n, shape = 1 / phi, scale = phi), each = m) X <- rnorm(n * m) # reparameterize scale as in rweibull function weibull.scale <- alpha / (U * exp(beta * X))^(1 / eta) T <- rweibull(n * m, shape = eta, scale = weibull.scale) # right censoring C <- runif(n * m, 0, 10) D <- as.numeric(T < C) T <- pmin(T, C) # strong left-truncation L <- runif(n * m, 0, 2) incl <- T > L incl <- ave(x = incl, id, FUN = sum) == m dd <- data.frame(L, T, D, X, id) dd <- dd[incl, ] fit.std <- standardize_parfrailty( formula = Surv(L, T, D) ~ X, data = dd, values = list(X = seq(-1, 1, 0.5)), times = 1:5, clusterid = "id" ) print(fit.std) plot(fit.std)
This is a summary
method for class "parfrailty"
.
## S3 method for class 'parfrailty' summary( object, ci_type = "plain", ci_level = 0.95, digits = max(3L, getOption("digits") - 3L), ... )
## S3 method for class 'parfrailty' summary( object, ci_type = "plain", ci_level = 0.95, digits = max(3L, getOption("digits") - 3L), ... )
object |
an object of class |
ci_type |
string, indicating the type of confidence intervals. Either "plain", which gives untransformed intervals, or "log", which gives log-transformed intervals. |
ci_level |
desired coverage probability of confidence intervals, in decimal form. |
digits |
the number of significant digits to use when printing.. |
... |
not used. |
An object of class "summary.parfrailty", which is a list that contains relevant summary statistics about the fitted model
Arvid Sjölander and Elisabeth Dahlqwist.
## See documentation for parfrailty
## See documentation for parfrailty
Tidy summarizes information about the components of the standardized regression fit.
## S3 method for class 'std_glm' tidy(x, ...)
## S3 method for class 'std_glm' tidy(x, ...)
x |
An object of class std_glm |
... |
Not currently used |
A data.frame
set.seed(6) n <- 100 Z <- rnorm(n) X <- rnorm(n, mean = Z) Y <- rbinom(n, 1, prob = (1 + exp(X + Z))^(-1)) dd <- data.frame(Z, X, Y) x <- standardize_glm( formula = Y ~ X * Z, family = "binomial", data = dd, values = list(X = 0:1), contrasts = c("difference", "ratio"), reference = 0 ) tidy(x)
set.seed(6) n <- 100 Z <- rnorm(n) X <- rnorm(n, mean = Z) Y <- rbinom(n, 1, prob = (1 + exp(X + Z))^(-1)) dd <- data.frame(Z, X, Y) x <- standardize_glm( formula = Y ~ X * Z, family = "binomial", data = dd, values = list(X = 0:1), contrasts = c("difference", "ratio"), reference = 0 ) tidy(x)
Tidy summarizes information about the components of the standardized model fit.
## S3 method for class 'std_surv' tidy(x, ...)
## S3 method for class 'std_surv' tidy(x, ...)
x |
An object of class std_surv |
... |
Not currently used |
A data.frame
require(survival) set.seed(8) n <- 300 Z <- rnorm(n) X <- rnorm(n, mean = Z) time <- rexp(n, rate = exp(X + Z + X * Z)) # survival time C <- rexp(n, rate = exp(X + Z + X * Z)) # censoring time U <- pmin(time, C) # time at risk D <- as.numeric(time < C) # event indicator dd <- data.frame(Z, X, U, D) x <- standardize_coxph( formula = Surv(U, D) ~ X + Z + X * Z, data = dd, values = list(X = seq(-1, 1, 0.5)), times = c(2,3,4) ) tidy(x)
require(survival) set.seed(8) n <- 300 Z <- rnorm(n) X <- rnorm(n, mean = Z) time <- rexp(n, rate = exp(X + Z + X * Z)) # survival time C <- rexp(n, rate = exp(X + Z + X * Z)) # censoring time U <- pmin(time, C) # time at risk D <- as.numeric(time < C) # event indicator dd <- data.frame(Z, X, U, D) x <- standardize_coxph( formula = Surv(U, D) ~ X + Z + X * Z, data = dd, values = list(X = seq(-1, 1, 0.5)), times = c(2,3,4) ) tidy(x)