causaloptim
?In summary, causaloptim
computes symbolic tight
bounds on causal effects. Suppose we have a causal graph G in which we want to estimate an
effect ACE(X → Y). If
there is unmeasured confounding of this effect, then it cannot be
determined. However, we can still try to find good bounds on it!
Let’s pretend for a while that the confounding was in fact measured, so we have an estimate of its distribution q. Then, given q, the effect ACEq(X → Y) is identifiable. Now, even though we don’t actually know q, we surely know something about it; it needs to be consistent with the structure of G, so it satisfies some constraint, say q ∈ 𝒜. With no further assumptions, ACEq(X → Y) is valid w.r.t. G if and only if we have q ∈ 𝒜, so finding tight bounds on ACE(X → Y) becomes a constrained optimization problem; simply find the extrema of {ACEq(X → Y) : q ∈ 𝒜}.
For a given estimate of the distribution p of our measured
variables, we could try a large number of simulations and numerically
approximate the bounds, but can we actually solve the problem
analytically to get closed form expressions of the bounds in terms of
p? For a certain class of
problems, it turns out we can! This is precisely what
causaloptim
does, and this post will highlight some of its
inner workings and recent efficiency improvements. For a tutorial on how
to get started using causaloptim
, have a look at this
post.
Back in 1995, [Balke & Pearl]1 showed how to do this in a classic example of an experiment with potential non-compliance. More on that shortly, but in summary they showed how to translate their causal problem into a linear programming one and leverage tools from linear optimization to compute the bounds. Balke even wrote a command line program in C++ that computes them, given such a linear program (LP) as input. His program has been used successfully by researchers in causal inference since then, but has had a few issues preventing it from wider adoption among applied researchers.
causaloptim
was written to solve these problems. It can
handle the generalized class of problems defined in [Sachs & Jonzon
& Gabriel & Sjölander]2. It removes the burden of translating the
causal problem into an LP and has an intuitive graphical user interface
that guides the user through specifying their DAG and causal query. It
has until version 0.7.1
used Balke’s code for the
optimization. However, for non-trivial instances of the generalized
class that causaloptim
handles, the original optimization
algorithm has not been up to the task.
Let’s time the following simple enough looking multiple instrument problem.
library(causaloptim)
library(igraph)
b <- graph_from_literal(Z1 -+ X, Z2 -+ X, Z2 -+ Z1, Ul -+ Z1, Ul -+ Z2, X -+ Y, Ur -+ X, Ur -+ Y)
V(b)$leftside <- c(1, 0, 1, 1, 0, 0)
V(b)$latent <- c(0, 0, 0, 1, 0, 1)
V(b)$nvals <- c(2, 2, 2, 2, 2, 2)
E(b)$rlconnect <- c(0, 0, 0, 0, 0, 0, 0, 0)
E(b)$edge.monotone <- c(0, 0, 0, 0, 0, 0, 0, 0)
obj <- analyze_graph(b, constraints = NULL, effectt = "p{Y(X = 1) = 1} - p{Y(X = 0) = 1}")
Using causaloptim
version 0.7.1
to bound
ACE(X → Y) we got
(with a 3.3 GHz Quad-Core Intel Core i5; your mileage may vary)
Note the times (in seconds)! 😱 Our CPU spent almost 9 hours working on this! 🤖 And we spent more
than 17 hours waiting! 😴 Correctness
is not enough; having to wait until the next day is a bad user
experience. Using causaloptim
version 0.8.0
however, we get
Now this is acceptable! 😊 Of course, we need to verify that the results are the same. It is difficult to see upon visual inspection that they are the same, because the vertices are returned in a different order (and there are lots of them). Instead we can randomly generate some probability distributions that satisfy the graph, and compute the bounds and compare the results. First, we create the R functions that compute the bounds:
eval_newbnds <- interpret_bounds(newbnds$bounds, obj$parameters)
eval_oldbnds <- interpret_bounds(oldbnds$bounds, obj$parameters)
Then, we create a distribution by first randomly generating the
counterfactual probabilities, and then calculating the observable
probabilities by using the constraints implied by the DAG (which live in
the constraints
element of obj
).
sim.qs <- rbeta(length(obj$variables), .05, 1)
sim.qs <- sim.qs / sum(sim.qs)
names(sim.qs) <- obj$variables
inenv <- new.env()
for(j in 1:length(sim.qs)) {
assign(names(sim.qs)[j], sim.qs[j], inenv)
}
res <- lapply(as.list(obj$constraints[-1]), function(x){
x1 <- strsplit(x, " = ")[[1]]
x0 <- paste(x1[1], " = ", x1[2])
eval(parse(text = x0), envir = inenv)
})
params <- lapply(obj$parameters, function(x) get(x, envir = inenv))
names(params) <- obj$parameters
Then we pass the probabilities to the bounds functions and compare:
do.call(eval_newbnds, params)
#> lower upper
#> q0_0 -0.2221728 0.2347853
do.call(eval_oldbnds, params)
#> lower upper
#> q0_0 -0.2221728 0.2347853
Success!
This improvement was achieved by modernizing a key back-end
optimization algorithm and implementation. To see how it fits into the
bigger picture, let’s first see how causaloptim
sets up the
optimization problem.
causaloptim
work?Given a DAG D, causally relating a set 𝒱 of measured binary variables, suppose we can bipart 𝒱 so that all association between the two parts is causal and directed from one part (which we call ℒ) to the other (called ℛ), and we want to determine the effect ACE(X → Y) of some variable X on a variable Y ∈ ℛ. As a guiding example, let’s follow along the original one given by Balke & Pearl.
Since we assume nothing apart from the macro-structure ℒ → ℛ ∋ Y, we must augment D with a “worst case” scenario of
confounding among each variable within the parts ℒ and ℛ to
get a causal graph G.
The specific causal graph below can be interpreted as an experiment with
potential non-compliance. We have three binary measured variables;
treatment assigned Z,
treatment taken X and
outcome Y. We want to estimate
ACE(X → Y), which is
confounded by the unmeasured U
and has a single instrument Z.
This yields a functional causal model {FV : pa(V) × UV → V ∣ V ∈ 𝒱}, although not in a computationally convenient form. However, by the assumption of finite-valued measured variables V ∈ 𝒱, we can in fact partition the ranges of the unmeasured ones UV ∈ 𝒰 into finitely many sets; they simply enumerate the finitely many distinct functions pa(V) → V for each V ∈ 𝒱. We call such a canonical partition RV of UV the response function variable of V.
(The dashed line signifies $R_X\not\!\perp\!\!\!\perp R_Y$.) In the context of our specific example, where e.g. X has a single parent only so RX takes exactly 4 distinct values, say {0, 1, 2, 3} where ∀z ∈ {0, 1}, we have
$$\begin{matrix} \textbf{Response Pattern}&\textbf{Function}&\textbf{Interpretation}\\ r=0&f_X(z,0)=0&\text{never takes $X$, regardless of $Z$}\\ r=1&f_X(z,1)=z&\text{full compliance with assignment $Z$}\\ r=2&f_X(z,2)=1-z&\text{total defiance of assignment $Z$}\\ r=3&f_X(z,3)=1&\text{always takes $X$, regardless of $Z$}\\ \end{matrix}$$
We can observe the outcomes of the variables in 𝒱, so we can estimate their joint distribution, but in fact we will only need their conditional distribution given ℒ, so let’s establish some notation for this. For our specific example we define p(y, x; z) := P(y, x|z), ∀x, y, x ∈ {0, 1} Although unknown, we need notation for the distribution of the response function variables as well. Our specific example requires only q(rX, rY) := P(rX, rY), ∀rX, rY ∈ {0, 1, 2, 3} By the Markovian property, our example yields p(y, x; z) = ∑rX, rYP(y|x, rY)P(x|z, rX)q(rX, rY) Note that all coefficients are deterministic binary, since e.g. $P(y|x,r_Y)=\begin{cases}1&\text{, if }y=f_Y(x,r_Y)\\0&\text{, otherwise}\end{cases}$, and by recursion the same applies to more complex examples. Thus, in our particular example, we have p = Rq for some matrix R ∈ {0, 1}8 × 16 Further, we can express potential outcomes of Y under intervention on X in terms of the parameters q by the adjustment formula; P(y|do(x)) = ∑rX, rYP(y|x, rY)q(rX, rY) Hence, we have $$\begin{align*} \text{ACE}_q(X\rightarrow Y) &=P(Y=1|do(X=1))-P(Y=1|do(X=0))\\ &=c^Tq\text{ for some vector }c\in\{0,1,-1\}^{16} \end{align*}$$
Now we have our constraints on q as well as our effect of interest
in terms of q and we are ready
to optimize! By adding the probabilistic constraints on q we have arrived at e.g. the
following LP giving a tight lower bound on ACE(X → Y): $$\begin{matrix}
\min&c^Tq\\
st&\Sigma q=1\\
\&&Rq=p\\
\&&q\geq0
\end{matrix}$$ By the Strong Duality Theorem3 of convex
optimization, the optimal value of this primal problem equals that of
its dual. Furthermore, its constraint space is a convex polytope and by
the Fundamental Theorem of Linear Programming4, this optimum is
attained at one of its vertices. Consequently, we have $$\begin{align*}
\min_q ACE_q(X\rightarrow Y)
&=\min\{c^Tq\mid
q\in\mathbb{R}^{16},q\geq0_{16\times1},1_{1\times16} q=1,Rq=p\}\\
&=\max\{\begin{pmatrix}1&p^T\end{pmatrix}y\mid
y\in\mathbb{R}^{9},y\geq0_{9\times1},\begin{pmatrix}1_{16\times1}&R^T\end{pmatrix}\leq
c\}\\
&=\max\{\begin{pmatrix}1&p^T\end{pmatrix}\bar{y}\mid\bar{y}\text{
is a vertex of }\{y\in\mathbb{R}^{9}\mid
y\geq0_{9\times1},\begin{pmatrix}1_{16\times1}&R^T\end{pmatrix}\leq
c\}\}
\end{align*}$$ Since we allow the user to provide additional
linear inequality constraints (e.g. it may be quite reasonable to assume
the proportion of “defiers” in the study population of our example to be
quite low), the actual primal and dual LP’s look slightly more
complicated, but this small example still captures the essentials.
In general, with $A_e:=\begin{pmatrix}1_{n\times1}^T\\R\end{pmatrix}$,
$b_e:=\begin{pmatrix}1\\p\end{pmatrix}$
and user provided matrix inequality Aℓq ≤ bℓ,
we have $$\begin{align*}\min_q(ACE(A\to
Y))&=\min(\{c^Tq\mid q\in\mathbb{R}^n,A_{\ell}q\leq
b_{\ell},A_eq=b_e,q\geq0_{n\times1}\})\\
&=\max(\{\begin{pmatrix}b_{\ell}^T&b_e^T\end{pmatrix}y\mid
y\in\mathbb{R}^{m_{\ell}+m_e},\begin{pmatrix}A_{\ell}^T&A_e^T\end{pmatrix}y\leq
c,\begin{pmatrix}I_{m_l\times m_l}&0_{m_l\times
m_e}\end{pmatrix}y\leq0_{m_l\times1}\}\\
&=\max(\{\begin{pmatrix}b_{\ell}^T&b_e^T\end{pmatrix}\bar{y}\mid\bar{y}\text{
is a vertex of
}\{y\in\mathbb{R}^{m_{\ell}+m_e}\mid\begin{pmatrix}A_{\ell}^T&A_e^T\\I_{m_l\times
m_l}&0_{m_l\times
m_e}\end{pmatrix}y\leq\begin{pmatrix}c\\0_{m_l\times1}\end{pmatrix}\}\})\end{align*}$$
The coefficient matrix and right hand side vector of the dual polytope
are constructed in the few lines of code below (where c0 := c).
The last step of vertex enumeration has previously been the major
computational bottleneck in causaloptim
. It has now been
replaced by cddlib
5, which has an implementation of the
Double Description Method (dd). Any convex polytope can be dually
described as either an intersection of half-planes (which is the form we
get our dual constraint space in) or as a minimal set of
vertices of which it is the convex hull (which is the form we
want it in) and the dd algorithm efficiently converts between
these two descriptions. cddlib
also uses exact rational
arithmetic, so we don’t have to worry about any numerical instability
issues, e.g. Also, it already interfaces with R6. The following lines
of code extract the vertices of the dual polytope and store them as rows
of a matrix.
library(rcdd)
hrep <- makeH(a1 = a1, b1 = b1)
vrep <- scdd(input = hrep)
matrix_of_vrep <- vrep$output
indices_of_vertices <- matrix_of_vrep[ , 1] == 0 & matrix_of_vrep[ , 2] == 1
vertices <- matrix_of_vrep[indices_of_vertices, -c(1, 2), drop = FALSE]
The rest is simply a matter of plugging them into the dual objective
function, evaluating the expression and presenting the results!
The first part of this is done by the following line of code, where the
utility function evaluate_objective
pretty much does what
you expect it to do (here (c1_num,p)
$=(\begin{pmatrix}b_{\ell}^T&1\end{pmatrix},p)$
separates the dual objective gradient into its numeric and symbolic
parts).
There are still a few places in causaloptim
that are
worthwhile to optimize, but with this major one resolved we looking
forward to researchers taking advantage of the generalized class of
problems that it can handle!
Happy bounding!