# Environments and formulae

In my two previous posts on non-standard evaluation, *Scoping rules and NSE* and *Overscoping and eval*, I explained:

- How R finds the bindings of variables by looking through chains of environments.
- How you can manipulate these environments to evaluate expressions in a non-standard way.
- How you can evaluate expressions where lists or data-frames overscope variables that would otherwise be found in environments.

I finished the last post with a model-fitting example, where the `lm`

function fits data to a formula, by saying that *this* particular example had more going on than you might expect. When we write something like

` lm(y ~ x, data = df)`

it looks as if the `lm`

function uses over-scoping to search for `x`

and `y`

in the `df`

data frame first, and if that fails, it looks in the calling scope. This is something we might implement using `eval`

and `parent.frame`

(or the more aptly named `rlang::caller_env`

). This is, in fact, not what is happening. It almost is, yes, but not quite. The `lm`

function does use overscoping to put `df`

in front of environments, but it does not look in the calling environment; it looks in the *formulas* environment.^{1}

```
x <- rnorm(5); y <- rnorm(5)
lm(y ~ x)
```

```
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 0.2791 0.2614
```

We will use this function to contrast an environment from a closure with one we create at the global scope:

```
closure_formula <- function() {
x <- rnorm(5); y <- rnorm(5)
y ~ x
}
```

The graph of environments looks like this:

Functions and formulae have associated environments. They do not have their own environments the way that function *calls* have, like the one that exists when we call `lm(y ~ x)`

, they are just references to other environments. The environment associated with a function is the one that will be the parent of the environment in a function call. It will be the one the function is defined in (unless you change it, which you can). The environment associated with a formula is also the one where the formula is created (unless you change it), but it is just used to find variables that are not over-scoped.

When we call `lm(y ~ x)`

we create the formula `y ~ x`

. Its environment is the global environment. When we call `lm`

it puts the formula in its argument, `formula`

, and then it fits a linear model to this formula, based on variables in `formula`

’s environment. Not its own and not its caller’s environment. Well, it is its caller’s environment since it is called from the global environment, but that is a coincidence. It doesn’t use the caller’s environment to look up variables.

The formula does not store the variables it contains. If we changed `x`

or `y`

between creating the formula and fitting the model, we would use the new values of the variables. We only use the formula environment to get a starting point for the search for variables.

Now, contrast it with this:

```
f <- closure_formula()
lm(f)
```

```
##
## Call:
## lm(formula = f)
##
## Coefficients:
## (Intercept) x
## 0.2621 -0.7383
```

The graph for this call looks like this, and I apologise that it is very busy. But there really is all this going on (and I haven’t even included edges for caller environments).

In the call to `closure_formula`

, we create a formula; the environment associated with this formula is the one we create in the function call. When we call `lm(f)`

the situation is similar to before. In `lm`

we will get variables from `formula`

. This time that will be the formula created in the function-call scope.

If we tried fitting the model in a function that has local variables `x`

and `y`

, we would see that these do not overrule the ones associated with the formula:

```
nested_fit <- function(f) {
x <- rnorm(5); y <- rnorm(5)
lm(f)
}
nested_fit(y ~ x)
```

```
##
## Call:
## lm(formula = f)
##
## Coefficients:
## (Intercept) x
## 0.2791 0.2614
```

`nested_fit(f)`

```
##
## Call:
## lm(formula = f)
##
## Coefficients:
## (Intercept) x
## 0.2621 -0.7383
```

If we take a formula defined in the global scope, and one defined in the closure

```
outer_f <- y ~ x
closure_f <- closure_formula()
```

we can see that the environment of the first is the global environment

`environment(outer_f)`

```
## <environment: R_GlobalEnv>
```

and the other is different. It is the environment of the function call, trust me.

`environment(closure_f)`

```
## <environment: 0x7f9f3dc6f590>
```

`ls(environment(closure_f))`

```
## [1] "x" "y"
```

This environment contains the variables `x`

and `y`

because we define them inside the function call.

If you don’t believe that the environment for the formula is the one from the function call—maybe you think that I am a liar—then we can check that explicitly by printing the environment inside the formula-generating function.

That would be a boring example compared to what we have already done, so I will add a twist to it. I will create the formula inside the function call but using a non-standard evaluation.

```
closure_formula_nse <- function(f) {
a <- rnorm(5); b <- rnorm(5)
print(environment())
eval(substitute(f))
}
```

The `eval(substitute(f))`

makes the argument `f`

into a quoted expression and then evaluates it. We lose the environment associated with the arguent^{2} but create a formula based on it inside the function call scope.

`closure_f <- closure_formula_nse(y ~ x)`

```
## <environment: 0x7f9f40634588>
```

`environment(closure_f)`

```
## <environment: 0x7f9f40634588>
```

`ls(environment(closure_f))`

```
## [1] "a" "b" "f"
```

## Exploiting the closure-ness of formulae

You can exploit that formulae carry environments with them. They capture the closure where they are defined in the same way that functions do, and you can get that environment using the `environment`

function.^{3}

The `rlang`

package exploits this to implement quasi-quotations for tidyeval.

You can exploit the formula+environment combination in other ways. For example, `purrr`

uses formulas to make lambda expressions. This functionality involves translating a formula with an associated environment, into a function plus that environment (i.e. a closure).

For example, take the formula `~ .x + .y + z`

. If you are familiar with `purrr`

, you know that this can be a lambda expression that takes two arguments, `.x`

and `.y`

, and it expects the variable `z`

to either be overscoped from a data frame or found in the caller environment. While `purrr`

handles how this works, you can go a level below that and make your own closure with this behaviour like this:

```
g <- rlang::as_closure(~ .x + .y + z)
z <- 3
g(1, 2)
```

```
## [1] 6
```

Violá.

It is when you combine creating functions in this way by non-standard evaluation with quosures you can do brilliant stuff. I don’t have time to write about that here, but my *Domain-Specific Languages in R* book has some examples.

I do not know if you can do all that `rlang`

does with formulae in pure R. The `rlang`

package does call into C several places. I haven’t played enough with this in base R to know how it would be done if it can. I use the `rlang`

package, and if you do that, it is relatively easy to manipulate formulae.

## Fitting models

It is one thing to exploit that formulas have scope to do all kinds of craziness, but what about using them for what they are designed for? Building statistical models. When we build a model with `lm(y ~ x)`

the formula specifies what the response variable is and what the explanatory variables are. With most models, you have a *model matrix* under the hood, and formulas are just ways of specifying how that model should look like.

`model.matrix(y ~ x - 1)`

```
## x
## 1 -0.1716845
## 2 -0.2635496
## 3 -0.3580057
## 4 1.1788116
## 5 -1.2989013
## attr(,"assign")
## [1] 1
```

`model.matrix(y ~ x)`

```
## (Intercept) x
## 1 1 -0.1716845
## 2 1 -0.2635496
## 3 1 -0.3580057
## 4 1 1.1788116
## 5 1 -1.2989013
## attr(,"assign")
## [1] 0 1
```

`model.matrix(y ~ I(x^2) + x)`

```
## (Intercept) I(x^2) x
## 1 1 0.02947556 -0.1716845
## 2 1 0.06945840 -0.2635496
## 3 1 0.12816806 -0.3580057
## 4 1 1.38959677 1.1788116
## 5 1 1.68714467 -1.2989013
## attr(,"assign")
## [1] 0 1 2
```

Different statistical models do different things with this model (or feature) matrix, but as long as you have a formula and some associated data, you can get it.

There are related functions that give you the data for the variables in a formula; not all the features, just the raw variables used:

`get_all_vars(y ~ I(x^2) + x)`

```
## y x
## 1 -0.4266105 -0.1716845
## 2 -1.4139657 -0.2635496
## 3 0.4878617 -0.3580057
## 4 1.4517492 1.1788116
## 5 1.0577282 -1.2989013
```

Or that includes the response variable, but otherwise looks like the model matrix:

`model.frame(y ~ I(x^2) + x)`

```
## y I(x^2) x
## 1 -0.4266105 0.029475.... -0.1716845
## 2 -1.4139657 0.069458.... -0.2635496
## 3 0.4878617 0.128168.... -0.3580057
## 4 1.4517492 1.389596.... 1.1788116
## 5 1.0577282 1.687144.... -1.2989013
```

The model frame is a more complex object than the model frame, but we use it to capture the full data in a formula.

To see how these functions work with scopes of formula, we can make some data in the global scope and in a closure and build corresponding formulae:

```
n <- 5
x <- rnorm(n) ; y <- rnorm(n)
make_closure_model <- function(x, y) y ~ I(x^2) + x
global_model <- y ~ I(x^2) + x
closure_model <- make_closure_model(rnorm(n), rnorm(n))
```

We can see that the variables in the two formulas are different (the `x`

and `y`

variables are from different scope):

`get_all_vars(global_model)`

```
## y x
## 1 -0.06023024 -0.4438439
## 2 1.71683148 -0.9133148
## 3 -0.55538683 0.5194221
## 4 1.03774602 0.3141842
## 5 -0.67897927 2.9846080
```

`get_all_vars(closure_model)`

```
## y x
## 1 0.37669580 -2.2014804
## 2 0.09373048 1.4216218
## 3 0.30818092 -0.4767920
## 4 -0.85408128 -0.1594740
## 5 -1.19739503 0.2522737
```

If we build a data frame

```
alpha <- 5 ; beta <- 0.5 ; gamma <- 2
d <- data.frame(y = alpha * x^2 + beta * x + gamma + rnorm(n, sd = 0.1),
x = x)
d
```

```
## y x
## 1 2.773563 -0.4438439
## 2 5.727659 -0.9133148
## 3 3.726645 0.5194221
## 4 2.699850 0.3141842
## 5 48.122075 2.9846080
```

we can overscope with that.

`get_all_vars(global_model, data = d)`

```
## y x
## 1 2.773563 -0.4438439
## 2 5.727659 -0.9133148
## 3 3.726645 0.5194221
## 4 2.699850 0.3141842
## 5 48.122075 2.9846080
```

`get_all_vars(closure_model, data = d)`

```
## y x
## 1 2.773563 -0.4438439
## 2 5.727659 -0.9133148
## 3 3.726645 0.5194221
## 4 2.699850 0.3141842
## 5 48.122075 2.9846080
```

Here, the `x`

variable is the global `x`

(that is how we constructed `d`

), but the `y`

column is the one we created for `d`

.

You can use these functions to create models from a formula. That way, the user can specify how the model matrix is constructed and then your model-code is more general. I have a more extended example in *Advance Object-Oriented Programming in R*, but here I will merely build a simple version of `lm`

.

When we fit a linear model, we construct the model matrix $X$, and then we fit it to the response variable $y$ using the expression `solve(t(X) %*% X, t(X) %*% y`

. I will refer you to a statistics textbook if you want to know why. We take the formula we get as an argument, construct the model frame to get the response variable and the model matrix for the model fit. When we have the fitted parameters, we wrap them up in an object that carries the information we might want to access later.

```
linmo <- function(form, data = NULL) {
mf <- model.frame(form, data)
y <- model.response(mf, "numeric")
X <- model.matrix(form, data)
weights <- solve(t(X) %*% X, t(X) %*% y)
structure(list(weights = weights,
formula = form,
data = data),
class = "linear_model")
}
```

I have given my fitted objects the class `"linear_model"`

, so I can write a print function for them like this:

```
print.linear_model <- function(x, ...) {
cat("linear model formula: ", deparse(x$formula), "\n\n")
print(x$weights)
cat("\n")
}
```

I will make a little more data for the models to work with. I kept the vectors short above so I could print them; I won’t do that now, but I will make them long enough to be interesting.

```
# Give the model a bit to work with...
n <- 100 ; x <- rnorm(n) ; y <- rnorm(n)
```

There is no relationship between `x`

and `y`

, so we expect the weights in the model to be around zero, which they are:

```
m1 <- linmo(global_model)
m1
```

```
## linear model formula: y ~ I(x^2) + x
##
## [,1]
## (Intercept) -0.14018596
## I(x^2) 0.13248999
## x -0.02285167
```

I had to use the `global_model`

formula here. If I had used the `closure_model`

formula, I would still use the vectors from the closure and not the new ones I just made.

We can try and put some structured data into a data frame and fit to that:

```
alpha <- 5 ; beta <- 0.5 ; gamma <- 2
d2 <- data.frame(y = alpha * x^2 + beta * x + gamma + rnorm(n, sd = 0.1),
x = x)
m2 <- linmo(closure_model, data = d2)
m2
```

```
## linear model formula: y ~ I(x^2) + x
##
## [,1]
## (Intercept) 1.9943900
## I(x^2) 5.0037470
## x 0.5143645
```

The correct intercept is 2, the linear component is 0.5, and the squared component is 5. Our estimated parameters are not far off. When we fitted this model, it doesn’t matter if we used the closure or the global formula. The `data`

parameter overscopes them anyway.

If we want to predict the response variable on new data, we need to make a new model matrix from the new data. R does *not* enjoy making a model matrix from data that doesn’t have all the variables we have in the formula. Not even if is just the response variable, which we don’t need for the model matrix anyway. So we need to remove the response variable from the formula first. Other than that, getting the model matrix from new data is simple and so is using the fitted weights to predict new targets.

```
predict.linear_model <- function(object, data, ...) {
if (missing(data)) {
data <- object$data
}
# This is just some magic needed to get an updated
# formula that does not include the response.
# If we do include the response, we also need it in the new data.
updated_terms <- delete.response(terms(object$formula))
X <- model.matrix(updated_terms, data)
X %*% object$weights
}
```

If we call `predict`

without new data, we get predictions from the data we fitted the model to:

`plot(y, predict(m1), xlab="Target", ylab="Prediction")`

`plot(d2$y, predict(m2), xlab="Target", ylab="Prediction")`

When we provide new data, we get the predictions from that.

```
new_data <- data.frame(x = rnorm(10))
new_target <- with(new_data, alpha * x^2 + beta * x + gamma)
new_data
```

```
## x
## 1 0.05232838
## 2 -0.35736623
## 3 -0.65235756
## 4 -0.15527569
## 5 -0.26683393
## 6 0.21692039
## 7 2.21927106
## 8 -0.73158219
## 9 0.59464670
## 10 1.26469203
```

`new_target`

```
## [1] 2.039855 2.459870 3.801673 2.042915 2.222585 2.343732 27.735456
## [8] 4.310271 4.065347 10.629576
```

```
new_prediction <- predict(m2, data = new_data)
plot(new_target, new_prediction, xlab="Target", ylab="Prediction")
```

Neat, right?

If you liked what you read, and want more like it, consider supporting me at Patreon.

- If you look at the code for
`lm`

you will see that it*does*evaluate am expression in the calling environment. This is to get a function to build a model frame. This is a detail and is not essential for the discussion here. I briefly discuss model matrices and model frames at the end of the post. Here, you will see that we can fit models just fine without ever looking in the caller environment.^{[return]} - Yes, arguments have a scope. I will return to that in a different post.
^{[return]} - You can also create formulae where you get to pick the environment if you use the
`as.formula`

function. It allows you to provide an environment as an argument.^{[return]}