On Model Parsimony

Here, in this blog post, I will illustrate how using a log-log transformation can sometimes save you a lot of time by turning a complex non-linear relationship into a very simple linear model. To illustrate this, I used the pressure dataset to model the impact of temperature in degrees Celsius (°C) on vapor pressure in millimeters of Mercury (mm Hg or mmm).

pressure %>%
  as_tibble() %>%
  rename(Tdeg = temperature, 
         Pmmm = pressure) -> pressure

For the sake of clarity and to avoid potential confusion with the name of the pressure dataset, I renamed the original temperature and pressure variables into Tdeg and Pmmm, respectively. Below I displayed the first five rows of the dataset to get a better idea about the nature of both variables.

kable(head(pressure, 5))
Tdeg Pmmm
0 0.0002
20 0.0012
40 0.0060
60 0.0300
80 0.0900

To vizualise the impact of temperature on vapor pressure, I plotted below the temperature-pressure relationship.

pressure %>%
  ggplot() +
  aes(x = Tdeg,
      y = Pmmm) +
  geom_point(shape = 19, 
             size = 4,
             color = rgb(0, 0, 1, 0.5)) +
  labs(x = "Temperature (°C)", 
       y = "Vapor pressure (mm Hg)") +
  theme(axis.text = element_text(size = 16),
        axis.title = element_text(size = 16, face = "bold")) -> p1

plot of chunk unnamed-chunk-2

The response variable here is vapor pressure Pmmm and the explanatory variable is temperature Tdeg. Because the response variable is quantitative and continuous, fitting a simple ordinary least square regression with the lm() function set to family = "Gaussian" sounds like the obvious thing to do.

fit1 <- lm(Pmmm ~ Tdeg, data = pressure) 
term estimate std.error statistic p.value
(Intercept) -147.89887 66.552888 -2.222276 0.0401241
Tdeg 1.51242 0.315846 4.788472 0.0001710

The simple linear model fit1 explains 54.92% of the total variance with both the intercept and slope parameter being significantly different from zero. However, plotting the predictions of this simple linear model leads to obvious errors, such as negative vapor pressure values!

p1 +
  geom_line(aes(x = Tdeg,
                y = predict(fit1, type = "response"))) -> p2

plot of chunk unnamed-chunk-4

We can complexify the model further, by using higher-order polynomial terms, to improve the fit and make predictions more reliable and closer to the observations.

fit2 <-  lm(Pmmm ~ poly(Tdeg, 2), data = pressure) 
fit3 <-  lm(Pmmm ~ poly(Tdeg, 3), data = pressure) 
fit4 <-  lm(Pmmm ~ poly(Tdeg, 4), data = pressure) 
term estimate std.error statistic p.value
(Intercept) 124.33671 1.234226 100.74059 0
poly(Tdeg, 4)1 722.17059 5.379868 134.23573 0
poly(Tdeg, 4)2 545.94688 5.379868 101.47959 0
poly(Tdeg, 4)3 280.65281 5.379868 52.16722 0
poly(Tdeg, 4)4 97.13691 5.379868 18.05563 0

For instance, the last model formula fit4 explains 99.94% of the total variance and seems to fit the data pretty well.

p1 +
  geom_line(aes(x = Tdeg,
                y = predict(fit2, type = "response")), 
            linetype = "dashed") +
  geom_line(aes(x = Tdeg,
                y = predict(fit3, type = "response")), 
            linetype = "longdash") +
  geom_line(aes(x = Tdeg,
                y = predict(fit4, type = "response")), 
            linetype = "solid") -> p3

plot of chunk unnamed-chunk-6

However this comes at the price of less degrees of freedom left in the model’s residuals.

term df sumsq meansq statistic p.value
poly(Tdeg, 4) 4 907789.9359 226947.48398 7841.191 0
Residuals 14 405.2018 28.94299 NA NA

There is only 14 degrees of freedom left in the most complex model formula fit4 as we spent 5 degrees of freedom: one for the intercept parameter and four for the coefficient estimates of the four terms of the fourth-oder polynomial relationship we used in model formula fit4. Yet, the exponential shape of the temperature-pressure relationship clearly suggests to use a log-log transformation as an alternative solution to improve model parsimony. Mind the zero value in the temperature data, which requires to add one before using the log transformation on temperature.

pressure %>%
  ggplot() +
  aes(x = log(Tdeg + 1),
      y = log(Pmmm)) +
  geom_point(shape = 19, 
             size = 4,
             color = rgb(0, 0, 1, 0.5)) +
  labs(x = "Log(T + 1)", 
       y = "Log(P)") +
  theme(axis.text = element_text(size = 16),
        axis.title = element_text(size = 16, face = "bold")) -> p4

plot of chunk unnamed-chunk-8

Almost there. The outlier point that deviates from a perfectly linear relationship is due to the original temperature value being 0°C. We can give our linear relationship some help by adding 50 to the log transformation instead of adding just one.

pressure %>%
  ggplot() +
  aes(x = log(Tdeg + 50),
      y = log(Pmmm)) +
  geom_point(shape = 19, 
             size = 4,
             color = rgb(0, 0, 1, 0.5)) +
  labs(x = "Log(T + 50)", 
       y = "Log(P)") +
  theme(axis.text = element_text(size = 16),
        axis.title = element_text(size = 16, face = "bold")) -> p5

plot of chunk unnamed-chunk-9

Now, we can fit a much simpler linear model by relating the log-transformed response variable to the log-transformed predictor variable and then extract coefficient estimates of the intercept and slope parameters for generating the predictions pon the log-log scale.

fit5 <-  lm(log(Pmmm) ~ log(Tdeg + 50), data = pressure) 
term estimate std.error statistic p.value
(Intercept) -38.725124 0.5145398 -75.26166 0
log(Tdeg + 50) 7.529105 0.0966811 77.87563 0
beta_0 <- tidy(fit5)[[1, 2]]
beta_1 <- tidy(fit5)[[2, 2]]

This times, the model formula used in fit5 explains 99.7% of the total variance while spending only 5 degrees of freedom: the so-called model parsimony.

term df sumsq meansq statistic p.value
log(Tdeg + 50) 1 380.383136 380.3831360 6064.614 0
Residuals 17 1.066269 0.0627217 NA NA

So, now a bit of maths to get the back transformation right. First, we can write the linear relationship as it looks like in the fit5 model formula.

\log(P) = \beta_0 + \beta_1 * log(T) + \varepsilon

Multiplying by the e^x function on both side of the equation and making some trivial simplifications allows to get back to the orginial scale linking vapor pressure to temperature.

P = e^{\beta_0 + \beta_1 * log(T) + \varepsilon}

P = e^{\beta_0}e^{log(T)^{\beta_1}}e^{\varepsilon}

P = e^{\beta_0}T^{\beta_1}e^{\varepsilon}

Below, the two plots displaying the predictions of the most parsimonious model: onto the log-log scale first and then onto the natural scale.

p5 +
  geom_line(aes(x = log(Tdeg + 50),
                y = predict(fit5, type = "response")), 
            linetype = "solid") -> p6
p1 +
  geom_line(aes(x = Tdeg,
                y = exp(beta_0)*(Tdeg + 50)^beta_1)) -> p7
grid.arrange(p6, p7, nrow = 2)

plot of chunk unnamed-chunk-14

2 thoughts on “On Model Parsimony

  1. Thanks, Jonathan! I always found it counter-intuitive to add these values to the log-transformation to avoid zero-issues, as this feels a bit arbitrary: adding either 1 or 50 can make a huge difference for the model, and it’s not clear to me if there are any rules to follow in that regard

    • Hi Jonas. Thanks a lot for your comment. I know, there are no clear rules or guidelines regarding these transformations. I’d say that most important is to look at the data first (always look at the raw data and explore the raw data without limitations to understand its true nature and its peculiarities) and sometimes it becomes very clear (like in this example) what transformation will really help. But in the end, it can make a huge difference and improve model fit drastically while keeping the model simple and parsimonious, which is the most important 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s