Here, in this blog post, I will illustrate how using a log-log transformation can sometimes save you a lot of time by turning a complex non-linear relationship into a very simple linear model. To illustrate this, I used the pressure
dataset to model the impact of temperature in degrees Celsius (°C) on vapor pressure in millimeters of Mercury (mm Hg or mmm).
library(tidyverse)
data(pressure)
pressure %>%
as_tibble() %>%
rename(Tdeg = temperature,
Pmmm = pressure) -> pressure
For the sake of clarity and to avoid potential confusion with the name of the pressure
dataset, I renamed the original temperature
and pressure variables
into Tdeg
and Pmmm
, respectively. Below I displayed the first five rows of the dataset to get a better idea about the nature of both variables.
library(knitr)
kable(head(pressure, 5))
Tdeg | Pmmm |
---|---|
0 | 0.0002 |
20 | 0.0012 |
40 | 0.0060 |
60 | 0.0300 |
80 | 0.0900 |
To vizualise the impact of temperature on vapor pressure, I plotted below the temperature-pressure relationship.
pressure %>%
ggplot() +
aes(x = Tdeg,
y = Pmmm) +
geom_point(shape = 19,
size = 4,
color = rgb(0, 0, 1, 0.5)) +
labs(x = "Temperature (°C)",
y = "Vapor pressure (mm Hg)") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 16, face = "bold")) -> p1
p1
The response variable here is vapor pressure Pmmm
and the explanatory variable is temperature Tdeg
. Because the response variable is quantitative and continuous, fitting a simple ordinary least square regression with the lm()
function set to family = "Gaussian"
sounds like the obvious thing to do.
library(broom)
fit1 <- lm(Pmmm ~ Tdeg, data = pressure)
kable(tidy(fit1))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -147.89887 | 66.552888 | -2.222276 | 0.0401241 |
Tdeg | 1.51242 | 0.315846 | 4.788472 | 0.0001710 |
The simple linear model fit1
explains 54.92% of the total variance with both the intercept and slope parameter being significantly different from zero. However, plotting the predictions of this simple linear model leads to obvious errors, such as negative vapor pressure values!
p1 +
geom_line(aes(x = Tdeg,
y = predict(fit1, type = "response"))) -> p2
p2
We can complexify the model further, by using higher-order polynomial terms, to improve the fit and make predictions more reliable and closer to the observations.
fit2 <- lm(Pmmm ~ poly(Tdeg, 2), data = pressure)
fit3 <- lm(Pmmm ~ poly(Tdeg, 3), data = pressure)
fit4 <- lm(Pmmm ~ poly(Tdeg, 4), data = pressure)
kable(tidy(fit4))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 124.33671 | 1.234226 | 100.74059 | 0 |
poly(Tdeg, 4)1 | 722.17059 | 5.379868 | 134.23573 | 0 |
poly(Tdeg, 4)2 | 545.94688 | 5.379868 | 101.47959 | 0 |
poly(Tdeg, 4)3 | 280.65281 | 5.379868 | 52.16722 | 0 |
poly(Tdeg, 4)4 | 97.13691 | 5.379868 | 18.05563 | 0 |
For instance, the last model formula fit4
explains 99.94% of the total variance and seems to fit the data pretty well.
p1 +
geom_line(aes(x = Tdeg,
y = predict(fit2, type = "response")),
linetype = "dashed") +
geom_line(aes(x = Tdeg,
y = predict(fit3, type = "response")),
linetype = "longdash") +
geom_line(aes(x = Tdeg,
y = predict(fit4, type = "response")),
linetype = "solid") -> p3
p3
However this comes at the price of less degrees of freedom left in the model’s residuals.
kable(tidy(anova(fit4)))
term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|
poly(Tdeg, 4) | 4 | 907789.9359 | 226947.48398 | 7841.191 | 0 |
Residuals | 14 | 405.2018 | 28.94299 | NA | NA |
There is only 14 degrees of freedom left in the most complex model formula fit4
as we spent 5 degrees of freedom: one for the intercept parameter and four for the coefficient estimates of the four terms of the fourth-oder polynomial relationship we used in model formula fit4
. Yet, the exponential shape of the temperature-pressure relationship clearly suggests to use a log-log transformation as an alternative solution to improve model parsimony. Mind the zero value in the temperature data, which requires to add one before using the log transformation on temperature.
pressure %>%
ggplot() +
aes(x = log(Tdeg + 1),
y = log(Pmmm)) +
geom_point(shape = 19,
size = 4,
color = rgb(0, 0, 1, 0.5)) +
labs(x = "Log(T + 1)",
y = "Log(P)") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 16, face = "bold")) -> p4
p4
Almost there. The outlier point that deviates from a perfectly linear relationship is due to the original temperature value being 0°C. We can give our linear relationship some help by adding 50 to the log transformation instead of adding just one.
pressure %>%
ggplot() +
aes(x = log(Tdeg + 50),
y = log(Pmmm)) +
geom_point(shape = 19,
size = 4,
color = rgb(0, 0, 1, 0.5)) +
labs(x = "Log(T + 50)",
y = "Log(P)") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 16, face = "bold")) -> p5
p5
Now, we can fit a much simpler linear model by relating the log-transformed response variable to the log-transformed predictor variable and then extract coefficient estimates of the intercept and slope parameters for generating the predictions pon the log-log scale.
fit5 <- lm(log(Pmmm) ~ log(Tdeg + 50), data = pressure)
kable(tidy(fit5))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -38.725124 | 0.5145398 | -75.26166 | 0 |
log(Tdeg + 50) | 7.529105 | 0.0966811 | 77.87563 | 0 |
beta_0 <- tidy(fit5)[[1, 2]]
beta_1 <- tidy(fit5)[[2, 2]]
This times, the model formula used in fit5
explains 99.7% of the total variance while spending only 5 degrees of freedom: the so-called model parsimony.
kable(tidy(anova(fit5)))
term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|
log(Tdeg + 50) | 1 | 380.383136 | 380.3831360 | 6064.614 | 0 |
Residuals | 17 | 1.066269 | 0.0627217 | NA | NA |
So, now a bit of maths to get the back transformation right. First, we can write the linear relationship as it looks like in the fit5
model formula.
Multiplying by the function on both side of the equation and making some trivial simplifications allows to get back to the orginial scale linking vapor pressure to temperature.
Below, the two plots displaying the predictions of the most parsimonious model: onto the log-log scale first and then onto the natural scale.
p5 +
geom_line(aes(x = log(Tdeg + 50),
y = predict(fit5, type = "response")),
linetype = "solid") -> p6
p1 +
geom_line(aes(x = Tdeg,
y = exp(beta_0)*(Tdeg + 50)^beta_1)) -> p7
library(gridExtra)
grid.arrange(p6, p7, nrow = 2)