Here, in this blog post, I will illustrate how using a log-log transformation can sometimes save you a lot of time by turning a complex non-linear relationship into a very simple linear model. To illustrate this, I used the `pressure`

dataset to model the impact of temperature in degrees Celsius (°C) on vapor pressure in millimeters of Mercury (mm Hg or mmm).

```
library(tidyverse)
data(pressure)
pressure %>%
as_tibble() %>%
rename(Tdeg = temperature,
Pmmm = pressure) -> pressure
```

For the sake of clarity and to avoid potential confusion with the name of the `pressure`

dataset, I renamed the original `temperature`

and `pressure variables`

into `Tdeg`

and `Pmmm`

, respectively. Below I displayed the first five rows of the dataset to get a better idea about the nature of both variables.

```
library(knitr)
kable(head(pressure, 5))
```

Tdeg | Pmmm |
---|---|

0 | 0.0002 |

20 | 0.0012 |

40 | 0.0060 |

60 | 0.0300 |

80 | 0.0900 |

To vizualise the impact of temperature on vapor pressure, I plotted below the temperature-pressure relationship.

```
pressure %>%
ggplot() +
aes(x = Tdeg,
y = Pmmm) +
geom_point(shape = 19,
size = 4,
color = rgb(0, 0, 1, 0.5)) +
labs(x = "Temperature (°C)",
y = "Vapor pressure (mm Hg)") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 16, face = "bold")) -> p1
p1
```

The response variable here is vapor pressure `Pmmm`

and the explanatory variable is temperature `Tdeg`

. Because the response variable is quantitative and continuous, fitting a simple ordinary least square regression with the `lm()`

function set to `family = "Gaussian"`

sounds like the obvious thing to do.

```
library(broom)
fit1 <- lm(Pmmm ~ Tdeg, data = pressure)
kable(tidy(fit1))
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | -147.89887 | 66.552888 | -2.222276 | 0.0401241 |

Tdeg | 1.51242 | 0.315846 | 4.788472 | 0.0001710 |

The simple linear model `fit1`

explains **54.92**% of the total variance with both the intercept and slope parameter being significantly different from zero. However, plotting the predictions of this simple linear model leads to obvious errors, such as negative vapor pressure values!

```
p1 +
geom_line(aes(x = Tdeg,
y = predict(fit1, type = "response"))) -> p2
p2
```

We can complexify the model further, by using higher-order polynomial terms, to improve the fit and make predictions more reliable and closer to the observations.

```
fit2 <- lm(Pmmm ~ poly(Tdeg, 2), data = pressure)
fit3 <- lm(Pmmm ~ poly(Tdeg, 3), data = pressure)
fit4 <- lm(Pmmm ~ poly(Tdeg, 4), data = pressure)
kable(tidy(fit4))
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 124.33671 | 1.234226 | 100.74059 | 0 |

poly(Tdeg, 4)1 | 722.17059 | 5.379868 | 134.23573 | 0 |

poly(Tdeg, 4)2 | 545.94688 | 5.379868 | 101.47959 | 0 |

poly(Tdeg, 4)3 | 280.65281 | 5.379868 | 52.16722 | 0 |

poly(Tdeg, 4)4 | 97.13691 | 5.379868 | 18.05563 | 0 |

For instance, the last model formula `fit4`

explains **99.94**% of the total variance and seems to fit the data pretty well.

```
p1 +
geom_line(aes(x = Tdeg,
y = predict(fit2, type = "response")),
linetype = "dashed") +
geom_line(aes(x = Tdeg,
y = predict(fit3, type = "response")),
linetype = "longdash") +
geom_line(aes(x = Tdeg,
y = predict(fit4, type = "response")),
linetype = "solid") -> p3
p3
```

However this comes at the price of less degrees of freedom left in the model’s residuals.

```
kable(tidy(anova(fit4)))
```

term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|

poly(Tdeg, 4) | 4 | 907789.9359 | 226947.48398 | 7841.191 | 0 |

Residuals | 14 | 405.2018 | 28.94299 | NA | NA |

There is only **14** degrees of freedom left in the most complex model formula `fit4`

as we spent **5** degrees of freedom: one for the intercept parameter and four for the coefficient estimates of the four terms of the fourth-oder polynomial relationship we used in model formula `fit4`

. Yet, the exponential shape of the temperature-pressure relationship clearly suggests to use a log-log transformation as an alternative solution to improve model parsimony. Mind the zero value in the temperature data, which requires to add one before using the log transformation on temperature.

```
pressure %>%
ggplot() +
aes(x = log(Tdeg + 1),
y = log(Pmmm)) +
geom_point(shape = 19,
size = 4,
color = rgb(0, 0, 1, 0.5)) +
labs(x = "Log(T + 1)",
y = "Log(P)") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 16, face = "bold")) -> p4
p4
```

Almost there. The outlier point that deviates from a perfectly linear relationship is due to the original temperature value being 0°C. We can give our linear relationship some help by adding 50 to the log transformation instead of adding just one.

```
pressure %>%
ggplot() +
aes(x = log(Tdeg + 50),
y = log(Pmmm)) +
geom_point(shape = 19,
size = 4,
color = rgb(0, 0, 1, 0.5)) +
labs(x = "Log(T + 50)",
y = "Log(P)") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 16, face = "bold")) -> p5
p5
```

Now, we can fit a much simpler linear model by relating the log-transformed response variable to the log-transformed predictor variable and then extract coefficient estimates of the intercept and slope parameters for generating the predictions pon the log-log scale.

```
fit5 <- lm(log(Pmmm) ~ log(Tdeg + 50), data = pressure)
kable(tidy(fit5))
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | -38.725124 | 0.5145398 | -75.26166 | 0 |

log(Tdeg + 50) | 7.529105 | 0.0966811 | 77.87563 | 0 |

```
beta_0 <- tidy(fit5)[[1, 2]]
beta_1 <- tidy(fit5)[[2, 2]]
```

This times, the model formula used in `fit5`

explains **99.7**% of the total variance while spending only **5** degrees of freedom: the so-called model parsimony.

```
kable(tidy(anova(fit5)))
```

term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|

log(Tdeg + 50) | 1 | 380.383136 | 380.3831360 | 6064.614 | 0 |

Residuals | 17 | 1.066269 | 0.0627217 | NA | NA |

So, now a bit of maths to get the back transformation right. First, we can write the linear relationship as it looks like in the `fit5`

model formula.

Multiplying by the function on both side of the equation and making some trivial simplifications allows to get back to the orginial scale linking vapor pressure to temperature.

Below, the two plots displaying the predictions of the most parsimonious model: onto the log-log scale first and then onto the natural scale.

```
p5 +
geom_line(aes(x = log(Tdeg + 50),
y = predict(fit5, type = "response")),
linetype = "solid") -> p6
```

```
p1 +
geom_line(aes(x = Tdeg,
y = exp(beta_0)*(Tdeg + 50)^beta_1)) -> p7
```

```
library(gridExtra)
grid.arrange(p6, p7, nrow = 2)
```

Thanks, Jonathan! I always found it counter-intuitive to add these values to the log-transformation to avoid zero-issues, as this feels a bit arbitrary: adding either 1 or 50 can make a huge difference for the model, and it’s not clear to me if there are any rules to follow in that regard

Hi Jonas. Thanks a lot for your comment. I know, there are no clear rules or guidelines regarding these transformations. I’d say that most important is to look at the data first (always look at the raw data and explore the raw data without limitations to understand its true nature and its peculiarities) and sometimes it becomes very clear (like in this example) what transformation will really help. But in the end, it can make a huge difference and improve model fit drastically while keeping the model simple and parsimonious, which is the most important 🙂