Dark Mode
LINEA is an open-source R library aimed at simplifying and accelerating the development of linear models to understand the relationship between two or more variables.
Linear models are commonly used in a variety of contexts including natural and social sciences, and various business applications (e.g. marketing, finance).
This page covers a basic how to setup the linea
library
to analyse a time-series. We’ll cover:
linea
can doThe library can be installed from CRAN using
install.packages('linea')
or from GitHub using
devtools::install_github('paladinic/linea')
. Once installed
you can check the installation.
print(packageVersion("linea"))
## [1] '0.1.1'
The linea
library works well with pipes. Used with dplyr
and plotly, it can perform data analysis and visualization with elegant
code. Let’s build a quick model to illustrate what linea
can do.
We start by importing linea
, some other useful
libraries, and some data.
# librarise
library(linea) # modelling
library(tidyverse) # data manipulation
library(plotly) # visualization
library(DT) # visualization
# fictitious ecommerce data
data_path = 'https://raw.githubusercontent.com/paladinic/data/main/ecomm_data.csv'
# importing flat file
data = read_xcsv(file = data_path)
# adding seasonality and Google trends variables
data = data %>%
get_seasonality(date_col_name = 'date',date_type = 'weekly starting') %>%
gt_f(kw = 'prime day',append = T)
# visualize data
data %>%
datatable(rownames = NULL,
options = list(scrollX = TRUE))
Now lets build a model to understand what drives changes in the
ecommerce
variable. We can start by selecting a few initial
independent variables
(i.e. christmas
,black.friday
,trend
,gtrends_prime day
)
model = run_model(data = data,
dv = 'ecommerce',
ivs = c('christmas','black.friday','trend','gtrends_prime day'),
id_var = 'date')
summary(model)
##
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -20604 -4502 -405 2982 54637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43679.24 948.66 46.043 < 2e-16 ***
## christmas 300.86 26.38 11.405 < 2e-16 ***
## black.friday 320.44 39.03 8.209 1.10e-14 ***
## trend 129.16 6.11 21.139 < 2e-16 ***
## gtrends_prime day 182.86 42.42 4.311 2.32e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7417 on 256 degrees of freedom
## Multiple R-squared: 0.7504, Adjusted R-squared: 0.7465
## F-statistic: 192.4 on 4 and 256 DF, p-value: < 2.2e-16
Our next steps can be guided by functions like
what_next()
, which will test all other variables in our
data. From the output below, it seems like the variables
covid
and offline_media
would improve the
model most.
model %>%
what_next()
## # A tibble: 81 × 5
## variable adj_R2 t_stat coef adj_R2_diff
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 offline_media 0.837 12.0 6.44 0.121
## 2 covid 0.815 9.78 191. 0.0917
## 3 year_2020 0.814 9.69 12076. 0.0904
## 4 year_2019 0.781 -6.40 -7115. 0.0458
## 5 christmas_eve 0.777 -6.05 -170926. 0.0414
## 6 week_48 0.771 5.30 21444. 0.0325
## 7 christmas_day 0.768 -5.02 -137025. 0.0293
## 8 week_52 0.766 -4.69 -21223. 0.0258
## 9 promo 0.758 3.66 5.59 0.0157
## 10 year_2017 0.754 2.93 3683. 0.00974
## # … with 71 more rows
Adding these variables to model brings the adjusted R squared to ~88%.
model = run_model(data = data,
dv = 'ecommerce',
ivs = c('christmas','black.friday','trend','gtrends_prime day','covid','offline_media'),
id_var = 'date')
summary(model)
##
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -21541.6 -2909.5 -718.2 2661.9 16287.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.781e+04 7.247e+02 65.977 < 2e-16 ***
## christmas 2.812e+02 1.849e+01 15.208 < 2e-16 ***
## black.friday 2.668e+02 2.770e+01 9.629 < 2e-16 ***
## trend 7.930e+01 5.959e+00 13.309 < 2e-16 ***
## gtrends_prime day 1.840e+02 2.940e+01 6.257 1.66e-09 ***
## covid 1.522e+02 1.621e+01 9.392 < 2e-16 ***
## offline_media 5.507e+00 4.752e-01 11.588 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5135 on 254 degrees of freedom
## Multiple R-squared: 0.8813, Adjusted R-squared: 0.8785
## F-statistic: 314.2 on 6 and 254 DF, p-value: < 2.2e-16
Now that we have a decent model we can start extracting insights from it. We can start by looking at the contribution of each independent variable over time.
model %>%
decomp_chart()
We can also visualize the relationships between our independent and
dependent variables using response curves. From this we can see that,
for example, when offline_media
is 10,
ecommerce
increases by ~55. To capture non-linear
relationships (i.e. response curves that aren’t straight lines) see the
Advanced Features page.
model %>%
response_curves(x_min = 0)
The Getting Started page
is a good place to start learning how to build linear models with
linea
.
The Advanced Features
page shows how to implement the features of linea
that
allow users to capture non-linear relationships.
The Additional Features illustrates page all other functions of the library.
LINEA is being continuously maintained and improved with several features and products under development.
The commercial products being developed:
A few improvements on the way:
A few features in development:
Future developments:
Latest developments: