An open-source solution by Linea Analytics


Dark Mode

Fork it

Linea-Analytics provides a measurement platform or full service measurement to:

To provide an open access forum for brands, agencies, publishers or students to understand & test how the relationship between two or more variables work we created Linea’s open source (frequentist) OLS library: LINEA




This Page

This page covers a basic how to setup the linea library to analyse a time-series. We’ll cover:




Prerequisites

To use this library an understanding of the following is assumed:




Installation

The library can be installed GitHub using devtools::install_github('linea-analytics/linea'). We’ll soon be available on CRAN as well. Once installed you can check the installation.

# devtools::install_github('linea-analytics/linea')
print(packageVersion("linea"))
## [1] '0.1.2'



Quick Start

The linea library works well with pipes. Used with dplyr and plotly, it can perform data analysis and visualization with elegant code. Let’s build a quick model to illustrate what linea can do.

Import Data

We start by importing linea, some other useful libraries, and some data.

# librarise
library(linea) # modelling
library(tidyverse) # data manipulation
library(plotly) # visualization
library(DT) # visualization

# fictitious ecommerce data
data_path = 'https://raw.githubusercontent.com/paladinic/data/refs/heads/main/ecomm_data.csv'

# importing flat file
data = read_xcsv(file = data_path)

# adding seasonality and Google trends variables
data = data |> 
  get_seasonality(date_col_name = 'date',date_type = 'weekly starting')

# visualize data
data |> 
  datatable(rownames = NULL,
            options = list(scrollX = TRUE))

Run Models

Now lets build a model to understand what drives changes in the ecommerce variable. We can start by selecting a few initial independent variables (i.e. christmas,black.friday,trend,gtrends_prime day)

model = run_model(data = data,
                  dv = 'ecommerce',
                  ivs = c('christmas','black.friday',"trend"),
                  id_var = 'date')
## [1] "actual:"
## [1] 261
## [1] "pred:"
## [1] 261
## [1] "resid:"
## [1] 261
## [1] "id_var_values:"
## [1] 261
## [1] "pool_var_values:"
## [1] 261
## [1] "pool_var_values:"
## [1] 261
## [1] "id_var_values_2:"
## [1] 261
## [1] "variable_decomp:"
## [1] 261
## [1] "pool_var_values:"
## [1] 261
## [1] "id_var_values_3:"
## [1] 261
## [1] "variable_decomp:"
## [1] 261
summary(model)
## 
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20462  -4664   -741   2988  54502 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44108.048    975.172  45.231  < 2e-16 ***
## christmas      294.052     27.219  10.803  < 2e-16 ***
## black.friday   317.203     40.339   7.863 1.03e-13 ***
## trend          130.445      6.308  20.680  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7666 on 257 degrees of freedom
## Multiple R-squared:  0.7323, Adjusted R-squared:  0.7291 
## F-statistic: 234.3 on 3 and 257 DF,  p-value: < 2.2e-16

Our next steps can be guided by functions like what_next(), which will test all other variables in our data. From the output below, it seems like the variables covid and offline_media would improve the model most.

model |> 
  what_next()
## # A tibble: 84 × 6
##    variable      adj_R2 t_stat       coef   vif adj_R2_diff
##    <chr>          <dbl>  <dbl>      <dbl> <dbl>       <dbl>
##  1 offline_media  0.821  11.5        6.50  1.12      0.126 
##  2 year_2020      0.796   9.26   12081.    1.59      0.0921
##  3 covid          0.795   9.12     188.    1.98      0.0901
##  4 year_2019      0.762  -6.09   -7043.    1.07      0.0457
##  5 christmas_eve  0.759  -5.75 -168934.    1.65      0.0412
##  6 week_num_48    0.753   5.09   21389.    1.21      0.0328
##  7 christmas_day  0.750  -4.79 -135781.    1.48      0.0292
##  8 week_num_52    0.748  -4.51  -21135.    1.48      0.0260
##  9 promo          0.740   3.48       5.50  1.07      0.0154
## 10 year_2021      0.738  -3.11   -7264.    1.19      0.0121
## # ℹ 74 more rows

Adding these variables to model brings the adjusted R squared above 80%.

model = run_model(data = data,
                  dv = 'ecommerce',
                  ivs = c('christmas','black.friday','trend','covid','offline_media'),
                  id_var = 'date')
## [1] "actual:"
## [1] 261
## [1] "pred:"
## [1] 261
## [1] "resid:"
## [1] 261
## [1] "id_var_values:"
## [1] 261
## [1] "pool_var_values:"
## [1] 261
## [1] "pool_var_values:"
## [1] 261
## [1] "id_var_values_2:"
## [1] 261
## [1] "variable_decomp:"
## [1] 261
## [1] "pool_var_values:"
## [1] 261
## [1] "id_var_values_3:"
## [1] 261
## [1] "variable_decomp:"
## [1] 261
summary(model)
## 
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21204.4  -3193.5   -874.6   2639.9  20486.7 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.819e+04  7.743e+02  62.228  < 2e-16 ***
## christmas     2.736e+02  1.978e+01  13.831  < 2e-16 ***
## black.friday  2.620e+02  2.969e+01   8.825  < 2e-16 ***
## trend         8.150e+01  6.378e+00  12.778  < 2e-16 ***
## covid         1.482e+02  1.737e+01   8.534 1.28e-15 ***
## offline_media 5.602e+00  5.093e-01  11.000  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5506 on 255 degrees of freedom
## Multiple R-squared:  0.863,  Adjusted R-squared:  0.8603 
## F-statistic: 321.2 on 5 and 255 DF,  p-value: < 2.2e-16

Generate Insights

Now that we have a decent model we can start extracting insights from it. We can start by looking at the contribution of each independent variable over time.

model |> 
  decomp_chart()

We can also visualize the relationships between our independent and dependent variables using response curves. From this we can see that, for example, when offline_media is 10, ecommerce increases by ~55. To capture non-linear relationships (i.e. response curves that aren’t straight lines) see the Advanced Features page.

model |> 
  response_curves(x_min = 0)



Next Steps

  1. The Getting Started page is a good place to start learning how to build linear models with linea.

  2. The Advanced Features page shows how to implement the features of linea that allow users to capture non-linear relationships.

  3. The Additional Features illustrates page all other functions of the library.




Latest developments: