Dark Mode
LINEA
LINEA offers a few useful features to make modelling quicker, simpler and more accurate. This page covers a basic implementation of the features below:
We will run simple models on some fictitious data sourced from Google trends. The aim of this exercise will be to demonstrate the use of the features above.
We start by importing linea
and some other useful
libraries.
library(linea) # modelling
library(tidyverse) # data manipulation
library(plotly) # visualization
library(DT) # visualization
The output of the linea::decomp_chart()
function can be
grouped based on a data.frame
mapping variables to
categories and specific operations (i.e. max and min).
This helps simplify the visualization and provide focus on specific
groups of variables. Lets start by looking at a non-aggregated,
variable decomposition.
First, we import some data…
data_path = 'https://raw.githubusercontent.com/paladinic/data/main/ecomm_data.csv'
data = read_xcsv(file = data_path)
data %>%
datatable(rownames = NULL,
options = list(scrollX = TRUE))
…and run a model.
dv = 'ecommerce'
ivs = c('christmas','covid','black.friday','offline_media')
model = data %>%
run_model(dv = dv,
ivs = ivs)
summary(model)
##
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -22738.0 -4713.4 -4.6 4550.7 21995.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.642e+04 5.486e+02 102.849 < 2e-16 ***
## christmas 2.913e+02 2.523e+01 11.546 < 2e-16 ***
## covid 3.014e+02 1.606e+01 18.775 < 2e-16 ***
## black.friday 2.796e+02 3.791e+01 7.374 2.29e-12 ***
## offline_media 5.538e+00 6.509e-01 8.507 1.51e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7038 on 256 degrees of freedom
## Multiple R-squared: 0.7752, Adjusted R-squared: 0.7717
## F-statistic: 220.8 on 4 and 256 DF, p-value: < 2.2e-16
Now we can plot our variable decomposition.
model %>%
decomp_chart(variable_decomp = T)
Now lets create a categories data.frame
to group the
‘christmas’ and ‘black.friday’ variables together.
categories = data.frame(
variable = ivs, # variables from the model
category = c('seasonality','covid','seasonality','media')
)
model = run_model(
data = data,
dv = dv,
ivs = ivs,
categories = categories,
id_var = 'date' # specify horizontal axis
)
model %>%
decomp_chart(variable_decomp = F)
The ‘christmas’ and ‘black.friday’ variables are derived from Google
trends, which captures the impact of these events over time. As there is
always a level of search for these keywords throughout the year, the
series never reaches zero. Using the calc
column of the
categories data.frame
we can tell linea
to add
this minimum value of search to the intercept, isolating the impact of
the variation of the variable.
categories = data.frame(
variable = ivs, # variables from the model
category = c('seasonality','covid','seasonality','media'),
calc = c('min','none','min','none')
)
model = run_model(
data = data,
dv = dv,
ivs = ivs,
categories = categories,
id_var = 'date' # specify horizontal axis
)
model %>%
decomp_chart(variable_decomp = F)
While the model above captures some of the variation from our
ecommerce
variable, there is still a lot left unexplained.
Using a date column, of data-type date, we can generate
seasonality variables with
linea::get_seasonality()
. Several columns will be added to
the original data.frame
. These are mainly dummy variables
that capture some basic holidays as well as year, month, and week
number. Also a trend variable is added which is a column that goes form
1 to n, where n is the number of rows.
data = data %>%
get_seasonality(
date_col_name = 'date',
date_type = 'weekly ending')
data %>%
datatable(rownames = NULL,
options = list(scrollX = TRUE))
plot_ly(data) %>%
add_bars(y = ~ week_26,
x = ~ date,
name = 'week_26',
color = color_palette()[1]) %>%
add_bars(y = ~ new_years_eve,
x = ~ date,
name = 'new_years_eve',
color = color_palette()[2]) %>%
add_bars(y = ~ year_2019,
x = ~ date,
name = 'year_2019',
color = color_palette()[3]) %>%
layout(yaxis = list(title = 'value'),
title = 'Seasonality Variables',
plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)")
These variables can be used in the model to capture the seasonal component of the dependent variable, among other things (e.g. trend).
ivs = c('christmas','covid','black.friday','offline_media','trend','month_Dec')
model = run_model(data = data,
dv = dv,
ivs = ivs,
id_var = 'date')
summary(model)
##
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -20899.1 -3149.9 -871.3 2667.1 20500.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.822e+04 7.763e+02 62.115 < 2e-16 ***
## christmas 2.546e+02 3.201e+01 7.955 5.93e-14 ***
## covid 1.482e+02 1.738e+01 8.525 1.38e-15 ***
## black.friday 2.713e+02 3.215e+01 8.438 2.47e-15 ***
## offline_media 5.609e+00 5.098e-01 11.003 < 2e-16 ***
## trend 8.142e+01 6.384e+00 12.753 < 2e-16 ***
## month_Dec 1.573e+03 2.083e+03 0.755 0.451
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5510 on 254 degrees of freedom
## Multiple R-squared: 0.8633, Adjusted R-squared: 0.8601
## F-statistic: 267.3 on 6 and 254 DF, p-value: < 2.2e-16
Thanks to the new variables this model has a better r-squared (~86%)
compared to the previous. The impact of these variables can be seen
clearly using the linea::decomp_chart()
function.
model %>%
decomp_chart()
To simplify this visualization it is worth using categories, as demonstrated previously.
categories = data.frame(
variable = ivs, # variables from the model
category = c('seasonality','covid','seasonality','media','Base','seasonality'),
calc = c('min','none','min','none','none','none')
)
model = run_model(data = data,
categories = categories,
dv = dv,
ivs = ivs,
id_var = 'date')
model %>% decomp_chart()
While the model is improving, thanks to the seasonal variables introduced, selecting which variable could be a good fit for the model can be tricky and tedious.
df = model %>% what_next()
df %>%
datatable(rownames = NULL,
options = list(scrollX = TRUE))
As shown above, the linea::what_next()
function
generates a data.frame
where each row represents a variable
in our data, and the impact it would have on our model in terms of:
We can now quickly see which variables are more likely to benefit the model.
ivs = c('christmas','covid','black.friday','offline_media','trend','month_Dec','year_2021','week_51')
categories = data.frame(
variable = ivs, # variables from the model
category = c('seasonality','covid','seasonality','media','Base','seasonality','covid','seasonality'),
calc = c('min','none','min','none','none','none','none','none')
)
model = run_model(data = data,
categories = categories,
dv = dv,
ivs = ivs,
id_var = 'date')
summary(model)
##
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -14353.5 -2856.5 -891.7 2910.0 20611.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.765e+04 6.658e+02 71.572 < 2e-16 ***
## christmas 3.112e+02 2.947e+01 10.560 < 2e-16 ***
## covid 1.910e+02 1.575e+01 12.132 < 2e-16 ***
## black.friday 2.483e+02 2.777e+01 8.940 < 2e-16 ***
## offline_media 4.756e+00 4.441e-01 10.709 < 2e-16 ***
## trend 8.429e+01 5.472e+00 15.404 < 2e-16 ***
## month_Dec 2.243e+03 1.782e+03 1.259 0.209
## year_2021 -1.210e+04 1.599e+03 -7.567 7.19e-13 ***
## week_51 -1.625e+04 2.612e+03 -6.219 2.07e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4707 on 252 degrees of freedom
## Multiple R-squared: 0.901, Adjusted R-squared: 0.8979
## F-statistic: 286.7 on 8 and 252 DF, p-value: < 2.2e-16
model %>% decomp_chart()
The model is getting better and better, with an adjusted R squared
almost reaching 90%. This doesn’t mean it can’t be improved further! Google Trends can be a very
useful source of data as Google search volumes are often correlated with
events and can be used as a proxy for a missing variable. The function
linea::gt_f()
will return the original
data.frame
with the added Google trends variable.
data = data %>%
gt_f(kw = 'ramadan',append = T) %>%
gt_f(kw = 'trump',append = T) %>%
gt_f(kw = 'prime day',append = T) %>%
gt_f(kw = 'amazon workers',append = T)
data %>%
datatable(options = list(scrollX = T),rownames = NULL)
plot_ly(data) %>%
add_lines(y = ~ gtrends_ramadan,
x = ~ date,
name = 'gtrends_ramadan',
color = color_palette()[1]) %>%
add_lines(y = ~ gtrends_trump,
x = ~ date,
name = 'gtrends_trump',
color = color_palette()[2]) %>%
add_lines(y = ~ `gtrends_prime day`,
x = ~ date,
name = 'gtrends_prime day',
color = color_palette()[3]) %>%
layout(yaxis = list(title = 'value'),
title = 'Google Trend Variables',
plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)")
Now that these variables are part of our data, we can use the
linea::what_next()
function to see if they can be added to
the model.
df = model %>% what_next(data = data)
df %>%
datatable(rownames = NULL,
options = list(scrollX = TRUE))
As shown from the table above, the new variable,
gtrends_prime day
, seems like a sensible addition to the
model.
ivs = c('christmas','covid','black.friday','offline_media','trend','month_Dec','year_2021','week_51','gtrends_prime day')
model = run_model(data = data,
categories = categories,
dv = dv,
ivs = ivs,
id_var = 'date')
summary(model)
##
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -14685.4 -2728.5 -665.1 2782.3 14956.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.731e+04 6.102e+02 77.528 < 2e-16 ***
## christmas 3.171e+02 2.694e+01 11.771 < 2e-16 ***
## covid 1.930e+02 1.439e+01 13.409 < 2e-16 ***
## black.friday 2.537e+02 2.539e+01 9.994 < 2e-16 ***
## offline_media 4.688e+00 4.059e-01 11.549 < 2e-16 ***
## trend 8.201e+01 5.010e+00 16.368 < 2e-16 ***
## month_Dec 2.501e+03 1.628e+03 1.536 0.126
## year_2021 -1.149e+04 1.463e+03 -7.854 1.18e-13 ***
## week_51 -1.651e+04 2.387e+03 -6.917 3.82e-11 ***
## gtrends_prime day 1.760e+02 2.468e+01 7.131 1.06e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4301 on 251 degrees of freedom
## Multiple R-squared: 0.9177, Adjusted R-squared: 0.9147
## F-statistic: 310.9 on 9 and 251 DF, p-value: < 2.2e-16
Using the variable decomposition we can see the new variable is nicely fitting that July peak.
model %>% decomp_chart(variable_decomp = T)
The model has an R squared greater than 90% and be presented in a more polished way using categories and other charting functions.
ivs = c('christmas','covid','black.friday','offline_media','trend','month_Dec','year_2021','week_51','gtrends_prime day')
categories = data.frame(
variable = ivs, # variables from the model
category = c('seasonality','covid','retail events','media','Base','seasonality','covid','seasonality','retail events'),
calc = c('min','none','min','none','none','none','none','none','none')
)
model = run_model(data = data,
categories = categories,
dv = dv,
ivs = ivs,
id_var = 'date')
model %>%
decomp_chart()
model %>% fit_chart()
Another feature available relates to Panel Data and Pooled
Models. linea
’s pooling functionality will divide the
dependent variable by the mean of each group (pool, panel, region,
etc…). When the coefficients are then multiplied by that same mean, we
get a scaled coefficient for each group.
Lets start by looking at some pooled data. As we can see, the data
below, generated again through Google trends, has a non-numeric
variable, country
.
data_path = 'https://raw.githubusercontent.com/paladinic/data/main/pooled%20data.csv'
data = read_xcsv(file = data_path)
data %>%
datatable(rownames = NULL,
options = list(scrollX = TRUE))
To run a pooled model we must pass a pool_var
, a
character string of the pool variable name (i.e. country
),
to linea::run_model()
. To enforce the normalization the
normalise_by_pool
parameter of the
linea::run_model()
function must be set to
TRUE
.
dv = 'amazon'
ivs = c('christmas','rakhi','diwali')
id_var = 'Week'
pool_var = 'country'
model = run_model(data = data,
dv = dv,
ivs = ivs,
id_var = id_var,
pool_var = pool_var,
normalise_by_pool = TRUE)
model %>%
decomp_chart()
In the decomposition above, the model’s decomposition is simply
aggregated, while still considering the re-scaled coefficients. The
visualization functions, such as the linea::decomp_chart()
function, allow to filter the visualization based on the pool, as shown
below.
model %>%
decomp_chart(pool = 'UK')
model %>%
decomp_chart(pool = 'India')
The Getting Started page
is a good place to start learning how to build basic linear models with
linea
.
The Advanced Features
page shows how to implement the features of linea
that
allow users to capture non-linear relationships.