The Optimizer Design

Note: We will not cover the optimizers themselves; here, we recommend reading the following articles:

GLMNET:

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–20. https://doi.org/10.18637/jss.v033.i01
Yuan, G.-X., Ho, C.-H., & Lin, C.-J. (2012). An improved GLMNET for l1-regularized logistic regression. The Journal of Machine Learning Research, 13, 1999–2030. https://doi.org/10.1145/2020408.2020421

ISTA:

Beck, A., & Teboulle, M. (2009). A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences, 2(1), 183–202. https://doi.org/10.1137/080716542
Gong, P., Zhang, C., Lu, Z., Huang, J., & Ye, J. (2013). A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. Proceedings of the 30th International Conference on Machine Learning, 28(2)(2), 37–45.
Parikh, N., & Boyd, S. (2013). Proximal Algorithms. Foundations and Trends in Optimization, 1(3), 123–231.

The Fitting Function

The objective of the optimizers is to minimize the fitting function. In both, glmnet and ista we assume that this fitting function is given by a differentiable part and a non-differentiable part. To be more specific, the fitting function is given by:

$$f(\pmb{\theta}) = l(\pmb\theta) + s(\pmb\theta,\pmb{t}_s) + p(\pmb\theta,\pmb{t}_p)$$

where $l(\pmb\theta)$ is the unregularized fitting function (e.g., the -2log-likelihood) in the SEMs implemented in lessSEM. $s(\pmb\theta,\pmb{t}_s)$ is a differentiable (smooth) penalty function (e.g., a ridge penalty) and $p(\pmb\theta,\pmb{t}_p)$ is a non-differentiable penalty function (e.g., a lasso penalty). All three functions take the parameter estimates $\pmb\theta$ as input and return a single value as output. The penalty functions $s(\pmb\theta,\pmb{t}_s)$ and $p(\pmb\theta,\pmb{t}_p)$ additionally expect vectors with tuning parameters--$\pmb{t}_s$ in case of the smooth penalty and $\pmb{t}_p$ in case of the non-differentiable penalty. Thus, in theory both penalty functions can use different tuning parameters.

A prototypical example for fitting functions of the form defined above is the elastic net. Here,

$$f(\pmb{\theta}) = l(\pmb\theta) + (1-\alpha)\lambda \sum_j\theta_j^2 + \alpha\lambda\sum_j| \theta_j|$$

The elastic net is a combination of a ridge penalty $\lambda \sum_j\theta_j^2$ and a lasso penalty $\lambda\sum_j| \theta_j|$. Note that in this case, both penalties take in the same tuning parameters ($\lambda$ and $\alpha$).

With this in mind, we can now have a closer look at the optimization functions. We will start with glmnet (see function glmnet in the file inst/include/glmnet_class.hpp). This function is called as follows:

inline less::fitResults glmnet(model& model_, 
                                  numericVector startingValuesRcpp,
                                  penaltyLASSOGlmnet& penalty_,
                                  penaltyRidgeGlmnet& smoothPenalty_, 
                                  const tuningParametersEnetGlmnet tuningParameters,
                                  const controlGLMNET& control_ = controlGlmnetDefault())
{...}

The first argument is a model. This model has to be created by you and must inherit from the less::model class (see lessLM for an example). Most importantly, this model must provide means to get the value of the first part of the fitting function: $l(\pmb\theta)$. It must also provide means to compute the gradients of your fitting function.

The next argument are the starting values which are given as an numericVector. If you are using R, this is an Rcpp::NumericVector, in C++ a less::numericVector. This type is chosen because it can have labels and in its current implementation lesstimate expects that you give all your startingValues names.

Now come the actual penalty functions. The first one is the non-differentiable penalty: the lasso $p(\pmb\theta)$. This must be an object of class penaltyLASSOGlmnet which can be created with less::penaltyLASSOGlmnet()'. Next comes the differentiable ridge penalty which must be of class penaltyRidgeGlmnet and can be created with less::penaltyRidgeGlmnet.

Now, the tuning parameters deviate a bit from what we discussed before. We said that the differentiable and the non-differentiable penalty functions will each take their own vector of tuning parameters ($\pmb t_s$ and $\pmb t_p$ respectively). Note, however, that the elastic net uses the same tuning parameters for both, ridge and lasso penalty. In the glmnet optimizer we therefore decided to combine both of these into one: the tuningParametersEnetGlmnet struct. tuningParametersEnetGlmnet has three slots: alpha (to set $\alpha$), lambda (to set $\lambda$) and a slot called weights. Now, the weights allow us to switch off the penalty for selected parameters. For instance, in a linear regression we would not want to penalize the intercept. To this end, the fitting function that is actually implemented internally is given by

$$f(\pmb{\theta}) = l(\pmb\theta) + (1-\alpha)\lambda \sum_j\omega_j \theta_j^2 + \alpha\lambda\sum_j\omega_j| \theta_j|$$

If we set $\omega_j = 0$ for a specific parameter, this parameter is unregularized. Setting $\omega_j = 1$ means that parameter $j$ is penalized. $\omega_j$ can also take any other value (e.g., $\omega_j = .4123$) which allows for penalties such as the adaptive lasso. Importantly, the weights vector must be of the same length as startingValuesRcpp. That is, each parameter must have a weight associated with it in the weights vector.

Finally, there is the controlGLMNET argument. This argument lets us fine tune the optimizer. To use the control argument, create a new control object in C++ as follows:

less::controlGLMNET myControlObject = less::controlGlmnetDefault();

Now, you can tweak each element of myControlObject; e.g.,

myControlObject.maxIterOut = 100 // only 100 outer iterations

If you take a closer look at how the two penalty functions are handled within glmnet, you will realize that we basically absorb the differentiable penalty in the normal fitting function. That is, only the non-differentiable part gets a special treatment, while the differentiable part is simply added to the differntiable $l(\pmb\theta)$. To give an example:

gradients_kMinus1 = model_.gradients(parameters_kMinus1, parameterLabels) +
      smoothPenalty_.getGradients(parameters_kMinus1, parameterLabels, tuningParameters); // ridge part

Note how the gradients of $l(\pmb\theta)$ and $s(\pmb\theta,\pmb{t}_s)$ are combined into one.

The ista variants

Besides the glmnet optimizer, we also implemented variants of ista. These are based on the publications mentioned above. The fitting function is again given by

$$f(\pmb{\theta}) = l(\pmb\theta) + s(\pmb\theta,\pmb{t}_s) + p(\pmb\theta,\pmb{t}_p)$$

In the following, we will build a lot on what we've already discussed regarding the glmnet optimizer above.

First, let's have a look at the ista function;

template<typename T, typename U> // T is the type of the tuning parameters
inline less::fitResults ista(
    model& model_, 
    numericVector startingValuesRcpp,
    proximalOperator<T>& proximalOperator_, // proximalOperator takes the tuning parameters
    // as input -> <T>
    penalty<T>& penalty_, // penalty takes the tuning parameters
    smoothPenalty<U>& smoothPenalty_, // smoothPenalty takes the smooth tuning parameters
    // as input -> <U>
    const T& tuningParameters, // tuning parameters are of type T
    const U& smoothTuningParameters, // tuning parameters are of type U
    const control& control_ = controlDefault()
)
{...}

This function is more complicated that the glmnet function discussed above. But, let's start with the part that stays the same: First, we still have to pass our model to the function. This model must have a fit and a gradients function which return the fit and the gradient respectively. Next, the function again expects us to provide starting values as an numericVector. We will skip the proximalOperator and the penalty for the moment (these relate to $p(\pmb\theta,\pmb t_p)$) and concentrate on the smoothPenalty first. This is the function $s(\pmb\theta,\pmb t_s)$. In our previous example, we looked at the elastic net penalty, where the smooth penalty is a ridge penalty function. Now, in the ista optimizer, we can also pass in the ridge penalty as a smooth penalty. In fact, this is exactly what we do when we use ista to fit the elastic net. This smooth penalty has the tuning parameters $\pmb t_s$ which are called smoothTuningParameters in the function call. In case of the elastic net, these would again be $\alpha$ and $\lambda$ (and the weights vector). Similar to the glmnet procedure outlined above, the differentiable penalty $s(\pmb\theta,\pmb t_s)$ is simply absorbed in the unregularized fitting function $l(\pmb\theta)$.

Now, for the non-differentiable part $p(\pmb\theta,\pmb p_s)$, the ista optimizer uses so-called proximal operators. The details are beyond the scope here, but Parikh et al. (2013) provide a very good overview of these algorithms. To make things work with ista, we must pass such a proximal operator to the optimizer. Within lesstimate, we have prepared a few of these proximal operators for well-known penalty functions. Additionally, we need a function which returns the actual penalty value. This is the penalty object in the function call. Finally, the penalty $p(\pmb\theta,\pmb t_p)$ gets its tuning parameters $\pmb t_p$. This is the tuningParameters object above. To make things more concrete, let's look at the elastic net again. In this case, penalty would be of class less::penaltyLASSO and the proximal operator of type less::proximalOperatorLasso. The tuning parameters would again be $\alpha$ and $\lambda$ (and the weights vector).

Note that many of the penalty function implemented in lesstimate are typically not combined with a smooth penalty (e.g., scad, mcp, ...). In this case, you must still pass a smoothPenalty object to ista. To this end, we created the less::noSmoothPenalty class which can be used instead of a smooth penalty like ridge.

Finally, there is the control argument. This argument lets us fine tune the optimizer. To use the control argument, create a new control object as follows:

less::control myControlObject = less::controlDefault();

Now, you can tweak each element of myControlObject; e.g.,

myControlObject.L0 = .9