Logistic Regression Using PyTorch with L-BFGS – Visual Studio Magazine



The data science lab

Logistic regression using PyTorch with L-BFGS

Dr. James McCaffrey of Microsoft Research demonstrates the application of the L-BFGS optimization algorithm to the ML logistic regression technique for binary classification – by predicting one of two possible discrete values.

Logistic regression is one of many machine learning techniques for binary classification – predicting one of two possible discrete values. One example is predicting whether an inpatient is male or female based on variables such as age, blood pressure, etc. There are dozens of code libraries and tools that can build a logistic regression prediction model, including Keras, scikit-learn, Weka, and PyTorch. When training a logistic regression model, many optimization algorithms can be used, such as stochastic gradient descent (SGD), iterated Newton-Raphson, Nelder-Mead, and L-BFGS.

This article explains how to create a logistic regression binary classification model using the PyTorch code library with L-BFGS optimization. A good way to see where this article is going is to take a look at the screenshot of a demo program in Figure 1.

Figure 1: Predicting Patient Sex Using Logistic Regression
[Click on image for larger view.] Figure 1: Predict patient sex using logistic regression

The objective of the demo is to predict the sex (0 = male, 1 = female) of an inpatient based on age, country of residence, blood monocyte count and history of hospitalization. The demo reads a 200-item training dataset and 40-item test data set in memory, then uses the training data to build a logistic regression model using the L- algorithm. BFGS.

After training, the demo calculates the model prediction accuracy on training data (84.50% = 169 out of 200 correct) and test data (72.50% = 29 out of 40 correct). The demonstration ends by making a prediction for a new, never-seen-before patient data item (age = 30, county = “carson”, monocytes = 0.4000, hospital history = “moderate”). The calculated pseudo-probability output is 0.0765 and since this value is less than 0.5, the prediction is class 0 = man.

This article assumes that you have intermediate or above knowledge of a C family programming language, preferably Python, and a basic knowledge of the PyTorch code library. The full source code for the demo program is presented in this article and is also available in the accompanying download file. Training and test data is included as comments in the source program file. All normal error checking has been removed to keep the main ideas as clear as possible.

To run the demo program, you must have Python and PyTorch installed on your machine. The demo programs were developed on Windows 10 using the 64-bit Anaconda 2020.02 distribution (which contains Python 3.7.6) and PyTorch version 1.8.0 for the processor installed via pip. The installation is not trivial. You can find detailed step-by-step installation instructions in my blog post.

Patient dataset
The data used by the demo program is artificial. There is a 200-item data set for training and a 40-item data set for testing. The data looks like:

1   0.58   0   1   0   0.6540   0   0   1
0   0.36   1   0   0   0.4450   0   1   0
1   0.55   0   0   1   0.6460   1   0   0
0   0.31   0   1   0   0.4640   1   0   0
. . .

Each tab delimited line represents an inpatient. The first column is the variable to predict, the sex (0 = male, 1 = female). The other eight columns are the predictor variables: age (normalized by dividing by 100), county of residence (1 0 0 = austin, 0 1 0 = bailey, 0 0 1 = carson), number of blood monocytes (a type of white blood cell) and hospital history (1 0 0 = minor, 0 1 0 = moderate, 0 0 1 = major).

The demo program defines a PyTorch Dataset class to load training or test data into memory. See List 1. Although you can load data from a file directly into a NumPy array and then convert it to a PyTorch tensor, using a dataset is the de facto technique used for most PyTorch programs.

List 1: A dataset class for patient data

import torch as T
import numpy as np

class PatientDataset(T.utils.data.Dataset):
  def __init__(self, src_file, num_rows=None):
    all_data = np.loadtxt(src_file, max_rows=num_rows,
      usecols=range(0,9), delimiter="t", skiprows=0,
      comments="https://visualstudiomagazine.com/articles/2021/06/23/#", dtype=np.float32)  # read all 9 columns

    self.x_data = T.tensor(all_data[:,1:9],
    self.y_data = T.tensor(all_data[:,0],

    self.y_data = self.y_data.reshape(-1,1)  # 2D

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx,:]  # idx rows, all 8 cols
    sex = self.y_data[idx,:]    # idx rows, the only col
    sample = { 'predictors' : preds, 'sex' : sex }
    return sample

The class loads a patient data file into memory as a two-dimensional array using the NumPy loadtxt () function. Class labels and predictors are separated into two arrays and then converted to PyTorch tensors. Even though class labels (0 or 1) are conceptually integers, the demo program uses a binary cross-entropy error that requires a float type.

The dataset can be called up for use by L-BFGS training like this:

fn = ".Datapatient_train.txt "
my_ds = PatientDataset(fn)
my_ldr = T.utils.data.DataLoader(my_ds, 
  batch_size=len(my_ds), shuffle=False)
for (_, batch) in enumerate(my_ldr):
  # _ is the batch index, not used
  # batch has all training items
. . .

When using L-BFGS for training, all data must be passed to the optimizer object. Therefore, the notions of data batch and batch learning do not apply. This means that the DataLoader shuffle parameter can be set to False.

Understanding logistic regression
Logistic regression is best explained by example. Suppose instead of the Patient dataset you have a simpler dataset where the goal is to predict gender from x0 = age, x1 = income, and x2 = duration of employment. A logistic regression model will have a weight value for each predictor variable and a bias constant. Suppose the weights are w0 = 13.5, w1 = -12.2, w2 = 1.08 and the bias is b = 1.12. Now suppose you have a data item where age = x0 = 0.32, income = x1 = 0.62, seniority = x2 = 0.77. The predicted sex is calculated as follows:

z = (x0 * wo) + (x1 * w1) + (x2 * w2) + b
  = (0.32 * 13.5) + (0.62 * -12.2) + (0.77 * 1.08) + 1.12
  = 4.32 + -7.56 + 0.83 + 1.12
  = -1.29

p = 1 / (1 + exp(-z))
  = 1 / (1 + exp(1.29))
  = 1 / (1 + 3.63)
  = 0.22

As the pseudo-probability value p is less than 0.5, the prediction is of class 0 = man.

In words, you calculate the z-value, which is the sum of the weights multiplied by the inputs, plus the bias. Then you calculate the ap value which is 1 over 1 plus exp () applied to -z. The equation for p is called the sigmoid logistic function. When calculating logistic regression, the az value can be between minus infinity and plus infinity, but the ap value will always be between 0 and 1.

Figure 2: Logistic regression explained with a graph
[Click on image for larger view.] Figure 2: Logistic regression explained with a graph

When represented graphically, the sigmoid logistic function has an “S” shape centered around z = 0. The amplitude of the graph is between 0 and 1. Each data item will correspond to the az value and each z value will correspond to a point p on the sigmoid function. P values ​​less than 0.5 are found in the lower part of the sigmoid curve and correspond to class 0 predictions; values ​​greater than 0.5 are on the top and correspond to class 1. You can use threshold values ​​other than 0.5 to tune a logistic regression model.

Note that the internet is littered with incorrect logistic regression charts where data points are displayed both above and below the sigmoid curve. If you ever see a chart like this, you would be well advised to look for better resources.

OK, all is well, but where do the weights and bias values ​​come from? The process of finding good values ​​for the model weights and biases is called model training. There is no closed form solution to finding the optimal values ​​of the weights and bias, and therefore the values ​​must be estimated using an iterative technique. There are many optimization algorithms for logistic regression training. The demo uses the L-BFGS (“limited memory Broyden Fletcher Goldfarb Shanno”) algorithm. The L-BFGS algorithm estimates a first derivative of the calculation (gradient) and also a second derivative (Hessian). This requires all data to be in memory but produces a very fast workout.

Definition of the logistic regression model
The class that defines the logistic regression model is:

import torch as T
class LogisticReg(T.nn.Module):
  def __init__(self):
    super(LogisticReg, self).__init__()
    self.fc = T.nn.Linear(8, 1)
    T.nn.init.uniform_(self.fc.weight, -0.01, 0.01) 

  def forward(self, x):
    z = self.fc(x)
    p = T.sigmoid(z) 
    return p

The linear layer calculates the sum of the weights multiplied by the inputs, plus the bias. The sigmoid () function applies the logistic sigmoid to the sum. The forward () method is implicitly called, for example:

log_reg = LogisticReg().to(device)
X = T.tensor(np.array([0.32, 1,0,0, 0.4567, 0,1,0],
pp = log_reg(X)  # not pp = log_reg.forward(X)

The demo uses an explicit uniform initialization () for model weights, which often works better than PyTorch’s default xavier () initialization algorithm for logistic regression.

Model formation
The demo program defines a train () function as shown in List 2. The train () function defines an LBFGS () optimizer object using the default parameter values, except for max_iter (maximum iterations). The LBFGS () class has seven parameters which have default values:

lr: default = 1.0
max_iter: default = 20
max_eval: default = max_iter * 1.25
tolerance_grad: default = 1.0e-5
tolerance_change: default = 1.0e-9
history_size: default = 100
line_search_fn: default = None

In most situations, the default parameter values ​​work quite well, but you should consult the PyTorch documentation to understand what these parameters are for so that you can change them if necessary when training fails.

List 2: The train () function

def train(log_reg, ds, mi):
  loss_func = T.nn.BCELoss()  # binary cross entropy
  opt = T.optim.LBFGS(log_reg.parameters(), max_iter=mi)
  train_ldr = T.utils.data.DataLoader(ds,
    batch_size=len(ds), shuffle=False)

  print("Starting L-BFGS training")

  for itr in range(0, mi):
    itr_loss = 0.0            # for one iteration
    for (_, all_data) in enumerate(train_ldr):
      X = all_data['predictors'] 
      Y = all_data['sex'] 

      # -------------------------------------------
      def loss_closure():
        oupt = log_reg(X)
        loss_val = loss_func(oupt, Y)
        return loss_val
      # -------------------------------------------

      opt.step(loss_closure)  # get loss, use to update wts
      oupt = log_reg(X)  # monitor loss
      loss_val = loss_closure() 
      itr_loss += loss_val.item()  
    print("iteration = %4d   loss = %0.4f" % (itr, itr_loss))
  print("Done ")

It is not necessary to explicitly set the model in train () mode because no normalization or batch abort is used. However, in my opinion, it is good practice to set the mode even when it is not technically necessary.

When using L-BFGS optimization, you must use a closure to calculate the loss (error) during training. A Python closure is a programming mechanism where the closure function is defined inside another function. The closure has access to all the parameters and local variables of the external container function. This allows the close function to be passed by name, without parameters, to any instruction within the containing function. In the case of L-BFGS training, the loss closure calculates the loss when the name of the loss function is passed to the step () method of the optimizer. The technique looks a bit odd if you’ve never seen it before, but it makes sense if you think about it long enough.

Overall program structure
The overall structure of the demo program, with some minor modifications to save time, is shown in List 3. The demo starts by importing the required NumPy and Torch core libraries. It is a good idea to document the versions of the libraries used as PyTorch is in continuous development.

List 3:
Overall structure of the logistic regression program

# patients_lbfgs.py
# Logistic Regression using PyTorch with L-BFGS optimization
# predict sex from age, county, monocyte, history
# PyTorch 1.8.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10 

import numpy as np
import torch as T
device = T.device("cpu")

# ----------------------------------------------------------

class PatientDataset(T.utils.data.Dataset):
  # see Listing 1

class LogisticReg(T.nn.Module):
  def __init__(self):
    super(LogisticReg, self).__init__()
    self.fc = T.nn.Linear(8, 1)
    T.nn.init.uniform_(self.fc.weight, -0.01, 0.01) 

  def forward(self, x):
    z = self.fc(x)
    p = T.sigmoid(z) 
    return p

def train(log_reg, ds, mi):
  # see Listing 2

def accuracy(model, ds, verbose=False):
  # see Listing 4 

# ----------------------------------------------------------

def main():
  # 0. get started
  print("Gender logistic regression L-BFGS PyTorch ")
  print("Gender from age, county, monocyte, history")

  # 1. create Dataset and DataLoader objects
  print("Creating Patient train and test Datasets ")

  train_file = ".Datapatients_train.txt"
  test_file = ".Datapatients_test.txt"

  train_ds = PatientDataset(train_file)  # read all rows
  test_ds = PatientDataset(test_file)

  # 2. create model
  print("Creating 8-1 logistic regression model ")
  log_reg = LogisticReg().to(device)

  # 3. train network
  print("Preparing L-BFGS training")
  max_iterations = 4
  print("Loss function: BCELoss ")
  print("Optimizer: L-BFGS ")
  train(log_reg, train_ds, max_iterations)

# ----------------------------------------------------------

  # 4. evaluate model
  acc_train = accuracy(log_reg, train_ds)
  print("Accuracy on train data = %0.2f%%" % 
    (acc_train * 100))
  acc_test = accuracy(log_reg, test_ds, verbose=False)
  print("Accuracy on test data = %0.2f%%" % 
    (acc_test * 100))

  # 5. examine model
  wts = log_reg.fc.weight.data
  print("Model weights: ")
  bias = log_reg.fc.bias.data
  print("Model bias: ")

  # 6. save model
  # print("Saving trained model state_dict n")
  # path = ".Modelspatients_LR_model.pth"
  # T.save(log_reg.state_dict(), path)

  # 7. make a prediction 
  print("Predicting sex for age = 30, county = carson, ")
  print("monocyte count = 0.4000, ")
  print("hospitization history = moderate ")
  inpt = np.array([[0.30, 0,0,1, 0.40, 0,1,0]],
  inpt = T.tensor(inpt, dtype=T.float32).to(device)

  with T.no_grad():
    oupt = log_reg(inpt)    # a Tensor
  pred_prob = oupt.item()   # scalar, [0.0, 1.0]
  print("Computed output pp: ", end="")
  print("%0.4f" % pred_prob)

  if pred_prob < 0.5:
    print("Prediction = male")
    print("Prediction = female")

  print("End Patient gender demo")

if __name__== "__main__":


Leave A Reply

Your email address will not be published.