Logistic Regression Train (LR Train)¶

Introduction¶

Logistic Regression¶

Logistic Regression (LR) is actually a classification problem, although it names itself as "regression". It is mainly used for the binary classification, by utilizing the Logistic Function - Sigmoid. Obviously, its value locates within (0, 1). Usually, it takes 0.5 as the threshold, below which it classifies the observed set of inputs as class "0", and vice versa. Sigmoid function is presented as a pretty S-curve.

Logistic regression (LR) is a statistical learning method widely used for classification problems, especially in binary classification scenarios. Despite the word "regression" in its name, it is essentially a classification model that makes classification decisions by modeling the probability of events. The core of logistic regression is to predict the probability of an event occurring. Using a probability threshold (usually 0.5), samples can be classified into two categories: those with probability ≥ 0.5 are classified as class "1", otherwise as class "0".

Sigmoid Function¶

The primary role of Sigmoid function is to map any real-valued input to a value between 0 and 1, making it ideal for modeling probabilities. It is presented as a S-shaped curve.

sigmoid

The formula of Sigmoid function is as follows: $$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Logistic Regression Classifier aims at learning a binary classification model from the features of the input training data. The model takes the linear combination of the input feature as the variable, $\theta_0+\theta_1x_1+,\ldots,+\theta_nx_n= \sum_{i=1}^n \theta_ix_i$ ，where $x_0$ is always 1. We can also present it as $\theta^Tx$ . Putting it into the sigmoid function, we get a prediction model as: $$ h_\theta(x)=g(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}} $$ LR finally maps the combination variable to (0, 1) to determine the class of the input features.

In order to train such a logistic model, we need to build a reasonable loss function. But before that, we firstly define the probability of classes (0 or 1) as follows:

$$ \begin{equation} \begin{aligned}&P(y=1\mid x;\theta)=h_\theta(x)\&P(y=1\mid 0;\theta)=1-h_\theta(x)\end{aligned}\end{equation} $$ It means for a set of input features denoted as $x_i$ and their correspondent labels $y_i$ , the probability of $y_i = 1$ under $x_i$ is $p_i$ and hence the probability of $y_i = 0$ is $1-p_i$ . Therefore, we can combine the two cases and get a unified probability function as:

$$ P(y)\mid x;\theta)=(h_\theta(x))^y(1-h_\theta(x))^{1-y} $$ Further, for m number of input features $x_1, x_2, ...... , x_m$ , the joint probability is simply the multiplication of the probability of each $y_i$ under $x_i$ , as follows:

$$ L(\theta)=\prod_{i=1}^{m}{P=(y_i\mid x_i;\theta)}=\prod_{i=1}^{m}(h_\theta(x_i))^{y_i}(1-h_\theta(x_i))^{1-y_i} $$ The purpose of LR is to obtain an optimal set of parameter set $\theta$ that could lead to the maximum $L(\theta)$ . Taking its logarithm form, we get $l(\theta)$ as:

$$ l(\theta)=\log{L(\theta)}=\sum_{i=1}^m \Big(y_i\log h_\theta(x_i)+(1-y_i)\log \big(1-h_\theta(x_i)\big)\Big) $$ If we regard $l(\theta)$ as the loss function, training the LR problem should use gradient ascent manner, because $l(\theta)$ is the larger the better. If we want to minimize the loss function ( to coincide with the mainstream machine learning training method), we can simply use its minus as: $J(\theta)=-\frac{1}{m}*l(\theta)$ , and we could use the gradient descent to issue the parameter optimization: $\theta_j:=\theta_j-\alpha\frac{\delta}{\delta_{\theta_j}}J(\theta)$

where:

$\begin{aligned}&\frac{\delta}{\delta_{\theta_j}}J(\theta)=-\frac{1}{m}\sum_{i=1}^m \Big(y_i\frac{1}{h_\theta(x_i)}\frac{\delta} {\delta_{\theta_j}}h_\theta(x_i)-(1-y_i)\frac{1}{1-h_\theta(x_i)}\frac{\delta}{\delta_{\theta_j}}h_\theta(x_i)\Big)\\&=-\frac{1}{m}\sum_{i=1}^m \Big(y_i\frac{1}{g(\theta^Tx_i)})-(1-y_i)\frac{1}{1-g(\theta^Tx_i)}\Big) \frac{\delta}{\delta_{\theta_j}}g(\theta^Tx_i)\\&=-\frac{1}{m}\sum_{i=1}^m \Big(y_i\frac{1}{g(\theta^Tx_i)})-(1-y_i)\frac{1}{1-g(\theta^Tx_i)}\Big) g(\theta^Tx_i)\big(1-g(\theta^Tx_i)\big)\frac{\delta}{\delta_{\theta_j}}\theta^Tx_i\\ &=-\frac{1}{m}\sum_{i=1}^m \Big(y_i\big(1-g(\theta^Tx_i)\big)-(1-y_i)g(\theta^Tx_i)\Big)x_i^j\\&=-\frac{1}{m}\sum_{i=1}^m\Big(y_i-g(\theta^Tx_i)\Big)x_i^j\\&=-\frac{1}{m}\sum_{i=1}^m\Big(h_\theta(x_i)-y_i)\Big)x_i^j \end{aligned}$

Vectorization¶

The parameter optimization process could be vectorized, which is of great importance in FHE. We could use the following process for the vectorization:

Firstly, we reform the $m$ number input vector $x$ as more fine-grained feature matrix for each observed feature, similar for the output class $y$ and the parameter set $\theta$ :

$x= \left[\matrix{ x_1\\ \cdots\\ x_m}\right]=\left[\matrix{ x_{11} & \cdots & x_{1n}\\ \vdots & \ddots & \vdots \\ x_{m1} & \cdots & x_{mn}}\right],y=\left[\matrix{ y_1\\ \cdots\\ y_m}\right],\theta=\left[\matrix{ \theta_0\\ \cdots\\ \theta_n}\right]\\$

The linear combination of each feature $x_i$ and the parameter set $\theta$ could then be presented as matrix-vector multiplication. The resulting matrix $A$ is used as input of the sigmoid function $g$ :

$A=x \cdot \theta^T=\left[\matrix{ x_{10} &\cdots & x_{1n}\\ \vdots & \ddots & \vdots \\ x_{m1} & \cdots & x_{mn}}\right]\cdot \left[\matrix{ \theta_0\\ \cdots\\ \theta_n}\right]=\left[\matrix{ \theta_0x_{10}+\theta_1x_{11}+\ldots+\theta_0x_{1n}\\ \cdots\\ \theta_0x_{m0}+\theta_1x_{m1}+\ldots+\theta_0x_{mn}}\right]\\$

$E=h_\theta(x)-y=\left[\matrix{ g(A_1)-y_1\\ \cdots\\ g(A_m)-y_m}\right]=\left[\matrix{ e_1\\ \cdots\\ e_m}\right]=g(A)-y$

$E$ is the error (or loss) between the observed label $y$ (0 or 1) and the predicted probability obtained by the sigmoid function with $x$ . Therefore, the final optimization is presented as follows: $$ \theta_j:=\theta_j-\alpha\frac{1}{m}\sum_{i=1}^m\big(h_\theta(x_i)-y_i)\big)x_i^j=\theta_j-\alpha\frac{1}{m}\sum_{i=1}^me_ix_i^j=\theta_j-\alpha\frac{1}{m}x^{jT} E $$

Implementation¶

In this section, we will illustrate the implementation of logistic regression in privacy.

Preprocess the input data
Compute Sigmoid function ciphertext
Compute gradient ciphertext
Update the weight ciphertext
Repeat step(2)~(4) util the result becomes convergence

Step 1¶

In step(1), the input matrix is divided into several blocks of $2^n * 2^n$ for the convenience of the next calculation. For example, the default size of our example input is $780 * 9$ , and the output size is $9*1$ . The $780 * 9$ input matrix is resized into $784 * 16$ (49 partitioned matrix of size $2^4 * 2^4$ ) where the newly added elements are fulfilled with zeros.

input_matrix := resize(input_matrix)
input_matrix_diagonal := diag(input_matrix)

for partitioned_matrix in input_matrix:
    size = partitioned_matrix.row
    for i in size
        for j in size
            matrix_diagonal[i][j] = matrix[j][(i+j) % size]

First, the training data matrix (the blue blocks in the picture) is extended to a square matrix fulfilling with zeros.

extend plaintext

Then the extended training data matrix is packed into a vector. On the other hand, the classification result (the red blocks in the picture) is rotated and extended into a vector where the element corresponds to the extended training data element.

extend plaintext

Step 2¶

In step(2), it computes the Sigmoid function over the input data ciphertext.

ct_tmp = ciph_x_diag * ciph_weight

// accumulate the slots
for i = 0 : slot_size / (block_size)
    for j = 0 : block_size
        for k = 0 : block_size
            ct_res[i * block_size * block_size + j] += ct_tmp[i * block_size * block_size + k * block_size + j]

ct_sigmoid = evaluate_poly_vector(ct_res)

Every block_size * block_size slots is accumulated to the beginning block_size * block_size slots.

extend plaintext

Step 3¶

In step(3), it computes the gradient of $\theta$ . The formula of gradient in plaintext is ${ grad(\theta_j) = \frac{1}{m}\sum_{i=1}^m\Big(y_i - h_\theta(x_i))\Big)x_{i,j}}$ .

// computing the gradient ciphertext
for i = 0 : x_tranpose.size()
    ct_tmp = ct_sigmoid * ct_x_transpose[i]
    ct_tmp = accumulate(ct_tmp)
    ct_tmp = ct_tmp * plt(1 0 0 0 0 0 0 ...)    // all the slots in plaintext is 0 except the first slot to be 1
    ct_tmp = rotate(ct_tmp, -i)
    ct_grad += ct_tmp

ct_grad *= (learning_rate / m)

// rotating the gradient ciphertext
ct_grad_shift = ct_grad + rotate(ct_grad, -16)
for i = 1 : block_size
    // the i-th to the (i+block_size)-th slot to be 1
    mask = [0, 0, ... 1, 1, ..., 1, 0, ..., 0]
    ct_tmp = rotate(ct_grad_shift, -i % 16) * mask
    ct_grad += rotate(ct_tmp, 16*i)

for i = 1 : slot_size / block_size
    ct_grad += rotate(ct_grad, i * block_size)
    i <<= 1

Step 4¶

In step(4), it updates the weight.

ct_weight := ct_weight - ct_gradient * plt_learning_rate

Source Code¶

void read_file(std::vector<std::complex<double>> &matrix, const std::string& file);

matrix (std::vector<std::complex<double>> &) : output of classification result
file (const std::string &) : the file which stores the output result

Usage : read the input of training data from the file

void read_file(std::vector<std::vector<std::complex<double>>> &matrix, const std::string& file);

matrix (std::vector<std::vector<std::complex<double>>> &) : input of training data
file (const std::string&) : the file which stores the training data

Usage : read the input of training data from the file

void preprocess(int block_size,
                int block_num,
                std::vector<std::vector<std::complex<double>>> &x,
                std::vector<std::vector<std::complex<double>>> &x_transpose,
                std::vector<std::vector<std::complex<double>>> &x_diag);

block_size (int) : the size of block matrix (partitioned matrix)
block_num (int) : the number of block matrix (partitioned matrix)
x (std::vector<std::vector<std::complex<double>>> &) : the input matrix of training data
x_transpose (std::vector<std::vector<std::complex<double>>> &) : the transposed matrix of training data matrix
x_diag (std::vector<std::vector<std::complex<double>>> &) : the diagonal matrix of training data matrix which is used for computing the multiplication of the matrix with the vector

Usage : Preprocessing the training data for further computation.

Ciphertext accumulate_top_n(const Ciphertext &ciph, int n, const CKKSEncoder &encoder,
                            const Encryptor &enc, std::shared_ptr<EvaluatorCkksBase> ckks_eva,
                            const GaloisKeys &rot_keys);

ciph (const Ciphertext &) : the input ciphertext
n (int) : the range of accumulation
encoder (const CKKSEncoder &) : the encoder
enc (const Encryptor &) : the encryptor
ckks_eva (std::shared_ptr<EvaluatorCkksBase>) : the evaluator for homomorphic computations
rot_keys (const GaloisKeys &) : the rotation key

Usage : It accumulates the top n slots and stores the result into the first slot. It can be expressed by the equation $slot[0] = \sum\limits_{i=0}^{n-1}slot[i]$ . Pay attention that the origin slots in the ciphertext will be changed!

double sigmoid(double x);

x (double) : input value

Usage: It computes the Sigmoid function $\frac{e^x}{1+e^x}$ of input value x.

Ciphertext sigmoid_approx(const Ciphertext &ciph, const PolynomialVector &polys,
                          const CKKSEncoder &encoder, std::shared_ptr<EvaluatorCkksBase> eva,
                          const RelinKeys &relin_keys);

ciph (const Ciphertext &) : the input ciphertext
polys (const PolynomialVector &) : the Sigmoid function expressed by approximate polynomial
encoder (const CKKSEncoder &) : the encoder
eva (std::shared_ptr<EvaluatorCkksBase>) : the evaluator for homomorphic computations
relin_keys (const RelinKeys &) : the relinearization key

Usage : It computes the sery expansion of the Sigmoid function $\frac{e^x}{1+e^x}$ in ciphertext.

int get_size(int min, int max);

min (int) : the minimum value
max (int) : the maximum value

Usage : It computes the logarithm of 2 which satisfies to $min \le 2^{ret \ value} \le max$ .

Ciphertext accumulate_block_matrix(const std::shared_ptr<EvaluatorCkksBase> eva, const GaloisKeys &rot_key, const Ciphertext &ciph, int block_size);

eva (const std::shared_ptr<EvaluatorCkksBase>) : the evaluator for homomorphic computations
rot_key (const GaloisKeys &) : the rotation keys
ciph (const Ciphertext) : the ciphertext
block_size (int) : the size of block matrix (partitioned matrix)

Usage : It accumulates every block_size slots into the beginning block_size slots. For $i \in [0, block\_size - 1]$ , there exists $slot[i] = \sum\limits_{j = 0}^{block\_size - 1}slot[i + j * block\_size]$ .

Ciphertext accumulate_slot_matrix(const std::shared_ptr<EvaluatorCkksBase> eva, const GaloisKeys &rot_key, const Ciphertext &ciph, int block_size, int block_num);

eva (const std::shared_ptr<EvaluatorCkksBase>) : the evaluator for homomorphic computations
rot_key (const GaloisKeys &) : the rotation keys
ciph (const Ciphertext &) : the ciphertext
block_size (int) : the size of block matrix (partitioned matrix)
block_num (int) : the number of block matrix (partitioned matrix)

Usage : It accumulates every block_size * block_size slots into the beginning block_size * block_size slots. For $i \in [0, block\_size * block\_size -1]$ , there exists $slot[i] = \sum\limits_{j = 0}^{block\_num - 1}slot[i + j * block\_size * block\_size]$ .

std::vector<std::complex<double>> vector_to_block_message(const std::vector<std::complex<double>> &vec, int cnt, int block_size);

vec (const std::vector<std::complex<double>> &) : the input vector
cnt (int) : the first cnt items of the vec
block_size (int) : the size of block matrix (partitioned matrix)

Usage : It packs the first cnt items of the vec into a vector which looks like:

$\begin{bmatrix} & \underbrace{{slot_{0} \cdots slot_{block\_size-1}}}_{block\_size \ slots} & | & \underbrace{\cdots}_{(block\_size - 1) * block\_size} & | & \underbrace{slot_{block\_size} \cdots slot_{2 * block\_size - 1}}_{block\_size} & | & \underbrace{\cdots}_{(block\_size - 1) * block\_size} & | & \cdots \end{bmatrix}$

Performance (TBD)¶

The environment is as follows:

System: Ubuntu 20.04.6 LTS
CPU: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
RAM: 512G

perf