This project is a C++ Object-Oriented Programming (OOP) version of my earlier SimpleCNN implementation in Python (PyTorch). While the original project served as a practical introduction to coding a simple CNN, this C++ version takes a lower-level approach: the entire network is built from scratch, without predefined ML frameworks or utilities.
By implementing each layer, operation, and optimization step manually, this project highlights the inner workings of convolutional networks such as convolution, activation functions, pooling, normalization, and backpropagation. The OOP design ensures modularity, making it easier to extend, experiment with new components, or adapt the network to different tasks.
By handling every component explicitly from convolutions and bias application to activation functions and optimization the project highlights the mechanics that higher-level libraries often abstract away.
I advise you before diving in to this project, see SimpleCNN.
This network is trained on MNIST dataset, a simple gray-scale images of a writen one-digit numbers (0-9), such that the network gets an image and it's target to classify it as the correct number (class).
The MNIST dataset has 70,000 images, such that the training dataset is 60,000 images and the test dataset is 10,000 images that is commonly used for training various image processing systems. The MNIST dataset of handwritten digits, it is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. The binary format of the dataset is also available for download on Yann LeCun - THE MNIST DATABASE, and available for download on Kaggle.
For more imformation on MNIST Dataset.
Our Network is consist of 6 layers:
- Convolution Layer with a kernel size of 5x5, and ReLU activation function.
- Max-pool Layer with a kernel size of 2x2.
- Convolution Layer with a kernel size of 5x5 and ReLU activation function.
- Max-pool Layer with a kernel size of 2x2.
- Fully-connected Layer with input layer of 1024 and output layer of 512 and ReLU activation function.
- Fully-connected Layer with input layer of 512 and output layer of 10 (classes) and Softmax activation function.
The Simple CNN is implemented in C++ as Object Oriented Programming. In order to implement the network layers and methods Eigen Library is being used. Eigen is a C++-based open-source linear algebra library.
Every Layer apart of the fully connected can gets an input of 4-dimentions (N,C,H,W), were N is the batch size, C is the number of the channels and H,W are height and width respectively, the resolution of the images.
Also see SimpleCNN.hpp.
Applies a 2D convolution over an input signal composed of singular or several data inputs. See Convolution2D.hpp.
The Convolutional Layer is the core building block of a Convolutional Neural Network (CNN), commonly used in image processing and computer vision. A convolutional layer applies a small filter (or kernel) across the input to extract features such as edges, textures, and shapes.
Given an input
Then, the output dimensions are:
Where:
-
$s$ - stride, how far the filter moves at each step. A stride of 1 means one pixel at a time. -
$p$ - padding, adds zeros around the input to control the size of the output.
Applies a 2D max pooling over an input signal composed of singular or several data inputs. See MaxPooling.hpp.
A pooling layer is used in Convolutional Neural Networks (CNNs) to reduce the spatial dimensions (height and width) of feature maps while preserving the most important information.
Max pooling is the most common type of pooling. It works by sliding a small window (like 2×2 or 3×3) over the input and taking the maximum value in each region.
For an input of size
Then, the output dimensions, like the convolutional layer, are:
Applies a linear transformation to the layer's input on a 2-dimentional input,
A fully-connected layer is a type of neural network layer where every neuron is connected to every neuron in the previous layer. It is usually placed at the end of convolutional or recurrent layers to make final predictions.
Each output neuron computes a weighted sum of all inputs, adds a bias, and then applies an activation function.
Where:
-
$x \in \mathbb{R}^{n_{\text{in}}}$ is the input vector. -
$W \in \mathbb{R}^{n_{\text{out}} \times n_{\text{in}}}$ is the weight matrix. -
$b \in \mathbb{R}^{n_{\text{out}}}$ is the bias vector. -
$f(\cdot)$ is an activation function (e.g., ReLU, sigmoid). -
$y \in \mathbb{R}^{n_{\text{out}}}$ is the output vector.
Weight initialization helps neural networks train efficiently by keeping activations and gradients stable across layers. Xavier initialization works well with sigmoid or tanh activations, using a variance of
He Initialization (also called Kaiming Initialization) is designed for neural networks that use ReLU or Leaky ReLU activations. Since ReLU sets all negative inputs to zero, it effectively reduces the number of active neurons by half, which can shrink the variance of outputs layer by layer if not handled properly.
To compensate, He Initialization sets the weights to have a higher variance:
-
For a normal distribution:
$$W \sim \mathcal{N}\Bigg(0,\quad \frac{2}{n_{in}}\Bigg)$$ -
For a uniform distribution:
$W \sim \mathcal{U}\Bigg(-\sqrt{\frac{6}{n_{in}}},\quad \sqrt{\frac{6}{n_{in}}}\Bigg)$
An activation function is a mathematical function applied to the output of each neuron in a neural network. It introduces non-linearity to the model, allowing it to learn complex patterns beyond just linear relationships.
Without activation functions, a neural network—no matter how deep—would behave like a simple linear model.
- ReLU (Rectified Linear Unit) is one of the most popular activation functions in deep learning.
- Softmax is usually used in the output layer for classification, especially multi-class problems.
Regularization refers to a set of techniques used to prevent a machine learning model from overfitting the training data, improving its generalization to unseen data. It works by constraining or penalizing the model’s complexity, encouraging simpler solutions that are less sensitive to noise in the data.
In our model, Simple CNN, we use Dropout and Batch Normalization methods.
Dropout is a regularization technique where, during training, a fixed percentage of neurons (e.g. 50%) are randomly set to zero in each forward pass, preventing co-adaptation of neurons. This prevents over-reliance on specific neurons and encourages redundancy and robustness.
At inference time, all neurons are active, and their outputs are scaled to match the expected value during training.
During inference, all units are used as-is:
Batch Normalization aims to stabilize and accelerate training by ensuring each channel’s activations have consistent statistics across mini‑batches. This method normalizes each feature channel’s activations to zero mean and unit variance over a mini-batch thereby It reduces internal covariate shift and can have a slight regularizing effect (due to batch noise).
For a layer’s inputs
Then we scale (
where
The Cross Entropy Loss function is widely used for classification tasks, as it measures the difference between the predicted probability distribution and the true distribution. Given a predicted probability vector
This loss penalizes confident incorrect predictions more heavily than less certain ones, encouraging the model to assign higher probabilities to the correct classes. Minimizing cross-entropy effectively maximizes the likelihood of the correct labels under the model’s predicted distribution.
Adam (Adaptive Moment Estimation is a widely used optimization algorithm in machine learning. It combines the benefits of Momentum and RMSProp, maintaining running estimates of both the mean and the uncentered variance of gradients to adaptively adjust the learning rate for each parameter. By using these adaptive estimates, Adam can converge faster and more reliably on complex models, handle noisy gradients, and often requires less manual tuning of the learning rate compared to standard stochastic gradient descent. Its adaptive nature makes Adam particularly effective for large-scale problems and deep neural networks, where gradients can vary significantly across parameters.
-
$\theta_t$ : parameters at time step t. -
$\beta_1, \beta_2$ : exponential decay rates for moment estimates. -
$\alpha$ : learning rate. -
$\epsilon$ : small constant to prevent division by zero. -
$\lambda$ : weight decay coefficient.
-
Compute gradients:
$$g_t = \nabla_\theta J(\theta_t)$$ -
Update moment estimates:
$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad;\quad v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$ -
Bias correction:
$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \quad;\quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$ -
Parameter update:
$$\theta_{t+1} = \theta_t - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$
For more information on Stochastic gradient descent, extensions and variants.
The Back Propagation Method for CNN
Adam: A Method for Stochastic Optimization
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

