Simple-CNN-OOP

This project is a C++ Object-Oriented Programming (OOP) version of my earlier SimpleCNN implementation in Python (PyTorch). While the original project served as a practical introduction to coding a simple CNN, this C++ version takes a lower-level approach: the entire network is built from scratch, without predefined ML frameworks or utilities.

By implementing each layer, operation, and optimization step manually, this project highlights the inner workings of convolutional networks such as convolution, activation functions, pooling, normalization, and backpropagation. The OOP design ensures modularity, making it easier to extend, experiment with new components, or adapt the network to different tasks.

By handling every component explicitly from convolutions and bias application to activation functions and optimization the project highlights the mechanics that higher-level libraries often abstract away.

I advise you before diving in to this project, see SimpleCNN.

Requirements

MNIST Dataset

This network is trained on MNIST dataset, a simple gray-scale images of a writen one-digit numbers (0-9), such that the network gets an image and it's target to classify it as the correct number (class).

The MNIST dataset has 70,000 images, such that the training dataset is 60,000 images and the test dataset is 10,000 images that is commonly used for training various image processing systems. The MNIST dataset of handwritten digits, it is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. The binary format of the dataset is also available for download on Yann LeCun - THE MNIST DATABASE, and available for download on Kaggle.

For more imformation on MNIST Dataset.

The Model - Simple CNN

Our Network is consist of 6 layers:

Convolution Layer with a kernel size of 5x5, and ReLU activation function.
Max-pool Layer with a kernel size of 2x2.
Convolution Layer with a kernel size of 5x5 and ReLU activation function.
Max-pool Layer with a kernel size of 2x2.
Fully-connected Layer with input layer of 1024 and output layer of 512 and ReLU activation function.
Fully-connected Layer with input layer of 512 and output layer of 10 (classes) and Softmax activation function.

The Simple CNN is implemented in C++ as Object Oriented Programming. In order to implement the network layers and methods Eigen Library is being used. Eigen is a C++-based open-source linear algebra library.

Every Layer apart of the fully connected can gets an input of 4-dimentions (N,C,H,W), were N is the batch size, C is the number of the channels and H,W are height and width respectively, the resolution of the images.
Also see SimpleCNN.hpp.

The Layers & Model's Components

Convolution2D

Applies a 2D convolution over an input signal composed of singular or several data inputs. See Convolution2D.hpp.

The Convolutional Layer is the core building block of a Convolutional Neural Network (CNN), commonly used in image processing and computer vision. A convolutional layer applies a small filter (or kernel) across the input to extract features such as edges, textures, and shapes.
Given an input $X$ of size $H×W$ and a kernel $K$ of size $k×k$:

$$ Y(i,j)=\sum_m \sum_n X(i + m, j + n) \cdot K(m, n) $$

Then, the output dimensions are:

$$ H_{out} = \left\lfloor \frac{H + 2p - k}{s} \right\rfloor + 1 \quad;\quad W_{out} = \left\lfloor \frac{W + 2p - k}{s} \right\rfloor + 1 $$

Where:

$s$ - stride, how far the filter moves at each step. A stride of 1 means one pixel at a time.
$p$ - padding, adds zeros around the input to control the size of the output.

MaxPooling

Applies a 2D max pooling over an input signal composed of singular or several data inputs. See MaxPooling.hpp.
A pooling layer is used in Convolutional Neural Networks (CNNs) to reduce the spatial dimensions (height and width) of feature maps while preserving the most important information. Max pooling is the most common type of pooling. It works by sliding a small window (like 2×2 or 3×3) over the input and taking the maximum value in each region.
For an input of size $H×W$ and a pooling window (kernel) of size $k×k$ on each step, the max pooling layer applies:

$$ Y(i, j) = \max_{\substack{(m, n) \in \text{window}}} X(i + m, j + n) $$

Then, the output dimensions, like the convolutional layer, are:

$$ H_{out} = \left\lfloor \frac{H + 2p - k}{s} \right\rfloor + 1 \quad;\quad W_{out} = \left\lfloor \frac{W + 2p - k}{s} \right\rfloor + 1 $$

FullyConnected

Applies a linear transformation to the layer's input on a 2-dimentional input, $(N,H)$. See FullyConnected.hpp.
A fully-connected layer is a type of neural network layer where every neuron is connected to every neuron in the previous layer. It is usually placed at the end of convolutional or recurrent layers to make final predictions.
Each output neuron computes a weighted sum of all inputs, adds a bias, and then applies an activation function.

$$ y = f\bigg(Wx+b\bigg) $$

Where:

$x \in \mathbb{R}^{n_{\text{in}}}$ is the input vector.
$W \in \mathbb{R}^{n_{\text{out}} \times n_{\text{in}}}$ is the weight matrix.
$b \in \mathbb{R}^{n_{\text{out}}}$ is the bias vector.
$f(\cdot)$ is an activation function (e.g., ReLU, sigmoid).
$y \in \mathbb{R}^{n_{\text{out}}}$ is the output vector.

He Weight initialization

Weight initialization helps neural networks train efficiently by keeping activations and gradients stable across layers. Xavier initialization works well with sigmoid or tanh activations, using a variance of $\frac{2}{n_{in}+n_{out}}$. He initialization is better for ReLU-based activations, using $\frac{2}{n_{in}}$ to account for the fact that ReLU drops negative inputs.
He Initialization (also called Kaiming Initialization) is designed for neural networks that use ReLU or Leaky ReLU activations. Since ReLU sets all negative inputs to zero, it effectively reduces the number of active neurons by half, which can shrink the variance of outputs layer by layer if not handled properly. To compensate, He Initialization sets the weights to have a higher variance:

For a normal distribution: $$W \sim \mathcal{N}\Bigg(0,\quad \frac{2}{n_{in}}\Bigg)$$
For a uniform distribution: $W \sim \mathcal{U}\Bigg(-\sqrt{\frac{6}{n_{in}}},\quad \sqrt{\frac{6}{n_{in}}}\Bigg)$

Activation Functions

An activation function is a mathematical function applied to the output of each neuron in a neural network. It introduces non-linearity to the model, allowing it to learn complex patterns beyond just linear relationships.
Without activation functions, a neural network—no matter how deep—would behave like a simple linear model.

ReLU (Rectified Linear Unit) is one of the most popular activation functions in deep learning.

$$\mathrm{ReLU}(\mathbf{x}) = \mathbf{max}(0, \mathbf{x})$$

Softmax is usually used in the output layer for classification, especially multi-class problems.

$$\mathrm{Softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \quad \text{for } i = 1, 2, \ldots, K \quad ; \quad \mathbf{z} \in \mathbb{R}^{K}$$

Regularization

Regularization refers to a set of techniques used to prevent a machine learning model from overfitting the training data, improving its generalization to unseen data. It works by constraining or penalizing the model’s complexity, encouraging simpler solutions that are less sensitive to noise in the data.

In our model, Simple CNN, we use Dropout and Batch Normalization methods.

Dropout

Dropout is a regularization technique where, during training, a fixed percentage of neurons (e.g. 50%) are randomly set to zero in each forward pass, preventing co-adaptation of neurons. This prevents over-reliance on specific neurons and encourages redundancy and robustness.
At inference time, all neurons are active, and their outputs are scaled to match the expected value during training.

$$ \tilde{h_i} = \begin{cases}0 & ; & with & probability & p \\ \frac{h_i}{1-p} & ; & with & probability & 1-p\end{cases} $$

During inference, all units are used as-is: $$\tilde{h_i} = h_i$$

Batch Normalization

Batch Normalization aims to stabilize and accelerate training by ensuring each channel’s activations have consistent statistics across mini‑batches. This method normalizes each feature channel’s activations to zero mean and unit variance over a mini-batch thereby It reduces internal covariate shift and can have a slight regularizing effect (due to batch noise).
For a layer’s inputs $x$, we compute per‑channel mean, $μ$, and variance, $σ^2$, then transform:

$$\hat{x} = \frac{(x - μ)}{\sqrt{σ^{2} + ε}}$$

Then we scale ($γ$) and shift ($β$):

$$⇨ y = γ·\hat{x} + β$$

where $γ$ and $β$ are learned scale and shift parameters. This stabilizes and speeds up training and adds a bit of regularization through batch noise.

Loss & Optimization

Cross Entropy Loss Function

The Cross Entropy Loss function is widely used for classification tasks, as it measures the difference between the predicted probability distribution and the true distribution. Given a predicted probability vector $\hat{y}$ and a one-hot encoded target vector $y$, the loss for a single example is defined as:

$$ \mathcal{L}_{CE} = - \sum_{i} y_i \log \hat{y}_i $$

This loss penalizes confident incorrect predictions more heavily than less certain ones, encouraging the model to assign higher probabilities to the correct classes. Minimizing cross-entropy effectively maximizes the likelihood of the correct labels under the model’s predicted distribution.

Adam Optimizer

Adam (Adaptive Moment Estimation is a widely used optimization algorithm in machine learning. It combines the benefits of Momentum and RMSProp, maintaining running estimates of both the mean and the uncentered variance of gradients to adaptively adjust the learning rate for each parameter. By using these adaptive estimates, Adam can converge faster and more reliably on complex models, handle noisy gradients, and often requires less manual tuning of the learning rate compared to standard stochastic gradient descent. Its adaptive nature makes Adam particularly effective for large-scale problems and deep neural networks, where gradients can vary significantly across parameters.

Adam Algorithm:

$\theta_t$ : parameters at time step t.
$\beta_1, \beta_2$ : exponential decay rates for moment estimates.
$\alpha$ : learning rate.
$\epsilon$ : small constant to prevent division by zero.
$\lambda$ : weight decay coefficient.

Compute gradients:

$$g_t = \nabla_\theta J(\theta_t)$$
Update moment estimates:

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad;\quad v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$
Bias correction:

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \quad;\quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$
Parameter update:

$$\theta_{t+1} = \theta_t - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

For more information on Stochastic gradient descent, extensions and variants.

Typical Run

References

The Back Propagation Method for CNN

Adam: A Method for Stochastic Optimization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.vscode		.vscode
MNIST		MNIST
include		include
src		src
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Simple-CNN-OOP

Requirements

MNIST Dataset

The Model - Simple CNN