Building an image classification model from scratch to recognize my own terrible handwriting
In an earlier post, I walked through the Word2Vec algorithm and a reasonably bare-bones neural network. In this post, I’m going to walk through the implementation of another kind of neural network — CNNs, which are effective for image recognition and classification tasks, among other things. Again, the Rust code referenced below does not leverage any ML libraries and implements this network from scratch. The code referenced in this post can be found here on GitHub.
Convolutional Neural Networks are named after the mathematical concept of convolution, which, given two functions f and g, produce a third function that describes how the shape of one is modified by the other using the following integral:
In the context of a CNN, the convolutional operation refers to the sliding of small kernels of weights across an image’s pixels, multiplying the two together and storing the result in an output feature map.
Typical CNNs are variants of feed-forward neural nets, so most of the rest of the algorithm is what you might expect (I walk through this with examples here). I found a great resource for walking through the model architecture at a high level in this Towards Data Science post, and this one, (and in more depth — this paper).
The network I’ll describe below was built to learn from the MNIST Database, which contains training and test data comprised of hand-written digits from the United States Census Bureau and high-school students. These digits are packed into 70 thousand 28×28 pixel grayscale images. After around five training epochs, this network can achieve a 98%+ predictive accuracy on handwritten digits from the MNIST dataset, though I initially couldn’t get better than 80% accuracy on my own handwriting. More on that later!
What you’ll find below
- CNN Layers
- Patches, Kernels and Convolution
- Max Pooling
- Fully-Connected Layer
- Evaluation of the Network
CNN Layers
Generally speaking, there are three kinds of layers in a Convolutional Neural Network, though there may be multiple layers of each type or additional layer types, depending on the application:
- Convolutional layer: applies a series of kernels (also called filters) to create feature maps for the input data, which capture patterns or textures for use in other layers
- Pooling (usually “max” pooling) layer: extracts the most significant features from input regions and reduces the dimensionality of the processing space
- Fully connected layer: learns the likely combinations of input data and in conjunction with a softmax activation function, solves multi-class classification problems
Patches, Kernels and Convolution
This neural network extracts patches from the images it’s processing. The patch captures spatial relationships (meaning, it preserves the two-dimensional layout) between a subset of pixels within the image and uses this pixel subset as a feature in the model.
/// A patch is a concept from image processing which captures subregions of the image pixels,
/// used to capture groups of spatial features from the underlying image.
pub fn
patches (&mut self, image: &Array2<f64>) -> Vec<Patch>
{
let mut data: Vec<Patch> = Vec::new();
for x in 0..(image.shape()[0] - self.rows + 1) {
for y in 0..(image.shape()[1] - self.cols + 1) {
let p = Patch {
x: x,
y: y,
data: image.slice(
// Collect a 2-D slice of the image from
// [(x -> x + kernel rows), (y -> y + kernel_cols)]
s![
x..(x + self.rows),
y..(y + self.cols)
])
.to_owned()
.insert_axis(Axis(2))
};
data.push(p)
}
}
data
}
The convolutional layer also defines kernels, which are learned weight matrices multiplied against patches to capture features from the underlying pixels.
In this image, you can see a 2D window (the patch) sliding over the (blue) input image, which multiplies the patch against its corresponding kernel and stores the result in the (green) feature output matrix.
Visualized differently, a kernel (whose weights were probably learned during back propagation) is multiplied against a patch at [0, 0] in blue. This results in an intermediate matrix which is summed and whose resulting value is stored in the final feature matrix at the corresponding top-left coordinates.
This feature map is the output of this layer.
Max Pooling
The pooling layer, and using the max pooling technique in particular, is a dimensionality and noise reduction step. It extracts the dominant features within the input representation and ignores the rest. Similarly to the earlier convolutional layer, this pooling step extracts patches over its own input, which are the output matrices of the convolutional layer.
The max pooling output is implemented by simply extracting the maximum value (in our case, convolved pixel magnitude) within each of these subsequent patches, further increasing the density of the representation of the input image.
let patches = self.patches(input);
for p in patches.iter() {
let depth = p.data.dim().2;
let v: Vec<f64> = (0..depth).map(|i| {
p.data
.slice(s![.., .., i])
.fold(f64::NEG_INFINITY, |acc, &v| acc.max(v))
}).collect();
a.slice_mut(s![p.x, p.y, ..]).assign(&Array1::from(v));
}
Visually, it looks like this:
It can then be flattened into a column vector for either training or inference within the feed-forward neural network captured by the next fully connected layer.
Fully-Connected Layer
The fully-connected layer uses the densified input features from the earlier layers and maps them into the output layer, using a softmax activation function. This translates those inputs into a probability distribution over the classes represented by the N dimensions of the output layer — in our case, ten neurons — representing the digits from 0–9.
This layer represents the feed-forward neural network or multi-layer perceptron, stage of this CNN. It’s called the fully-connected layer because, unlike earlier layers, every neuron in this layer is connected to every neuron in the previous layer. This is unlike the convolution and pooling layers described above, which focus on manipulating spatial segments of their input matrices.
/// Runs forward propagation in this layer of the network. Flattens input, computes probabilities of
/// outputs based on the dot product with this layer's weights and softmax outputs.
pub fn
forward_propagation<'a> (&mut self, input: &'a Array3<f64>, ctx: &mut SoftmaxContext<'a>) -> Array1<f64>
{
let flattened: Array1<f64> = input.to_owned().into_shape((input.len(),)).unwrap();
let dot_result = flattened.dot(&self.weights).add(&self.bias);
let probabilities = softmax(&dot_result);
ctx.input = Some(input);
ctx.flattened = Some(flattened);
ctx.dot_result = Some(dot_result);
ctx.output = Some(probabilities.clone());
probabilities
}
This relationship between features through each layer is visualized here:
Evaluation of the network
The MNIST dataset comes with sixty thousand training images and labels, and ten thousand test images and labels. Of course you could use all seventy thousand pairs for training and test on whatever you like — this is just how the files are partitioned.
With a learning rate of 0.01, the model accuracy improves pretty quickly:
1000 examples, avg loss 1.3944 accuracy 68.8% avg batch time 783ms
2000 examples, avg loss 0.7603 accuracy 83.8% avg batch time 776ms
3000 examples, avg loss 0.7397 accuracy 84.2% avg batch time 774ms
4000 examples, avg loss 0.6618 accuracy 85.5% avg batch time 775ms
5000 examples, avg loss 0.6253 accuracy 86.5% avg batch time 774ms
6000 examples, avg loss 0.6403 accuracy 86.3% avg batch time 776ms
7000 examples, avg loss 0.5936 accuracy 86.0% avg batch time 775ms
8000 examples, avg loss 0.5467 accuracy 88.9% avg batch time 775ms
9000 examples, avg loss 0.6172 accuracy 87.1% avg batch time 777ms
And converges after a few epochs:
57000 examples, avg loss 0.0630 accuracy 98.9% avg batch time 788ms
58000 examples, avg loss 0.0845 accuracy 98.3% avg batch time 783ms
59000 examples, avg loss 0.0568 accuracy 99.0% avg batch time 784ms
60000 examples, avg loss 0.0609 accuracy 99.1% avg batch time 797ms
I wanted to test this network on a classification task of my own handwritten digits. I took a photo of numbers I wrote by hand, inverted the colors, and scaled them to the same 28×28 format as the images in the MNIST dataset.
The model accuracy is pretty good against the dataset, so I assumed it would do pretty well classifying my own, famously bad handwriting:
cnn-rs ~>> ./target/release/cnn-rs --load ./cnn.model --input_dir=./custom/
loaded model data from ./cnn.model
predict that the digit in 0.jpeg is 0
predict that the digit in 1.jpeg is 1
predict that the digit in 6.jpeg is 5
predict that the digit in 7.jpeg is 8
predict that the digit in 8.jpeg is 8
predict that the digit in 4.jpeg is 4
predict that the digit in 5.jpeg is 5
predict that the digit in 9.jpeg is 4
predict that the digit in 2.jpeg is 2
predict that the digit in 3.jpeg is 3
Just 7/10.
In the bigger picture, that’s pretty good for a hand-written image classifier, but I thought it would do better. My written 4s and 9s are often confused by humans (especially former math teachers). I’d previously learned to write my 7s with a crossbar to distinguish from my 1s — it’s uncommon in the US and it makes sense that it would confuse the model, given its US-based training data. Getting my 6 confused with a 5 is odd.
Since I have this sample of my own handwriting, I can just fine-tune the model based on this small sample set — running back-propagation against these ten images and their filename labels. This is kind of like transfer learning. I configured the code to run backprop just three times for each digit, which seemed insignificant compared to the three hundred thousand training examples used to train this model so far.
I enabled this behavior behind the–cheat flag.
cnn-rs ~>> ./target/release/cnn-rs --cheat --load ./cnn.model --input_dir=./custom/
loaded model data from ./cnn.model
predict that the digit in 0.jpeg is 0
predict that the digit in 1.jpeg is 1
predict that the digit in 6.jpeg is 6
predict that the digit in 7.jpeg is 7
predict that the digit in 8.jpeg is 8
predict that the digit in 4.jpeg is 4
predict that the digit in 5.jpeg is 5
predict that the digit in 9.jpeg is 9
predict that the digit in 2.jpeg is 2
predict that the digit in 3.jpeg is 3
10/10 😎
Further reading:
- https://towardsdatascience.com/building-convolutional-neural-network-using-numpy-from-scratch-b30aac50e50a
- https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
- https://towardsdatascience.com/building-a-convolutional-neural-network-from-scratch-using-numpy-a22808a00a40
- https://arxiv.org/pdf/1603.07285.pdf
Convolutional Neural Networks from the Ground-Up, for the ML-Adjacent was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.