How neural networks learn: from initial chaos to pattern recognition

Imagine a musician who has just received an instrument they’ve never played. Their first attempts will sound bad, almost random. But with each note, they adjust the pressure of their fingers, the position of their hands, the force of their breath. Little by little, the sound approaches the desired melody. This is how a neural network learns: not with magic or explicit instructions, but through an iterative process of trial, error, and adjustment. At first, its predictions are as bad as guessing at random. But through a learning loop — the training cycle — the network transforms a pile of random numbers into a system capable of recognizing faces, translating languages, or diagnosing diseases.

To understand how this transformation happens, we need to look inside the black box. A neural network is nothing more than a series of layers of artificial neurons connected to one another. Each connection has a weight, a number that determines how important that signal is. When the network receives an input — say, the pixels of an image — those values travel layer by layer toward the output. Each neuron sums its weighted inputs and applies an activation function that decides whether it should “fire” or not. That’s the forward pass: the moment when the network produces a prediction.

But that prediction is almost always wrong at first. How do you measure the error? That’s where the loss function comes in. If the network is trying to guess a number (like the price of a house), you use mean squared error (MSE), which simply averages the squared difference between what it predicted and what it should have said. If it’s classifying images (is it a cat or a dog?), you use cross-entropy, which penalizes wrong predictions made with high confidence more heavily. The higher the loss, the worse the network is doing.

Having a measure of error is useful, but not enough. The network needs to know in which direction to adjust each of its thousands or millions of weights to reduce that error. And here we arrive at the heart of training: backpropagation.

Backpropagation, popularized by Rumelhart, Hinton, and Williams in their seminal 1986 paper, is an elegantly efficient algorithm that calculates how much each weight contributed to the final error. It works backward: from the output layer toward the input, applying the chain rule of differential calculus. It’s like an investigation of accountability: if the final output is wrong, what share of the blame falls on each neuron in the previous layer? Backpropagation answers that question with mathematical precision, avoiding the need to recalculate everything from scratch. Without it, training deep networks would be computationally infeasible.

But backpropagation only calculates the gradient — the direction and magnitude of the change needed. Real learning happens when the network uses that gradient to update its weights, and that’s where gradient descent comes in. The classic analogy is a blindfolded person walking through mountainous terrain, where elevation represents the error. To reach the valley (minimum error), the person takes steps in the direction that descends the most. That’s exactly what the gradient does: it points in the direction of steepest ascent, so the network moves in the opposite direction.

The size of those steps is controlled by a hyperparameter called the learning rate. If it’s too large, the network takes leaps that make it bounce around and never converge. If it’s too small, it advances so slowly that training takes forever. Finding the sweet spot is part of the art of training networks.

Basic gradient descent works, but it has limitations. It can oscillate in narrow canyons of the error landscape or get stuck on plateaus. That’s where more sophisticated optimizers come in. Momentum, inspired by physics, accumulates “inertia” from previous steps: like a ball rolling downhill, if it has been going in a consistent direction, it keeps moving forward, smoothing out oscillations and accelerating convergence. The Adam optimizer, proposed by Kingma and Ba in 2014, goes a step further: it combines momentum with adaptive learning rates, adjusting the step size individually for each weight. Adam is today the default optimizer in most deep learning projects, from language models to recommendation systems.

Another key ingredient is activation functions. For years, the sigmoid and hyperbolic tangent functions were used, but both squashed values into small ranges, causing the dreaded vanishing gradient problem: in deep layers, the gradient became so tiny that weights stopped updating. The ReLU function (Rectified Linear Unit, f(x) = max(0, x)) solved this in a surprisingly simple way: for positive values, its derivative is 1, which allows the gradient to flow without shrinking. Additionally, ReLU produces sparse representations (many neurons output exactly zero, which is computationally efficient) and doesn’t require expensive operations like exponentials. Since AlexNet used it in 2012 to win ImageNet, ReLU became the default activation.

Behind this entire process there’s a silent but indispensable actor: the GPU. Every step of the training cycle — forward pass, backpropagation, weight update — involves multiplying enormous matrices. GPUs, originally designed to render graphics in parallel, turned out to be perfect for these operations. That same 2012 AlexNet demonstrated that training with GPUs reduced the time from weeks to days. Today, entire clusters of GPUs train models with hundreds of billions of parameters, but the fundamental principle remains the same.

Understanding how neural networks are trained matters because it’s the mechanism underlying virtually every modern application of artificial intelligence. Every time ChatGPT generates a response, when an autonomous car detects a pedestrian, or when Spotify recommends a song, behind it there’s a model that went through the same cycle: forward pass, loss calculation, backpropagation, and gradient descent. It’s not magic, but an iterative process of fine-tuning that turns initial randomness into useful knowledge. And the better we understand that process, the better we can use — and question — the tools it builds.

Main source: Adam: A Method for Stochastic Optimization — Kingma and Ba (2014), the paper that introduced the Adam optimizer and synthesizes the modern principles of neural network training.