🌑

Stephen's Blog

A Complete Guide to Neural Networks

Stephen Cheng

 

Intro

Artificial Intelligence is a term used for machines that can interpret the data, learn from it, and use it to do such tasks that would otherwise be performed by humans. Deep Learning is a branch of Artificial Intelligence that focuses more on training the machines to learn on their own without much supervision. Deep Learning has witnessed tremendous growth in the last decade. With applications in image classification, speech recognition, text to speech conversion, self driving cars etc., the list of problems that Deep Learning has addressed is very significant. It is therefore necessary to understand the basic structure and working of Neural Networks to appreciate these advancements.

What is a Neural Network?

A neural network is a system or hardware that is designed to operate like a human brain. It can perform the following tasks:

  • Translate text
  • Identify faces
  • Recognize speech
  • Read handwritten text
  • Control robots
  • And a lot more

A neural network is usually described as having different layers. The first layer is the input layer, it picks up the input signals and passes them to the next layer. The next layer does all kinds of calculations and feature extractions—it’s called the hidden layer. Often, there will be more than one hidden layer. And finally, there’s an output layer, which delivers the final result.

How Does a Neural Network Work?

Let’s take the real-life example of how traffic cameras identify license plates and speeding vehicles on the road. The image is 28 by 28 pixels, and the image is fed as an input to identify the license plate. Each neuron has a number, called activation, which represents the grayscale value of the corresponding pixel, ranging from 0 to 1 (It’s 1 for a white pixel and 0 for a black pixel). Each neuron is lit up when its activation is close to 1. Pixels in the form of arrays are fed into the input layer (If your image is bigger than 28 by 28 pixels, you must shrink it down, because you can’t change the size of the input layer). In our example, we’ll name the inputs as X1, X2, and X3. Each of those represents one of the pixels coming in. The input layer then passes the input to the hidden layer. The interconnections are assigned weights at random. The weights are multiplied with the input signal, and a bias is added to all of them.

The weighted sum of the inputs is fed as input to the activation function, to decide which nodes to fire for feature extraction. As the signal flows within the hidden layers, the weighted sum of inputs is calculated and is fed to the activation function in each layer to decide which nodes to fire.

Finally, the model will predict the outcome, applying a suitable application function to the output layer. In our example with the car image, optical character recognition (OCR) is used to convert it into text to identify what’s written on the license plate. In the neural network example, we show only three dots coming in, eight hidden layer nodes, and one output, but there’s really a huge amount of input and output. Error in the output is back-propagated through the network and weights are adjusted to minimize the error rate. This is calculated by a cost function. We keep adjusting the weights until they fit all the different training models we put in.

The output is then compared with the original result, and multiple iterations are done for maximum accuracy. With every iteration, the weight at every interconnection is adjusted based on the error.

Essential Components of Neural Networks

A neural network is a computational learning system that maps input variables to the output variable using an underlying mapping function that is non linear in nature. The architecture of a neural network comprises five essential components:

  1. Layers
  2. Nodes
  3. Activation Function
  4. Loss Function
  5. Optimizer

We will learn about each of these components in detail.

1.Layers

Simply put, a Neural Network is a stack of layers, interconnected to each other. There are three types of layers in a Neural Network: Input Layer takes the input data, Hidden Layer transforms the input data, Output Layer generates prediction for the given inputs after applying transformations. The layers close to the Input Layer are called the Lower layers, the layers close to the Output Layer are called the Upper Layers.

2. Nodes

Each layer consists of multiple neurons, also called Nodes. Each node in a given layer is connected to each node in the next layer. The nodes take the weighted sum of the inputs from the previous layer, applies a non linear activation function to it and generates an output which then becomes an input to the nodes in the next layer.

The number of nodes in the input layer correspond to the number of independent variables in the data . The number of hidden layers and the nodes in these layers is a hyperparameter and usually is a function of the complexity of the problem and the data available. For a regression problem, the number of nodes in the output layer is one; for a multiclassification problem, the number of nodes in the output layer is equal to the number of labels / categories, for a binary classification problem, the number of nodes in the output layer is equal to 1.

Each connection between neurons carries a weight that determines the strength of their influence on the data’s transformation. For any arbitrary function f there exist a neuronal network. The goal is to find the best parameters 𝜃 (weights) which result in the best decision boundary. Thus, a neuron can be defined as an operation that has two parts — linear component and an activation component i.e. Neuron = Linear + Activation.

How Nodes Work in Layers

Let’s consider the illustration below. There is a dataset on the left. Typically, the dataset consists of some features denoted as X. In this case, we have two features, X1 and X2, for each sample. Additionally, there is a label Y, also referred to as the target or class, associated with each sample.

To learn the relationship between features X1 and X2 and their corresponding label, we utilize a neural network consisting of 2 input nodes (owing to the two features), one hidden layer with 3 neurons (the number of hidden layers and neurons can be adjusted as hyperparameters), and one output neuron. A weight matrix is associated with each layer. In this instance, there exists a hidden layer and an output layer, resulting in two weight matrices. These weights are initialized randomly, and throughout the training process, they are iteratively updated until the loss converges.

A weight matrix always has the dimension n x m:

  • n neurons in the previous layer (input layer or a previous hidden layer).
  • m neurons in the current hidden layer.

The illustration below shows how a neural network results in a specific function.

Each node in a hidden layer has the following function: a = ReLU(weights * input + bias), where a refers to an activation function, such as ReLu. The last node a7 is a combination of all previous functions, resulting in one single non-linear function. To understand the combination of functions, we can take node a4 as an example. We can see that node a4 depends on the functions of nodes a1 to a3, which in turn depend on the input x. In particular, the value of node a4 is calculated by ReLU(weights * input + bias). In this case the bias is -1, the weights are 0.3, 0.2 and 0.1. And the input is the output of the previous three nodes a1, a2 and a3.

In this illustration we use ReLU as activation function, which simply is max(0, z).

3.Activation Function

An activation function is used to transform the input from a node to an output value that is fed to the node in the next hidden layer. In technical terms, an activation function, also known as a transfer function, defines how the weighted sum of the inputs and the bias is transformed into an output from the node in a given layer. It maps the output value in a given range i.e. 0 to 1 or -1 to +1 depending on the type of function used. Generally one activation function is used across all layers, exception being the output layer. There are different types of activation functions used in Neural Networks, and they have two types — linear and non-linear.

  • Linear Activation Function: The range of this function is: —infinity to +infinity. A linear activation function is used in outer layer of the neural network when solving regression problems. It is not a good idea to use it in the input or hidden layers cause the network will not be able to capture the complex relationships in the underlying data.

  • Non-Linear Activation Function: Non-Linear activation functions are by default, the most used activation function in Deep Learning. These include Sigmoid or Logistic function, Rectified Linear Activation (ReLU), and Hyperbolic Tangent (Tanh). Next, let’s understand each of them in more detail.

  • Sigmoid Function: The Sigmoid activation function, also called the Logistic function, compresses values between 0 and 1, which can be interpreted as a probability that the input belongs to a specific class. It takes in any real value as input and gives an output in the range of 0 and 1. Given as y = 1/(1+ e^-z), it has a S shaped curve. Here z = b + sigma(xi * wi), indexed over i input variables. For a very large positive number z, e^-z will be 0 and the output of the function will be 1. For a very large negative number z, e^-z will be a large number and thus the output of the function will be 0. Sigmoid function is frequently employed as an activation function for the output in binary classification problems. However, it yields very small gradients that can lead to neural network stagnation. Additionally, it causes gradients to vanish beyond 1 and 0, respectively.

Implementation in Python:

1
2
3
4
5
6
# sigmoid function
def sigmoid(z):
return 1.0 / (1 + np.exp(-z))
# Derivative of sigmoid function
def sigmoid_prime(z):
return sigmoid(z) * (1-sigmoid(z))
  • Hyperbolic Tangent Function: The hyperbolic tangent function is similar to the sigmoid function but has a range of -1 to 1. It is given as : f(x) = (e^z — e^-z) / (e^z+e^-z). Here z = b + sigma(xi * wi), indexed over i input variables. The shape of Tanh function is also S shaped but the range is different.

Derivative function give us almost same as sigmoid’s derivative function.

Implementation in Python:

1
2
3
4
5
6
# tanh activation function
def tanh(z):
return (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))
# Derivative of Tanh Activation Function
def tanh_prime(z):
return 1 - np.power(tanh(z), 2)
  • ReLU (Rectified Linear Unit) Function: ReLu is today, the most used activation function. ReLU has a property of being linear for all input values greater than 0 and non-linear otherwise. It is computationally efficient, because it uses only a simple thresholding operation. It is given as f(x) = max(0, x) It is less susceptible to vanishing gradient problem because the gradients are 1 if x > 0. However, every negative value results in a gradient of zero, which means the weights will never be updated, resulting in a dead neuron.

Implementation in Python:

1
2
3
4
5
6
# ReLU activation function
def relu(z):
return max(0, z)
# Derivative of ReLU Activation Function
def relu_prime(z):
return 1 if z > 0 else 0
  • Leaky ReLU: Leaky ReLU prevents dying ReLU problem. This variation of ReLU has a small positive slope in the negative area, so it does enable back-propagation, even for negative input values. Leaky ReLU does not provide consistent predictions for negative input values. During the front propagation if the learning rate is set very high it will overshoot killing the neuron. The idea of leaky ReLU can be extended even further. Instead of multiplying x with a constant term we can multiply it with a hyper-parameter which seems to work better the leaky ReLU. This extension to leaky ReLU is known as Parametric ReLU.

While we compare Leaky-ReLU with ReLU, then it shows clear concept of difference between them.

Implementation in Python:

1
2
3
4
5
6
# Leaky_ReLU activation function
def leakyrelu(z, alpha):
return max(alpha * z, z)
# Derivative of leaky_ReLU Activation Function
def leakyrelu_prime(z, alpha):
return 1 if z > 0 else alpha
  • Softmax: Softmax is genereally used at last layer of neural network which calculates the probabilities distribution of the event over n different events. The main advantage of the function is able to handle multiple classes.

when we compare the sigmoid and softmax activation functions , they produce different results:

  • Input values: -0.5, 1.2, -0.1, 2.4
  • Sigmoid output values: 0.37, 0.77, 0.48, 0.91
  • SoftMax output values: 0.04, 0.21, 0.05, 0.70

Sigmoid’s probabilities produced by a Sigmoid are independent. Furthermore, they are not constrained to sum to one: 0.37 + 0.77 + 0.48 + 0.91 = 2.53. The reason for this is because the Sigmoid looks at each raw output value separately. Whereas Softmax’s the outputs are interrelated. The Softmax probabilities will always sum to one by design: 0.04 + 0.21 + 0.05 + 0.70 = 1.00. In this case, if we want to increase the likelihood of one class, the other has to decrease by an equal amount.

  • Threshold Function: The threshold function is used when you don’t want to worry about the uncertainty in the middle.

4.Loss Function

The predicted value is compared with actual value and the error is computed. The magnitude of the error is given by the loss function. The loss function will estimate how close the distribution of the predicted value is to distribution of the actual target variable in the training data. The Maximum Likelihood Estimation (MLE) framework is used to compute the error over the entire training data. It does this by estimating how closely the distribution of the predictions matches with the distribution of the target variable in the training data. The loss function under the MLE framework for classification problem is Cross Entropy, and for regression problem is Mean Squared Error.

  • Cross Entropy Loss: Cross Entropy gives the measure of the difference between two probability distributions of a random variable. In the context of the Neural Networks, it gives the difference between the predicted probability distribution and the distribution of the target variable in the training data set for a given set of weights or parameters. For a binary classification problem, the loss function used is binary cross entropy and for a multiclass classification problem, the loss function used is categorical cross entropy.

  • Binary Cross Entropy Loss: In case of a binary classification problem, where the target variable only has two options, class 1 or class 0, we use binary cross entropy loss to understand how bad the prediction was by measuring the dissimilarity between predicted probabilities and the actual target values. It is important to note that we only have one output node in binary classification tasks. This output node uses the sigmoid activation function, which squeezes values between 0 and 1. We use sigmoid, because the output value close to 1 can be interpreted as a high probability of the input belonging to one class, while an output value close to 0 indicates a high probability of belonging to the other class.

    • Binary Cross-Entropy Loss: L = - (y_true * log(y_pred) + (1 - y_true) * log(1 - y_pred)).
    • The prediction for the first sample was 0.7 and the target was 1. If we pass this into the binary cross-entropy loss function we get a loss of 0.15: L = - (1 * log(0.7) + (1 - 1) * log(1 - 0.7)) = - log(0.7) = 0.15.
    • Similarly, the second sample yields the prediction 0.4 and the target was 0, thus the loss is 0.2: L = - (0 * log(0.4) + (1 - 0) * log(1 - 0.4)) = - log(1-0.4) = 0.2.

  • Categorical Cross Entropy Loss: In case of a multiclass classification problem, where the target variable is encoded as 1 to n-1 categories, the categorical cross entropy will calculate the score that summarizes the average difference between the actual and predicted probability distributions for all the classes. The loss is averaged over all samples. If we want to classify an input that has more than two target classes, we use an architecture that has one output neuron for each class. First of all, we have to encode all labels as one-hot-encode, thus, if we have three classes we have an array of size three, as label for each class. The last layer uses softmax instead of sigmoid. Softmax function is typically used in multi class classification problems. It is applied to the outputs of all nodes in the output layer of a neural network. The output of the function is a vector of values between 0 and 1 that sum to 1.

  • Mean Squared Error: In regression, we have only one output node and no activation function. As Loss function we use Mean Squared Error (MSE). MSE is the most commonly used loss function for a regression problem. MSE is calculated as the average of the squared difference between the predicted and actual values of the target variable. The output is always positive as it is a square of the error. MSE penalizes larger prediction errors more significantly due to the squaring operation. This means that outliers or instances with larger errors contribute more to the overall loss. Minimizing MSE during training encourages the model to adjust its parameters to make predictions that closely match the actual target values, resulting in a regression model that provides accurate estimations. There are variants to the MSE like the Mean Squared Logarithmic Error Loss (MSLE) and Mean Absolute Error (MAE). The choice depends on number of factors like presence of outliers , distribution of the target variable and others.

    • Mean-squared error: L = (y_true - y_pred)²

5. Optimizer

The output generated by the network in the first forward pass is a result of the weights that were initialized to some random values. The loss function compares the actual and predicted values and computes the error. The next step is to minimize the error by changing the weights. How does the network achieve this? This is achieved by using an optimizer together with backpropagation (The Backpropagation algorithm involves two main steps: the Forward Pass and the Backward Pass).

Optimizers are algorithms or methods used to minimize an error function(loss function)or to maximize the efficiency of production. Optimizers are mathematical functions which are dependent on model’s learnable parameters i.e Weights & Biases. Optimizers help to know how to change weights and learning rate of neural network to reduce the losses.

There are different types of optimizers, such as Gradient Descent algorithm, Stochastic Gradient Descent (SGD), Mini-Batch Gradient Descent, SGD with Momentum, Adaptive Gradient Descent (AdaGrad), Root Mean Square Propagation (RMS-Prop), AdaDelta, Adaptive Moment Estimation (Adam), etc.

Types of Neural Networks

There are different types of neural networks.

Feed-forward Neural Network

This is the simplest form of ANN (artificial neural network); data travels only in one direction (input to output). This is the example we just looked at. When you actually use it, it’s fast. When you’re training it, it takes a while. Almost all vision and speech recognition applications use some form of this type of neural network.

Recurrent Neural Network

In this type, the hidden layer saves its output to be used for future prediction. The output becomes part of its new input. Applications include text-to-speech conversion.

Convolution Neural Network

In Convolution Neural Network, the input features are taken in batches, as if they pass through a filter. This allows the network to remember an image in parts. Applications include signal and image processing, such as facial recognition.

Radial Basis Functions Neural Network

This model classifies the data point based on its distance from a center point. If you don’t have training data, for example, you’ll want to group things and create a center point. The network looks for data points that are similar to each other and groups them. One of the applications for this is power restoration systems.

Kohonen Self-organizing Neural Network

Vectors of random input are input to a discrete map comprised of neurons. Vectors are also called dimensions or planes. Applications include using it to recognize patterns in data like a medical analysis.

Modular Neural Network

This is composed of a collection of different neural networks working together to get the output. This is cutting-edge and is still in the research phase.

, , — Sep 17, 2024

Search

    Made with ❤️ and ☀️ on Earth.