Artificial Intelligence is a term used for machines that can interpret the data, learn from it, and use it to do such tasks that would otherwise be performed by humans. Deep Learning is a branch of Artificial Intelligence that focuses more on training the machines to learn on their own without much supervision. Deep Learning has witnessed tremendous growth in the last decade. With applications in image classification, speech recognition, text to speech conversion, self driving cars etc., the list of problems that Deep Learning has addressed is very significant. It is therefore necessary to understand the basic structure and working of Neural Networks to appreciate these advancements.
A neural network is a system or hardware that is designed to operate like a human brain. It can perform the following tasks:
A neural network is usually described as having different layers. The first layer is the input layer, it picks up the input signals and passes them to the next layer. The next layer does all kinds of calculations and feature extractions—it’s called the hidden layer. Often, there will be more than one hidden layer. And finally, there’s an output layer, which delivers the final result.
Let’s take the real-life example of how traffic cameras identify license plates and speeding vehicles on the road. The image is 28 by 28 pixels, and the image is fed as an input to identify the license plate. Each neuron has a number, called activation, which represents the grayscale value of the corresponding pixel, ranging from 0 to 1 (It’s 1 for a white pixel and 0 for a black pixel). Each neuron is lit up when its activation is close to 1. Pixels in the form of arrays are fed into the input layer (If your image is bigger than 28 by 28 pixels, you must shrink it down, because you can’t change the size of the input layer). In our example, we’ll name the inputs as X1, X2, and X3. Each of those represents one of the pixels coming in. The input layer then passes the input to the hidden layer. The interconnections are assigned weights at random. The weights are multiplied with the input signal, and a bias is added to all of them.
The weighted sum of the inputs is fed as input to the activation function, to decide which nodes to fire for feature extraction. As the signal flows within the hidden layers, the weighted sum of inputs is calculated and is fed to the activation function in each layer to decide which nodes to fire.
Finally, the model will predict the outcome, applying a suitable application function to the output layer. In our example with the car image, optical character recognition (OCR) is used to convert it into text to identify what’s written on the license plate. In the neural network example, we show only three dots coming in, eight hidden layer nodes, and one output, but there’s really a huge amount of input and output. Error in the output is back-propagated through the network and weights are adjusted to minimize the error rate. This is calculated by a cost function. We keep adjusting the weights until they fit all the different training models we put in.
The output is then compared with the original result, and multiple iterations are done for maximum accuracy. With every iteration, the weight at every interconnection is adjusted based on the error.
A neural network is a computational learning system that maps input variables to the output variable using an underlying mapping function that is non linear in nature. The architecture of a neural network comprises five essential components:
We will learn about each of these components in detail.
Simply put, a Neural Network is a stack of layers, interconnected to each other. There are three types of layers in a Neural Network: Input Layer takes the input data, Hidden Layer transforms the input data, Output Layer generates prediction for the given inputs after applying transformations. The layers close to the Input Layer are called the Lower layers, the layers close to the Output Layer are called the Upper Layers.
Each layer consists of multiple neurons, also called Nodes. Each node in a given layer is connected to each node in the next layer. The nodes take the weighted sum of the inputs from the previous layer, applies a non linear activation function to it and generates an output which then becomes an input to the nodes in the next layer.
The number of nodes in the input layer correspond to the number of independent variables in the data . The number of hidden layers and the nodes in these layers is a hyperparameter and usually is a function of the complexity of the problem and the data available. For a regression problem, the number of nodes in the output layer is one; for a multiclassification problem, the number of nodes in the output layer is equal to the number of labels / categories, for a binary classification problem, the number of nodes in the output layer is equal to 1.
Each connection between neurons carries a weight that determines the strength of their influence on the data’s transformation. For any arbitrary function f there exist a neuronal network. The goal is to find the best parameters 𝜃 (weights) which result in the best decision boundary. Thus, a neuron can be defined as an operation that has two parts — linear component and an activation component i.e. Neuron = Linear + Activation.
Let’s consider the illustration below. There is a dataset on the left. Typically, the dataset consists of some features denoted as X. In this case, we have two features, X1 and X2, for each sample. Additionally, there is a label Y, also referred to as the target or class, associated with each sample.
To learn the relationship between features X1 and X2 and their corresponding label, we utilize a neural network consisting of 2 input nodes (owing to the two features), one hidden layer with 3 neurons (the number of hidden layers and neurons can be adjusted as hyperparameters), and one output neuron. A weight matrix is associated with each layer. In this instance, there exists a hidden layer and an output layer, resulting in two weight matrices. These weights are initialized randomly, and throughout the training process, they are iteratively updated until the loss converges.
A weight matrix always has the dimension n x m:
The illustration below shows how a neural network results in a specific function.
Each node in a hidden layer has the following function: a = ReLU(weights * input + bias), where a refers to an activation function, such as ReLu. The last node a7 is a combination of all previous functions, resulting in one single non-linear function. To understand the combination of functions, we can take node a4 as an example. We can see that node a4 depends on the functions of nodes a1 to a3, which in turn depend on the input x. In particular, the value of node a4 is calculated by ReLU(weights * input + bias). In this case the bias is -1, the weights are 0.3, 0.2 and 0.1. And the input is the output of the previous three nodes a1, a2 and a3.
In this illustration we use ReLU as activation function, which simply is max(0, z).
An activation function is used to transform the input from a node to an output value that is fed to the node in the next hidden layer. In technical terms, an activation function, also known as a transfer function, defines how the weighted sum of the inputs and the bias is transformed into an output from the node in a given layer. It maps the output value in a given range i.e. 0 to 1 or -1 to +1 depending on the type of function used. Generally one activation function is used across all layers, exception being the output layer. There are different types of activation functions used in Neural Networks, and they have two types — linear and non-linear.
Implementation in Python:
1 | # sigmoid function |
Derivative function give us almost same as sigmoid’s derivative function.
Implementation in Python:
1 | # tanh activation function |
Implementation in Python:
1 | # ReLU activation function |
While we compare Leaky-ReLU with ReLU, then it shows clear concept of difference between them.
Implementation in Python:
1 | # Leaky_ReLU activation function |
when we compare the sigmoid and softmax activation functions , they produce different results:
Sigmoid’s probabilities produced by a Sigmoid are independent. Furthermore, they are not constrained to sum to one: 0.37 + 0.77 + 0.48 + 0.91 = 2.53. The reason for this is because the Sigmoid looks at each raw output value separately. Whereas Softmax’s the outputs are interrelated. The Softmax probabilities will always sum to one by design: 0.04 + 0.21 + 0.05 + 0.70 = 1.00. In this case, if we want to increase the likelihood of one class, the other has to decrease by an equal amount.
The predicted value is compared with actual value and the error is computed. The magnitude of the error is given by the loss function. The loss function will estimate how close the distribution of the predicted value is to distribution of the actual target variable in the training data. The Maximum Likelihood Estimation (MLE) framework is used to compute the error over the entire training data. It does this by estimating how closely the distribution of the predictions matches with the distribution of the target variable in the training data. The loss function under the MLE framework for classification problem is Cross Entropy, and for regression problem is Mean Squared Error.
Cross Entropy Loss: Cross Entropy gives the measure of the difference between two probability distributions of a random variable. In the context of the Neural Networks, it gives the difference between the predicted probability distribution and the distribution of the target variable in the training data set for a given set of weights or parameters. For a binary classification problem, the loss function used is binary cross entropy and for a multiclass classification problem, the loss function used is categorical cross entropy.
Binary Cross Entropy Loss: In case of a binary classification problem, where the target variable only has two options, class 1 or class 0, we use binary cross entropy loss to understand how bad the prediction was by measuring the dissimilarity between predicted probabilities and the actual target values. It is important to note that we only have one output node in binary classification tasks. This output node uses the sigmoid activation function, which squeezes values between 0 and 1. We use sigmoid, because the output value close to 1 can be interpreted as a high probability of the input belonging to one class, while an output value close to 0 indicates a high probability of belonging to the other class.
Mean Squared Error: In regression, we have only one output node and no activation function. As Loss function we use Mean Squared Error (MSE). MSE is the most commonly used loss function for a regression problem. MSE is calculated as the average of the squared difference between the predicted and actual values of the target variable. The output is always positive as it is a square of the error. MSE penalizes larger prediction errors more significantly due to the squaring operation. This means that outliers or instances with larger errors contribute more to the overall loss. Minimizing MSE during training encourages the model to adjust its parameters to make predictions that closely match the actual target values, resulting in a regression model that provides accurate estimations. There are variants to the MSE like the Mean Squared Logarithmic Error Loss (MSLE) and Mean Absolute Error (MAE). The choice depends on number of factors like presence of outliers , distribution of the target variable and others.
The output generated by the network in the first forward pass is a result of the weights that were initialized to some random values. The loss function compares the actual and predicted values and computes the error. The next step is to minimize the error by changing the weights. How does the network achieve this? This is achieved by using an optimizer together with backpropagation (The Backpropagation algorithm involves two main steps: the Forward Pass and the Backward Pass).
Optimizers are algorithms or methods used to minimize an error function(loss function)or to maximize the efficiency of production. Optimizers are mathematical functions which are dependent on model’s learnable parameters i.e Weights & Biases. Optimizers help to know how to change weights and learning rate of neural network to reduce the losses.
There are different types of optimizers, such as Gradient Descent algorithm, Stochastic Gradient Descent (SGD), Mini-Batch Gradient Descent, SGD with Momentum, Adaptive Gradient Descent (AdaGrad), Root Mean Square Propagation (RMS-Prop), AdaDelta, Adaptive Moment Estimation (Adam), etc.
There are different types of neural networks.
This is the simplest form of ANN (artificial neural network); data travels only in one direction (input to output). This is the example we just looked at. When you actually use it, it’s fast. When you’re training it, it takes a while. Almost all vision and speech recognition applications use some form of this type of neural network.
In this type, the hidden layer saves its output to be used for future prediction. The output becomes part of its new input. Applications include text-to-speech conversion.
In Convolution Neural Network, the input features are taken in batches, as if they pass through a filter. This allows the network to remember an image in parts. Applications include signal and image processing, such as facial recognition.
This model classifies the data point based on its distance from a center point. If you don’t have training data, for example, you’ll want to group things and create a center point. The network looks for data points that are similar to each other and groups them. One of the applications for this is power restoration systems.
Vectors of random input are input to a discrete map comprised of neurons. Vectors are also called dimensions or planes. Applications include using it to recognize patterns in data like a medical analysis.
This is composed of a collection of different neural networks working together to get the output. This is cutting-edge and is still in the research phase.
Deep Learning, Machine Learning, Neural Networks — Sep 17, 2024
Made with ❤️ and ☀️ on Earth.