Neural Networks

Models

class sealion.neural_networks.models.NeuralNetwork(layers=None)

This class is very rich and packed with methods, so I figured it would be best to have this tutorial guide you on a tutorial on the “hello world” of machine learning.

There are two ways to initialize this model with its layers, through the __init__() or through the add() methods.

The first way:

>>> from sealion import neural_networks as nn
>>> layers = [nn.layers.Flatten(), nn.layers.Dense(784, 64, activation=nn.layers.ReLU()),
...     nn.layers.Dense(64, 32, activation=nn.layers.ReLU())]
>>> nn.layers.Dense(32, 10, activation=nn.layers.Softmax())]
>>> model = nn.models.NeuralNetwork(layers)

Or you can go through the second way:

>>> from sealion import neural_networks as nn
>>> model = nn.models.NeuralNetwork()
>>> model.add(nn.layers.Flatten())
>>> model.add(nn.layers.Dense(784, 64, activation=nn.layers.ReLU()))
>>> model.add(nn.layers.Dense(64, 32, activation=nn.layers.ReLU()))
>>> model.add(nn.layers.Dense(32, 10, activation=nn.layers.ReLU()))

Either way works just fine.

Next, you will want to perhaps see how complex the model is (having too many parameters means a model can easily overfit), so for that you can just do:

>>> num_params = model.num_parameters() # returns an integer
>>> assert num_params == 52650

Looks like our model will be pretty complex. Next up is finalizing and training.

>>> model.finalize(loss=nn.loss.CrossEntropy(), optimizer=nn.optimizers.Adam())

Here we use cross-entropy loss for this classification problem and the Adam optimizer.

Onto training. Assuming our data is 60k * 28 * 28 (sounds a lot like MNIST) in the variable X_train and we have y_train is a one-hot encoded matrix size 60k * 10 (10 classes) we can do :

>>> model.train(X_train, y_train, epochs=20) # train for 20 epochs

which will work fine. Here we have a batch_size of 32 (default), and the way we make this run fast is by making batch_size datasets, and training them in parallel via multithreading.

If you want to change batch_size to 18 you could do:

>>> model.train(X_train, y_train, epochs=20, batch_size=18) # train for 20 epochs, batch_size 18

If you want the gradients to be calculated over the entire dataset (this will be much longer) you can do:

>>> model.full_batch_train(X_train, y_train, epochs=20)

Lastly there’s also something known as mini-batch gradient descent, which just does gradient descent but randomly chooses a percent of the dataset for gradients to be calculated upon. This cannot be parallelized, but still runs fast:

>>> model.mini_batch_train(X_train, y_train, N=0.1) # here we take 10% of X_train or 6000 data-points
...                                                 # randomly selected for calculating gradients at a time

All of these methods have a show_loop parameter on whether you want to see the tqdm loop, which is set true by default.

Now that we have trained our model, time to test and use it. To evaluate it, given X_test (shape 10k * 28 * 28) and y_test (shape 10k * 10), we can feed this into our evaluate() function:

>>> model.evaluate(X_test, y_test)

This gives us the loss, which can be not so interpretable - so we have some other options.

If you are doing a classification problem as we are here you may do instead:

>>> model.categorical_evaluate(X_test, y_test)

which just gives what percent of X_test was classified correctly.

For regression just do:

>>> model.regression_evaluate(X_test, y_test)

which gives the r^2 value of the predictions on X_test. If you are doing this make sure that y_test is not one-hot encoded.

To predict on data we can just do:

>>> predictions = model.predict(X_test)

and if we want this in reverted one-hot-encoded form all we need to do is this:

>>> from utils import revert_softmax
>>> predictions = revert_softmax(predictions)

Storing around 53,000 parameters in matrices isn’t much, but for good practice let’s store it. Using the give_parameters() method we can save our weights in a list:

>>> parameters = model.give_parameters()

Then we may want to pickle this using the given method, into a file called “MNIST_pickled_params”:

>>> file_name = "MNIST_pickled_params"
>>> model.pickle_params(FILE_NAME=file_name)

Now this is the beauty - let’s say a few weeks from now we come back and realize we probably should train for 20 epochs. BUT - we don’t want to have to restart training (imagine this was a really big network.)

So we can just load the weights using the pickle module, build the same model architecture, insert the params, and hit train()!!

>>> import pickle
>>> with open('MNIST_pickled_params.pickle', 'rb') as f :
>>>  params = pickle.load(f)

Now we can create the model structure (must be the EXACT same):

>>> model = nn.models.NeuralNetwork()
>>> model.add(nn.layers.Flatten())
>>> model.add(nn.layers.Dense(784, 64, activation = nn.layers.ReLU()))
>>> model.add(nn.layers.Dense(64, 32, activation = nn.layers.ReLU()))
>>> model.add(nn.layers.Dense(32, 10, activation = nn.layers.ReLU()))

Obviously we want to finalize our model for good practice:

>>> model.finalize(loss = nn.loss.CrossEntropy(), optimizer = nn.optimizers.Adam())

and we can enter in the weights & biases using the enter_parameters():

>>> model.enter_parameters(params) #must be params given from the give_parameters() method

and train!

>>> model.train(X_train, y_train, epochs = 100) #let's try 100 epochs this time

The reason this is such a big deal is because we don’t have to start training from scratch all over again. The model will not have to start from 0% accuracy, but may start with 80% given the params belonging to our partially-trained model we loaded from the pickle file. This is of course not a big deal for something like MNIST but will be for bigger model architectures, or datasets.

Of course there’s a lot more things we could’ve changed, but I think that’s pretty good for now!

__init__(layers=None)
finalize(loss, optimizer)

Both of these have to be the classes for the loss and optimizations

Layers

class sealion.neural_networks.layers.Layer

Base Layer class. All layers inherit from this.

backward(grad)

This method takes in a gradient (e.x ∂l/∂Z2) and returns its grad (e.x. dL/dA1)

forward(inputs)

This method saves the inputs, and returns its outputs

class sealion.neural_networks.layers.Flatten

This would be better explained as Image Data Flattener. Let’s say your dealing with MNIST (check out the examples on github) - you will have data that is 60000 by 28 by 28. Neural networks can’t work with data like that - it has to be a matrix. What you could do is take that 60000 * 28 * 28 data and apply this layer to make the data 60000 * 784. 784 because 28 * 28 = 784 and we just “squished” this matrix to one layer. If you have data that has colors, e.g. 60000 * 28 * 28 * 3, applying this layer would turn it into a 60000 * 2352 matrix, 2352 = 28 * 28 * 3.

An example for how this would work with something like MNIST (the grayscale 60k * 28 * 28 dataset) is shown below.

>>> from sealion import neural_networks as nn
>>> from sealion.neural_networks.models import NeuralNetwork
>>> model = NeuralNetwork()
>>> model.add(nn.layers.Flatten()) # always, always add that on the first layer
>>> model.add(nn.layers.Dense(784, whatever_output_size, more_args)) # now put that data to ``28*28 = 784``.
>>> # Do more cool stuff...
class sealion.neural_networks.layers.Dropout(dropout_rate)

Dropout is one of the most well-known regularization techniques there is. Imagine you were working on a coding project with about 200 people. If we just relied on one person to know how to compile, another person to know how to debug, then what happens when those special people leave for a day, or worse leave forever?

Now I know that seems to have no connection to dropout, but here’s how it does. In dropout this sort of scenario is prevented. “Dropping out”, or setting to 0, some of the outputs of some of the neurons means that the model will have to learn that it can’t just depend on one neuron to learn the most important features. This means has each neuron learn some features, some other features, etc. and can’t just depend on one node. The model will become more robust and generalize better with Dropout as every neuron now has a better set of weights. Normally due to dropout, it will be applied in training, but then “reverse-applied” in testing. Dropout will make the training accuracy go down a bit, but remember in the end it’s testing on real-world data that matters.

There’s a parameter dropout_rate on what percent (from 0 to 1 here) you want each neuron in the layer you are at to become 0. This is essentially the chance of dropping out any given neuron, or usually what percent of neurons will be dropped out. Typical values range from 0.1 to 0.5. Example below.

Let’s say you’ve gotten your models up so far:

>>> model.add(nn.layers.Flatten())
>>> model.add(nn.layers.Dense(128, 64, activation = nn.layers.ReLU()))

And now you want to add dropout. Well just add that layer like such:

>>> dropout_probability = 0.2
>>> model.add(nn.layers.Dropout(dropout_probability))

This will just mean that about 20% of the neurons coming from this first input layer will be dropped out. A higher dropout rate may not always lead to better generalization, but usually will decrease training accuracy.

In dropout remember 2 things, not just one matter. The probability, and the layers its applied at. Experimentation is key.

class sealion.neural_networks.layers.Dense(input_size: int, output_size: int, activation=None, weight_init='xavier')

This is the class that you will depend on the most. This class is for creating the fully-connected layers that make up Deep Neural Networks - where each neuron in a previous layer is connected to each layer in the next. Feel free to watch a couple of youtube tutorials from 3Blue1Brown (if you like calculus :) or others to get a better understanding of these parameters, or just look at the examples on my github.

The main method of course is the init. You will have to define the input_size, the output_size, and the activation and weight initialization are optional parameters.

In a neural network the number of nodes starting in a layer are known as the input_size here (fan_in in papers), and the number of nodes in the output of a layer is the output_size here (fan_out in papers.) The activation parameter is for the activation/non-linearity function you want your layers to go through. If you don’t understand what I just meant, sorry - but another way to think about it is that its a way to change the outputs of your network a bit for it to fit you dataset a little better (looking at a graph will help.) The default is no activation (some people call that Linear), so you’ll need to add that yourself. I’ll get to weight init in a bit. Examples below.

To add a layer, here’s how it’s done:

>>> from sealion import neural_networks as nn
>>> model = nn.models.NeuralNetwork()
>>> model.add(nn.layers.Dense(128, 64)) # input_size = 128, output_size = 64

This sets up a neural network with 128 incoming nodes (that means we have 128 numeric features), and then 64 output nodes.

Let’s say we wanted an activation function, like a relu or sigmoid (there are a bunch to choose from this API.) You could add that layer here like such:

>>> model = nn.models.NeuralNetwork() # clear all existing layers
>>> model.add(nn.layers.Dense(128, 64, activation=nn.layers.Sigmoid())) # all outputs go through the sigmoid function

Onto weight initalization! The jargon starts …. now. What weight init does is make the weights come from a special distribution (typically Gaussian) with a given standard deviation based on the input and output size. The reason this is done is because you don’t want the weights to be initalized too big or else the gradients in backpropagation may cause the model to go way off and become NaNs or infinity (this is known as exploding gradients.) For neural networks that are small that solve datasets like XOR or even breast cancer this isn’t a problem, but for deep neural networks for problems like MNIST this is a huge concern. The weight_init you will want to use is also dependent on what activation you are using. Most activation functions will do well with Xavier Glorot, so that is set as the default. You can choose to also use He for ReLU, LeakyReLU, ELU, or other variants of the ReLU activation. For SELU, you may choose to use LeCun weight initalization.

The possible choices are:

>>> "xavier" # no activation, logistic/sigmoid, softmax, and tanh
>>> "he"     # relu + similar variants
>>> "lecun"  # selu
>>> "none"   # if you want to do this

To set this you can just do:

>>> model = nn.models.NeuralNetwork() # clear all existing layers
>>> model.add(nn.layers.Dense(128, 64, activation=nn.layers.ReLU(), weight_init="he"))

Sorry for so much documentation, but this really is the class you will call the most.

class sealion.neural_networks.layers.BatchNormalization(input_size: int, momentum=0.9, epsilon=0.001, lr=0.01)

Batch Normalization is a frequently used technique in today’s models (especially in CNNs).

Often times you are told to normalize your data (make it bell-curved shaped) before you feed it to your model - that’s because normalization makes it easier for the model to learn because the loss curve is smoother (thus getting to the minima is easier and faster.) So, why don’t we apply normalization to the inputs of all of the other layers (aside from the input layer)? That’s what batch normalization does.

I got this explanation from CodeEmporium

In order for a given hidden layer to normalize its inputs it needs to know the mean and variance (just standard deviation squared) of the inputs it usually gets. Note that this does change (because the parameters of prior layers change), so the mean and variance is not necessarily the same throughout training. To do this we need to use a moving average to approximate the mean and variance of the given inputs fed to a B.N. layer we would like to normalize.

While the B.N. first uses this mean and variance to normalize its inputs to a standard normal distribution, where the mean is 0 and the standard deviation is 1, it is not guaranteed that this distribution is optimal as inputs for the succeeding layer. So the B.N. layer also learns a mean and variance, called beta and gamma respectively, to adjust the standard normal distribution (it got by normalizing its inputs) before feeding it to the next layer.

Onto the parameters:

input_size: number of features the B.N. layer needs to normalize (just output_size of the Dense layer above)

momentum: how slowly to change the mean and variance that the B.N. layer uses for normalization. A higher momentum means that the mean and the variance the B.N. layer uses will change very slowly, whereas a lower momentum means that the mean and the variance the B.N. layer uses will change very quickly in response to changes in the B.N.’s inputs (and their mean and variance). Default 0.9.

epsilon: tiny value needed in normalization if the variance ever becomes 0 to protect against 0 division errors. Default 0.001.

lr: learning rate for the B.N. layer’s learning of the normalized mean and variance. Default 0.01.

To add Batch Normalization to your model simply do:

>>> import sealion as sl 
>>> model = sl.neural_networks.models.NeuralNetwork()
>>> model.add(...) # add whatever Dense Layers 
>>> model.add(sl.neural_networks.layers.BatchNormalization(input_size = 5, momentum = 0.9, lr = 0.01))

Note you may want to experiment on whether to place a Batch Normalization layer after an activation or before, etc. I don’t know too much about this, but handsonml said this so I might as well too.

Activations

class sealion.neural_networks.layers.Tanh

Uses the tanh activation, which squishes values from -1 to 1.

class sealion.neural_networks.layers.Sigmoid

Uses the sigmoid activation, which squishes values from 0 to 1.

class sealion.neural_networks.layers.Swish

Uses the swish activation, which is sort of like ReLU and sigmoid combined. It’s really just f(x) = x * sigmoid(x). Not as used as other activation functions, but give it a try!

class sealion.neural_networks.layers.ReLU

The most well known activation function and the pseudo-default almost. All it does is turn negative values to 0 and keep the rest. Basically f(x) = max(x, 0)

class sealion.neural_networks.layers.LeakyReLU(leak=0.01)

Variant of the ReLU activation, just allows negatives values to be something more like 0.01 instead of 0, which means the neuron is “dead” as 0 * anything is 0.

The leak is the slope of how low negative values you are willing to tolerate. Usually set from 0.001 to 0.2, but the default of 0.01 works usually quite well.

class sealion.neural_networks.layers.ELU(alpha=1)

Solves the similar dying activation problem. The default of 1 for alpha works quite well in practice, so you won’t need to change it much.

class sealion.neural_networks.layers.SELU

Special type of activation function, that will “self-normalize” (have a mean of 0, and a standard deviation of 1) its outputs. This self-normalization typically leads to faster convergence.

If you are using this activation function make sure weight_init is set = “lecun” in whichever layer applied. It also need its inputs (in the beginning and all throughout) to be standardized (mu = 0, sigma = 1) for it to work, so make sure to get that taken care of. You can do that by standardizing your inputs and then always using SELU activation function (remember the “lecun” part!).

class sealion.neural_networks.layers.PReLU(lr=0.0001, momentum=0.9)

The PReLU activation function is essentially the same thing as the LeakyReLU activation, except that the “leak” parameter is learnt during training. To learn this parameter, gradient descent is used - so you have a learning rate and momentum parameter. Both are on a scale of 0 to 1. We initialize this “leak” parameter to 0.25 at the first iteration.

class sealion.neural_networks.layers.Softmax

Softmax activation function, used for multi-class (2+) classification problems. Make sure to use crossentropy with softmax, and it is only meant for the last layer!

Losses

class sealion.neural_networks.loss.Loss

Base loss class.

class sealion.neural_networks.loss.MSE

MSE stands for mean-squared error, and its the loss you’ll want to use for regression. To set it in the model.finalize() method just do:

>>> from sealion import neural_networks as nn
>>> model = nn.models.NeuralNetwork(layers_list)
>>> model.finalize(loss=nn.loss.MSE(), optimizer=...)

and you’re all set!

class sealion.neural_networks.loss.CrossEntropy

This loss function is for classification problems. I know there’s a binary log loss and then a multi-category cross entropy loss function for classification, but they’re essentially the same thing so I thought using one class would make it easier. Remember to use one-hot encoded data for this to work (check out utils).

If you are using this loss function, make sure your last layer is Softmax and vice versa. Otherwise, annoying error messages will occur.

To set this in the model.finalize() method do:

>>> from sealion import neural_networks as nn
>>> model = nn.models.NeuralNetwork()
>>> # ... add the layers ...
>>> model.add(nn.layers.Softmax()) # last layer has to be softmax
>>> model.finalize(loss=nn.loss.CrossEntropy(), optimizer=...)

and that’s all there is to it.

Optimizers

class sealion.neural_networks.optimizers.Optimizer

Base optimizer class. All optimizers extend from this.

class sealion.neural_networks.optimizers.GD(lr=0.001, clip_threshold=inf)

The simplest optimizer - you will quickly outgrow it. All you need to understand here is that the learning rate is just how fast you want the model to learn (default 0.001) set typically from 1e-6 to 0.1. A higher learning rate may mean the model will struggle to learn, but a slower learning rate may mean the model will but will also have to take more time. It is probably the most important hyperparameter in all of today’s machine learning.

The clip_threshold is simply a value that states if the gradients is higher than this value, set it to this. This is to prevent too big gradients, which makes training harder. The default for all these optimizers is infinity, which just means no clipping - but feel free to change that. You’ll have to experiment quite a bit to find a good value. This method is known as gradient clipping.

In the model.finalize() method:

>>> model.finalize(loss=..., optimizer=nn.optimizers.GD(lr=0.5, clip_threshold=5)) # here the learning
>>> # learning rate is 0.5 and the threshold for clipping is 5.
class sealion.neural_networks.optimizers.Momentum(lr=0.001, momentum=0.9, nesterov=False, clip_threshold=inf)

If you are unfamiliar with gradient descent, please read the docs on that in GD class for this to hopefully make more sense.

Momentum optimization is the exact same thing except with a little changes. All it does is accumulate the past gradients, and go in that direction. This means that as it makes updates it gains momentum and the gradient updates become bigger and bigger (hopefully in the right direction.) Of course this will be uncontrolled on its own, so a momentum parameter (default 0.9) exists so the previous gradients sum don’t become too big. There is also a nesterov parameter (default false, but set that true!) which sees how the loss landscape will be in the future, and makes its decisions based off of that.

An example:

>>> momentum = nn.optimizers.Momentum(lr=0.02, momentum=0.3, nesterov=True) # learning rate is 0.2, momentum at 0.3, and we have nesterov!
>>> model.finalize(loss=..., optimizer=momentum)

There’s also a clip_threshold argument which you implements gradient clipping, an explanation you can find in the GD() class’s documentation.

Usually though this works really good with SGD…

class sealion.neural_networks.optimizers.SGD(lr=0.001, momentum=0.0, nesterov=False, clip_threshold=inf)

SGD stands for Stochastic gradient descent, which means that it calculates its gradients on random (stochastic) picked samples and their predictions. The reason it does this is because calculating the gradients on the whole dataset can take a really long time. However ML is a tradeoff and the one here is that calculating gradients on just a few samples means that if those samples are all outliers it can respond poorly, so SGD will train faster but not get as high an accuracy as Gradient Descent on its own.

Fortunately though, there are work arounds. Implementing momentum and nesterov with SGD means you get faster training and also the convergence is great as now the model can go in the right direction and generalize instead of overreact to hyperspecific training outliers. By default nesterov is set to False and there is no momentum (set to 0.0), so please change that as you please.

To use this optimizer, just do:

>>> model.finalize(loss=..., optimizer=nn.optimizers.SGD(lr=0.2, momentum=0.5, nesterov=True, clip_threshold=50))

Here we implemented SGD optimization with a learning rate of 0.2, a momentum of 0.5 with nesterov’s accelerated gradient, and also gradient clipping at 50.

class sealion.neural_networks.optimizers.AdaGrad(lr=0.001, nesterov=False, clip_threshold=inf, e=1e-10)

Slightly more advanced optimizer, an understanding of momentum will be invaluable here. AdaGrad and a whole plethora of optimizers use adaptive gradients, or an adaptive learning rate. This just means that it will assess the landscape of the cost function, and if it is steep it will slow it down, and if it is flatter it will accelerate. This is a huge deal for avoiding gradient descent from just going into a steep slope that leads to a local minima and being stuck, or gradient descent being stuck on a saddle point.

The only new parameter is e, or this incredibly small value that is meant to prevent division by zero. It’s set to 1e-10 by default, and you probably won’t ever need to think about it.

As an example:

>>> model.finalize(loss=..., optimizer=nn.optimizers.AdaGrad(lr=0.5, nesterov=True, clip_threshold=5))

AdaGrad is not used in practice much as often times it stops before reaching the global minima due to the gradients being too small to make a difference, but we have it anyways for your enjoyment. Better optimizers await!

class sealion.neural_networks.optimizers.RMSProp(lr=0.001, beta=0.9, nesterov=False, clip_threshold=inf, e=1e-10)

RMSprop is a widely known and used algorithm for deep neural network. All it does is solve the problem AdaGrad has of stopping too early by not scaling down the gradients so much. It does through a beta parameter, which is set to 0.9 (does quite well in practice.) A higher beta means that past gradients are more important, and a lower one means current gradients are to be valued more.

An example:

>>> model.finalize(loss=..., optimizer=nn.optimizers.RMSprop(nesterov=True, beta=0.9))

Of course there is the nesterov, clipping threshold, and e parameter all for you to tune.

class sealion.neural_networks.optimizers.Adam(lr=0.001, beta1=0.9, beta2=0.999, nesterov=False, clip_threshold=inf, e=1e-10)

Most popularly used optimizer, typically considered the default just like ReLU for activation functions. Combines the ideas of RMSprop and momentum together, meaning that it will adapt to different landscapes but move in a faster direction. The beta1 parameter (default 0.9) controls the momentum optimization, and the beta2 parameter (default 0.999) controls the adaptive learning rate parameter here. Once again higher betas mean the past gradients are more important.

Often times you won’t know what works best - so hyperparameter tune.

As an example:

>>> model.finalize(loss=..., optimizer=nn.optimizers.Adam(lr=0.1, beta1=0.5, beta2=0.5))

Adaptive Gradients may not always work as good as SGD or Nesterov + Momentum optimization. For MNIST, I have tried both and there’s barely a difference. If you are using Adam optimization and it isn’t working maybe try using nesterov with momentum instead.

class sealion.neural_networks.optimizers.Nadam(lr=0.001, beta1=0.9, beta2=0.999, clip_threshold=inf, e=1e-10)

Nadam optimization is the same thing as Adam, except there’s nesterov updating. Basically this class is the same as the Adam class, except there is no nesterov parameter (default true.)

As an example:

>>> model.finalize(loss=..., optimizer=nn.optimizers.Nadam(lr=0.1, beta1=0.5, beta2=0.5))
class sealion.neural_networks.optimizers.AdaBelief(lr=0.001, beta1=0.9, beta2=0.999, nesterov=False, clip_threshold=inf, e=1e-10)

AdaBelief is a recently popular optimizer that I thought to implement after checking in with the original paper. The way it works is by establishing the general “belief” (prediction) in where the gradient will go. If the belief and the actual given gradient are very similar, there will be a large change in the weights (larger gradients), whereas if they are very different, there will be a small change in the weights. This has worked well in practice, and its results are comparable to Adam.

For those interested in the original paper, go here : https://arxiv.org/pdf/2010.07468.pdf