Ensemble Learning

class sealion.ensemble_learning.RandomForest(num_classifiers=20, max_branches=inf, min_samples=1, replacement=True, min_data=None)

Random Forests may seem intimidating but they are super simple. They are just a bunch of Decision Trees that are trained on different sets of the data. You give us the data, and we will create those different sets. You may choose for us to sample data with replacement or without, either way that’s up to you. Keep in mind that because this is a bunch of Decision Trees, classification is only supported (avoid using decision trees for regression - it’s range of predictions is limited.) The random forest will have each of its decision trees predict on data and just choose the most common prediction (not the average.)

Enjoy this module - it’s one of our best.

__init__(num_classifiers=20, max_branches=inf, min_samples=1, replacement=True, min_data=None)
Parameters
  • num_classifiers – Number of decision trees you want created.

  • max_branches – Maximum number of branches each Decision Tree can have.

  • min_samples – Minimum number of samples for a branch in any decision tree (in the forest) to split.

  • replacement – Whether or not any of the data points in different chunks/sets of data can overlap.

  • min_data – Minimum number of data there can be in any given data chunk. Each classifier is trained on a chunk of data, and if you want to make sure each chunk has 3 points for example you can set min_data = 3. It’s default is 50% of the amount of data, the None is just a placeholder.

evaluate(x_test, y_test)
Parameters
  • x_test – testing data (2D)

  • y_test – testing labels (1D)

Returns

accuracy score

fit(x_train, y_train)
Parameters
  • x_train – 2D training data

  • y_train – 1D training labels

Returns

give_best_tree(x_test, y_test)

You give it the data and the labels, and it will find the tree in the forest that does the best. Then it will return that tree. You can then take that tree and put it into the DecisionTree class using the give_tree method.

Parameters
  • x_test – testing data (2D)

  • y_test – testing labels (1D)

Returns

tree that performs the best (dictionary data type)

predict(x_test)
Parameters

x_test – testing data (2D)

Returns

Predictions in 1D vector/list.

visualize_evaluation(y_pred, y_test)
Parameters
  • y_pred – predictions from the predict() method

  • y_test – labels for the data

Returns

a matplotlib image of the predictions and the labels (“correct answers”) for you to see how well the model did.

class sealion.ensemble_learning.EnsembleClassifier(predictors, classification=True)

Aside from random forests, voting/ensemble classifiers are also another popular way of ensemble learning. How it works is by training multiple different classifiers (you choose!) and predicting the most common class (or the average for regression - more on that later.) Pretty simple actually, and works quite effectively. This module also can tell you the best classifier in a group with its get_best_predictor(), so that could be useful. Similar to give_best_tree() in the random forest module, what it does is give the class of the algorithm that did the best on the data you gave it. This can also be used for rapid hypertuning on the exact same module (giving the same class but with different parameters in the init.)

Example:

>>> from sealion.regression import SoftmaxRegression
>>> from sealion.naive_bayes import GaussianNaiveBayes
>>> from sealion.nearest_neighbors import KNearestNeighbors
>>> ec = EnsembleClassifier({'algo1': SoftmaxRegression(num_classes=3), 'algo2': GaussianNaiveBayes(), 'algo3': KNearestNeighbors()},
...     classification=True)
>>> ec.fit(X_train, y_train)
>>> y_pred = ec.predict(X_test) # predict
>>> ec.evaluate_all_predictors(X_test, y_test)
algo1 : 95%
algo2 : 90%
algo3 : 75%
>>> best_predictor = ec.get_best_predictor(X_test, y_test) # get the best predictor
>>> print(best_predictor) # is it Softmax Regression, Gaussian Naive Bayes, or KNearestNeighbors that did the best?
<regression.SoftmaxRegression object at 0xsomethingsomething>
>>> y_pred = best_predictor.predict(X_test) # looks like softmax regression, let's use it

Here we first important all the algorithms we are going to be using from their respective modules. Then we create an ensemble classifier by passing in a dictionary where each key stores the name, and each value stores the algorithm. Classification = True by default, so we didn’t need to put that (if you want regression put it to False. A good way to remember classification = True is the default is that this is an EnsembleCLASSIFIER.)

We then fitted that and got it’s predictions. We saw how well each predictor did (that’s where the names come in) through the evaluate_all_predictors() method. We could then get the best predictor and use that class. Note that this class will ONLY use algorithms other than neural networks, which should be plenty. This is because neural networks have a different evaluate() method and typically will be more random in performance than other algorithms.

I hope that example cleared anything up. The fit() method trains in parallel (thanks joblib!) so it’s pretty fast. As usual, enjoy this algorithm!

__init__(predictors, classification=True)
Parameters
  • predictors – dict of {name (string): algorithm (class)}. See example above.

  • classification – is it a classification or regression task? default classification - if regression set this to False.

evaluate(x_test, y_test)
Parameters
  • x_test – testing data (2D)

  • y_test – testing labels (1D)

Returns

accuracy score

evaluate_all_predictors(x_test, y_test)
Parameters
  • x_test – testing data (2D)

  • y_test – testing labels (1D)

Returns

None, just prints out the name of each algorithm in the predictors dict fed to the __init__ and its score on the data given.

fit(x_train, y_train)
Parameters
  • x_train – 2D training data

  • y_train – 1D training labels

Returns

get_best_predictor(x_test, y_test)
Parameters
  • x_test – testing data (2D)

  • y_test – testing labels (1D)

Returns

the class of the algorithm that did best on the given data. look at the above example if this doesn’t make sense.

predict(x_test)
Parameters

x_test – testing data (2D)

Returns

Predictions in 1D vector/list.

visualize_evaluation(y_pred, y_test)
Parameters
  • y_pred – predictions from the predict() method

  • y_test – labels for the data

Returns

a matplotlib image of the predictions and the labels (“correct answers”) for you to see how well the model did.