Random Forest Developer Guide: 5 ways to implement in Python

Random Forest, an ensemble of decision trees, is one of the most popular methods in the machine learning world and is often used to make predictions. In this article, we will discuss 5 different Python packages that you can use to implement Random Forest.

Let's start with how you can implement a random forest in Python without using packages.

Implementation of a random forest in Python

A Random Forest consists of a collection of decision trees. The algorithm randomly samples the training data and variables and creates a decision tree from each sample. Prediction is performed by averaging or tuning the predictions of each tree.

This code sample implements the Random Forest algorithm in Python. The goal of the algorithm is to generate a set of decision trees and increase the prediction accuracy by aggregating their predictions.

Pure Python implementation of a random forest

The code starts by importing the required libraries, namely numpy and collections. In addition, the Scikit-Learn dataset module is imported in order to use the function make_classification() which generates an artificial data set on which the algorithm can be trained and tested.

Next, a class Node defined to represent nodes of the decision tree. Each node has several attributes like feature_idx, threshold, left, right, and value.

The class RandomForest is the main class that implements the algorithm. It has several parameters like n_trees, max_depth and min_samples_split. The fit()-function takes data sets as input and generates decision trees by repeated random selection to make a prediction.

The function build_tree() creates a decision tree by recursively splitting the data. When the maximum depth is reached, there is only one class, or the minimum sample size is not reached, a leaf node is created with the most frequently occurring label. Otherwise, a random feature is selected, the best threshold for splitting is calculated, and the data is split accordingly. Then, this process is recursively applied to the two subtrees until the termination conditions are met.

The predict()-function generates predictions for new data by aggregating the predictions of all decision trees. The predict_tree()-function makes a decision based on the data and the current decision tree.

The function best_split() Selects the best feature and threshold for a decision node separation by calculating the information gain. The information_gain()-function calculates the information gain based on the current threshold value, while the function entropy() calculates the entropy of a node. The split()-function splits the data into two subsets, one for each branch of the decision tree. The function most_common_label() returns the most frequently occurring label.

Overall, this is a robust code example that is a clear implementation of the Random Forest algorithm in Python. It uses several techniques such as bootstrapping, feature sampling, and entropy calculations to make an accurate prediction.

Sample code Random Forest

Here is a sample code to implement a random forest in Python:

import numpy as np
from collections import Counter
from sklearn.datasets import make_classification
class Node:
    def __init__(self, feature_idx=None, threshold=None, left=None, right=None, value=None):
        self.feature_idx = feature_idx
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value
class RandomForest:
    def __init__(self, n_trees=10, max_depth=5, min_samples_split=5):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.trees = []
    def fit(self, X, y):
        n_samples, n_features = X.shape
        for i in range(self.n_trees):
            sample_idxs = np.random.choice(n_samples, n_samples, replace=True)
            X_bootstrap = X[sample_idxs]
            y_bootstrap = y[sample_idxs]
            tree = self.build_tree(X_bootstrap, y_bootstrap, 0)
            self.trees.append(tree)
    def build_tree(self, X, y, depth):
        n_samples, n_features = X.shape
        n_labels = len(np.unique(y))
        if depth == self.max_depth or n_labels == 1 or n_samples < self.min_samples_split:
            leaf_value = self.most_common_label(y)
            return Node(value=leaf_value)
        feature_idxs = np.random.choice(n_features, int(np.sqrt(n_features)), replace=False)
        best_feature_idx, best_threshold = self.best_split(X, y, feature_idxs)
        left_idxs, right_idxs = self.split(X[:, best_feature_idx], best_threshold)
        left = self.build_tree(X[left_idxs, :], y[left_idxs], depth+1)
        right = self.build_tree(X[right_idxs, :], y[right_idxs], depth+1)
        return Node(best_feature_idx, best_threshold, left, right)
    def predict(self, X):
        predictions = np.zeros((X.shape[0], len(self.trees)))
        for i, tree in enumerate(self.trees):
            predictions[:, i] = self.predict_tree(X, tree)
        return np.round(np.mean(predictions, axis=1))
    def predict_tree(self, X, tree):
        if tree.value is not None:
            return tree.value
        if X[tree.feature_idx]  best_gain:
                best_gain = gain
                split_idx = feature_idx
                split_threshold = threshold
    return split_idx, split_threshold
def information_gain(self, y, X_column, split_threshold):
    parent_entropy = self.entropy(y)
    left_idxs, right_idxs = self.split(X_column, split_threshold)
    if len(left_idxs) == 0 or len(right_idxs) == 0:
        return 0
    n = len(y)
    n_l, n_r = len(left_idxs), len(right_idxs)
    e_l, e_r = self.entropy(y[left_idxs]), self.entropy(y[right_idxs])
    child_entropy = (n_l/n)*e_l + (n_r/n)*e_r
    ig = parent_entropy - child_entropy
    return ig
def entropy(self, y):
    _, counts = np.unique(y, return_counts=True)
    probabilities = counts / len(y)
    entropy = sum(probabilities * -np.log2(probabilities))
    return entropy
def split(self, X_column, threshold):
    left_idxs = np.argwhere(X_column  threshold).flatten()
    return left_idxs, right_idxs
def most_common_label(self, y):
    counter = counter(y)
    most_common = counter.most_common(1)
    return most_common[0][0]

Generate data and fit RandomForest model

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)
rf = RandomForest(n_trees=10, max_depth=5, min_samples_split=5) rf.fit(X, y)

Make predictions

X_test = np.random.randn(10, 10)
predictions = rf.predict(X_test)

This is a very simple implementation of the Random Forest algorithm in Python. However, as you can see, it is very cumbersome and you have to take many things into account. For example, checking if the tree depth has been reached, using decision trees and tuning predictions.

Therefore, in practice, packages are often used to implement a Random Forest. Below are five different Python packages that you can use to implement Random Forest.

Random Forest with Scikit-Learn

Scikit-Learn is a popular package in the machine learning world and offers many algorithms and features. With Scikit-Learn you can implement random forests with only a few lines of code.

Here is a sample code:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
# Create data and fit RandomForest model
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)
rf = RandomForestClassifier(n_estimators=10, max_depth=5, min_samples_split=5, random_state=42)
rf.fit(X, y)
# Make predictions
X_test = np.random.randn(10, 10)
predictions = rf.predict(X_test)
print(predictions)

Random Forest in XGBoost

XGBoostis a package designed specifically for use in decision trees and gradient boosting algorithms.

was developed. It is fast and offers many options for customizing models. XGBoost also supports Random Forest models.

Here is a sample code:

import xgboost as xgb
from sklearn.datasets import make_classification
# Create data
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)
# Create XGBoost data structure
dtrain = xgb.DMatrix(X, label=y)
# Set hyperparameters
params = {
    "objective": "multi:softprob",
    "eval_metric": "mlogloss",
    "num_class": len(np.unique(y)),
    "max_depth": 5,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "seed": 42
}
# Fit model
num_round = 10
bst = xgb.train(params, dtrain, num_round)
# Make predictions
X_test = np.random.randn(10, 10)
dtest = xgb.DMatrix(X_test)
predictions = bst.predict(dtest)
print(predictions)

Implement decision trees in LightGBM

LightGBM is another fast package for decision trees and gradient boosting algorithms. It also provides an option for random forest models.

Here is a sample code:

import lightgbm as lgb
from sklearn.datasets import make_classification
# Create data
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)
# Create LightGBM data structure
lgb_train = lgb.Dataset(X, y)
# Set hyperparameters
params = {
    "objective": "multiclass",
    "num_class": len(np.unique(y)),
    "max_depth": 5,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "seed": 42
}
# Fit model
num_round = 10
bst = lgb.train(params, lgb_train, num_round)
# Make predictions
X_test = np.random.randn(10, 10)
predictions = bst.predict(X_test)
print(predictions)

RandomForestRegressor and RandomForestClassifier in Statsmodels

Statsmodels is a package that specializes in statistical modeling and analysis. It also provides options for Random Forest models.

Here is a sample code for RandomForestRegressor:

import statsmodels.api as sm
from sklearn.datasets import make_regression
# generate data
X, y = make_regression(n_samples=100, n_features=10, random_state=42)
# Fit model
rf = sm.OLS(y, X)
result = rf.fit()
# Make predictions
X_test = np.random.randn(10, 10)
predictions = result.predict(X_test)
print(predictions)

Here is a sample code for RandomForestClassifier:

import statsmodels.api as sm
from sklearn.datasets import make_classification
# generate data
X, y = make_classification(n_samples=100, n_features=10, random_state=42)
# Fit model
rf = sm.discrete_model.RandomForest(y, X)
result = rf.fit()
# Make predictions
X_test = np.random.randn(10, 10)
predictions = result.predict(X_test)
print(predictions)

Conclusion

Random Forest is a popular Algorithm in the world of machine learning and is often used to make predictions. There are many packages in Python that facilitate the implementation of Random Forest models, such as Scikit-Learn, XGBoost, LightGBM and Statsmodels. Depending on your requirements and the amount of data, you can choose the package that suits you best.

"
"
Florian Zyprian Avatar

Latest articles