Random Forest, an ensemble of decision trees, is one of the most popular methods in the machine learning world and is often used to make predictions. In this article, we will discuss 5 different Python packages that you can use to implement Random Forest.
Let's start with how you can implement a random forest in Python without using packages.
Implementation of a random forest in Python
A Random Forest consists of a collection of decision trees. The algorithm randomly samples the training data and variables and creates a decision tree from each sample. Prediction is performed by averaging or tuning the predictions of each tree.
This code sample implements the Random Forest algorithm in Python. The goal of the algorithm is to generate a set of decision trees and increase the prediction accuracy by aggregating their predictions.
Pure Python implementation of a random forest
The code starts by importing the required libraries, namely numpy and collections. In addition, the Scikit-Learn dataset module is imported in order to use the function make_classification()
which generates an artificial data set on which the algorithm can be trained and tested.
Next, a class Node
defined to represent nodes of the decision tree. Each node has several attributes like feature_idx
, threshold
, left
, right
, and value
.
The class RandomForest
is the main class that implements the algorithm. It has several parameters like n_trees
, max_depth
and min_samples_split
. The fit()
-function takes data sets as input and generates decision trees by repeated random selection to make a prediction.
The function build_tree()
creates a decision tree by recursively splitting the data. When the maximum depth is reached, there is only one class, or the minimum sample size is not reached, a leaf node is created with the most frequently occurring label. Otherwise, a random feature is selected, the best threshold for splitting is calculated, and the data is split accordingly. Then, this process is recursively applied to the two subtrees until the termination conditions are met.
The predict()
-function generates predictions for new data by aggregating the predictions of all decision trees. The predict_tree()
-function makes a decision based on the data and the current decision tree.
The function best_split()
Selects the best feature and threshold for a decision node separation by calculating the information gain. The information_gain()
-function calculates the information gain based on the current threshold value, while the function entropy()
calculates the entropy of a node. The split()
-function splits the data into two subsets, one for each branch of the decision tree. The function most_common_label()
returns the most frequently occurring label.
Overall, this is a robust code example that is a clear implementation of the Random Forest algorithm in Python. It uses several techniques such as bootstrapping, feature sampling, and entropy calculations to make an accurate prediction.
Sample code Random Forest
Here is a sample code to implement a random forest in Python:
import numpy as np
from collections import Counter
from sklearn.datasets import make_classification
class Node:
def __init__(self, feature_idx=None, threshold=None, left=None, right=None, value=None):
self.feature_idx = feature_idx
self.threshold = threshold
self.left = left
self.right = right
self.value = value
class RandomForest:
def __init__(self, n_trees=10, max_depth=5, min_samples_split=5):
self.n_trees = n_trees
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.trees = []
def fit(self, X, y):
n_samples, n_features = X.shape
for i in range(self.n_trees):
sample_idxs = np.random.choice(n_samples, n_samples, replace=True)
X_bootstrap = X[sample_idxs]
y_bootstrap = y[sample_idxs]
tree = self.build_tree(X_bootstrap, y_bootstrap, 0)
self.trees.append(tree)
def build_tree(self, X, y, depth):
n_samples, n_features = X.shape
n_labels = len(np.unique(y))
if depth == self.max_depth or n_labels == 1 or n_samples < self.min_samples_split:
leaf_value = self.most_common_label(y)
return Node(value=leaf_value)
feature_idxs = np.random.choice(n_features, int(np.sqrt(n_features)), replace=False)
best_feature_idx, best_threshold = self.best_split(X, y, feature_idxs)
left_idxs, right_idxs = self.split(X[:, best_feature_idx], best_threshold)
left = self.build_tree(X[left_idxs, :], y[left_idxs], depth+1)
right = self.build_tree(X[right_idxs, :], y[right_idxs], depth+1)
return Node(best_feature_idx, best_threshold, left, right)
def predict(self, X):
predictions = np.zeros((X.shape[0], len(self.trees)))
for i, tree in enumerate(self.trees):
predictions[:, i] = self.predict_tree(X, tree)
return np.round(np.mean(predictions, axis=1))
def predict_tree(self, X, tree):
if tree.value is not None:
return tree.value
if X[tree.feature_idx] <= tree.threshold:
return self.predict_tree(X, tree.left)
else:
return self.predict_tree(X, tree.right)
def best_split(self, X, y, feature_idxs):
best_gain = -np.inf
split_idx, split_threshold = None, None
for feature_idx in feature_idxs:
X_column = X[:, feature_idx]
thresholds = np.unique(X_column)
for threshold in thresholds:
gain = self.information_gain(y, X_column, threshold)
if gain > best_gain:
best_gain = gain
split_idx = feature_idx
split_threshold = threshold
return split_idx, split_threshold
def information_gain(self, y, X_column, split_threshold):
parent_entropy = self.entropy(y)
left_idxs, right_idxs = self.split(X_column, split_threshold)
if len(left_idxs) == 0 or len(right_idxs) == 0:
return 0
n = len(y)
n_l, n_r = len(left_idxs), len(right_idxs)
e_l, e_r = self.entropy(y[left_idxs]), self.entropy(y[right_idxs])
child_entropy = (n_l/n)*e_l + (n_r/n)*e_r
ig = parent_entropy - child_entropy
return ig
def entropy(self, y):
_, counts = np.unique(y, return_counts=True)
probabilities = counts / len(y)
entropy = sum(probabilities * -np.log2(probabilities))
return entropy
def split(self, X_column, threshold):
left_idxs = np.argwhere(X_column <= threshold).flatten()
right_idxs = np.argwhere(X_column > threshold).flatten()
return left_idxs, right_idxs
def most_common_label(self, y):
counter = Counter(y)
most_common = counter.most_common(1)
return most_common[0][0]
Generate data and fit RandomForest model
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)
rf = RandomForest(n_trees=10, max_depth=5, min_samples_split=5) rf.fit(X, y)
Make predictions
X_test = np.random.randn(10, 10)
predictions = rf.predict(X_test)
This is a very simple implementation of the Random Forest algorithm in Python. However, as you can see, it is very cumbersome and you have to take many things into account. For example, checking if the tree depth has been reached, using decision trees and tuning predictions.
Therefore, in practice, packages are often used to implement a Random Forest. Below are five different Python packages that you can use to implement Random Forest.
Random Forest with Scikit-Learn
Scikit-Learn is a popular package in the machine learning world and offers many algorithms and features. With Scikit-Learn you can implement random forests with only a few lines of code.
Here is a sample code:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
# Daten erzeugen und RandomForest-Modell anpassen
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)
rf = RandomForestClassifier(n_estimators=10, max_depth=5, min_samples_split=5, random_state=42)
rf.fit(X, y)
# Vorhersagen machen
X_test = np.random.randn(10, 10)
predictions = rf.predict(X_test)
print(predictions)
Random Forest in XGBoost
XGBoostis a package designed specifically for use in decision trees and gradient boosting algorithms.
was developed. It is fast and offers many options for customizing models. XGBoost also supports Random Forest models.
Here is a sample code:
import xgboost as xgb
from sklearn.datasets import make_classification
# Daten erzeugen
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)
# XGBoost-Daten-Struktur erstellen
dtrain = xgb.DMatrix(X, label=y)
# Hyperparameter einstellen
params = {
"objective": "multi:softprob",
"eval_metric": "mlogloss",
"num_class": len(np.unique(y)),
"max_depth": 5,
"subsample": 0.8,
"colsample_bytree": 0.8,
"seed": 42
}
# Modell anpassen
num_round = 10
bst = xgb.train(params, dtrain, num_round)
# Vorhersagen machen
X_test = np.random.randn(10, 10)
dtest = xgb.DMatrix(X_test)
predictions = bst.predict(dtest)
print(predictions)
Implement decision trees in LightGBM
LightGBM is another fast package for decision trees and gradient boosting algorithms. It also provides an option for random forest models.
Here is a sample code:
import lightgbm as lgb
from sklearn.datasets import make_classification
# Daten erzeugen
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)
# LightGBM-Daten-Struktur erstellen
lgb_train = lgb.Dataset(X, y)
# Hyperparameter einstellen
params = {
"objective": "multiclass",
"num_class": len(np.unique(y)),
"max_depth": 5,
"subsample": 0.8,
"colsample_bytree": 0.8,
"seed": 42
}
# Modell anpassen
num_round = 10
bst = lgb.train(params, lgb_train, num_round)
# Vorhersagen machen
X_test = np.random.randn(10, 10)
predictions = bst.predict(X_test)
print(predictions)
RandomForestRegressor and RandomForestClassifier in Statsmodels
Statsmodels is a package that specializes in statistical modeling and analysis. It also provides options for Random Forest models.
Here is a sample code for RandomForestRegressor:
import statsmodels.api as sm
from sklearn.datasets import make_regression
# Daten erzeugen
X, y = make_regression(n_samples=100, n_features=10, random_state=42)
# Modell anpassen
rf = sm.OLS(y, X)
result = rf.fit()
# Vorhersagen machen
X_test = np.random.randn(10, 10)
predictions = result.predict(X_test)
print(predictions)
Here is a sample code for RandomForestClassifier:
import statsmodels.api as sm
from sklearn.datasets import make_classification
# Daten erzeugen
X, y = make_classification(n_samples=100, n_features=10, random_state=42)
# Modell anpassen
rf = sm.discrete_model.RandomForest(y, X)
result = rf.fit()
# Vorhersagen machen
X_test = np.random.randn(10, 10)
predictions = result.predict(X_test)
print(predictions)
Conclusion
Random Forest is a popular Algorithm in the world of machine learning and is often used to make predictions. There are many packages in Python that facilitate the implementation of Random Forest models, such as Scikit-Learn, XGBoost, LightGBM and Statsmodels. Depending on your requirements and the amount of data, you can choose the package that suits you best.