DeepGBoost
Machine learning algorithm based on gradient boosting forest that merges the power of tree ensembles with neural network architectures.
Algorithm
DeepGBoost implements the Distributed Gradient Boosting Forest (DGBF), introduced in:
Delgado-Panadero, A., Benitez-Andrades, J. A., & Garcia-Ordas, M. T. (2023). A generalized decision tree ensemble based on the NeuralNetworks architecture: Distributed Gradient Boosting Forest (DGBF). Applied Intelligence, 53, 22991-23003. https://doi.org/10.1007/s10489-023-04735-w
Classical tree ensemble methods — RandomForest (bagging) and GradientBoosting (boosting) — cannot perform hierarchical representation learning the way neural networks do. DGBF addresses this by combining both approaches into a unified formulation that defines a graph-structured tree ensemble with distributed representation learning, without requiring back-propagation or parametric models.
The ensemble prediction is:
$$F(x) = \sum_{l=1}^{L} RF_l(x) = \frac{1}{T} \sum_{l=0}^{L} \sum_{t=0}^{T} h_{l,t}(x)$$
where L is the number of boosting layers and T is the number of trees per layer. Each RandomForest layer is the analogue of a dense network layer, with distributed gradients replacing back-propagation.
RandomForest (L = 1) and GradientBoosting (T = 1) are recovered as special cases.
Fig. 1 — NeuralNetwork vs DGBF architecture
Fig. 2 — RandomForest & GradientBoosting as DGBF special cases
Fig. 3 — Benchmark results across 9 UCI datasets
Installation
pip install deepgboost
Optional plotting support:
pip install deepgboost[plotting]
Install from source with development dependencies:
git clone https://github.com/delgadopanadero/deepgboost.git
cd deepgboost
pip install -e ".[dev]"
Quick Start
Regression
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from deepgboost import DeepGBoostRegressor
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = DeepGBoostRegressor(n_trees=10, n_layers=10, max_depth=4, learning_rate=0.1)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
Classification
from sklearn.datasets import load_breast_cancer
from deepgboost import DeepGBoostClassifier
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = DeepGBoostClassifier(n_trees=10, n_layers=10, learning_rate=0.1)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
Early stopping with callbacks
from deepgboost import DeepGBoostRegressor, EarlyStoppingCallback, EvaluationMonitorCallback
model = DeepGBoostRegressor(n_trees=5, n_layers=20, learning_rate=0.05)
model.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
callbacks=[EvaluationMonitorCallback(period=5), EarlyStoppingCallback(rounds=10)],
)
Citation
@article{delgado2023dgbf,
author = {Delgado-Panadero, {\'A}ngel and Ben{\'i}tez-Andrades, Jos{\'e} Alberto and Garc{\'i}a-Ord{\'a}s, Mar{\'i}a Teresa},
title = {A generalized decision tree ensemble based on the {NeuralNetworks} architecture: {Distributed Gradient Boosting Forest (DGBF)}},
journal = {Applied Intelligence},
volume = {53},
pages = {22991--23003},
year = {2023},
doi = {10.1007/s10489-023-04735-w}
}