Machine Learning : Decision Tree & Random Forest

Chetna Shahi
4 min readOct 16, 2021

--

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Gini Index is the loss function used by the decision tree to decide which column should be used for splitting the data, and at what point the column should be split. A lower Gini index indicates a better split. A perfect split (only one class on each side) has a Gini index of 0. It divides data based on column which can divide the data

It is calculated by subtracting the sum of squared probabilities of each class from one.

Gini-Index is used to analyze distribution of data. A country in which every resident has the same income would have an income Gini coefficient of 0. A country in which one resident earned all the income, while everyone else earned nothing, would have an income Gini coefficient of 1.

There is a high chance of overfitting in Decision Tree, where the training datasets has very high accuracy while test dataset .The process of reducing overfitting is known as regularization. You can tune the arguments (hyperparameters) of DecisionTreeClassifier.

  1. max_depth : By reducing the maximum depth of the decision tree, we can prevent the tree from memorizing all training examples, which may lead to better generalization (definition to demonstrate how well is a trained model to classify or forecast unseen data)
  2. max_leaf_nodes : Controls the size of complexity of a decision tree by limiting the number of leaf nodes. This allows branches of the tree to have varying depths.

The performance of a tree can be further increased by pruning. It involves removing the branches that make use of features having low importance. By tuning hyperparameters of decision tree model we can prune tree and prevent from overfitting. While tuning the hyperparameters of a single decision tree may lead to some improvements, a much more effective strategy is to combine the results of several decision trees trained with slightly different parameters. This is called a random forest model.

Model parameters are estimated based on the data during model training and model hyperparameters are set manually and are used in processes to help estimate model parameters.

The key idea here is that each decision tree in the forest will make different kinds of errors, and upon averaging, many of their errors will cancel out.

Technique of combining the results of many models is called “ensembling”, it works because most errors of individual models cancel out on averaging. You can tune the arguments (hyperparameters) of DecisionTreeClassifier.

  1. max_depth : By reducing the maximum depth of the decision tree, we can prevent the tree from memorizing all training examples, which may lead to better generalization
  2. max_leaf_nodes : Controls the size of complexity of a decision tree by limiting the number of leaf nodes. This allows branches of the tree to have varying depths.
  3. n_estimators: Controls the number of decision trees in the random forest. The default value is 100. For larger datasets, it helps to have a greater number of estimators. As a general rule, try to have as few estimators as needed.
  4. max_features: Instead of picking all features (columns) for every split, we can specify that only a fraction of features be chosen randomly to figure out a split.
  5. min_samples_split and min_samples_leaf : increase the values of these arguments to change this behavior and reduce overfitting, especially for very large datasets.
  6. min_impurity_decrease : control the threshold for splitting nodes. A node will be split if this split induces a decrease of the impurity (Gini index) greater than or equal to this value. It’s default value is 0, and you can increase it to reduce overfitting
  7. bootstrap : Random forest doesn’t use the entire dataset for training each decision tree. Instead it applies a technique called bootstrapping. For each tree, rows from the dataset are picked one by one randomly, with replacement i.e. some rows may not show up at all, while some rows may show up multiple times.
  8. max_samples : The max_samples hyperparameter determines what fraction of the original dataset is given to any individual tree.
  9. class_weight : The different type of inputs to this parameter allows you to handle class imbalance using different manner. By default, when no value is passed, the weight assigned to each class is equal e.g., 1. Let’s say, there are two classes labelled as 0 and 1. Passing input to class_weight as class_weight={0:2, 1:1} means class 0 has weight 2 and class 1 has weight 1.

--

--

No responses yet