Pankaj Jainani Blog: Ensemble Approach

This is going to be series of tutorials mainly describing about various Ensemble techniques and approaches. This also gives the very high-level idea of the implementation of each of these techniques in Python. Here, few experiments are performed on the famous Iris dataset, and the task is to classify the plant species from its key attributes, namely: length, width, sepal and petal.

At the beginning of this I am going to start with the simple Stacking example- Here i will be using a self-generated random data-set with two input variables X1 and X2, output variable is, Y.

To start with the experiment let's perform the basic steps to setup notebook:

Import the required Python libraries for the experiment

import numpy as npimport pandas as pd

import os

import sklearnfrom sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.linear_model import LogisticRegression

Import the data-set, add the columns attributes and shuffle the data-set

#Process on iris dataset
dataset = pd.read_csv("../input/iris-dataset/iris.data.csv"
dataset.columns= ["length","width", "sepal", "petal","class"]
dataset = dataset.sample(frac=1).reset_index(drop=True)
#Process on random dataset
random = pd.read_csv("../input/randomdata/random-data.csv"

Create train and test split

data_x = random.iloc[:,0:3]
data_y = random.iloc[:,3]
train_x, test_x, train_y, test_y = train_test_split(data_x, data_y, test_size = 0.20,random_state = 1001)

Implementing - Stacking

Stacking is a technique by which we Pick the Model1 as a base model and then create a K-Fold from the training set, then this base model is made to learn from the K-1 part of the training data. Then the prediction is made on the K-th split set of the training data. This process is repeated K-times to fetch next set of (K-1) splits for training and then K-th set for validation purpose.

The approach is better demonstrated from the experiment shown below: The Stacking function is defined with 4 arguments. The stratified K-folds are made to create K-splits for each iteration. *Iteration 1* fit the model on k-1 splits and predict results of the k-th split (which is actually a validation set). Thus repeating the entire process k-times. Along with this, simultaneously the same base model is also fit against the test set.

The entire above process is then repeated for the next base model Model2 resulting in entirely new set of predictions for train set and test set.

Method Definition:

def stacking(model, train, y, test, n_fold):
folds = sklearn.model_selection.StratifiedKFold(n_splits = n_fold, random_state=1001)
test_pred = []
train_pred = []
for train_indices, val_indices in folds.split(train,y.values):
x_train, x_val = train.iloc[train_indices], train.iloc[val_indices]
y_train, y_val = y.iloc[train_indices], y.iloc[val_indices]

model.fit(X=x_train, y=y_train)
train_pred = np.append(train_pred, model.predict(x_val))
test_pred = np.append(test_pred, model.predict(test))

return test_pred, train_pred

Model 1

model1 = DecisionTreeClassifier(random_state=1)
test_pred1,train_pred1 = stacking(model = model1,n_fold = 5,
train = train_x,test = test_x,y = train_y)
train_pred1 = pd.DataFrame(train_pred1).astype(int)
test_pred1 = pd.DataFrame(test_pred1).astype(int)

Model 2

model2 = KNeighborsClassifier()
test_pred2,train_pred2 = stacking(model = model2,n_fold = 5,
train = train_x,test = test_x,y = train_y)
train_pred2 = pd.DataFrame(train_pred2).astype(int)
test_pred2 = pd.DataFrame(test_pred2).astype(int)

Once we have the predictions from the test set, we will use these predictions as the new set of features to create a Model3 (stacking the results from the above 2). Lastly the 3rd Model is used to predict on the test set for the final predictions. Below is the code to implement the same.

df_final_train = pd.concat([train_pred1,train_pred2], axis=1)

df_final_test = pd.concat([test_pred1, test_pred2], axis=1)

Model 3

model3 = DecisionTreeClassifier(random_state=1)

model3.fit(X=df_final_train, y=train_y)

pred = model3.predict(df_final_test.reset_index(drop=True))

model3.score(df_final_test.reset_index(drop=True), test_y.reset_index(drop=True))

Conclusion

Hence the example shows the basic and the simple way to implement the ensemble stacking using simple base models. This, approach helps to combine the predictive power of simple base models to perform better predictions.
In the Next tutorial we will see one more basic ensemble approach which is almost similar to stacking, called blending.

Pankaj Jainani Blog

Saturday, January 19, 2019

Ensemble Approach - Stacking

Import the required Python libraries for the experiment

Import the data-set, add the columns attributes and shuffle the data-set

#Process on iris dataset
dataset = pd.read_csv("../input/iris-dataset/iris.data.csv"
dataset.columns= ["length","width", "sepal", "petal","class"]
dataset = dataset.sample(frac=1).reset_index(drop=True)
#Process on random dataset
random = pd.read_csv("../input/randomdata/random-data.csv"

Create train and test split

Implementing - Stacking

Conclusion

No comments:

Post a Comment

Autoscaling: Azure HDInsight Cluster

Saturday, January 19, 2019

Ensemble Approach - Stacking

Import the required Python libraries for the experiment

Import the data-set, add the columns attributes and shuffle the data-set

#Process on iris dataset dataset = pd.read_csv("../input/iris-dataset/iris.data.csv" dataset.columns= ["length","width", "sepal", "petal","class"]dataset = dataset.sample(frac=1).reset_index(drop=True) #Process on random datasetrandom = pd.read_csv("../input/randomdata/random-data.csv"

Create train and test split

Implementing - Stacking

Conclusion

No comments:

Post a Comment

Autoscaling: Azure HDInsight Cluster

#Process on iris dataset
dataset = pd.read_csv("../input/iris-dataset/iris.data.csv"
dataset.columns= ["length","width", "sepal", "petal","class"]
dataset = dataset.sample(frac=1).reset_index(drop=True)
#Process on random dataset
random = pd.read_csv("../input/randomdata/random-data.csv"