Travel Insurance Claim- Classification Problem

Surya Vamsi Chenduluru
Surya Vamsi Chenduluru
4 min readMar 19, 2021

--

It became more important for an Travel Insurance companies to know whether the insured will claim the policy in the future or not, which can save them thousands of dollars.

Source: Above picture taken from Steve Nix

Here i got this Dataset from Kaggle. This is an unbalanced dataset, where the number of records for Claimed are ~15% and not claimed are ~85%. This is due to the number of people who takes a policy are more and who claims a policy are less. similar to a Fraud detection while using credit cards.

Based on my domain knowledge on Travel Insurance solved this problem more easily but selecting the best algorithm and feature extraction and EDA took me more time.

Coming to the dataset, here we have 4 numerical and 7 categorical features.
Agency 63326 non-null object
Agency Type 63326 non-null object
Distribution Channel 63326 non-null object
Product Name 63326 non-null object
Claim 63326 non-null object
Duration 63326 non-null int64
Destination 63326 non-null object
Net Sales 63326 non-null float64
Commision (in value) 63326 non-null float64
Gender 18219 non-null object
Age 63326 non-null int64

here I have made some EDA on this dataset and find out the below observations.

sns.set_style(“whitegrid”);
sns.pairplot(df, hue=”Claim”);
plt.show()

From the above pairplot we can come to some conclusions:

  1. Here we clearly see that Commision and Net Sales Values are corelated and distributions look same (linearlly distributed).
  2. So we can drop any of the feature ‘Commision’ or ‘Net Sales’. So that it should not effect the Model.
  3. Also there won’t be a -Ve values in Net Sales Amount. Might be outliers, these should be removed.
  4. We observe ~20% of insured are in 35–40 Age.

for i in range(len(df)):
if df[‘Duration’][i]<0:
print(df[‘Duration’].iloc[i])

-1
-1
-1
-1
-1

We have negative values in this Duration column but can time be negative? YES Sometimes!!

This is due to timezone difference. for ex: 12:10 AM 18/3/2021 you took a flight which is one way and reached other country where the timezone is now 11:50 PM 17/3/2021.

In this case the data stored in the DataBase will be -1. and this should be handled in our model. So making this as Absolute value. that is ‘1'.

Coming to categorical variables:

Agency
Agency Type
Distribution Channel
Gender
Product Name

Lets see Destination pie chart. (please ignore the mess at pie we can consider top 20 or 25 and left over can be calculated as others)

here i have performed one-hot encoding for all the other categorical features. then i have trained my models and got the feature importance and found that Destination feature doesn’t matters the most. then i have removed this from my dataset. Also Gender feature values exist for only very few rows less than 15%. we can ignore this feature.

I have applied many techniques that generally used for unbalanced datasets and few algorithms which can classify them correctly. Below are few i tried:

Method:

  1. OverSampling- Oversampling the minority class records. so that the dataset will try to become balanced. — didn’t work with this dataset
  2. UnderSampling- Undersampling the majority class. here we may loose some important patterns or data from the dataset. — didn’t work with this dataset
  3. SMOTE: which uses KNN Algorithm to make the dataset balance. — Improved my F1-Score but not best fit to my dataset.
  4. Weighting: few Algorithms in SKlearn accept “class-weight” where we can pass ‘balanced’ parameter to make it balance by adding more weight to minority class labels.
  5. Algorithms: Logistic regression, SVM, DT, Random Forest. (Random Forest with Weighting worked me in this dataset)

Code Sample:

from sklearn.ensemble import RandomForestClassifier
Y = df[‘Claim’]
X=df_numerical
print(X.shape)
print(Y.shape)
# Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 100, stratify=Y)

clf = RandomForestClassifier(n_estimators=100,random_state=0, class_weight={0: 1, 1: 98.5})
clf.fit(X_train, y_train)

pred = clf.predict(X_test)
print(‘score on test set:’, clf.score(X_test, y_test))
print(metrics.classification_report(y_true=y_test, y_pred=pred))

Here is the output:

score on test set: 1.0
precision recall f1-score support

0 1.00 1.00 1.00 17953
1 1.00 1.00 1.00 277

accuracy 1.00 18230
macro avg 1.00 1.00 1.00 18230
weighted avg 1.00 1.00 1.00 18230
le* Some person- Predicting 100% is not possible(me*- hey !! have you seen model before)

That was a joke !!.. But even i’m surprised seeing my model outcome. lol !!

So one Imp point to be noted. Accuracy should not be your metric when you are using unbalanced dataset. because if 99% data are 1’s. you can get 99% accuracy by just giving 1. for any query point.

Imp metric for unbalanced dataset is confussion matrix, F1-Score value and ROC AUC Curve.

from sklearn.metrics import confusion_matrix
mat=confusion_matrix(y_test, pred)
sns.heatmap(mat,square=True,annot=True,fmt=’d’,cbar=’True’, cmap=plt.cm.Greens)

Please do appreciate my first blog post. that helps me writing more .

--

--

Surya Vamsi Chenduluru
Surya Vamsi Chenduluru

Certified Google Cloud Professional Data Engineer and an Aspiring Data Scientist. — Anyone who stops learning is old, whether at twenty or eighty.Happy Learning