This project addresses a binary classification problem: predicting whether a credit card client will default on their payment in the next month. It uses the Default of Credit Card Clients Dataset, which contains 30,000 records and 24 features, including demographic information, payment history, and billing amounts from April to September 2005.
The target variable is default.payment.next.month, indicating whether the client defaulted (1=yes, 0=no). The remaining features provide monthly financial and repayment data that help inform this prediction task.
This dataset was originally featured in a research study comparing classification models for default prediction. Results in this project may be compared against findings from the paper.
The columns ID, LIMIT_BAL, SEX, EDUCATION, MARRIAGE, and AGE are self-explanatory. Columns PAY_0 to PAY_6 indicate the repayment status for each month (-1 = paid duly, 1 = payment delayed by one month, 2 = payment delayed by two months, …, 8 = payment delayed by eight months, and 9 = payment delayed by nine months or more), which are ordinary columns. Columns BILL_AMT1 to BILL_AMT6 represent the amount on the bill statement for each month. Columns PAY_AMT1 to PAY_AMT6 represent the amount of the previous payment for each month.
Preliminary analysis
PAY, BILL_AMT, PAY_AMT creates an interesting problem for feature engineering. Should these be treated as individual months or some kind of time-series feature to spot trends within the 6 month period.
Each month has a different number of days. Do we need to look at the data on a per day basis? Does the granularity matter?
Should we look at relative proportions instead? For example, PAY_AMT/BILL_AMT instead of the absolute value of PAY_AMT.
Education has 1 value for “others” and 2 different values for “unknown”, is there a difference here?
PAY shows delays for up to 9 months but our data only spans across 6 months. This means for customers who are 7-9 months late, we would have some additional insights in to their prior behaviour before the 6 months period but for others we would not.
Should AGE be put in to age groups instead?
Drop SEX because it’s not appropriate to use gender as a basis to determine whether someone would default.
The dataset only covers data for each month from April 2005 to September 2005, which is somewhat limited.
All features are numeric.
No missing values. Therefore we assume all customers have been at the bank for at least 6 months.
There is a class imbalance in the TARGET column, which we will address in the following sections.
df = pd.read_csv('data/UCI_Credit_Card.csv')# Drop ID because it's a unique identifier. It is not useful as a predictordf.drop(columns=['ID'], inplace =True)# df.rename(columns={'default.payment.next.month' : 'DEFAULT', 'PAY_0' : 'PAY_1'}, inplace = True)# Rename columnsdf.rename( columns = {"default.payment.next.month": "TARGET","PAY_0": "REPAY_STATUS_SEP","PAY_2": "REPAY_STATUS_AUG","PAY_3": "REPAY_STATUS_JUL","PAY_4": "REPAY_STATUS_JUN","PAY_5": "REPAY_STATUS_MAY","PAY_6": "REPAY_STATUS_APR","BILL_AMT1": "BILL_AMT_SEP","BILL_AMT2": "BILL_AMT_AUG","BILL_AMT3": "BILL_AMT_JUL","BILL_AMT4": "BILL_AMT_JUN","BILL_AMT5": "BILL_AMT_MAY","BILL_AMT6": "BILL_AMT_APR","PAY_AMT1": "PAY_AMT_SEP","PAY_AMT2": "PAY_AMT_AUG","PAY_AMT3": "PAY_AMT_JUL","PAY_AMT4": "PAY_AMT_JUN","PAY_AMT5": "PAY_AMT_MAY","PAY_AMT6": "PAY_AMT_APR", }, inplace =True)df.info()
Perform exploratory data analysis on the train set.
Summary Statistics
REPAY_STATUS has a minimum of -2, which is outside the range of the data description.
train_df.iloc[:,0:11].describe()
LIMIT_BAL
SEX
EDUCATION
MARRIAGE
AGE
REPAY_STATUS_SEP
REPAY_STATUS_AUG
REPAY_STATUS_JUL
REPAY_STATUS_JUN
REPAY_STATUS_MAY
REPAY_STATUS_APR
count
21000.000000
21000.000000
21000.000000
21000.000000
21000.000000
21000.000000
21000.000000
21000.000000
21000.000000
21000.000000
21000.000000
mean
167880.651429
1.600762
1.852143
1.554000
35.500810
-0.015429
-0.137095
-0.171619
-0.225238
-0.264429
-0.295095
std
130202.682167
0.489753
0.792961
0.521675
9.212644
1.120465
1.194506
1.196123
1.168556
1.137205
1.147992
min
10000.000000
1.000000
0.000000
0.000000
21.000000
-2.000000
-2.000000
-2.000000
-2.000000
-2.000000
-2.000000
25%
50000.000000
1.000000
1.000000
1.000000
28.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
50%
140000.000000
2.000000
2.000000
2.000000
34.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
75%
240000.000000
2.000000
2.000000
2.000000
41.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
max
1000000.000000
2.000000
6.000000
3.000000
79.000000
8.000000
8.000000
8.000000
8.000000
8.000000
8.000000
train_df.iloc[:,11:].describe()
BILL_AMT_SEP
BILL_AMT_AUG
BILL_AMT_JUL
BILL_AMT_JUN
BILL_AMT_MAY
BILL_AMT_APR
PAY_AMT_SEP
PAY_AMT_AUG
PAY_AMT_JUL
PAY_AMT_JUN
PAY_AMT_MAY
PAY_AMT_APR
TARGET
count
21000.000000
21000.000000
21000.000000
21000.000000
21000.000000
21000.000000
21000.000000
2.100000e+04
21000.000000
21000.000000
21000.000000
21000.000000
21000.000000
mean
51107.566762
49126.824810
47010.414095
43486.610905
40428.518333
38767.202667
5673.585143
5.895027e+03
5311.432286
4774.021381
4751.850095
5237.762190
0.223238
std
73444.143025
71400.032096
69035.759516
64843.303993
61187.200817
59587.689549
17033.241454
2.180143e+04
18377.997079
15434.136142
15228.193125
18116.846563
0.416427
min
-15308.000000
-67526.000000
-157264.000000
-50616.000000
-61372.000000
-339603.000000
0.000000
0.000000e+00
0.000000
0.000000
0.000000
0.000000
0.000000
25%
3649.250000
2925.750000
2663.750000
2293.750000
1739.500000
1215.750000
1000.000000
8.200000e+02
390.000000
266.000000
234.000000
110.750000
0.000000
50%
22284.000000
21002.500000
20088.500000
19102.500000
18083.000000
16854.500000
2100.000000
2.007000e+03
1809.500000
1500.000000
1500.000000
1500.000000
0.000000
75%
66979.750000
63795.250000
59895.000000
54763.250000
50491.000000
49253.750000
5007.250000
5.000000e+03
4628.500000
4021.250000
4016.000000
4000.000000
0.000000
max
964511.000000
983931.000000
855086.000000
891586.000000
927171.000000
961664.000000
873552.000000
1.227082e+06
896040.000000
621000.000000
426529.000000
528666.000000
1.000000
Correlation
We see BILL_AMT has a strong collinearity with the BILL_AMT from the prior months. Same with REPAY_STATUS. We may need to find a way to address the collinearity.
aly.corr(train_df)
Distribution Visualization
Class imbalance with TARGET
Proportion of defaults higher in 50+ age group
Lower default with married vs single
Even though REPAY_STATUS has a range from -2 to 9, most values cluster around -2 to 2