Chapter 3 Finance

3.1 Credit Risk Analysis

3.1.1 Background

Credit scoring merupakan sistem yang digunakan oleh bank atau lembaga keuangan lain untuk menentukan apakah seorang nasabah layak atau tidak mendapatkan pinjaman. Credit scoring membutuhkan berbagai data profil calon peminjam sehingga tingkat resiko dapat dihitung dengan tepat. Semakin tepat dan lengkap data yang disediakan, maka semakin akurat perhitungan yang dilakukan.

Proses tersebut tentunya merupakan hal yang baik, namun di sisi calon peminjam proses yang harus dilalui dirasa sangat merepotkan dan membutuhkan waktu untuk menunggu dan seiring tingginya tingkat kompetisi yang ada di industri finansial, menjadikan nasabah memiliki banyak alternatif. Semakin cepat proses yang ditawarkan, semakin tinggi kesempatan untuk mendapatkan peminjam.

Tantangan pun muncul, bagaimana mendapatkan peminjam dengan proses yang efisien namun akurasi dari credit scoring tetap tinggi. Disinilah machine learning dapat membantu menganalisa data-data profil peminjam dan proses pembayaran sehingga dapat mengetahui profil peminjam yang memiliki peluang besar untuk melunasi pinjaman dengan lancar.

Harapannya setelah mempunyai model machine learning dengan perfomance model yang baik, pegawai bank dapat dengan mudah mengidentifikasi karakteristik customer yang memiliki peluang besar untuk melunasi pinjaman dengan lancar. Dengan adanya model machine learning ini tentunya akan mengurangi biaya dan waktu yang lebih cepat.

3.1.2 Modelling Analysis

3.1.2.1 Cleaning data

credit <- read_csv("assets/02-finance/credit_record.csv")
application <- read_csv("assets/02-finance/application_record.csv")

Data Description:

Credit

  • ID : Client number
  • MONTHS_BALANCE : Record month The month of the extracted data is the starting point, backwards, 0 is the current month, -1 is the previous month, and so on
  • STATUS : Status
    • 0: 1-29 days past due
    • 1: 30-59 days past due
    • 2: 60-89 days overdue
    • 3: 90-119 days overdue
    • 4: 120-149 days overdue
    • 5: Overdue or bad debts, write-offs for more than 150 days
    • C: paid off that month
    • X: No loan for the month

Application

  • ID : Client number
  • CODE_GENDER : Gender
  • FLAG_OWN_CAR : Is there a car
  • FLAG_OWN_REALTY ; Is there a property
  • CNT_CHILDREN : Number of children
  • AMT_INCOME_TOTAL : Annual income
  • NAME_INCOME_TYPE : Income category
  • NAME_EDUCATION_TYPE : Education level
  • NAME_FAMILY_STATUS : Marital status
  • NAME_HOUSING_TYPE : Way of living
  • DAYS_BIRTH : Birthday Count backwards from current day (0), -1 means yesterday
  • DAYS_EMPLOYED : Start date of employment Count backwards from current day(0). If positive, it means - - the person currently unemployed.
  • FLAG_MOBIL : Is there a mobile phone
  • FLAG_WORK_PHONE : Is there a work phone
  • FLAG_PHONE : Is there a phone
  • FLAG_EMAIL : Is there an email
  • OCCUPATION_TYPE : Occupation
  • CNT_FAM_MEMBERS :Family size

Check missing values

Pada data credit tidak terdapat missing value

colSums(is.na(credit))
#>             ID MONTHS_BALANCE         STATUS 
#>              0              0              0
colSums(is.na(application))
#>                  ID         CODE_GENDER        FLAG_OWN_CAR     FLAG_OWN_REALTY 
#>                   0                   0                   0                   0 
#>        CNT_CHILDREN    AMT_INCOME_TOTAL    NAME_INCOME_TYPE NAME_EDUCATION_TYPE 
#>                   0                   0                   0                   0 
#>  NAME_FAMILY_STATUS   NAME_HOUSING_TYPE          DAYS_BIRTH       DAYS_EMPLOYED 
#>                   0                   0                   0                   0 
#>          FLAG_MOBIL     FLAG_WORK_PHONE          FLAG_PHONE          FLAG_EMAIL 
#>                   0                   0                   0                   0 
#>     OCCUPATION_TYPE     CNT_FAM_MEMBERS 
#>              134203                   0

Pada data application terdapat variabel OCCUPATION_TYPE yang memiliki banyak data missing, kita dapat membuang variabel tersebut. Serta kita akan membuang variabel DAYS_BIRTH dan DAYS_EMPLOYED yang tidak dibutuhkan pada model.

application <- application %>% 
               select(-c(OCCUPATION_TYPE, DAYS_BIRTH, DAYS_EMPLOYED))

Menyesuaikan tipe data

Tahap berikutnya adalah menggabunkan data credit dan application serta menyesuaikan tipe data kategorik yang masih terbaca sebagai character.

data_clean <- credit %>% 
              left_join(application) %>% 
              na.omit() %>% 
              select(-ID) %>% 
              filter(STATUS != "X") %>% 
              mutate(STATUS = as.factor(ifelse(STATUS == "C", "good credit", "bad credit"))) %>% 
              mutate_at(.vars = c("FLAG_MOBIL", "FLAG_WORK_PHONE",
                                  "FLAG_PHONE", "FLAG_EMAIL"), as.factor) %>% 
              mutate_if(is.character, as.factor) %>% 
              data.frame()
str(data_clean)
#> 'data.frame':	631765 obs. of  16 variables:
#>  $ MONTHS_BALANCE     : num  0 -1 -2 -3 -4 -5 -6 -7 -8 -9 ...
#>  $ STATUS             : Factor w/ 2 levels "bad credit","good credit": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ CODE_GENDER        : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ FLAG_OWN_CAR       : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ FLAG_OWN_REALTY    : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ CNT_CHILDREN       : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ AMT_INCOME_TOTAL   : num  427500 427500 427500 427500 427500 ...
#>  $ NAME_INCOME_TYPE   : Factor w/ 5 levels "Commercial associate",..: 5 5 5 5 5 5 5 5 5 5 ...
#>  $ NAME_EDUCATION_TYPE: Factor w/ 5 levels "Academic degree",..: 2 2 2 2 2 2 2 2 2 2 ...
#>  $ NAME_FAMILY_STATUS : Factor w/ 5 levels "Civil marriage",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ NAME_HOUSING_TYPE  : Factor w/ 6 levels "Co-op apartment",..: 5 5 5 5 5 5 5 5 5 5 ...
#>  $ FLAG_MOBIL         : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ FLAG_WORK_PHONE    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ FLAG_PHONE         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ FLAG_EMAIL         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ CNT_FAM_MEMBERS    : num  2 2 2 2 2 2 2 2 2 2 ...

3.1.2.2 Exploratory Data Analysis (EDA)

Pada data EDA kita ingin mengetahui bagaimana sebaran data kategorik maupun numerik.

data_clean %>% inspect_cat() %>% show_plot()

Pada visualisasi berikut kita akan mendapatkan informasi apakah terdapat variabel yang tidak memiliki banyak informasi pada data, contohnya adalah variabel FLAG_MOBIL dimana keseluruhan data berisikan 1, artinya semua nasabah kita yang melakukan pinjaman memiliki mobil. Data yang tidak memiliki variansi seperti ini tidak diikutsertakan pada model.

data_clean <- data_clean %>% 
              select(-c(FLAG_MOBIL,FLAG_EMAIL))
data_clean %>% inspect_num() %>% show_plot()

3.1.2.3 Modelling Random Forest

Split data train dan data test dengan proporsi 80:20. Data train akan digunakan untuk modelling, sedangkan data test akan digunakan untuk evaluasi.

set.seed(100)
index <- initial_split(data = data_clean, prop = 0.8, strata = "STATUS")
train <- training(index)
test <- testing(index)

Cek proporsi dari target variabel

prop.table(table(train$STATUS))
#> 
#>  bad credit good credit 
#>   0.4960938   0.5039062

Bentuk model random forest dengan 3 k-fold dan 2 repetition

# set.seed(100)
# 
# ctrl <- trainControl(method = "repeatedcv",
#                      number = 3, 
#                      repeats = 2,
#                      allowParallel=FALSE)
# 
# model_forest <- caret::train(STATUS ~.,
#                              data = train, 
#                              method = "rf", 
#                              trControl = ctrl)

#saveRDS(model_forest, "model_forest.RDS")

model_forest <- readRDS("assets/02-finance/model_forest.RDS")
model_forest
#> Random Forest 
#> 
#> 80001 samples
#>    13 predictor
#>     2 classes: 'bad credit', 'good credit' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (3 fold, repeated 2 times) 
#> Summary of sample sizes: 53335, 53334, 53333, 53334, 53334, 53334, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>    2    0.6432232  0.2846367
#>   14    0.7487656  0.4973803
#>   26    0.7114411  0.4230518
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 14.

Setelah dilakukan 3 repetition pada model, repetition kedua memiliki accuracy paling tinggi dengan jumlah mtry sebanyak 14.

Selanjutnya akan dilakukan prediksi untuk data test dan mencari nilai confusion matrix pada hasil prediksi.

pred_rf<- predict(model_forest, newdata = test, type = "prob") %>% 
          mutate(result = as.factor(ifelse(`bad credit` > 0.45, "bad credit", "good credit")),
                 actual = ifelse(test$STATUS == 'good credit', 0, 1))
confmat_rf <- confusionMatrix(pred_rf$result, 
                                 test$STATUS,
                                 mode = "prec_recall",
                                 positive = "bad credit")

eval_rf <- tidy(confmat_rf) %>% 
  mutate(model = "Random Forest") %>% 
  select(model, term, estimate) %>% 
  filter(term %in% c("accuracy", "precision", "recall", "specificity"))

eval_rf
#> # A tibble: 4 x 3
#>   model         term        estimate
#>   <chr>         <chr>          <dbl>
#> 1 Random Forest accuracy       0.810
#> 2 Random Forest specificity    0.814
#> 3 Random Forest precision      0.810
#> 4 Random Forest recall         0.806

3.1.2.4 Modelling XGBoost

Tahap selanjutnya kita akan implementasikan data menggunakan model XGBoost, kita perlu menyiapkan data untuk model XGBoost terlebih dahulu

data_xgb <- data_clean %>% 
            mutate(STATUS = ifelse(STATUS == "good credit", 0, 1)) %>% 
            data.frame()
set.seed(100)
index <- initial_split(data = data_xgb, prop = 0.8, strata = "STATUS")
train_xgb <- training(index)
test_xgb <- testing(index)
label_train <- as.numeric(train_xgb$STATUS)
label_test <- as.numeric(test_xgb$STATUS)
train_matrix <- data.matrix(train_xgb[,-2])
test_matrix <- data.matrix(test_xgb[,-2])
# convert data to Dmatrix
dtrain <- xgb.DMatrix(data = train_matrix, label = label_train)
dtest <- xgb.DMatrix(data = test_matrix, label = label_test)
params <- list(booster = "gbtree",
               objective = "binary:logistic",
               eta=0.7, 
               gamma=10, 
               max_depth=10, 
               min_child_weight=3, 
               subsample=1, 
               colsample_bytree=0.5)
xgbcv <- xgb.cv( params = params, 
                 data = dtrain,
                 nrounds = 1000, 
                 showsd = T, 
                 nfold = 10,
                 stratified = T, 
                 print_every_n = 50, 
                 early_stopping_rounds = 20, 
                 maximize = F)
#> [15:34:08] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
#> [15:34:08] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
#> [15:34:08] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
#> [15:34:08] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
#> [15:34:08] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
#> [15:34:08] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
#> [15:34:08] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
#> [15:34:08] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
#> [15:34:08] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
#> [15:34:08] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
#> [1]	train-logloss:0.657835+0.008359	test-logloss:0.659014+0.008508 
#> Multiple eval metrics are present. Will use test_logloss for early stopping.
#> Will train until test_logloss hasn't improved in 20 rounds.
#> 
#> [51]	train-logloss:0.502362+0.002141	test-logloss:0.523349+0.004859 
#> [101]	train-logloss:0.487041+0.003130	test-logloss:0.511456+0.005289 
#> [151]	train-logloss:0.480246+0.002629	test-logloss:0.506027+0.005960 
#> [201]	train-logloss:0.476529+0.001743	test-logloss:0.503253+0.005054 
#> [251]	train-logloss:0.473638+0.001135	test-logloss:0.500887+0.004853 
#> [301]	train-logloss:0.470976+0.000924	test-logloss:0.499179+0.004563 
#> [351]	train-logloss:0.468947+0.001220	test-logloss:0.497893+0.004723 
#> [401]	train-logloss:0.467192+0.001040	test-logloss:0.496554+0.004479 
#> [451]	train-logloss:0.465561+0.001008	test-logloss:0.495503+0.004603 
#> [501]	train-logloss:0.464243+0.001280	test-logloss:0.494370+0.004283 
#> [551]	train-logloss:0.463161+0.001288	test-logloss:0.493679+0.004757 
#> [601]	train-logloss:0.462078+0.001320	test-logloss:0.492902+0.005234 
#> [651]	train-logloss:0.461227+0.001556	test-logloss:0.492334+0.005148 
#> [701]	train-logloss:0.460708+0.001542	test-logloss:0.491907+0.005247 
#> [751]	train-logloss:0.460228+0.001541	test-logloss:0.491557+0.005346 
#> [801]	train-logloss:0.459816+0.001526	test-logloss:0.491173+0.005345 
#> [851]	train-logloss:0.459306+0.001623	test-logloss:0.491024+0.005399 
#> [901]	train-logloss:0.459079+0.001518	test-logloss:0.490845+0.005412 
#> [951]	train-logloss:0.458793+0.001343	test-logloss:0.490583+0.005397 
#> [1000]	train-logloss:0.458523+0.001349	test-logloss:0.490372+0.005506
print(xgbcv)
#> ##### xgb.cv 10-folds
#>     iter train_logloss_mean train_logloss_std test_logloss_mean test_logloss_std
#>        1          0.6578350       0.008358933         0.6590142      0.008508178
#>        2          0.6328996       0.009530030         0.6361721      0.009246701
#>        3          0.6146676       0.008940854         0.6200005      0.008948886
#>        4          0.6026426       0.011825006         0.6092987      0.010625445
#>        5          0.5889047       0.009177284         0.5968755      0.007583324
#> ---                                                                             
#>      996          0.4585228       0.001349444         0.4903715      0.005506051
#>      997          0.4585228       0.001349444         0.4903715      0.005506051
#>      998          0.4585228       0.001349444         0.4903716      0.005506139
#>      999          0.4585228       0.001349444         0.4903715      0.005506051
#>     1000          0.4585228       0.001349444         0.4903715      0.005506051
#> Best iteration:
#>  iter train_logloss_mean train_logloss_std test_logloss_mean test_logloss_std
#>   996          0.4585228       0.001349444         0.4903715      0.005506051
xgb1 <- xgb.train (params = params, 
                   data = dtrain, 
                   nrounds = xgbcv$best_iteration, 
                   watchlist = list(val=dtest,train=dtrain),
                   print_every_n = 100, 
                   early_stoping_rounds = 10, 
                   maximize = F , 
                   eval_metric = "error",
                   verbosity = 0)
#> [1]	val-error:0.408470	train-error:0.404582 
#> [101]	val-error:0.259563	train-error:0.238085 
#> [201]	val-error:0.253613	train-error:0.233447 
#> [301]	val-error:0.251413	train-error:0.228810 
#> [401]	val-error:0.248062	train-error:0.226485 
#> [501]	val-error:0.246562	train-error:0.224297 
#> [601]	val-error:0.246562	train-error:0.223735 
#> [701]	val-error:0.246512	train-error:0.222985 
#> [801]	val-error:0.246312	train-error:0.223010 
#> [901]	val-error:0.245112	train-error:0.222872 
#> [996]	val-error:0.243512	train-error:0.222472
xgbpred_prob <-predict(object = xgb1, newdata = dtest)
xgbpred <- ifelse (xgbpred_prob > 0.45,1,0)
confmat_xgb <- confusionMatrix(as.factor(xgbpred), as.factor(label_test), positive = "1")
confmat_xgb
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 7338 2288
#>          1 2740 7633
#>                                                
#>                Accuracy : 0.7486               
#>                  95% CI : (0.7425, 0.7546)     
#>     No Information Rate : 0.5039               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.4973               
#>                                                
#>  Mcnemar's Test P-Value : 0.0000000002013      
#>                                                
#>             Sensitivity : 0.7694               
#>             Specificity : 0.7281               
#>          Pos Pred Value : 0.7359               
#>          Neg Pred Value : 0.7623               
#>              Prevalence : 0.4961               
#>          Detection Rate : 0.3817               
#>    Detection Prevalence : 0.5187               
#>       Balanced Accuracy : 0.7487               
#>                                                
#>        'Positive' Class : 1                    
#> 
confmat_rf <- confusionMatrix(pred_rf$result, 
                                 test$STATUS,
                                 mode = "prec_recall",
                                 positive = "bad credit")

eval_rf <- tidy(confmat_rf) %>% 
  mutate(model = "Random Forest") %>% 
  select(model, term, estimate) %>% 
  filter(term %in% c("accuracy", "precision", "recall", "specificity"))

confmat_xgb <- confusionMatrix(as.factor(xgbpred), as.factor(label_test), positive = "1")

eval_xgb <- tidy(confmat_xgb) %>% 
  mutate(model = "XGBoost") %>% 
  select(model, term, estimate) %>% 
  filter(term %in% c("accuracy", "precision", "recall", "specificity"))

Setelah diperoleh perfomance model XGBoost kita akan membandingkan dengan perfomance model random forest.

eval_result <- rbind(eval_rf, eval_xgb)
eval_result
#> # A tibble: 8 x 3
#>   model         term        estimate
#>   <chr>         <chr>          <dbl>
#> 1 Random Forest accuracy       0.810
#> 2 Random Forest specificity    0.814
#> 3 Random Forest precision      0.810
#> 4 Random Forest recall         0.806
#> 5 XGBoost       accuracy       0.749
#> 6 XGBoost       specificity    0.728
#> 7 XGBoost       precision      0.736
#> 8 XGBoost       recall         0.769

Metrics evaluasi yang kita utamakan adalah recall karena kita ingin meminimalisir mungkin keadaan dimana data actual nasabah tersebut bad credit namun terprediksi sebagai good credit. Dari hasil evaluasi dapat diketahui model XGBoost memiliki nilai recall lebih tinggi dibandingkan model random forest.

var_imp <- xgb.importance(model = xgb1,
                          feature_names = dimnames(dtrain)[[2]])
xgb.ggplot.importance(var_imp,top_n = 10) + 
  theme_minimal()+
  theme(legend.position = "none")

Grafik di atas menampilkan informasi mengenai 10 variabel yang paling berpengaruh pada model. Annual income dan months balance merupakan dua variabel terpenting pada model ini.

xgb_result <- data.frame(class1 = xgbpred_prob, actual = as.factor(label_test))

auc_xgb <- roc_auc(data = xgb_result, truth = actual,class1) 
value_roc_xgb <- prediction(predictions = xgbpred_prob,
                        labels = label_test)

# ROC curve
plot(performance(value_roc_xgb, "tpr", "fpr"))

value_auc_xgb <- performance(value_roc_xgb, measure = "auc")
value_auc_xgb@y.values
#> [[1]]
#> [1] 0.8427956

Nilai AUC yang diperoleh pada model model ini sebesar 0.83 artinya model dapat memprediksi dengan baik kedua target class yaitu good credit dan bad credit. Harapannya model ini dapat digunakan oleh pihak bank untuk menentukan credit scoring dengan mengisikan data profil nasabah, kemudian hasil yang diperoleh dapat di visualisasikan sebagai berikut:

explainer <- lime(train_matrix %>% as.data.frame(), xgb1)
explanation <- explain(test_matrix[11:12,] %>% as.data.frame(),
                             explainer, 
                             labels = "1",
                             n_features = 3,
                             n_permutations = 5000,
                             dist_fun = "manhattan",
                             kernel_width = 0.75,
                             feature_select = "highest_weights")

plot_features(explanation)

Hasil dari visualisasi tersebut untuk nasabah 1 dan 2 memiliki probability 0.22 dan 0.17 artinya kedua nasabah tersebut akan dikategorikan sebagai good credit. Kedua nasabah tersebut memiliki karakteristik yang mirip karena hasil prediksi mereka didukung oleh kepemilikan model dan juga total income.

3.2 Evaluating Customer Financial Complaints

3.2.1 Background

Penanganan complain customer pada perusahaan saat ini menjadi salah satu kunci utama suatu perusahaan dapat terus tumbuh dan berkembang, karena apabila nasabah merasa tidak mendapatkan layanan yang baik saat menyampaikan keluhan maka nasabah akan mudah berpindah ke perusahaan lain yang dianggap bisa memberikan layanan terhadap komplain dengan baik. Nasabah yang merasa tidak mendapatkan layanan baik biasanya akan mengajukan keluhan ke Consumer Financial Protection Bureau (CFPB), CFPB merupakan instansi yang bertanggung jawab atas perlindungan konsumen di sektor keuangan. CFPB menyediakan data yang berisi keluhan dari customer financial, data keluhan tersebut dapat dianalisa untuk dijadikan pertimbangan pihak perusahaan untuk mengetahui indikator yang memerlukan perbaikan demi meningkatkan kualitas layanan.

3.2.2 Exploratory Data Analysis

customer <- read_csv("assets/02-finance/data_complaint.csv")%>% 
              mutate_if(is.character, as.factor) %>% 
              data.frame()

Data diperoleh dari Consumer Financial Protection Bureau (CFPB) yang mengatur penawaran dan penyediaan produk atau layanan nasabah keuangan. CFPB menyediakan pertanyaan-pertanyaan umum dan dapat membantu nasabah terhubung dengan perusahaan keuangan yang terlibat. Data tersebut berisikan keluhan nasabah dari berbagai bank di Amerika Serikat.

top_company <- customer %>% 
  na.omit(Consumer.complaint.narrative) %>% 
  group_by(Company) %>% 
  summarise(total = n()) %>% 
  arrange(desc(total)) %>% 
  head(1)

Dari 4504 perusahaan pada data, perusahaan yang paling banyak memiliki complain adalah Transunion Intermediate Holdings. Perlu diketahui bahwa banyaknya complain yang diperhitungkan tidak mempertimbangkan volume perusahaan. Misalnya, perusahaan dengan lebih banyak customer tentunya memiliki kemungkinan banyak complain dibandingkan perusahaan yang lebih sedikit pelanggannya dan juga pada analisa ini kita hanya memperhitungkan complain yang dilengkapi dengan narasi dari customer tersebut.

Berikutnya kita akan fokus untuk menganalisa complai dari perusahaan Transunion Intermediate Holdings yang memiliki paling banyak narasi complain dari data.

Setelah memperoleh data observasi, selanjutnya membersihkan data text:

data_clean <- data_complaint %>% 
  select(Consumer.complaint.narrative) %>% 
  mutate(Consumer.complaint.narrative = Consumer.complaint.narrative %>% 
  tolower() %>% 
  str_trim() %>% 
  str_remove_all(pattern = "[[:punct:]]") %>% 
  str_remove_all(pattern = "[0-9]") %>% 
  str_remove_all(pattern = "xxxx") %>% 
  replace_contraction() %>% 
  replace_word_elongation() %>% 
  replace_white() %>% 
  str_squish())
head(data_clean)
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Consumer.complaint.narrative
#> 1 this legal notice being sent and delivered to you persuant to florida statutes notice of order to cease and desist from using personal and private information fl statute violation title xlvi crimes chapter fraudulent practices view entire chapter criminal use of personal identification information as used in this section the term a access device means any card plate code account number electronic serial number mobile identification number personal identification number or other telecommunications service equipment or instrument identifier or other means of account access that can be used alone or in conjunction with another access device to obtain money goods services or any other thing of value or that can be used to initiate a transfer of funds other than a transfer originated solely by paper instrument b authorization means empowerment permission or competence to act c harass means to engage in conduct directed at a specific person that is intended to cause substantial emotional distress to such person and serves no legitimate purpose harass does not mean to use personal identification information for accepted commercial purposes the term does not include constitutionally protected conduct such as organized protests or the use of personal identification information for accepted commercial purposes d individual means a single human being and does not mean a firm association of individuals corporation partnership joint venture sole proprietorship or any other entity e person means a person as defined in s f personal identification information means any name or number that may be used alone or in conjunction with any other information to identify a specific individual including any name postal or electronic mail address telephone number social security number date of birth mother s maiden name official stateissued or united statesissued driver license or identification number alien registration number government passport number employer or taxpayer identification number or food assistance account number bank account number credit or debit card number or personal identification number or code assigned to the holder of a debit card by the issuer to permit authorized electronic use of such card unique biometric data such as fingerprint voice print retina or iris image or other unique physical representation unique electronic identification number address or routing code medical records telecommunication identifying information or access device or other number or information that can be used to access a person s financial resources g counterfeit or fictitious personal identification information means any counterfeit fictitious or fabricated information in the similitude of the data outlined in paragraph f that although not truthful or accurate would in context lead a reasonably prudent person to credit its truthfulness and accuracy a any person who willfully and without authorization fraudulently uses or possesses with intent to fraudulently use personal identification information concerning an individual without first obtaining that individual s consent commits the offense of fraudulent use of personal identification information which is a felony of the third degree punishable as provided in s s or s b any person who willfully and without authorization fraudulently uses personal identification information concerning an individual without first obtaining that individual s consent commits a felony of the second degree punishable as provided in s s or s if the pecuniary benefit the value of the services
#> 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       transunion continues to report inaccurate negative items on my report that i have previously dispu ted ti methey continuos to report account with no account number inaccurate information that is hurting my credit
#> 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          my credit report has an incorrect address the address on the report is nv other than when i was in the i have lived in my entire life associated with this lv address are bills from and and both amounts are less than $ every time i dispute these bills and ask for source documents to prove i owe the bills are sold to collection agencies when these receivables are sold as a package the collection agencies reopen the outstanding debt and it reappears on my report i have a substantial credit history with mortgage car loans and credit cards i have never been late on any amount i owe there should be some protection against collection companies windmilling these bogus receivables thank you for all you do
#> 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      i recently discovered that reported my account as a charge off on my credit report when i was denied a loan by due to the charge off i immediately contacted by phone in an attempt to rectify the error i was informed by the customer service representative that my last payment was received on and that my account was reported as a charge off on we both agreed that it did not make sense that my account would be reported as a charge off days after my payment was received at the time reported my account as a charge off it was paid in full with a i have spent hours on the phone with numerous customer service representatives the phone calls have been extremely frustrating and have not produced a positive outcome every representative has told me the same thing due to the age of this account they do not have account records and can not provide me with the information that i am requesting yet they refuse to remove the inaccurate information from my credit report i was told that a supervisors would call me back on separate occasions to this day i have not received a call back from a supervisor i have sent numerous certified letters disputing the charge off that is currently being reported on my credit report and negatively effecting my credit i have asked for validation of my account including details explaining my why my account is being reported as a charge off with a balance i have asked for them to send me a copy of my payment history along with account notes the letters that i received in response to my letters have been a one paragraph response stating that they only report accurate information to the credit bureaus
#> 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               sent letter to credit reporting agencies telling them the inquiries on my report are unjustified the actions fail to comply with fcra section the letters are attached below credit agencies replied stating they did not need to investigate verify or remove inquiries credit agencies failed to prove requirements and fa iled to remove inquiries from my credit reports
#> 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      student loans have been discharged and i have a letter from the stating all loans have been discharged the credit bureaus refuse to accept the letter

Setelah membersihkan data text, selanjutnya kita akan melakukan proses tokenization yaitu memecah 1 kalimat menjadi beberapa term, pada proses berikut ini juga diperoleh frekuensi dari setiap term yang muncul.

text.dat <- data_clean %>% 
  rowid_to_column("id") %>% 
  unnest_tokens(word, Consumer.complaint.narrative) %>% 
  anti_join(stop_words) %>% 
  count(word, sort = T) %>% 
  rename(words = word,
         freq = n) %>% 
  filter(words != is.na(words),
         freq > 50)
head(text.dat)
#>         words freq
#> 1      credit 1572
#> 2      report  774
#> 3     account  613
#> 4  transunion  529
#> 5 information  503
#> 6   reporting  358

Kata yang sudah diperoleh akan divisualisasikan dengan wordcloud. Semakin sering suatu kata digunakan, maka semakin besar pula ukuran kata tersebut ditampilkan dalam wordcloud. Artinya kita dapat mengetahui kata yang paling sering digunakan oleh customer Transunion Intermediate Holdings. Kata credit, report, dan account merupakan kata yang paling sering digunakan oleh customer saat complain.

wordcloud2(data = text.dat, size = 1, color = 'random-dark', shuffle = 1)

3.2.3 Comparing Sentiment Dictionaries

Semakin banyak informasi yang ditampilkan, dapat membantu pihak marketing mengembangkan strategi yang efektif dalam meningkatkan pelayanan, berikutnya tidak hanya kata yang sering muncul yang akan ditampilkan, namun juga informasi mengenai kata tersebut merupakan kata positif atau negatif yang digunakan oleh customer saat mengajukan complain.

text_dat <-  data_clean %>% 
  rowid_to_column("id") %>% 
  unnest_tokens(word, Consumer.complaint.narrative) %>% 
  anti_join(stop_words) %>% 
  mutate(word = wordStem(word)) %>% 
  count(word, sort = T) %>% 
  filter(word != is.na(word))
head(text_dat,20)
#>          word    n
#> 1      credit 1573
#> 2      report 1462
#> 3     account  869
#> 4      inform  584
#> 5  transunion  533
#> 6       remov  423
#> 7     inquiri  420
#> 8      disput  415
#> 9        file  345
#> 10    request  329
#> 11     letter  309
#> 12       loan  287
#> 13    payment  269
#> 14     bureau  263
#> 15     verifi  242
#> 16       call  228
#> 17       time  227
#> 18    compani  223
#> 19     receiv  218
#> 20      agenc  213
bing_word <- text_dat %>% 
  inner_join(get_sentiments("bing")) 
head(bing_word)
#>        word   n sentiment
#> 1      debt 206  negative
#> 2   correct 150  positive
#> 3 complaint 115  negative
#> 4     fraud  96  negative
#> 5 incorrect  89  negative
#> 6      hard  82  negative
library(reshape2)
library(wordcloud)
bing_word %>% 
  acast(word ~ sentiment, value.var = "n", fill = 0) %>% 
  comparison.cloud(colors = c("gray70","gray20"), max.words = 200)

Sentiment Analysis yang dilakukan sebelumnya kita memperhitungan kemunculan kata positif dan negatif. Salah satu kelemahan pada pendekatan tersebut terkadang dapat disalah artikan penggunaannya pada sebuah kata, misal correct dan support akan dianggap sebagai kata positif. Namun, arti kata tersebut akan berubah jika terdapat kata not didepannya. Pada analisis berikut ini kita akan menggunakan n-gram untuk melihat seberapa sering word1 diikuti oleh word2. Tokenisasi menggunakan n-gram berguna untuk eksplorasi kata yang memiliki hubungan. Ketika kita mengatur n = 2 artinya kita akan menampilkan dua kata berturut-turut atau sering disebut dengam bigrams. Hasil dari visualisasi berikut ini menampilkan kata-kata yang berhubungan dengan kata not.

dat_bigrams <- data_clean %>% 
  unnest_tokens(bigram, Consumer.complaint.narrative, token = "ngrams", n= 2) %>%
  separate(bigram, c("word1","word2"), sep = " ") %>% filter(word1 == "not") %>% 
  inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>% 
  count(word1,word2, value, sort = T) %>% 
  mutate(contribution = n*value) %>% 
  arrange(desc(abs(contribution))) %>% 
  group_by(word1) %>% 
  dplyr::slice(seq_len(20)) %>% 
  arrange(word1, desc(contribution)) %>% 
  ungroup() 
graph_bigram <- dat_bigrams %>% 
                graph_from_data_frame()

set.seed(123)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(graph_bigram, layout = "fr") +
  geom_edge_link(alpha = .25) +
  geom_edge_density(aes(fill = value)) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name),  repel = TRUE) +
  theme_void() + theme(legend.position = "none",
                       plot.title = element_text(hjust = 0.5)) +
  ggtitle("Negation Bigram Network")

3.2.4 Correlation Pairs

Analisis berikutnya, akan dilakukan eksplorasi untuk mengetahui kata-kata yang memiliki kecenderungan muncul bersama pada complain nasabah dengan mencari nilai korelasi antar kata.

data_clean_cor <- data_complaint %>% 
  select(Consumer.complaint.narrative,Issue,Product) %>% 
  mutate(Consumer.complaint.narrative = Consumer.complaint.narrative %>% 
  tolower() %>% 
  str_trim() %>% 
  str_remove_all(pattern = "[[:punct:]]") %>% 
  str_remove_all(pattern = "[0-9]") %>% 
  str_remove_all(pattern = "xxxx") %>% 
  replace_contraction() %>% 
  replace_word_elongation() %>% 
  replace_white() %>% 
  str_squish())
head(data_clean_cor)
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Consumer.complaint.narrative
#> 1 this legal notice being sent and delivered to you persuant to florida statutes notice of order to cease and desist from using personal and private information fl statute violation title xlvi crimes chapter fraudulent practices view entire chapter criminal use of personal identification information as used in this section the term a access device means any card plate code account number electronic serial number mobile identification number personal identification number or other telecommunications service equipment or instrument identifier or other means of account access that can be used alone or in conjunction with another access device to obtain money goods services or any other thing of value or that can be used to initiate a transfer of funds other than a transfer originated solely by paper instrument b authorization means empowerment permission or competence to act c harass means to engage in conduct directed at a specific person that is intended to cause substantial emotional distress to such person and serves no legitimate purpose harass does not mean to use personal identification information for accepted commercial purposes the term does not include constitutionally protected conduct such as organized protests or the use of personal identification information for accepted commercial purposes d individual means a single human being and does not mean a firm association of individuals corporation partnership joint venture sole proprietorship or any other entity e person means a person as defined in s f personal identification information means any name or number that may be used alone or in conjunction with any other information to identify a specific individual including any name postal or electronic mail address telephone number social security number date of birth mother s maiden name official stateissued or united statesissued driver license or identification number alien registration number government passport number employer or taxpayer identification number or food assistance account number bank account number credit or debit card number or personal identification number or code assigned to the holder of a debit card by the issuer to permit authorized electronic use of such card unique biometric data such as fingerprint voice print retina or iris image or other unique physical representation unique electronic identification number address or routing code medical records telecommunication identifying information or access device or other number or information that can be used to access a person s financial resources g counterfeit or fictitious personal identification information means any counterfeit fictitious or fabricated information in the similitude of the data outlined in paragraph f that although not truthful or accurate would in context lead a reasonably prudent person to credit its truthfulness and accuracy a any person who willfully and without authorization fraudulently uses or possesses with intent to fraudulently use personal identification information concerning an individual without first obtaining that individual s consent commits the offense of fraudulent use of personal identification information which is a felony of the third degree punishable as provided in s s or s b any person who willfully and without authorization fraudulently uses personal identification information concerning an individual without first obtaining that individual s consent commits a felony of the second degree punishable as provided in s s or s if the pecuniary benefit the value of the services
#> 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       transunion continues to report inaccurate negative items on my report that i have previously dispu ted ti methey continuos to report account with no account number inaccurate information that is hurting my credit
#> 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          my credit report has an incorrect address the address on the report is nv other than when i was in the i have lived in my entire life associated with this lv address are bills from and and both amounts are less than $ every time i dispute these bills and ask for source documents to prove i owe the bills are sold to collection agencies when these receivables are sold as a package the collection agencies reopen the outstanding debt and it reappears on my report i have a substantial credit history with mortgage car loans and credit cards i have never been late on any amount i owe there should be some protection against collection companies windmilling these bogus receivables thank you for all you do
#> 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      i recently discovered that reported my account as a charge off on my credit report when i was denied a loan by due to the charge off i immediately contacted by phone in an attempt to rectify the error i was informed by the customer service representative that my last payment was received on and that my account was reported as a charge off on we both agreed that it did not make sense that my account would be reported as a charge off days after my payment was received at the time reported my account as a charge off it was paid in full with a i have spent hours on the phone with numerous customer service representatives the phone calls have been extremely frustrating and have not produced a positive outcome every representative has told me the same thing due to the age of this account they do not have account records and can not provide me with the information that i am requesting yet they refuse to remove the inaccurate information from my credit report i was told that a supervisors would call me back on separate occasions to this day i have not received a call back from a supervisor i have sent numerous certified letters disputing the charge off that is currently being reported on my credit report and negatively effecting my credit i have asked for validation of my account including details explaining my why my account is being reported as a charge off with a balance i have asked for them to send me a copy of my payment history along with account notes the letters that i received in response to my letters have been a one paragraph response stating that they only report accurate information to the credit bureaus
#> 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               sent letter to credit reporting agencies telling them the inquiries on my report are unjustified the actions fail to comply with fcra section the letters are attached below credit agencies replied stating they did not need to investigate verify or remove inquiries credit agencies failed to prove requirements and fa iled to remove inquiries from my credit reports
#> 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      student loans have been discharged and i have a letter from the stating all loans have been discharged the credit bureaus refuse to accept the letter
#>                                                                              Issue
#> 1                                                  Disclosure verification of debt
#> 2 Problem with a credit reporting company's investigation into an existing problem
#> 3                                             Incorrect information on your report
#> 4 Problem with a credit reporting company's investigation into an existing problem
#> 5 Problem with a credit reporting company's investigation into an existing problem
#> 6                                             Incorrect information on your report
#>                                                                        Product
#> 1                                                              Debt collection
#> 2 Credit reporting, credit repair services, or other personal consumer reports
#> 3 Credit reporting, credit repair services, or other personal consumer reports
#> 4 Credit reporting, credit repair services, or other personal consumer reports
#> 5 Credit reporting, credit repair services, or other personal consumer reports
#> 6 Credit reporting, credit repair services, or other personal consumer reports
text_dat_cor <-  data_clean_cor %>% 
                rowid_to_column("id") %>% 
                unnest_tokens(word,Consumer.complaint.narrative) %>% 
                anti_join(stop_words)

Untuk memperoleh korelasi antar kata dapat menggunakan function pairwise_cor() dari package widyr

words_cors <- text_dat_cor %>% 
  group_by(word) %>% 
  filter(n() > 100) %>%
  pairwise_cor(word, Issue, sort = T) 

Korelasi antar kata dapat kita tampilkan secar visual menggunakan package ggraph. Pada visualisasi berikut kita hanya ingin menampilkan kata yang memiliki korelasi lebih dari 0.9. Artinya korelasi pada visualisasi berikut memiliki kecenderungan muncul bersamaan saat nasabah mengajukan complain.

set.seed(100)

words_cors %>%
  filter(correlation > .9) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation)) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void() +
  ggtitle("Correlation between Words")+
  theme(legend.position = "none",
                       plot.title = element_text(hjust = 0.5,face = "bold"))

Manfaat dari Sentiment Analysis yang telah dilakukan adalah kita dapat mengetahui pesan utama dari pendapat dan pemikiran customer terhadap suatu company atau product. Selain itu, output dari sentiment analysis dapat memberikan gambaran mengenai pelayanan atau product yang belum sesuai. Hal tersebut dapat membantu tim marketing untuk meneliti trend yang dibutuhkan customer dengan lebih baik. Seiring dengan peningkatan kualitas layanan dan pengembangan produk yang lebih baik, tentunya akan mengurangi tingkat churn customer.