999_YT_01_FinancialFraudDetectionUsingDecisionTreeMachineLearningModels(Python)

Loading...

ScaDaMaLe Course site and book

Financial Fraud Detection using Decision Tree Machine Learning Models

This notebooks is the accompaniment to the following databricks blog:

This is an exercise in self-learning. Here you will learn to use mlflow in pyspark to keep track of your experiments.

Also you may have to adapt it for Spark 3.x if needed.

In this notebook, we will showcase the use of decision tree ML models to perform financial fraud detection.


Source Data

PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.

This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle. To load the dataset yourself, please download it to your local machine from Kaggle and then import the data via Import Data: Azure | AWS.


Dictionary

This is the column definition of the referenced sythentic dataset.

Column Name Description
step maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).
type CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
amount amount of the transaction in local currency.
nameOrig customer who started the transaction
oldbalanceOrg initial balance before the transaction
newbalanceOrig new balance after the transaction
nameDest customer who is the recipient of the transaction
oldbalanceDest initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).
newbalanceDest new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).


Show code
# Configure MLflow Experiment
mlflow_experiment_id = 866112
 
# Including MLflow
import mlflow
import mlflow.spark
 
import os
print("MLflow Version: %s" % mlflow.__version__)
MLflow Version: 0.8.2
# Create df DataFrame which contains our simulated financial fraud detection dataset
df = spark.sql("select step, type, amount, nameOrig, oldbalanceOrg, newbalanceOrig, nameDest, oldbalanceDest, newbalanceDest from sim_fin_fraud_detection")
# Review the schema of your data 
df.printSchema()
root |-- step: integer (nullable = true) |-- type: string (nullable = true) |-- amount: double (nullable = true) |-- nameOrig: string (nullable = true) |-- oldbalanceOrg: double (nullable = true) |-- newbalanceOrig: double (nullable = true) |-- nameDest: string (nullable = true) |-- oldbalanceDest: double (nullable = true) |-- newbalanceDest: double (nullable = true)

Calculate Differences between Originating and Destination Balanaces

With the following PySpark DataFrame query, we will calculate the following columns:

New Column Definition
orgDiff Difference between the originating balance
destDiff Difference between the destination balance
# Calculate the differences between originating and destination balances
df = df.withColumn("orgDiff", df.newbalanceOrig - df.oldbalanceOrg).withColumn("destDiff", df.newbalanceDest - df.oldbalanceDest)
 
# Create temporary view
df.createOrReplaceTempView("financials")

Let's review the data and ask some simple questions

# Review the new table (including the origination and destiation differences)
display(df)
 
step
type
amount
nameOrig
oldbalanceOrg
newbalanceOrig
nameDest
oldbalanceDest
newbalanceDest
orgDiff
destDiff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1
PAYMENT
9839.64
C1231006815
170136
160296.36
M1979787155
0
0
-9839.640000000014
0
1
PAYMENT
1864.28
C1666544295
21249
19384.72
M2044282225
0
0
-1864.2799999999988
0
1
TRANSFER
181
C1305486145
181
0
C553264065
0
0
-181
0
1
CASH_OUT
181
C840083671
181
0
C38997010
21182
0
-181
-21182
1
PAYMENT
11668.14
C2048537720
41554
29885.86
M1230701703
0
0
-11668.14
0
1
PAYMENT
7817.71
C90045638
53860
46042.29
M573487274
0
0
-7817.709999999999
0
1
PAYMENT
7107.77
C154988899
183195
176087.23
M408069119
0
0
-7107.7699999999895
0
1
PAYMENT
7861.64
C1912850431
176087.23
168225.59
M633326333
0
0
-7861.640000000014
0
1
PAYMENT
4024.36
C1265012928
2671
0
M1176932104
0
0
-2671
0
1
DEBIT
5337.77
C712410124
41720
36382.23
C195600860
41898
40348.79
-5337.769999999997
-1549.2099999999991
1
DEBIT
9644.94
C1900366749
4465
0
C997608398
10845
157982.12
-4465
147137.12
1
PAYMENT
3099.97
C249177573
20771
17671.03
M2096539129
0
0
-3099.970000000001
0
1
PAYMENT
2560.74
C1648232591
5070
2509.26
M972865270
0
0
-2560.74
0
1
PAYMENT
11633.76
C1716932897
10127
0
M801569151
0
0
-10127
0
1
PAYMENT
4098.78
C1026483832
503264
499165.22
M1635378213
0
0
-4098.780000000028
0
1
CASH_OUT
229133.94
C905080434
15325
0
C476402209
5083
51513.44
-15325
46430.44
1
PAYMENT
1563.82
C761750706
450
0
M1731217984
0
0
-450
0

Truncated results, showing first 1000 rows.

What are the type of transactions?

%sql
-- Organize by Type
select type, count(1) from financials group by type
TRANSFERCASH_INCASH_OUTPAYMENTDEBIT8%22%35%34%1%typeTRANSFERTRANSFERCASH_INCASH_INCASH_OUTCASH_OUTPAYMENTPAYMENTDEBITDEBIT

How much money are we talking about (synthetically)?

%sql
select type, sum(amount) from financials group by type
0.0050G100G150G200G250G300G350G400G450G500GTRANSFERCASH_INCASH_OUTPAYMENTDEBITTOOLTIPtypesum(amount)

Rules-based Model: Create a set of rules to identify fraud based on known cases

The following where clause are a set of rules to identify know fraud-based cases using SQL; i.e. rules-based model.

  • Often, financial fraud analytics start with with clauses like the where clause below
  • Note, in reality, rules are often much larger and more complicated
from pyspark.sql import functions as F
 
# Rules to Identify Known Fraud-based
df = df.withColumn("label", 
                   F.when(
                     (
                       (df.oldbalanceOrg <= 56900) & (df.type == "TRANSFER") & (df.newbalanceDest <= 105)) | 
                       (
                         (df.oldbalanceOrg > 56900) & (df.newbalanceOrig <= 12)) | 
                           (
                             (df.oldbalanceOrg > 56900) & (df.newbalanceOrig > 12) & (df.amount > 1160000)
                           ), 1
                   ).otherwise(0))
 
# Calculate proportions
fraud_cases = df.filter(df.label == 1).count()
total_cases = df.count()
fraud_pct = 1.*fraud_cases/total_cases
 
# Provide quick statistics
print("Based on these rules, we have flagged %s (%s) fraud cases out of a total of %s cases." % (fraud_cases, fraud_pct, total_cases))
 
# Create temporary view to review data
df.createOrReplaceTempView("financials_labeled")
Based on these rules, we have flagged 255640 (0.04017841706718302) fraud cases out of a total of 6362620 cases.

How much fraud are we talking about?

Based on the existing rules, while 4% of the transactions are fraudulent, it takes into account of the 11% of the total amount.

%sql
select label, count(1) as `Transactions`, sum(amount) as `Total Amount` from financials_labeled group by label
104%96%11%89%label1100TransactionsTotal Amount

Top Origination / Destination Difference Pairs (>$1M TotalDestDiff)

Each bar represents a pair of entities performing a transaction

%sql
-- where sum(destDiff) >= 10000000.00
select nameOrig, nameDest, label, TotalOrgDiff, TotalDestDiff
  from (
     select nameOrig, nameDest, label, sum(OrgDiff) as TotalOrgDiff, sum(destDiff) as TotalDestDiff 
       from financials_labeled 
      group by nameOrig, nameDest, label 
     ) a
 where TotalDestDiff >= 1000000
 limit 100
-2.0M-1.0M0.001.0M2.0M3.0M4.0M5.0M6.0M7.0M8.0M9.0M10M11M12MC1816757085, C752686443C2115178036, C1809115549C1018394275, C423919427C805814824, C137669742C516516576, C418409319C311573688, C235819446-2.0M-1.0M0.001.0M2.0M3.0M4.0M5.0M6.0M7.0M8.0M9.0M10M11M12MC1816757085, C752686443C2115178036, C1809115549C1018394275, C423919427C805814824, C137669742C516516576, C418409319C311573688, C235819446TOOLTIPTotalOrgDiffTotalDestDiffnameDest, nameOrigTotalOrgDiff, TotalDestDiffTotalOrgDiffTotalOrgDiffTotalDestDiffTotalDestDiff

What type of transactions are associated with fraud?

Reviewing the rules-based model, it appears that most fraudulent transactions are in the category of Transfer and Cash_Out.

%sql
select type, label, count(1) as `Transactions` from financials_labeled group by type, label
PAYMENTCASH_OUTDEBITTRANSFERCASH_IN34%35%1%8%22%typePAYMENTPAYMENTCASH_OUTCASH_OUTDEBITDEBITTRANSFERTRANSFERCASH_INCASH_IN

Rules vs. ML model

Instead of creating specific rules that will change over time, can we be more precise and go to production faster by creating a ML model?

Decision Trees

Decision trees and their ensembles are popular methods for the machine learning tasks of classification and regression. Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions. Tree ensemble algorithms such as random forests and boosting are among the top performers for classification and regression tasks.

Because of these facets, decision trees often perform well on top of rules-based models and are often a good starting point for fraud detection.

Source: The Wise Old Tree

Create training and test datasets

To build and validate our generalized fraud ML model, we will initially split the data using randomSplit to create our training and test datasets.