Data and Data Cleaning

Data and Data Cleaning

A lot of peer-to-peer lending groups have trouble when it comes to loan disbursement. One is never sure whether the variables you take into account are enough. In the pursuit of finding data, I found that there exists a wide array of variables. There exist datasets online that have up to 150 trainable variables. In the modern world, data is changing everything. Did you know that it is now more valuable than gold?

In fact, the most famous place to find loan data, The Lending Club, just stopped making their data open-source. They realized just how valuable it was. Up to 2018, it was easy to find these datasets. However, now, you have to pay money to gain access. However, the model we want to come up with can learn with what is available. The main objective of this project is to find a better way to assess someone’s loan default risk. Traditional methods included looking at someone’s income or the assets they own. In the modern world, this is not enough to prove that you are not a risk to the institution. Hence the introduction of more modern methods. For example, some peer-to-peer lending groups allow for loan limits to be dependant on a person’s deposit. If you had $100 as deposits, then you can get a loan of $300. This makes sense; but again, not so accurate.

With this implementation, we will be looking at loans against the interests they generated. The fraction we find will be n assessment of whether the party is a risk or not. The Lending group will be responsible for deciding what fraction of predicted paid interests is acceptable. If our model predicts ⅔ of the interest will be paid, and it is acceptable, then the party is not a credit risk. You can imagine the vice versa.

Data and Data Cleaning

The data available, as stated earlier, is sourced from The Lending Club. It has records from 2007 to 2018—a huge dataset. The download is more than a GigaByte big. Yes, it is an ordinary CSV file. The best part about this is that there is sufficient data to work with. The data is split into loans that were accepted, and loans that were rejected. Only the accepted loan data will be applicable in our case since we have to know how much interest the loan earned. As you know, data is seldom arranged properly; in a way that will make it easy to train a Neural Network. Meaning that it first has to be cleaned. This does not mean that it is impossible to use it in its raw form. The problem is that your model’s accuracy will suffer on account of the dirt.

Describing data as dirty seems rather vague. The term dirt refers to a few impurities that need to be removed. With the use of tools such as Python, this becomes possible. For this project, we will be using Google Colab so that we do not need to install all the packages. If you do not have an internet connection, then offline would be the only option. However, the size of such an undertaking means many tools will be needed. Python calls these tools packages. To get them, you have to download them. It often takes time, especially if you have slow internet. If you do not have the internet at all, the task becomes impossible.

Dirt in data includes; empty columns, repeated values, irrelevant columns, and so on. Python is what will be used to get rid of all this trouble. The first order of business would be to remove all the empty columns. Empty columns mean the model will waste time running through the empty fields while training. After this is done, we need to start thinking about what the output variable will be. An output variable is an output from the model after it has made a decision. According to the introduction, the output variable will be a fraction. It will predict what fraction of the due interest of the loan will be paid. To do this, we will have to combine columns in the loan data. The interest defaulted and the interest due.

However, before combining the columns, something has to be done about the loan status column. The column had eight unique values, but in order for the model to tell the difference, this has to change. Any loan that is not fully paid must be changed to ‘0’, meaning false. The rest is assumed to be fully paid, and the loan status value is changed to ‘1’. If you are familiar with boolean values, it should be obvious that true and false can also be represented by numbers 1 and 0, respectively.

Neural Network

After the data is ready, it is time to build the Neural Network. When they were imagined, neural networks were designed to mimic the biological neurons. It is amazing how something that had seemed impossible a century ago is now becoming commonplace. You can only imagine what wonders technology will have for us in another century. The neural network will be composed of three activation layers. These layers will need to be activated; ReLu is often the method of choice. Some studies have proven just how effective it is. Our Neural Network needs an error loss function so that it actually learns something. When training starts, the error loss function will be responsible for slowly making it more and more accurate. After much deliberation, we settled on applying the Logarithmic error loss function. Even though it is not the most praised method, this specific implementation needs it. Simply because the data is large, it performs better than most of its counterparts in these conditions.

The number of epochs we chose was 100; we trained two models. Making it a total of 200 epochs. Although it took some time, the accuracy should be around an upwards of 0.954454. The biggest contributing factor to this is the size of data. After the model is trained, one is expected to save it. Deploying the model to be used in a live system will expect it to be saved in a certain format. Joblib is the perfect tool to use for this exact task. Even though there are plenty of other ways to deploy an AI model, we elected to deploy it as a REST service. Flask will be instrumental in this endeavour. Once live, all that will be needed is a request carrying the input parameters. Most importantly, the request will have to be validated before it is fed into the model.

Conclusion

Once we get to production, the necessity for such technologies will become clear. The need for humans making all the decisions is slowly diminishing. In many fields, computers are proving to be better and faster than humans. Take maths and chess, for example. It is near impossible to beat a computer in these two reals, not unless it was programmed poorly. Why not apply the same in loan risk assessment? Food for thought.

References

González-González, C. S., García-Holgado, A., de los Angeles Martínez-Estévez, M., Gil, M.,

Martín-Fernandez, A., Marcos, A., … & Gershon, T. S. (2018, April). Neural Networks in Peer-to-Peer Lending Loan Defaulter Analysis In 2018 IEEE global engineering education conference (EDUCON) (pp. 2082-2087). IEEE.

Kenny, C., & O’Donnell, M. (2017). Expanding AIs Role in Developing Technology:

Loan Default Prediction

Pssst… we can write an original essay just for you.

Remember! This is just a sample.

Save time and get your custom paper from our expert writers