Using data for financial inclusion: Doing it right
Smart and ethical usage of data can potentially end financial exclusion. The case for data sharing and standards for responsible algorithm design.
The financially excluded entered the digital universe. This creates opportunities as well as risks. For a bigger impact and to ensure the end client benefits, collective action by the financial inclusion sector is needed.
High interest rates and costs of financial products mean financial services are too expensive for many. Distribution of services through the digital channel has lowered operational costs for loan providers. Responsible usage of Artificial Intelligence and data analytics can decrease risk premiums, operational costs and thus interest rates even further. Financial Service Providers (FSPs) can use data to better understand the needs of their clients, and improve their value proposition. Gradually more stakeholders in the industry take note. However, to take this development to the next level the sector needs to solve 2 major problems: Data silos, and a lack of standards for responsible algorithm design.
Usecase: Data and risk
A key thing a FSP wants to know when a client applies for a loan is the probability of default. How likely is it that the client will repay the loan according to the agreed terms? In an ideal world, the FSP would have a model that perfectly predicts this, which would remove all uncertainty. Then the client could borrow for the risk-free rate and the FSP doesn’t need to worry about non-paying customers. However, the world is not ideal and such a model does not exist (and never will, until we can look into the future). With no credit history available to rely on, the FSP closely monitors these ‘risky’ clients and demands a risk premium to compensate potential losses. And even though he makes weekly visits and keeps a close eye on all risky clients, the FSP still incurs losses due to clients not paying back their debt. As his portfolio at risk (PAR) and amount of non-performing loans increases, his financing cost raise and, as a consequence, the interest rate he charges to his clients.
Obviously the FSP needs to solve this problem of uncertainty. No model can perfectly predict if a client will repay his loan, but there is an alternative. What the provider needs is a sample consisting of ‘good’ clients, always paying on time and ‘bad’ clients, paying their debts or not paying at all. Having this information, we could aggregate the dataset with everything we know about these clients. Gender, age, family size, profession, address, marital status, size of the house, you name it. In case the client uses mobile money, we could add to that transactional data, phone usage and other electronic data.
Now this is something we can work with. We could figure out which of these features best predict the probability of default of the client. We might find that women are more likely to repay then men, or that certain professions always pay back their debt in time. With this information we could build a model to predict the probability of default of new clients more accurately, and train it by adding more subjects to the dataset to get better and better.
Even then, we still can’t predict if a client will default or not, but we can at least make a reasonable estimate of the probability of default. What are the practical implications of this?
The FSP could charge clients with a low probability of default a lower interest rate. It could lower their monitoring efforts for these clients, thereby lowering operational costs and the burden of monitoring for clients. It could make a better estimate for what is a reasonable risk provision, making more efficient usage of the funds available. It could provide more certainty to outside investors, thereby reducing the cost of capital. There is many more, but all these implications have one thing in common: Lower costs for the client, more accessible financial services.
Quantifying the impact of reducing risk for the end client
The average interest rate for a loan of up to 500,000 Kenyan shilling for a smallhood farmer is 30 to 40 percent. The risk-free rate, defined as the yield of a Kenyan 10-year bond is 12.6%. A Kenyan farmer borrowing 50,000 Kenyan shilling for the duration of a year to pay for seeds and fertilizer would have to pay an interest 6,300 Kenyan Shilling if she could borrow for the risk-free rate. This is a 68% decrease in borrowing costs compared to the average interest rate charged by MFIs in Kenya. Given that the current average income of a Kenyan smallholder farmer is 256,000 shilling, she will effectively increase her income on a yearly basis with 5.4%.
Credit scoring models based on alternative data (i.e. all data other than credit history) are more and more used by FSPs in the industry. And over time, these models are getting more sophisticated, producing better predictions. However, these models can only massively improve their performance by adding data, and lots of it. A thousand records are not merely enough. The graph below comes from the famous paper of Banko and Brill (2001). They discovered that very different algorithms can have the same performance, if only they are fed with loads of data: More is better.
We’d like tens of thousands, millions of records to make accurate prediction possible. One FSP cannot by itself produce the amount of data needed for high-performance models, especially not the small ones. But what if we could aggregate all data of all microfinance institutions worldwide, and train our models on that? And what if we would add to that also the data of other organization serving the target group such as providers of PayGo energy, governments and telecom providers? And what, if we’d make the models trained by this data available to every FSP in the industry? That would truly make a difference. But how to get there?
In the last decade the world has radically changed due to a mix of factors that can be summarized by one term: Digitalization, or the expansion of the digital universe relative to the physical world. The facts are mind blowing: The share of the world population that uses the internet rose from 20% in 2007  to 53% in 2017, which means that over 4 billion people are online now. A quarter of a billion people came online for the first time in 2017, a large share of them being from the African continent. The growth is driven by the spread of affordable smart phones and data plans. All these new internet users add to the gargantuan amount of data produced each year, expected to grow exponentially in the coming years. Mobile money usage grew from 136 million accounts in 2012 to 690 million accounts in 2017.
At the moment data on individual clients is not or sparsely shared among FSPs due to unfeasability, privacy laws and competitive reasons. These are valid reasons not to share data, but creating isolated data silos also means that even though a lot of data is produced, only a fraction can be used for analytics and AI applications. The issue could be solved by ensuring privacy and ownership of the data. A promising field is the research on decentralized distributive databases. Encrypting the data would deal with the problem of exposing the privacy of clients and add security. The owner of the data would remain in control on what happens with his data and could be compensated for sharing it to the network. Currently various organizations are working on such a solution, most notably among them the Ocean Protocol. The microfinance industry could use this or similar technologies to make data sharing between FSPs feasible. But in order to get there, the sector needs to set standards both for data collection and responsible algorithm design first. This brings us to the second issue that needs solving.
Lack of standards for data collection and responsible algorithm design
Many FinTech companies are already using algorithms to determine if someone is eligible for a loan. A known risk of algorithms is that the human bias is build into the model. Notorious is the algorithm Amazon created and later abandoned for its HR-department. Data analytics used by police departments to predict criminal events are accused of building in racial bias.
An acute risk of building in racial or gender bias, or bias against minorities exists for predictive models used in the financial inclusion sector. Responsible algorithm design is therefore crucial from an ethic angle. Technology companies such as IBM and Accenture are developing tools to detect such biases in algorithms. To help the sector distinguish good (i.e. with minimal bias) from bad algorithms, certification could be a solution. Certificates should then only be handed out for algorithms that have proven not to be biased against a set of criteria, and to make responsible usage of the data available.
A second challenge the sector needs to solve is to make sure that only relevant data is collected, and that input is accurate. In data models there is a law, which states that Garbage In = Garbage Out. Uncertainty about validity of data renders the input effectively useless, or even harmful when added to the model nonetheless. A possible solution may be to only allow trustworthy FSPs that have a proven record on data collection to add data to a distributed database. Once again certification can be part of the solution, subjecting certified partners to a periodical due diligence.
Certainty on the validity of the data is a first step, but the sector should also set standards on what data to collect. Ideally you would like to collect as much data as possible, but this is unfeasible both for practical reasons as well as privacy concerns. A mapping of features based on their (expected) predictive power, feasibility to collect and privacy risk could be a first step to set standards on what data to collect.
Now that the un(der)banked are entering the digital universe many opportunities open up, possibly ending financial exclusion. But for that to happen, the financial inclusion industry has some hurdles to take. Many risks are involved, and biased algorithms, or even the outright misuse of personal data can have harmful consequences. Data is great, but doing it right is crucial.
Thank you for reading. Any comments or suggestions? Please let me know me by sending an e-mail to firstname.lastname@example.org or send me a PM on LinkedIn.