Data is the new oil

by Dominik Matula Data Scientist

Have you ever tried digging deeper into your company data warehouse? Perhaps you found one or two skeletons in the cupboard, but maybe also a hidden golden vein. If you haven’t, then it’s about time you roll your sleeves up and start the data mining!

We live in the age of information and it has been said often enough that data are the oil of the new millennium. Moreover, in contrast to oil, data as a commodity are practically inexhaustible, constantly multiplying, they can be easily stored, and do not cause global warming – they simply hold so many advantages! Nowadays, the key ingredient that separates successful entities from others is the ability to possess the right piece of information at the right time. Or rather the ability to find the right piece of information, because with the development of big data technologies it has never been easier to save all thinkable and even unthinkable logs.

The internet giants of today such as Amazon, Google, Facebook, and others highlight to us every day how essentially important it is to find, mine, and use information in a sophisticated way. All of these companies manage to process information about their clients and turn it into more profit and better services. Well, who has never clicked on a “recommended video”? Or added another “others also bought” book to the shopping cart?

CLIENT TRANSACTION PICTURE

One of the biggest potentials is hidden in relational data. Notice that all the corporations mentioned above mine exactly this data category! Thankfully, you don’t have to run the world’s biggest e-shop in order to be able to make use of it. Relational data are saved by the majority of our clients (banks, telco, government institutions, etc.) and it’s only up to you to gain the maximum advantage from them.

Let’s have a brief look at our use-case, the so-called ‘Instalment Detector’, which we have a developer for a major Czech bank. Its goal is simple: there is a client with an account at your bank, and you would like to find out if (and which!) of his transactions represent instalment payments.

Surely, you can imagine what this kind of knowledge can be used for:

  • From the instalment amount and (ir-) regularity of the payments, one can better extrapolate risk scores of a client.
  • It suddenly becomes possible to create a tailored offer and give a specific client better conditions. This ultimately results in higher satisfaction of both the client and your management.

Fortunately, one doesn’t have to call all customers in order to access this type of information. All it takes is a careful look at the data, which are already stored in your company’s data warehouses.

MY NAME IS HOLMES. SHERLOCK HOLMES.

How exactly does such a data treasure hunt look like? Rather than actual data mining, which is what this job is mostly called, it reminds me of meticulous detective work. You don’t believe me?

Let’s have a look:

Initially, the detective has to carefully examine the crime scene, talk to witnesses and form several working hypotheses. Similarly, a data scientist pays attention to a problematics as a whole (in the case of the Instalment Detector it’s the world of credits, loan companies, typical client behaviour in each segment, transaction anomalies, etc.). Moreover, a debate with the domain experts of our client can be extremely productive, as they are witnesses, who know the data and their creation processes the most.

With every new hint and fact, the detective is forced to adjust his crime hypothesis. For example, if somebody discovers a bloody pitchfork in the nearby garden shed, suspicion of the gardener immediately rises. Unfortunately, clues are not always this straight forward – often it is necessary to untangle the sequence of events with a Holmes-like thoroughness and attention to detail. Subsequently, the final revelation at the end of every Doyle novel can be even more fascinating.

Equally, in the case of the Instalment Detector, we have not taught our digital detective to notice only the most obvious hints (e.g. an outgoing payment to the account of a known loan company labelled as „instalment TV“). To prove its usefulness, the algorithm had to pay attention to payment patterns (such as frequency, amount variability, stability and content of used symbols, etc. client characteristics (indeed, older clients borrow money less often than younger generations), time correlations, or transaction notes.

At the end of his effort, this digital detective was finally able to utter: „The data speak clearly, my dear Watson!“

STRENGTH GROWS WITH EXPERIENCE

Classification algorithms, which we have used for instalment detection, can be divided into two groups – unsupervised and supervised learning algorithms. The first category reminds one of the observant eyes of the detective, who immediately notices if something is out of the ordinary. However, this can be quite tricky. For example, how would you fancy being incarcerated for behaving oddly well?

The second category supervised learning algorithms, lie close to the actual training of a detective, who gets to familiarize himself with a number of solved cases in school and is then allowed to work independently. Unfortunately, this approach is also not a panacea – the world is constantly changing and criminals invent newer and newer ways of tricking the defenders of the law. Somehow, one can already guess that both methods are necessary for the able detective to do his job well.

The dynamic world of instalment payments is not too different in this regard. Companies are being founded and fall apart, client habits change, etc. For example, if one wanted to use a list of bank accounts of loan companies, it would be dated within a very short time span. Similarly, P2P credits and loans between friends could not be tracked.

Hence, while working with our Instalment Detector we decided to listen to the actual clients and have implemented a meta-model, which reflects this dynamics. Additionally, this is not the only case when the digital detective can capitalize on pre-won knowledge. Bayesian networks, which can be found in his core, are perfect for iterative improvements.

From the topic of consumer loans, it’s only a small step towards credits and leasing, and another one to other types of transactions. The detective can be taught to notice other types of clues and hints – he can detect income payments (see our Salary Detector project) or segment clients based on their consumer behaviour.

Turn to us for a comprehensive service concerning Big Data & Data Science solutions for your business!