Why we need synthetic data

Fresh off the job, the two Winterfell Bank data scientists were commissioned to use machine learning algorithms to better understand what differentiates High Net Worth Individual (HNWI) from those considered Mass Affluent and Emerging Affluent. Unfortunately, HNWI are less than 1% of the total, very few. In terms of heads, Mass Affluent and Emerging Affluent customers overpower the HNWI and therefore, from the point of view of a machine learning algorithm, they blur them numerically, making them very difficult to study.


Jurassic Bank commissioned a small fintech company, highly specialized in data analytics, to identify the best cross-selling opportunities among clients, developing a “custom” machine learning model on its database. Unfortunately, Jurassic Bank’s corporate policy firmly forbids any data transfer outside the bank, as well as remote access to external personnel via VPN: no way, you have to work physically on site, in basements crammed with Paleoproterozoic era mainframes. As chance would have it, the fintech company is located 350 km away from Jurassic Bank’s headquarters, so the cost and time of the cross-selling project is likely to rise enormously. This is not good for cross-selling and the revenue that should result from it.


As part of its business plan for the next three years, Gondor Insurance is considering expanding its product range. The customer population will change under the influence of two main forces: (i) the strong generational shift already in place and (ii) the acquisition of another insurance company, with a quite different customer base. How can we numerically assess the impact on premium income and margins of various scenarios associated with a transformation of the product range combined with contextual changes in the type of customers and their preferences?


These three cases, involving three different financial intermediaries grappling with three different data analytics problems, certainly have one factor in common: they would all benefit from the use of synthetic data.


What are synthetic data?

Synthetic data are “fake” data, but with the same statistical properties as real data. In short: data similar to the originals but different. Different enough, for example, to prevent a synthetic data point from being uniquely associated with a data point in the original database. Which is good news for GDPR & C.

To generate a synthetic dataset you need a statistical-simulative model, a Synthetic Data Generator (SDG), i.e. a machine learning model that has done the training on real data and is then used to generate new data, via Monte Carlo simulation. Let’s take a closer look at the process of creating the synthetic data.


How the Synthetic Data Generator works?

The SDG learns what are the fundamental characteristics of the data related to a given problem, identifying the underlying multivariate probabilistic laws, which move the system as a whole, considering the interrelationships.

It is not a trivial problem: often the data are substantiated in a vast sample space, with many dimensions, defined by variables heterogeneous by type: for example nominal variables (such as gender, or residence), ordinary numerical (such as the level of education), discrete numerical (such as the number of children), continuous numerical (such as the masses of money under management or the balance of the current account), in historical series (such as the share value of mutual funds), and so on. Then add the interdependence relationships between the variables. Over time, of course, because things change. And, believe me, if by chance the word “interdependence” makes the linear correlation flash in your mind, well, know that for this kind of problem relying on such a metric is almost always a bad idea, leading to poor synthetic data: more detailed definitions of interdependence are needed.

For those who are familiar with financial data, a simulative system for market risk analysis is to all intents and purposes a SDG. Only that it is a “simple” case: many variables, but all continuous numerical, at most discrete, all similar in their macroscopic statistical traits (fat-tailed, heteroschedastic, self-correlated, etc.). When one mixes demographic data of customers, behaviour and relationship data, product and market data, and so on, it is intuitively appreciable that the situation becomes quite complicated.

After the training, the SDG can generate an arbitrary number of synthetic data related to the problem of which it has digested the original data: technically, the SDG generates an artificial sample by sampling from the multivariate probability distribution that describes the system/phenomenon under analysis. If the SDG is of good quality, the simulated data will retain most of the statistical and informative properties of the original data and can be used for a variety of purposes.


The use of synthetic data: three key applications

Generally speaking, synthetic data have three macro-applications: let’s take a look.


Create datasets for training machine learning algorithms

This is the case of Winterfell Bank: the original data of the bank are a bad sample for learning, since HNWI are too few; generating other artificial HNWI (similar and at the same time different from the real ones) allows the model to learn on a wider and more representative sample base. And the models will be grateful with better output quality. Synthetic data is thus a tool that allows you to develop and test models under various conditions (avoiding, among other things, overuse of the original databases, with serious dangers of overfitting).


Protect privacy-sensitive data

This is the case with Jurassic Bank. While maintaining the basic statistical characteristics of the original data, the summary data does not contain the information from the original sample and strongly protects privacy. Therefore, if Jurassic Bank has an SDG, it can generate a synthetic sample, which does NOT have privacy issues, and leave it quietly to consultants to develop the model in their offices and not in the basement of the bank, reducing project time and costs.


Simulate completely new situations

We are in the case of Gondor Insurance: you want to explore new, never realized, complex situations. Here generating synthetic data is basically the goal, not the means. We’re in the “What if?” field. But the data must not be camouflaged in the air, it must be rooted in reality: in the case of Gondor Insurance, the scenario to simulate is not science fiction, simply phenomena that start from today’s reality and can reasonably happen. The process is broadly that:

  • a whole population of clients will be simulated; many will be new, while others will leave the scene;
  • everyone will have preferences needs and interact with the environment, making decisions to buy and sell products and services;
  • it will reflect on the insurance company’s KPIs, and impact its value;
  • by analysing various product range hypotheses, it will be possible to make informed and numerically well-founded managerial choices;
  • all with low costs (testing, reputational, “intrusion” into the insurance agent-customer relationship, etc.), and a business understanding benefit for the entire organization.


The last case is the most profound and innovative application of synthetic data: generating artificial populations is in fact a fundamental step in Agent Based Modeling (ABM), a technique increasingly used to test situations and behaviors without getting caught up in privacy and organizational problems. The use of ABM goes so far as to create entire artificial markets: it may seem bizarre, but it is not, it is a reality, as demonstrated by the dotted case of Gondor Insurance. And it is the future of marketing. We will discuss this in detail in a future post.


Virtual B’s solution

Virtual B has been working for years in the financial sector, in close contact with data and their analysis. Our experience has resulted in numerous solutions that generate value and solve problems for financial and insurance intermediaries.

Are you interested in expanding your corporate database with the implementation of synthetic data? Contact us to request a demo and receive our White Paper describing the principles behind Virtual B’s Synthetic Data Generator.

Contact us