Data Enrichment: no data? No problem

“We have no data.”

How many times do we hear these words.

Machine learning, artificial intelligence, data science: wonderful tools, extremely powerful. But completely useless if you cannot feed them with data. In fact, it is not uncommon for banks, insurance companies and asset managers to complain about a lack of data. A lack that may concern both quantity and quality of data, or even their own existence, in case of some more “unusual” ones.

Often, the dimensional and/or qualitative scarcity of data, besides being a source of frustration for the management, affects the outcome or launch of (generically considered) AI projects, since data are at the core of that type of process, as the term “data science” suggests.

The good news is that there are solutions to this problem: they fall into the category of Data Enrichment or Data Augmentation, i.e. expanding the current sample base with additional data, which in turn increase the amount of information available. This means, at the end of the day, expanding the number of valid and informative data points. Basically, if you think about data in the classic rows/columns format (records/fields), this means adding more rows, or fields, or both.

It sounds like trivial stuff. But it’s not. There are two main approaches for doing it.

 

External data

An idea that would even come to a lichen: in order to enrich the database, you take the data you need elsewhere. After all, there are countless interesting databases (for example, ECB, Eurostat, OECD, World Bank Open Data, IMF Data, just to name a few). Then there are dedicated private data providers. So, you just have to access, download, and add columns (fields), right?

Ummh.

Even if the basic idea is intuitive and apparently trivial, put into practice is quite complex and full of boring technical aspects. To name one, the evident absence of a unique identifier to rely on for a ” join”. Thus, it has to be done in another way. Indeed, according to another logic: a probabilistic one.

Then there are differences in the degree of the data: for example, one source may present age classes 0-20, 30-40, 40-50, 60-70, 70+, while another may be based on classes 0-35, 35-50, 50-65 and so on. We thus face similar but different taxonomies in these various sources. There is no lack of duplication of information. You have to compare, choose and define ontologies. Not to mention the differences in the timing and frequency of data updates. Things like that.

Anyway: it can be done. Using crawling techniques, data fusion along with the right architectural and data modelling choices, you actually get relevant information in the end. For example, you can recognize specific risks related to the physical and social environment in which a client lives, income and financial capacity, investment and indebtedness, and other “heavy” information for banks and insurance companies. We will get back to this another time in more detail.

 

Synthetic (or artificial) data

In other cases, the data needed to train a machine learning algorithm can be… created. I admit that this seems bizarre, but actually it represents a very effective technique, that is being increasingly used and which will play a more and more significant role in the future according to many people (for example, Andrew Ng, former Google AI manager and professor at Stanford, and Anima Anandkumar, who teaches at Caltech and leads the machine learning research team in Nvidia).

In practice, you have to create “fake” data with the same statistical properties as real data. In this way, they will be different from the original but very similar. So similar that they are almost indistinguishable to human eye – see the graph below.

To generate a synthetic dataset, you need a statistical-simulative model, a Synthetic Data Generator, which has been trained on real data (or theoretical models, in some aspects of natural sciences) and then is used to generate new data via a Monte Carlo simulation. In addition to modelling knowledge, a business domain know-how is crucial to realize and use these frontier algorithms.

Synthetic data can increase and densify rows (records), columns (fields), or both. Therefore, when data quality and/or quantity is scarce, synthetic data is an effective solution. Not to mention the advantages when it comes to sensitive data and privacy issues – this is certainly the case within the financial sector.

The use of synthetic data is not limited to Data Enrichment for training AI/machine learning algorithms, but it also extends to what-if analyses of all kinds, including simulations of completely new market situations (e.g. the medium-term impact of generational change on intermediaries’ revenues). We will discuss this topic in another post as well.

 

Moral of the story

The paradox of living in a society flooded with data that companies often fail to use is partially solvable by using Data Enrichment techniques, which allow to work, without stopping relevant projects when a first-hand data collection is lacking.

Are you interested in expanding your database using synthetic data? Contact us for an explanatory demo about our Data Enrichment process.

Request a demo