Big Data or Small Data? The value it’s not about quantity but about quality

Data are said to be the fuel of the digital economy. In fact, there is an increasing explosion in the speed, variety and volume of data of all kinds: phone interactions, payment flows, personal activities on social media and on the web in general, real-time financial and economic data, geospatial data, personal data in various forms, company accounts data, online documents and so on – it seems to be an endless and ever-growing tidal wave.

Ordini di grandezza dei Big Data

Compared to the past, today companies can exploit this huge amount of data by using Data Science. This is possible thanks to three major trends:

  • the advent of the cloud and other technologies that can effectively handle large amounts of data;
  • the broad availability of open-source analytical libraries (for example Python, R) that apply Data Mining, Machine Learning, NPL, AI, and the development of lively communities (Kaggle, for example) that have spread the culture of Data Science;
  • the hardware’s low cost along with the increase of its processing power (see following graph).

Il rapporto tra il basso costo dell'hardware e la relativa potenza di calcolo

This is it. Almost everywhere, we hear about Big Data. That’s because the idea behind them is so appealing that people puff their chests out when they state that “they are making Big Data” (yes, phrase collected on the field).

All well and good, however there is a pretty big misunderstanding about the idea of Big Data: many people believe that any Machine Learning application, or other advanced analytical methodology used to extract value from data is “Big Data Analysis”. It’s a common misconception among banks, insurance companies, and asset management companies alike.

Often, it is not the case: do not feel undermined, but not everything related to data extraction can be referred as “Big Data”. Let’s see why.

What Big Data really are

According to Gartner’s most classic interpretation dating back to 2001, Big Data can be described by the “Three V’s”:

  • Volume – The amount of data defines whether we can talk about Big Data or not: a real Big Data dataset is generally measured by zettabytes (10^21bytes), even if it is only an indicative threshold, since it grows continuously depending on the availability of data and computing power;
  • Variety – Here the nature of data comes into play: Big Data include both structured data (such as numbers, labels or categories), and unstructured data (such as images, audio and video files, text files);
  • Velocity – The output speed of the typical Big Data is real-time.

Then, there are those who also include the V of Veracity (linked to the idea of quality of information contained in the data, often variable and discontinuous) and that of Variability (ie the intensity at which data arrive, that can be very variable in time and space). Having said that, I repeat that the concept of Big Data varies according to the context, and it is an idea that changes over time.

The point is that, according to the “Three Vs” (or, if you prefer, to the “Five Vs”), Big Data require architectural choices and dedicated technologies to collect, store, handle, process and visualize the data extracted, in order to obtain useful information, that is what really matters. We’re talking about cloud, clustering, parallel processing and MPP, virtualization, high connectivity, and so on. Open-source frameworks, such as Spark and Hadoop, are crucial tools for working effectively with Big Data. So, I repeat: Big Data involve dedicated architectural and technological choices, with related consequences on budgets and costs (especially for companies, such as financial ones, which are often reluctant to use cloud systems and attached to on-premise solutions).

Currently, many financial applications of Data Science are simply not Big Data. It’s just a lot of data. Lots of data, which are typically grinded by Machine Learning algorithms. But it’s not Big Data. And it doesn’t have to be something to be disappointed at. On the contrary, maybe it’s better: remember that a Big Data architecture has cost and organizational implications. Let’s give a couple of concrete examples of what Big Data is and what isn’t.

  • not Big Data – the MiFID/IVASS questionnaire data of 500,000 clients, their socio-demographic data, their investment’s positions over the last 5 years, their biographical data, historical price series, risk data and economics of the 2,500 financial instruments they use, data on who/when/where they meet each client.
  • Big Data – detailed data of web and app navigation of a 500,000 customers bank, account transactions, real time details of the purchases made with credit cards, ATMs and payment apps, socio-demographic data, the complete history of 5 years of clients’ investments and their associated economics, the data on who/when/where each customer has been met, the recordings of the conversations with the bank’s call centre and the history of the interactions with the weekly newsletter.

The point is that the value does not lie in accumulating disproportionate amounts of data, but in extracting useful information from it, while using it for concrete business actions, such as offering a more tailored service to customers, an effective cross-selling and upselling, a better customer retention with a targeted communication, the avoidance of compliance issues, and so on. To do this, you need to collect the most relevant business data from all available sources, clean it up and store it in the most reasonable way.

In short: relevance beats quantity 4-0. The act of scraping kitten photos on Facebook or of acquiring data on tequila consumption will not help you for boosting the sales of your mutual funds or insurance policies.

Moreover, the “law of parsimony” always applies: in order to reduce the risk of overfitting and data snooping it is crucial to ensure that the size of a dataset is close to the essential, while containing storage and maintenance costs. These risks are very high if “black-box” Machine Learning models are put in place.

 

Last but not least, data must be understood. Therefore, it is a matter of combining the two sides: on one hand, the technical capabilities of data analysis, on the other hand, the knowledge of the business, of the process. Otherwise, it will be very easy to end up throwing tons of data into fascinating algorithms, which in reality just implement a sophisticated form of overfitting, producing results that are not easily understood by those who should be using them, who, after the first pseudoscience infatuation, are going to find everything useless, relegating those data to the oblivion. Money thrown away. But, cool: “we have Big Data”. That’s not how you do it.

To sum up, the value of data does not derive from its quantity, nor from its rough use, but from the information that result from its processing, using both algorithms and business know-how. Information that can be used for concrete professional decisions, that reinvent products, services and the relation with customers.

Virtual B Fintech solutions

Virtual B has been working for years in the financial sector, with a close focus on data and data analysis. Our experience has led to numerous solutions that generate value and solve issues for financial and insurance intermediaries.

SideKYC® is an advanced data analytics software created by Virtual B for banks and insurance companies. SideKYC® can profile customers, identify individual needs and map them with the best product.

If you want to find out more, visit the page and contact us

More on SideKYC®


Download our white paper “Wealth Management and Financial Data Science: a short guide”