Bytes and Diamonds

ARTICLE by Ignacio Chechile

Sep 09, 2022

As we speak, petabytes of data are stored only to be ignored forever. It may be time-series telemetry from a machine, sales data from an e-commerce platform, or video footage from a camera pointing to a bucolic street. Neglecting data is a growing problem: we generate more and more data, but we humans are still instrumental in performing all sorts of data cleansing, wrangling, tidying1, and feature engineering for algorithms to perform better. No matter how sophisticated or complicated our machine learning algorithm might be, it still requires a human brain equipped with good domain knowledge to help the algorithm be aware of the nuances they cannot parse by themselves. In the words of Robert Monarch: no algorithm survives bad data2.
But data wranglers and feature engineering experts are not growing at the same rate as data is growing. Therefore, just like an overflowing sink, raw data is filling hard disk drives everywhere. Ironically, such data “waste” is actively backed up, that is, ignored not once but several times, just in case. You never know.
This data surplus appears as a ‘good problem’ to have: it is better to have more data than less data. But, is it? Data seems to be showing what economics calls marginal utility. This factor has been illustrated by the famous diamond-water paradox, where an essential element for life like water is comparatively cheaper than an object with less practical use like diamonds. A way to look at this paradox is to apply the simple principles of supply and demand. The availability of water at no marginal cost3—although many would argue that this is already changing—relative to demand means that the equilibrium price4 will be low or negligible for water. Diamonds, on the other hand, are high in demand and are expensive to find and produce so that the supply is limited and the intersection of the supply and demand curves occurs at a high price. Hence water is cheap and diamonds are dear.

Data follows a similar pattern. The more data there is, the less the value of a single ‘unit’ of extra data generated. Now, during a data drought—typically during failures, or a crime in the case of video footage—every single bit of information becomes a matter of life or death.
You would think data is *always* valuable, as everyone appears to make it look. But data can show effects intangible assets show, such as sunkenness5. Certain types of data is difficult to liquidate or sell to others, because such data is only worth, if worth at all, to its creator: think of an IoT-equipped fridge temperature data stored in the cloud. For free, any data scientist would accept practically any data set as a gift. For a price, that’s a different story.
Also, research has provided proof that as data sets grow larger they have to contain arbitrary correlations. These correlations appear simply due to the size of the data, which indicates that many of the correlations will be spurious. Too much information tends to behave like very little information.
And, last but not least, data gets old. For instance, earth observation data loses value as time passes and as the scenery under observation changes. Try using Google Street when the imagery available is 10 years old or more—as it is the case in some parts of Helsinki—and the buildings and features of the landscape all have changed6.

In times where Machine Learning and Artificial Intelligence are the buzzwords everyone is wanting to spout in their marketing materials, it is good to think about the practical limits behind data hoarding and the effort—human effort—required to improve data quality for algorithms to perform better. My thoughts are with those lonely data bytes sitting in forgotten drives, collecting dust while waiting for a data wrangler who will most probably never show up.

Follow ReOrbit on LinkedIn and Twitter for regular updates!

https://www.jstatsoft.org/article/view/v059i10

Robert Monarch is the author of “Human In the Loop Machine Learning”: https://www.manning.com/books/human-in-the-loop-machine-learning

In economics, the marginal cost is the change in the total cost that arises when the quantity produced is incremented; i.e., the cost of producing additional quantity.

The equilibrium price is where the supply of goods matches demand

For more details about intangible assets, see Capitalism without Capital: The Rise of the Intangible Economy by Jonathan Haskel and Stian Westlake

Old data of almost any kind, and as long as ethics allows, should be given for free

ReOrbit

Bytes and Diamonds

ARTICLE by Ignacio Chechile

Discussion about this post