Solving An Age-old Problem 

If you’ve ever worked with data, you’ll be aware of the problem. Slow, ad-hoc, expensive and risky supply chains for data are the norm and they simply don’t work for consumers or suppliers. click to tweet

Despite years of technological investment (ETL, wrangling, warehouses, lakes, and marketplaces) we still spend ‘80% of our time, just getting the data’. As our need to supply and consume data increases, so does the scale of the problem.

A typical data supply chain can be broken down into various phases, each of which vary in value, repeatability and predictability:

80% of the time-01.png

With significant time and effort invested across each of these phases, the true costs associated with data acquisition are greater than most realize. Expensive and scarce technical resources such as data scientists and data engineers spend most of their time acquiring data rather than using it, which is a poor use of highly-talented resources. Meanwhile, those in non-technical roles (e.g. legal, compliance, procurement, sales and marketing) interact via slow, manual processes with multiple hand-offs. These processes provide few opportunities to measure, manage or optimize the supply chain and contribute to operational risk. Business projects take months and years to deliver. Opportunities to gain unique insight are missed. The time taken to acquire data hinders everything from building and managing operational processes to taking strategic, data-driven decisions.

All this occurs against a backdrop of organizations generating more data than ever before, which dramatically increases the data opportunities within and between organizations. But, taking advantage of these opportunities often proves difficult, especially when the data required is outside of your organization. In our hyper-connected world, a single organization’s data footprint is rarely sufficient. Mass data sharing has become a social and economic imperative, yet few organizations have achieved this effectively. New technologies continue to emerge in line with the exponential improvements set out in Moore’s Law. New techniques such as AI and machine learning have become feasible. To be effective, many of these technologies and approaches require huge quantities of diverse data. Yet the archaic acquisition and distribution processes within and between organizations fail to satisfy them now, let alone tomorrow.

Time For Change

Pervasive for decades, this age-old problem needs to be solved in a sustainable way. Not only will this reduce cost, time, and risk, it will also establish a foundation for a competency that’s increasingly critical for every organization’s competitiveness and commercial success. Achieving this requires a strong understanding of the unique characteristics of data that make it especially challenging to acquire and distribute. We need to shift from viewing data as ‘technical’ to it being the heart of an organization’s knowledge, capabilities and intellectual property, click to tweet.

There are many ways this problem can be solved and it is so vast and complex that a single solution is inconceivable. Solving this will require a combination of many technologies and strong interoperability – any data silo, no matter how big, will always be a data silo and fail to satisfy the diverse needs of consumers and suppliers. The advent of the cloud creates a compelling foundation. Given so much data is increasingly stored in just a handful of locations, technical access is easier than it has previously been. Alas, the interoperability between clouds is hampered by export charges that make data movement expensive if it is not well-managed. This also solves only a part of the technical challenge and we know data acquisition is primarily a business challenge. The inefficiencies around the speed, reliability and cost of a given transfer mechanism pales into insignificance compared to the business issues of governance, compliance, security and fair value transfer.

Focus on Trust

When acquiring data, the biggest point of friction is trust, click to tweet. Yet existing processes do little to support the establishment and maintenance of trust, instead favouring administrative risk management. While this approach works for many commodities it is ineffective for data.

Data suppliers need to trust that consumers won’t misuse the data; it can be easily copied and distributed and can be adapted to the point where it bears no resemblance to the original product. This is a far higher standard than would be needed for other commodities. Meanwhile, data consumers need to trust the data is fit for purpose, reliable and cost effective. This is more difficult to prove out than for other commodities. For both sides to navigate these requirements currently takes a long time, especially with the advent of legislation designed to protect data and data subjects (e.g. GDPR, CCPA, HIPAA) that invariably increases risk and complexity.

To change this requires a mechanism that offers assurance to both suppliers and consumers. This can be achieved by enabling unfettered access to data in a secure environment without losing ownership. This significantly accelerates the acquisition and distribution process and can be extended to the fulfilment of an ongoing ‘access only’ agreement to enable use cases that would not otherwise be possible if it meant losing data ownership. Another option is the ability to rapidly create different products for different audiences that meet the needs of both parties. This also helps avoid row and column level access management that can quickly become a data management nightmare and ultimately undermine the very trust that it was designed to foster.

Put Data First

Arguably, there has been too much focus on technology and not enough on data. Technologies are useless without data, but even the most basic technologies can achieve a surprising amount with data; the ongoing use of Excel for data analysis being a great example.

To make real progress requires a combination of great technology and great data. The two need to be in balance. Yet we’ve spent so much time and effort on technologies to solve technical problems that we have failed to use technology to solve the business problem of acquiring data. As a result, many technologies fail to live up to the hype due to a lack of data and the business continues to suffer with bureaucratic manual processes.

Consider the many categories of technologies designed to technically aggregate data – ETL, wrangling, warehouses and lakes – and those designed to technically move data – SFTP, APIs and intra-platform sharing. None of these tackle the underlying business problems that prevent the rapid acquisition of data. The result is a need for technologies that support and optimise both the technical and non-technical interactions between people. Technologies that support those with a critical role in ensuring the rapid acquisition, distribution and monetization of data.

Cloud platforms coupled with modern software approaches provide an opportunity to solve this; however, some platforms are simply adding data marketplaces, a paradigm with a poor track-record of solving this problem. Some are adding intra-platform data sharing mechanisms that seem helpful, but further exacerbate the problem by adding yet more technical diversity and complexity without solving the underlying business problem.

Reject Short-Termism

Data acquisition happens at a point in time, when building an application or trying to answer a pertinent question, so stakeholders often encounter the problem infrequently. This makes it appear ad hoc and, consequently, it is treated as such with manual processes, bespoke agreements and little or no repeatability. However, when viewed across an organization or over time, data is a persistent and predictable need that’s critical to solve in a sustainable way. When considering historic trends, this need is only likely to increase:

  • Technology evolves and is increasingly capable of managing growing volumes of data
  • The digitization of our lives creates more data and, in parallel, an increased need for data to empower those digital journeys and deliver high-value customer experiences  
  • Competition is constant and data and models are a crucial source of competitive advantage
  • Increasing interconnectedness creates a need for ever-greater context to navigate complexity and deliver real-world outcomes, which can only come from more data and better data, often from outside the organization

There are few, if any, organizations with sufficient data to achieve all their objectives. This will not change unless they take a long-term view and invest in a fundamental transformation of how data is acquired and distributed. When it comes to data, it’s important to invest in the future by ensuring there is an efficient and effective supply of this critical resource. click to tweet

Embrace Customization

While optimization and standardization should be built into all parts of the data acquisition process, some aspects will remain custom: data adaptation; use cases; destinations; approaches to monetization. Yet all of these can be managed in such a way as to enable reuse if required.

For example, on initial access, data consumers typically spend much of their time preparing it for use. This includes profiling; cleansing; restructuring; joining with other data; reformatting; labelling for AI/ML use cases, etc. This customization is an inevitable and important part of the process because what the supplier provides is rarely what’s consumed – the supplier knows the data, but the consumer knows the use-case.

To avoid wasting time on the repetitive parts of this process, it is crucial to be able to share these customizations with others. Not only will it reduce the future burden of acquiring data and support standardisation, it will also reduce cost by avoiding duplicative efforts and profligate versions of the same data. This approach builds repeatability into the supply chain, while embracing the necessary customization. It also enables consistency and governance in a convenient way and helps avoid industrial-scale data remediation exercises in the future. By embracing customization and building in optional repeatability, the entire data supply chain becomes as efficient as possible, click to tweet.

A Solution

Based on the above, we think a software platform that solves this problem should have the following features:

  • Provides complete platform ownership to address trust issues and ensure organizations own their data, models and the network effects of their data ecosystem.
  • Enables platform owners to decide who participates in exchanging and collaborating on data and models with their organization
  • Recognizes the crucial roles played by non-technical users to help solve the core business problems within a strong entitlements framework
  • Combines the intuitive interface of a catalog or marketplace with the ability to provide secure access to data and models without loss of ownership
  • Enables the creation and sharing of customized data versions to bring suppliers and consumers are closer together, not further apart

Authored by Anthony Cosgrove (Co-Founder) at Harbr