As early as 2012, IBM declared that 90% of the world’s data had been created in the previous two years.

And here we are past the dawn of the Information Age and standing in the bright, blinding light of a new day.  

We’re swimming in oceans of information, inundated by the proliferation, commoditization and public distribution of data. Our daily tasks now involve a burgeoning amount of data and information, creating a daunting and overwhelming experience click to tweet.

Data, by its very nature, can be complex and problematic. Many of us are ill-equipped to understand it. We’re data novices turned data nerds, struggling to understand the issues that were once solely the problem of traditional analysts and data scientists.  

With the advent of the Covid-19 pandemic and the injection of politics into how data is interpreted, even lay people, who previously didn’t need to ingest and understand data, find themselves tracking information, numbers and trends in ways they never used to. We can now all appreciate the pain that professional data users experience on a regular basis.  

What are the issues with using data?

There are many fundamental issues that can make data difficult to use and understand. Some of the most common are:

  • Accessibility
  • Trustworthiness 
  • Validity
  • Source Management
  • Presentation

Access

Data accessibility and the so-called “data divide” are hot topics, with Microsoft recently announcing its Open Data Campaign. But how are we affected daily?

We’re all trying to use data that we know has to be out there somewhere, in order to make data-driven decisions. Finding and retrieving that data is only the first step, but proves difficult for two main reasons:

  • Too few sources or insufficient data – sometimes you can only find one or two sources for the data, which means there’s nothing to compare it to. For example, if you’re interested in acquiring a vacation property and want to know if it’s a sound investment, there are only one or two freely available sources; other data lies behind a paywall or requires a real estate license.  With only limited sources, you have no comparisons or benchmarks.
  • Too many sources or too much data – there is just too much information to digest, much of it conflicting, or in multiple disparate formats. Many of us suffer from data overload, which affects consumers of big data as well as the everyday folk trying to make sense of the noise.

Trust

Once you’ve discovered data, can you trust it? Is it an infographic shared on social media by a well-meaning friend, or from a trusted media provider, or published from a scientific study, or a well-designed survey or investigation? Does the data have a marketing slant or political aim?

Entire papers have been written to address the issue of data provenance and trustworthiness. Even government sponsored data sources can’t be taken at face value, and need to be evaluated in context.

Validity

Validity is a measure of the data’s integrity; however, I’m using a variation of the term validity from the classic data science use. What I mean by validity can be summarised here as:

“Validity in data collection means that your findings truly represent the phenomenon you are claiming to measure. Valid claims are solid claims.”

Although related to trust, this is a separate issue; even a trustworthy source must still provide citations and explain its methods. Sometimes you can use a third party source or service for information. An organization that has already done the legwork, such as Snopes or the US Census for population data.  

Recently, scientific research and experimental results data has been more widely disseminated; however, the provenance and reliability of such data should never be taken at face value, but needs further investigation. Is it a recognized and reviewed source? Does the cited media article present unbiased results?

Sources

Available information is often collected from multiple sources and collaborators. Many sites, presenting data in the current pandemic, consolidate it from multiple sources, some of which can be erroneous. Marketing data may also have come from other sources without citation.

Is the data used to derive or source the data trustworthy or reliably updating? Who are the contributors to this collection of data? For example, on the commonly referenced site Worldmeters there was briefly a blip in the count of cases in Nevada. 

Historical screenshot of results from    Worldmeters    site, taken on April 8, 2020.
Historical screenshot of results from Worldmeters site, taken on April 8, 2020.

A short time later this was corrected, as the site scraped from the Nevada Health Response site reflected the error, which was quickly corrected.

Data visualization and Presentation

Is the data presentation biased? Can I easily interpret the results? What do these patterns mean? 

I’ve been engaged online quite a bit lately in helping people understand what a log axis means versus a linear one, and what confidence intervals mean. Lots of people are looking at charts carefully for perhaps the first time.

Here’s an information graphic released by a government agency, clearly intended to show a trend on a date axis, but the dates were out of order. To some it was an intentional action to support a controversial position and mislead viewers.  Others saw it as a poor, but not ill-intentioned, presentation made out of ignorance.

Note the dates on the X axis… they are not sorted in chronological order. Source:    Georgia Department of Public Health    , May 17, 2020. The visualization has since been updated
Note the dates on the X axis… they are not sorted in chronological order. Source: Georgia Department of Public Health , May 17, 2020. The visualization has since been updated

The value of data needs to apply to everyone

Ultimately, data has to have relevance, meaning, importance, and application, even for us every day data nerds, click to tweet.  Data is about generating information and knowledge – it is a means to an end, not an end in itself. The acquisition, transformation, and realization of this value is always a group exercise, whether it’s social or professional. 

Data platforms, apps and visualizations need to acknowledge the diversity of roles, skills and experiences required to be successful with data, and add the issues of access, trust, validation, collaboration, and presentation. By getting that right we can all unleash our inner data nerds.

Authored by John Kuo (Head of Product) at Harbr