[HPE] How ontologies help data science make sense of disparate data

We all know the basic challenges of sharing business data: departmental silos, incompatible software, legal and regulatory hurdles, colleagues who sincerely don’t have time to make it a priority, and sadly, colleagues who passive-aggressively won’t make it a priority. On top of all that, we can be tripped up by each other’s mental models.

As a simple example of this complex issue: How do you define “close of business”? When someone promises to send feedback by close of business today, does it mean 5 p.m., 6 p.m., midnight? In whose time zone? Imagine what happens if you don’t realize until it’s too late that you’re thinking 6 p.m. while your boss is thinking 5 p.m.

Worse than multiple definitions, we also create dozens of terms that essentially mean the same thing. That problem is especially vexing in the data-swamped world of academic science. Fortunately, scientists have developed successful ways to navigate the deluge that can be of great benefit to enterprise organizations.

A FAIR choice

University labs are surprisingly independent realms of purpose-built devices and localized vocabularies, with results entered into private databases (or Excel spreadsheets or even paper notebooks). This worked fine in its jerry-rigged way when scientists were merely expected to publish methods and results. But recently, there has been a push among both funders and journals to share raw data.

In response, and very much in spirit, a working group of scientists, scholarly publishers, funding agencies, and corporate representatives developed the FAIR Guiding Principles for scientific data management and stewardship. The principles promote data management as “the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process.”

The FAIR acronym stands for findable, accessible, interoperable, reusable. Arguably, the third principle is key to the other three. There are now several data repositories where researchers in specific disciplines can upload files, allowing a broader and deeper understanding of research questions.

Unfortunately, there’s such a tremendous amount of data that no human can search it. And the semantic confusion isn’t a simple one-to-one correspondence. Rather, it’s a multidimensional mess that includes spatial, temporal, and methodological differences, along with functional definitions. The obvious answer is machine learning, but it’s an enormous semantic challenge to parse hundreds of bespoke terms.

Is there an answer for scientists—and other data-challenged professionals—that’s more productive than giving up in despair? FAIR co-author Maryann Martone, a neuroscience professor at the University of California, San Diego, is a specialist in ontologies. She is optimistic, particularly so for someone looking at what might at first appear to be an intractable problem.

According to Martone, “Ontologies are a critical tool for creating and managing knowledge bases that help us communicate and verify the knowledge we think we have. If someone says X is in a motor region, how does a machine know what a motor region is? There’s an ontologic structure that says some features obligatorily appear together.”

In an ontology, individual terms are tagged to central concepts called Uniform Resource Identifiers (URIs), with no weight to the tagging. Dog, Canis lupus familiaris, and Mr. Fluffy all map to the same URI. Because ideas can overlay each other, explains Martone, “ontologies allow you to construct a reasonable theory. Previously, we couldn’t even bring all the data together because we first had to navigate all the terms.”

Beyond buckets of data

Finding those co-occurrences leaves room for humans to sort the bigger questions: Yes, you can now see that X apparently relates to Y, but is that meaningful in any causal or other sense? For example, a medical researcher can search through a data repository and have a chance to understand that dozens of terms that may have once seemed unrelated are all evidence of a specific condition. Equally important, URIs can help researchers detect gaps and outliers in data. If everything else says “rose” equates to pink or light red, what to make of the study that maps it to blue?

“Ontologies aren’t hierarchies. They don’t force you into categories; they just put some structure around the experimental edge of science,” says Martone. “It’s just a data pattern. You can compare the patterns and analyze them—and maybe learn there’s nothing fundamentally different or maybe the distinction is important.”