Managing Unstructured Data

Text is free-form, rarely contains explicit keys, is highly non-repetitive, and derives its meaning from context within sentences and documents rather than predefined schema.

The world of unstructured data and text:

News reports
Newspapers
Government reports
Comic books
The Internet

Structured data is record-based, stored in uniform tables where all rows share the same schema; context is fixed at design time via the data model and DDL.

Data Model

A classical data model exists at different levels of abstraction. The classical data model has a high level of abstraction called the ER – entity relationship. Each of the entities described in the ER has a lower level of detail for each of the entities. In the mid level of the data model, the data item set, are found keys, attributes and relationships.

At the lowest level of the structured data model is found the place where the model is described to the DBMS – database management system. This is called the DDL – data definition language. Each of the levels of the data model are related to each other, in much the same way as the globe of the world relates to Texas, and Texas relates to the city of Dallas.

Text Model

The world of unstructured data has its own model – a text model.

There are at least two levels of granularity of the text model – the chunk level and the word level. A chunk is a collection of text that contains words that form some kind of thought that has been expressed. The word level is the lowest level of granularity of the unstructured environment. Each of these levels of granularity have their own advantages and disadvantages.

Unstructured data at the chunk level is good for understanding the reliability and the veracity of the thought that is being expressed.
Unstructured data at the word level expresses text and the lowest level of granularity. Words are good for simple contextualization, but not for complete expression of a thought.

Both chunks and words require context in order to be meaningful.

The word “fire” Fire can mean a flame, a conflagration. Fire can mean the pulling of the trigger of a loaded gun. Fire can mean the involuntary cessation of employment.

The text model is built on an ontology - a domain vocabulary. The ontology consists of one or more taxonomies. A taxonomy is a vocabulary that relates to some one subject. The taxonomies in an ontology may or may not have an interrelationship between themselves.

The ontology for medicine might contain taxonomies for such things as medications, diagnoses, oncology, dermatology, the ICU.
The ontology for a bank might include such taxonomies as account management, loans, credit cards, savings, and so forth.

Context

Both structured data and text require context, but structured data gets explicit context at schema-definition time, while text’s context is inferred from surrounding words (e.g., different meanings of “fire”).

Structured data uses vertical structuring (columns define context, values are instances), whereas textual databases use horizontal structuring: each row pairs a term instance with its context, allowing arbitrary relationships to emerge from text.

TETL, ontologies, and dynamic text structuring

Textual ETL (TETL) ingests raw text from many sources (internet, voice, print, email), matches it against domain taxonomies, and emits a structured text database of term–context pairs.[williaminmon.substack]
Unlike static DBMS schemas, this horizontal text structure is dynamic: any relationship mentioned in text can be represented, supporting flexible queries like “patients from Colorado taking Multaq” without prior schema design.[williaminmon.substack]

Analytical power and COVID example

The article describes loading a TETL-generated medical text database (from 10,000 COVID-related records) into a Pearson correlation matrix to find correlations between COVID and factors like smoking, age, sex, and medications.[williaminmon.substack]
Because the pipeline from raw text to matrix takes around ten minutes, analysts can iteratively adjust taxonomies (add/remove words, tweak contexts) and rerun analyses, enabling exploratory, heuristic use of unstructured data for business value.[williaminmon.substack]

a digital garden

Explorer