Datamining Intelligence.
Structured extraction of named entities, attributes, and relationships across large source corpora. Produces datasets, rosters, and entity graphs.
- TYPICAL CADENCE
- One-shot; refresh on-request
- TYPICAL SOURCE COUNT
- Variable; schema-driven
- TYPICAL OUTPUT LENGTH
- Structured data, not prose
- TYPICAL CONFIDENCE REGISTER
- Coverage-weighted; per-field reliability
// QUESTION CLASS
Produce the list, the roster, the dataset.
Datamining is how The Watch turns unstructured source material into structured data. When the question is who the people in a category are, or what events happened in a period, or what dataset to build with which fields, Datamining is the engine that produces it. This is not a report product in the narrative sense; the output is a table, a roster, a timeline, a graph file. It is consumed programmatically as often as it is read. Datamining also feeds the other engines. Baseline reports rely on it to extract structural facts. Forecast runs rely on it for indicator data. Decision targeting relies on it for node inventories. When the output of another engine lists specific entities, specific dates, or specific numbers, Datamining produced that inventory.
// ANATOMY
A Datamining output contains:
A Datamining output is a dataset with a cover document. The dataset itself is the deliverable — a table, a graph, a timeline — in whatever format the customer consumes. The cover document explains how it was built: what the schema was, what queries ran, what sources were scanned, what was rejected by quality gates, and what the analyst's confidence is in each field of the schema. A reader who wants to use the dataset for downstream analysis needs to know what they can trust in it; the cover document is that instrument panel.
// TRADECRAFT
Structured data inherits the problems of its sources. The cover document is how we show our work.
Schema first.
The fields are declared before the extraction runs. A Datamining product is only as rigorous as its schema — an ill-defined field produces unusable data. The Intake step pressure-tests the schema against the source universe before the run: do the sources actually support these field definitions, at this coverage target, within this time bound? If not, the schema is revised before anything runs.
Structured Extraction · Schema-first methodologyPer-cell citation.
Every cell of the output table carries the source it came from. Not per-row, not per-entity — per-cell. This is how a downstream analyst knows they can use the data: they can verify any individual datum against the specific source that produced it. Structured data without cell-level sourcing is not auditable data, and we don't produce it.
ICD 206 · Cell-level sourcingKnown gaps are part of the output.
A Datamining product names the gaps in its own coverage. Rows that could not be filled confidently are flagged, not omitted. A "complete-looking" dataset that quietly dropped the 3% of entities with thin sourcing is a less honest dataset than one that includes those rows with LOW-confidence flags. The customer can filter; they cannot unfilter what was never shown.
ICD 203 · Standard 2: Properly expresses uncertaintiesFeeds all other engines. Often invoked as the first step before a Baseline or Forecast run.
// OTHER ENGINES