The Specification

The EDXML specification describes how EDXML data is structured, validated and interpreted.

Current version:
3.0.0

Introduction

The EDXML specification describes how EDXML data is generated, interpreted and processed. Some key elements from the specification text are summarized below. It is recommended to read the introduction to EDXML first, in case you have not done so yet.

XML

EDXML uses an XML based data format which is designed to enable the use of schemas to offload data validation to efficient C libraries. The EDXML SDK demonstrates this in both its parser and generator.

Events

In EDXML all data is represented by means of events. An EDXML event is like a data row in a database table or a spreadsheet. It has a simple, flat structure. An example event is shown below:


<event event-type="crm.customer.details"
       source-uri="/com/acme/crm/">
  <properties>
    <phone>0034656286219</phone>
    <email>[email protected]</email>
  </properties>
</event>

The above example shows an event having two properties. Properties can have zero, one or multiple object values. The example shows both properties having one object value.

Each event has an event type and a source URI. The event type defines the structure of the events, which properties it has, what these properties mean, how properties are mutually related, and so on. Event type definitions are part of the ontology, which we will describe later.

The simple event representation suggests that EDXML is only useful for simple data structures. This is not the case though. In EDXML, the true structure of the data is for the most part the result of interpretation of the data according to its meaning. The meaning of the events is encoded in their event type definition. Using the event type definitions in the EDXML ontology, one or more events can be expanded into more advanced data structures like hierarchical structures or graphs.

Ontologies

The structure of an event is determined by its event type. Event types are defined in EDXML ontologies and describe the exact structure and meaning of all events of a particular type. They describe which properties their events can have, how these properties are related to one another, what the value space of their values is, and which concepts they are associated with. They describe how the time stamps contained in the events must be interpreted, how the ordering of the events is defined and many other things.

The specification does not include any predefined event types. Each EDXML data source can generate its own ontology which contains the definitions of the types of whatever events it wishes to produce. As explained in the story telling analogy the ontology is like the introduction of a novel which describes the various characters and their role in the story that the data is telling us.

Ontologies are an integral part of the EDXML document format. In fact, the specification requires that the ontology is never separated from the events that it describes. This guarantees that the meaning of the data flows along with the data itself, the information required to process the data is wherever the data is.

Ontologies describe the meaning of the data. Combined with the events, this yields knowledge. Consider the example event shown above once more. The event type may contain an intra-concept relation between the phone number and e-mail address, which looks like this (abbreviated):


<intra source="phone"
       target="email"
       source-concept="person"
       target-concept="person"/>

This relation tells machines that both the phone number and the e-mail address are identifiers of a person and that both are associated with the same person. Computers can use this relation to make the reasoning steps needed to perform concept mining.

The ontologies of multiple data sources can be merged automatically. This means that software components that output EDXML data can be shared and combined. Combining the outputs yields a single consistent body of correlated knowledge. The use of shared ontology bricks helps to achieve this in practise.

Event types and other ontology components are versioned. An update can be sent as part of a regular ontology element in an EDXML data stream. This assures that the update is automatically distributed to all system components that consume event data. The specification provides backward compatibility guarantees and describes how ontology updates are processed.

Object Types

Events can be correlated by determining which events have a particular object value in common. This requires matching both the object value and the object type. Each event property defines the type of objects it can contain by specifying its object type. While properties are specific to a single event type, many event types can use the same object type.

The object type of an event property also determines the value space of the object values. For example, an object type defining a date only allows dates as values.

Concepts

As explained in the story telling analogy, EDXML concepts are like the characters in the stories told by the events in the data set. Concepts can be persons, places, organizations, or whatever makes sense to describe the data. In EDXML, concepts are basically just names, identifiers, like person.customer.

Which attributes a concept may have (name, address, ...) is not explicitly defined. In stead, event properties can define which concept they are associated with, marking themselves as a concept attribute. The full set of available attributes of a concept is obtained by querying all event types that refer to it. This is how multiple EDXML data sources can join forces to provide a more complete description of a concept, simply by combining their event type definitions and their event data.

Last but not least, concepts are the very thing that makes concept mining possible.

Hashing

Each EDXML event has a unique persistent identifier which can be represented as a hash value. These hashes are not explicitly stored in the EDXML data. Rather, they can be computed on demand from the event data. The specification text describes the hash computation method in detail.

The event hashes are called sticky hashes because they remain attached to a given event, even when the event evolves over time. The hashes provide a persistent identifier which can be safely used to refer to any event, static or dynamic. This is achieved by allowing an event type to customize the hash computation.

Event types can describe which combination of event properties constitute a unique event. Only the object values from these properties are included in the hash. This means that a data source can generate multiple physical events that are in fact different instances of the same logical event. As such an EDXML event can be used to track the state of a dynamic data record at the data source.

Events that are instances of the same logical event are said to collide, analogous to a hash collision in cryptography. Colliding events can be merged into one by means of a merging procedure that is also described in the specification. The event type definition can provide directives for merging its events. Various merge strategies are supported, such as taking the largest value, adding the value to the set, and so on. Merge operations are idempotent, which allows for implementing event merges in distributed databases.

Templates

While the primary goal of EDXML is to make data equally meaningful to both humans and machines, its expressiveness is and will always be limited. Ultimately, the exact meaning of data is best conveyed to humans by means of human language.

EDXML includes a simple template language that enables converting EDXML events into plain English. A short example template is shown below:

On [[date-created]] bank account [[account]] was created for [[customer-name]]{, who can be contacted at e-mail address [[customer-email]]}.

The double square brackets mark placeholders for property values. The part of the template enclosed in curly brackets is omitted in case customer-email has no value. This feature allows event descriptions to degrade gracefully for events that contain less information. The above template may evaluate into

On May 1st 2018 bank account 12345 was created for Alice Johnson, who can be contacted at e-mail address [email protected].

Machines really can tell stories.

The templates are also used to describe relationships between events. This enables machines to do things like explaining their own reasoning that led to a particular concept mining result, quite similar to how a human analyst would.

other subjects

Story Telling

Concept Mining

Ontologies

Scientific Background

EDXML Foundation

SDK

Introduction