Temporal Integrity in Immutable Key Modeling Techniques
This post was updated on .
What follows is the definitions according to Wikipedia:
There are three types of integrity that needs to be upheld in order to claim database integrity: domain, referential and entity integrity.
Domain integrity Restricts data to conform to certain data types and certain constraints, such as being within a predefined interval, in order be considered valid.
Referential integrity Requires all foreign key values to be present as primary keys in the referenced tables. In other words, there can not be any dangling references, and joining over references will not produce duplicates.
Entity integrity Requires every table to have a defined primary key. No duplicate rows can be created on the primary key.
Each of the integrity conditions above become inherently more complicated with the level of temporalization introduced. I will therefore discuss them in the context of immutable key modeling techniques (of which Anchor Modeling and Data Vault are two types) and how and why we need to look at them a bit differently, especially in the case of bi-temporal and concurrent-temporal modeling.
I am cross posting this from the LinkedIn group 'Temporal Data':
I do not see joining as much of a problem as long as you can resolve temporality first. Say you have two bitemporal tables, Employee and Department, with a relationship describing to which department and employee belongs. Then, it is possible to create two parametrized views ppEmployee and ppDepartment, both which take two points in time as parameters. One for the valid/effective/changing time you want to travel to and one for the transaction/assertion/recording time you want to travel to. Most queries can now be written on the form:
ppEmployee(2001, 9999) e
ppDepartment(2001, 2001) d
d.DepartmentIdentifier = e.DepartmentIdentifier
With different parameters depending on what you want to see.
Just to give an example of when this does not work, is when you need to do what I call a 'join across timelines'. Say that you instead want to see the Departments as they looked on the year that the Employee was hired. You would want to do something like:
ppEmployee(2001, 9999) e
ppDepartment(e.HireDate, e.HireDate) d
d.DepartmentIdentifier = e.DepartmentIdentifier
But I have yet to find a database that allows that type of notation, so you will have to do some more work by hand instead to get the desired join.
I still haven't been able to get a hold of the SQL:2011 standard, but from what I could gather from other resources it looks like constraints will not apply to some of the information stored in a system versioned table. We do, however, know that before the information was updated or deleted it did satisfy the constraints that were present at the time. Referential integrity is a constraint. So, now the tough question is, if you join across times, ie using different parameters in the pp-views as in the first select above, how can you still guarantee referential integrity?
The way referential integrity is ensured in Anchor Modeling, Data Vault, and related techniques is by introducing the idea of immutability. In AM we make the assumption that identities are immutable, and then only the synthetic/surrogate identity, whereas in DV the assumption is extended to also include the natural/business identity. The model is then separated, such that immutable and mutable information never live together in the same tables. AM has anchors, which hold only synthetic identities, and assumes everything else to be mutable. DV has hubs, which hold a combination of synthetic and natural identities.
In AM we let the database enforce referential integrity by declaring foreign keys. Such foreign keys must always reference an immutable identity. In AM attributes (mutable) have a foreign key to the anchor (immutable). Relationships, or ties (mutable) as we call them in AM reference a number of anchors (immutable) using foreign keys. In DV I understand that foreign keys are not declared, and RI must be enforced by your loading process instead. Nevertheless, assuming that no mistakes have been made, DV can be seen as having an implied RI.
The benefit of this is that temporal referential integrity can be maintained, even though no temporal constraints have been declared. For any given two fixed points in bitemporal time, the model behaves as if it was wholly immutable and with complete RI. Given two by two fixed and different points in bitemporal time, the result is two models, each wholly immutable with complete RI. Let's call them 'past' and 'present'. Thanks to immutability, we know that one model must contain more identities than the other. We do not expect all present identities to exist in the past, but all past identities will exist in the present. Therefore, any RI that existed in the past will also satisfy RI in the present.
What remains is that some past attribute values or tie instances may have been deleted in the present. The immutable identities are still there, enforcing RI, but the mutable parts are gone. This is the desired behavior though. If you for example query for the Employees we had at a certain time in the past and want to know in which Departments they are today, then there will be no results for Employees that are no longer employed. You can do an outer join to include these, but then we've left the realm of RI.
The distinction between immutable/mutable information and its disjunction in the model helps a lot when dealing with the operations described in my first post. I would be very interested to hear about other approaches, if there are any.
Let's introduce some better terminology. What I was getting at was that there are two classes of joins:
Temporally independent joins
I referred to this as "resolving temporality first". In this case, which bi-temporal version of a table you are interested in does not depend on information in any of the adjoined tables.
Temporally dependent joins
I referred to this as "joining across timelines". In this case, which bi-temporal version of a table you are interested in depends on information found in the adjoined tables.
From my experience it is apparent that the first class is the predominant one. Perhaps it is even more so for 6NF. If you have made parametrized views in order to resolve temporality, the first class would correspond to passing constants as arguments to the view. Such as ppProduct('2012-01-01', infinity) in order to find what the prices of products were on the 1st of January 2012 given the best information I currently have.
In the second class, given the same views, this would roughly correspond to passing a variable as argument. Such as ppProduct('time-of-purchase', 'time-of-purchase') where 'time-of-purchase' would be found in an adjoined table. It may look doable, but it is not, and I have always ended up writing these queries by hand. Basically, in the end you will at some point need conditions stating Product.ValidTime <= PointOfSale.PurchaseTime and Product.TransactionTime <= PointOfSale.PurchaseTime, which is why I called it "joining across timelines".
I managed to find a way to do temporally dependent joins and still utilize the parametrized views. In order to show this pattern I updated the script and added one such query at the very end. It uses a combination of cross apply instead of join, which makes it possible to pass subselects as parameters, and a where-condition. I am not all too familiar with cross apply, so there may be performance implications, but the readability of the query is quite good:
-- a pattern for a temporally dependent join
ppCU_Customer('2011-03-16', '2011-03-19') cu
cupo.CU_ID_is = cu.CU_ID
PO_QUA.PO_ID = cupo.PO_ID_holding
poco.PO_ID_for = po.PO_ID
co.CO_ID = poco.CO_ID_the
po.PO_ID = cupo.PO_ID_holding;
Cross posting this from the Temporal Data forum on LinkedIn for reference:
An identity is an immutable property given to something at some point in time. The least restrictive assumption is that everything else about that something may change over time. If for some reason that something changes so much you start to think of it as something else, then you have created a new identity.
Identities are abstract properties, that from a philosophical standpoint only exist in the eye of the beholder. However, within a database they can be materialized as an identifier unique with respect to the classification of that something. Given this, I believe that surrogate keys are the perfect candidates to represent identities. They are a construct that helps the database keep track of what is what, and should not incur any other meaning whatsoever, and probably best never to be shown to an end user.
Natural (business) keys tend to change over time, both in content and composition. They also tend to carry meaning. Both suggest that they should be represented as properties of that something which they also identify, rather than used as identifiers in the database.
If the database is structured such that mutable content is separated from eachother by an immutable component, you can even achieve a model in which all foreign keys are non-composed (single column), even if the primary keys in the temporal tables have several temporal components. Furthermore, the different tables will enjoy temporal independence, greatly simplifying joins, as long as the joins themselves are not temporally dependent.
I believe that smart handling of identities and corresponding keys in the database is of great importance when you are dealing with temporal data.