How to handle data corrections

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

How to handle data corrections

BasVanDenBerg
If you know that an attribute can change, you make it a historic attribute else not.

Suppose you have a first name of a customer which is not historic because it should never change.

Now it appears that the user that entered the data made a typing error and this is corrected after a few weeks, how would this be stored in the anchor model?

Since the first name is not historic, the only way to handle this case is to update the first name attribute and metadata, but now you loose the previous (incorrect) first name. So you loose the feature that data is never lost in the anchor model.

From a modelling perspective this sounds right, because you defined that first name cannot change, but I don't like the fact that you loose data in this case.
So can I conclude that if you want to disable the UPDATE functionality on a database level and/or you never want to loose any values then you need to make every attribute historic, else you cannot handle data corrections.
Bas
Reply | Threaded
Open this post in threaded view
|

Re: How to handle data corrections

Olle
Hi,
I have earlier solved this kind of problem with an extra attribute indicating that a value is valid or not (kind of a logical delete), but I’m not too happy with this. I cannot include the "this is a valid row" attribute in the key because users may sometimes do errors twice, and if they do I don’t know in which order they made the errors…
So to make Referential integrity work in the database, you are right, you have to historize the attribute or delete the row...

Lars is working on a Bi-temporal extension to AM. I think your problem is a typical Bi-temporal case. I hope Lars soon will come up with a smooth solution to this problem

Olle
Reply | Threaded
Open this post in threaded view
|

Re: How to handle data corrections

Ivo Clemens
Bas,

You suggest using historization to deal with corrections in your source data. I would argue against this.

In the restaurant example:
You could historize "dish-name" to deal with the eventuality that the name will be corrected. In that case the [validFrom] field has the meaning: the date on which the attribute was corrected to this value.
But for the dish-price attribute the [validFrom] field already has a meaning: the date on which this the attribute took this particular value. This is not the same as a correction. It is a different use of the [validFrom] field. It would be confusing to assign 2 different meanings to the same field in 1 model.
Besides, how would you deal with corrections in "dish-price"? When a new dish-price is entered into the datawarehouse a mistake may have been made in either the price or in the [validFrom] date. Corrections of these mistakes cannot be solved using historization, since the [validFrom] field was assigend a different meaning already.

I agree with Olle that an extra temporality is needed to properly deal with data-corrections without updates.

Ivo
Reply | Threaded
Open this post in threaded view
|

Re: How to handle data corrections

roenbaeck
Administrator
The code generated by the tool (and the way we describe Anchor Modeling in our papers) is mono-temporal and it handles value changes in the domain being modeled.

For example, if your domain contains products one can assume that their prices may change over time. These changes are captured by "changing time" in Anchor Modeling. We normally also store what we call "recording time" in metadata, the time that the database recorded the price value. Even if two different times are used the resulting model is not bi-temporal, since "recording time" is not part of the primary key.

This means that erroneous data cannot be stored in such a model. Only correct data, or correct to the best of our knowledge. If errors are found, that data has to be deleted or moved someplace else to make place for the corrections.

For example, if someone mistyped a price as 100 euro in effect between 2001-01-01 and 2001-12-31, when it in fact should have been 10 euro, the old row will be removed and a new one will replace it.

I am working on a bi-temporal extension of Anchor Modeling, in which also "recording time" is part of the primary key. In such a model the erroneous data can be kept alongside the correct one. That enables you to rewind time along two dimensions, either backwards looking at correct information, or backwards looking at exactly what you knew at the time, even if it was wrong.

Going from mono-temporal to bi-temporal results in a level of complexity much higher than that of going from no temporality to mono-temporal. From the research I am doing right now it seems that few (if anyone) has really understood the complexities of bi-temporal modeling.
Reply | Threaded
Open this post in threaded view
|

Re: How to handle data corrections

Jorg Jansen
Hi Lars,

For me the fact that AM concentrates on changing (functional/valid) times instead of recording (technical/transaction) times is one of the strong points of AM. And indeed, to be able to satisfy auditability requirements and to track all historic corrections, one needs a second time dimension besides the ValidFrom date. Of course this is the whole idea of bitemporal data: both time dimensions are important in their own right.

Like you say, one could add a second date/time field like RecordedFrom to an attribute table and as a consequence this field would then also have to be added to the primary key of the attribute table. And like you say, this can make things pretty complex, for example consider the case where the changing time itself is changed in the source...

I would argue that a better option for the second time dimension is to make a separate Audit table for the attribute table, where a complete audit history of the records in the attribute table is kept, including changes to the ValidFrom date. The attribute table itself would then represent the "latest view" with respect to the recording time, and using the audit table the point-in-time and the interval views with respect to recording times could be constructed. In this way one would have complete auditability, while other (datamart) layers on top of the AM model would probably not be bothered with the (big and fast growing) audit tables. Furthermore, one could use a union of the attribute and the attribute audit table to construct a complete bitemporal view of the data!

Looking forward to your ideas about bitemporal extensions of AM...
Reply | Threaded
Open this post in threaded view
|

Re: How to handle data corrections

roenbaeck
Administrator
I'm cross-posting this from another forum, but it fits here too:

As I see it, you model a domain, and in that domain the information itself cannot be destroyed. Within the concept of an attribute lies the fact that it can always be evaluated. Within the concept of an entity lies the fact that it can always be identified. I therefore believe in modeling changing (valid) time using a gapless timeline. Using only a single timestamp (when the change happened) ensures gapless timelines. They define an implicit interval that can only be closed by another row in the database.

This approach also forces you to model events in the domain explicitly. I have seen objections like the following example: "Let's say I have a subscription of a magazine that started in 2005 and ended in 2010. Surely that should be modeled using valid time?" I say it should not. Either the subscription is viewed as having a state {active, inactive} which change over time, or it itself has two attributes containing the start and end dates. I believe that when using a gapless timeline this mistake is much harder to make, which is another advantage.

The way I see recording (transaction) time is in connection with some 'memory' in which information can be recorded and erased. Surely the information about the domain can be erased or forgotten by whatever agent who is holding this memory. Therefore, for someone observing this memory over time, there can be gaps in the timeline. For example, if we first believed that there was a magazine subscription, but in fact it never happened. In other words, it was never information of the domain, but it was during some period stored in our memory.

The amount of memory used may increase and decrease over time, but the information in the domain can only increase over time.
Reply | Threaded
Open this post in threaded view
|

Re: How to handle data corrections

BasVanDenBerg
What is the forum, you are referring to?
Reply | Threaded
Open this post in threaded view
|

Re: How to handle data corrections

roenbaeck
Administrator
A LinkedIn forum named "Temporal Data".