CRT under fuzzy record linking

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

CRT under fuzzy record linking

Eric Biggs
Is there a recommended way to address CRT under fuzzy record linking, where the process by which the positor finds an anchor upon which to make posits can vary based on new information or approaches (i.e. probabilistic Record Linking)

Let's say for example we have two Persons for John Smith from 2 data sources (positors) and we want to say with some degree of error that they are the same person, so we would only have a single Person Anchor instance for the two John Smiths. Let's say later, data source B updates a soc sec number, and we now know that they were not after all the same person. How might we approach this under CRT? I can think of two ways.

We could have a isValidLink knotted attribute for a person anchor, set the old Person instance to invalid and create 2 new person instances for the two different John Smiths. Alternatively, we could pick one positor to separate, restate all of the anchor's attributes and ties to invalid for that positor and create a new person anchor instance for that single positor.

The 3rd option might be to model links as a tie, in which case we'd only need to restate the ties, but then we lose the concurrency under Person and the benefits it brings (posit, annex normalization, etc)

Has this concern been addressed in the past?
Reply | Threaded
Open this post in threaded view
|

Re: CRT under fuzzy record linking

roenbaeck
Administrator
It looks like there are three scenarios:

One person assumption
Two sources refer to John Smith on monday. On monday we are sure they are the same person. If we were not sure, we would create two John Smiths (move to next scenario). On tuesday we link a bunch of transactions to John Smith (#1), coming from the first source. On wednesday we link a bunch of transactions to John Smith (#1), coming from the second source. On thursday we suspect that there are in fact two John Smiths. The transactions on wednesday appear to be linked to the wrong John Smith. These transactions are retracted (0 reliability) and relinked to the newly created John Smith (#2) with some degree of reliability. There is only one possible view of the world.

Two distinct persons assumption
Two sources refer to John Smith on monday. On monday we cannot be sure they are the same person, so we create two John Smiths. On tuesday we link a bunch of transactions to John Smith (#1), coming from the first source. On wednesday we link a bunch of transactions to John Smith (#2), coming from the second source. On thursday we suspect that there is in fact only one John Smith. The transactions on wednesday appear to be linked to a duplicate of John Smith. These transactions are retracted (0 reliability) and relinked to the first John Smith (#1) with some degree of reliability. All attributes of the second John Smith are retracted as well. There is only one possible view of the world.

Possible worlds assumption
Two sources refer to John Smith on monday. On monday we cannot be sure they are the same person, so we create two John Smiths. On tuesday we link a bunch of transactions to both John Smiths (#1, #2) with some reliabilities, coming from the first source. On wednesday we link a bunch of transactions to both John Smiths (#1, #2) with some reliabilities, coming from the second source. On thursday we confirm that there is in fact only one John Smith. Some transactions on tuesday and wednesday are linked to the duplicate John Smith. These transactions are retracted (0 reliability). All attributes of the duplicate John Smith are retracted as well. Until better knowledge is available, there are three possible views of the world, only #1 exist in reality, only #2 exist in reality, #1 and #2 exist in reality. Which view of the world you want must be resolved at query time.

Unfortunately these scenarios will induce manual labor. I see no way to automate the discovery and recovery. Furthermore, you will need to identify the historical transactions that are connected to the wrong entity, which may involve detective work on the source data.

Does this answer your question?
Reply | Threaded
Open this post in threaded view
|

Re: CRT under fuzzy record linking

Eric Biggs
This is helpful, Lars, thank you.