I am not certain when to use 3way ties. I have implemented a 3way tie in my model, and would like to confirm this is correct.
In email campaign data: a customer receives a message, sent by an email campaign. Customers can only be addressed once only by a single campaign.
I have created campaign, message and customer anchors. Each have several attributes. (The message anchor was created to accommodate attributes such as senddatetime and bounce-reason).
Initially i had modeled the a 2way tie between customer and message, and a 2way tie between message and campaign. (So no direct tie between campaign and message). When loading new data I want my ETL to check if a message already exists before creating a new message anchor ID. I found I had to combine the 2 tie tables, to see if a relation between the campaign and the customer already existed. This method seemed cumbersome and performed poorly as tables grew large.
A 3way tie seemed a better representation of reality: The relation between a campaign and a customer exists if and only if a message was sent. The relation between message and customer exists if and only if there is a campaign that sent this message. The relation between campaign and message exists only for each customer addressed.
This 3way solution certainly appears more elegant, and performs much better.
Is my reasoning correct? Do you agree that this situation calls for a 3way tie?
The message ID is primary key. I am considering a secondary index on the customer ID in the 3way tie table.
Do you have any observations?
I think a 3way tie will work for you unless you can end up in a situation knowing the campaign and a message but not the customer. What i am saying is that you will have a problem with late arriving data (if you can end up in that situation) because you need all three anchors to make your tie.
It is interesting that you have performance problems in your two way tie setup. Can you describe that solution in more detail?
Thanks for your reply. Let me try to describe the performance problem:
Before inserting a new message, i want to make sure it was not yet inserted. The natural key for messages is the combination of the customer and the campaign. However, in the model with only 2-way ties, there is no table that combines the customer anchor and the campaign anchor.
I can merge the customer-message tie table with the message-campaign tie table, and i end up with all the existing customer-campaign combinations. Now I can check if the combination of customer and campaign that i want to insert is new, or already in the database.
But this combining of the 2 tie tables is cumbersome and slow. This is the performance problem i mentioned. Please note: both individual tie tables perform just fine, but having to combine these 2 tables took up time and felt un-elegant.
I was considering creating my own additional table outside of the anchor model tables, in which all combinations of customers and campaigns would be stored. But then i realized that a 3-way tie would do that without me having to create extra tables.
What kind of volumes are we talking about here? The "worst" scenario I have seen was for a retail chain where the natural key for a receipt consisted of: date, receipt number, store number, cash register number. This was modeled using three anchors and two ties, one receipt anchor with date and receipt number attributes, a store anchor with store number attribute, and a cash register anchor with a cash register number attribute.
There were about 70 million receipts in the receipt anchor, 50 stores and a few hundred cash registers. Doing the joins "live" to assemble the natural key when new data was inserted was never a problem.
I normally use a natural key view in which a join is defined that assembles the natural key and associates it with its surrogate anchor identity. Should you run into a situation where the performance of doing lookups in this view is not good enough I would recommend that you materialize the view using the 'WITH SCHEMABINDING' option. We always had this as a backup plan should the views be too slow, but never had to put it into action.
The daily load batch contains data from the last 7 days, to make sure all data reaches us even if there is an error that persists for up to 6 days. Batches often contain some 3 or 4 million records. Currently there are 135 million messages in the database, from 190 thousand unique campaigns to 4 million unique customers.
The performance problem crept in when my script "live joined" the 2 tie tables in a subscript before comparing it to the new natural keys. I suspect this temporary table occupied all available memory. This step of the batch load took longer than all other steps combined.
If i remember correctly the 3-way tie solution performs the "match and anchor creation" step in under 1/10th the time of the 2 tie solution. So even if the loading time of the 2 tie solution was not (yet) an insurmountable performance-problem, the 3-way tie solution strikes me as superior.
For comparison, processing a batch now takes less than 5 minutes when the system is calm, 30 minutes when the system is busy (when all refresh-jobs are running). What kind of refresh times did you experience in your example?
But maybe i'm simply missing the downside of using the 3-way tie? Could you tell me why I should avoid using them?
If you by "subscript" mean a subselect resulting in a derived table, then yes, that could be the performance hog. You need to join the tables in optimal succession without using subselects in order to get maximum performance. This is due to the 'reduced intermediate result set' effect. From my example, if first the condition that matches the store's natural key (i.e. store number) is evaluated, and there is 50 stores, then it will have to scan 1/50th of the data in the cash register table to find all matches. If furthermore the next condition matches a cash registers natural key (i.e. cash register number) and there are 10 distinct such numbers, then it will have to scan only 1/(50*10) = 1/500th of the data in the receipt table to find all matches.
So, when the cardinality of the tables are very different, extra joins may actually be quite beneficial!
I could load about 1/2 million new rows in the scenario I described in less than 30 seconds (when there was no concurrent activity on the server). This was on an Intel Core 2 Quad, 8GB memory, and two local mirrored SATA disks.
Two more things. I had quite poor performance to begin with after doing a historical load of one year worth of data. This was due to two things:
1) index fragmentation in the clustered indexes. I had high fragmentation causing partial scans to be slow. Reorganizing the index solved this.
2) stale statistics. I had not updated statistics after the loading, so the optimizer could not figure out the optimal join path with respect to the 'reduced intermediate result set' effect described above.
In AM you need to have updated statistics and unfragmented indexes to have good performance. This is of course true for any technique, but even more so in AM since you often have many joins.