Why joins are a good thing

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Why joins are a good thing

roenbaeck
Administrator
There seems to be a very common misconception about joins being bad for performance, but joining is not necessarily always a bad thing, as can be seen in the performance of Anchor Modeling. Let me illustrate it this way. Say we have wide table with 50 columns in 3NF but almost all queries use only up to 5 of those columns. That means that at least 90% of the columns you need to scan during querying are a waste of resources. Now imagine that you explode the single table into 50 tables in 6NF. Almost all queries now instead require up to 5 joins, however, the narrow tables that are being scanned contain exactly the information you are asking for. The previously wasted resources can in this scenario be used to perform the joins, with some break-even point in performance.

This break-even point is pushed further to the benefit of 6NF when changes are captured and stored. In the wide table, if a single column changes value and history should be kept, the values in the other 49 columns are duplicated. If many columns change values and do so asynchronously, your single table will grow fast. In comparison, with 50 tables, no data is duplicated, since only a single narrow table is affected if a single column changes value. The same reasoning can be applied to null values, which must be represented in the single table, but result in the absence of rows in the 6NF model.

Last, but not least, if your queries contain conditions joins can mimic indexes through column statistics. Using statistics the query optimizer can determine the best join order, which is taking the most selective condition first, producing a small intermediate result set, followed by the second most selective condition, and so on. Statistics take up much less space than an index, and secondary indexes rarely are needed in 6NF models. For the single wide table, to achieve the same general performance benefits, every column would have to be indexed, and that would take up a lot of extra space.

So, to conclude, highly normalized models are certainly not unsuitable for ad-hoc querying. On the contrary, we have shown that in Anchor Modeling we surpass the performance of dimensional models for many types of queries. Particularly so when we need to ask a query that the dimensional model was not prepared for.