Anchor Modelling Performance on Flat Historical Table

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view

Anchor Modelling Performance on Flat Historical Table

This post was updated on .

We have multiple flat "data models" exposed through a query-able analytics API in our application. Each of these is stored in the database as a flattened table. The table comprises of fields brought together from multiple fact and dimension tables from the main data warehouse. The joins across the galaxy are done during etl time so they don't have to be done at query time which was very slow.

The flat tables also store historical information, using a scheme that emulates slowly changing dimensions (with start and end dates on all records - all fields being type 2). So for each change to a "business entity" a new snapshot row is created (limited to one a day). For the larger tables (0.5MM business entities, 0.5GB storage space), we envision about 4 snapshots being created for each entity, so it would grow to say 2.5MM rows, 2.5GB space. The smallest data model may have < 10k entities, but have 150 snapshots created.

When a snapshot happens, its likely only a few fields actually changed.

One other note - we typically have some time based fields we like to add to the data models, eg days since entity created. We do this by having a view over the table, joining it to a calendar table and doing the calculation. All incoming queries hit this view.

Ultimately we want fast queries to satisfy the queries provided via the api calls. A query currently only hits one data model. Do you think using anchor modelling for these tables would likely improve performance?
Reply | Threaded
Open this post in threaded view

Re: Anchor Modelling Performance on Flat Historical Table

I have never done a data warehousing implementation with Anchor Modeling so take this answer with a grain of salt.

Your question seems to mostly be concerned with query performance, so I'll ignore the performance of populating the tables in question, but it should be noted that improving the performance of one usually negatively impacts the other.

First off, the amount of data being discussed here is extremely small. 500MB is almost nothing in the world of modern databases. Keep in mind, almost any operation happening is going to occur entirely in memory, which means that the performance difference between an index existing or not can be quite minimal. I would be very surprised if anchor modeling improved the speed of queries at these data sizes.

A common misconception about Anchor modeling is space efficiency. Anchor modeling is not space efficient compared to flat tables even when recording history unless you make large amounts of revisions to a small subset of attributes comprising an entity. Here is a way to prove this to yourself:

SQL Server uses 7 bytes to store the metadata of a single row. Every row also has an entry in a page index that costs another 2 bytes. So a table consisting of zero columns would take 9 bytes per row. Imagine you have a logical entity with 5 columns [ ID, FirstName, LastName, DateOfBirth, RowCreatedAt]. We'll say that ID is an int (4 bytes), FirstName and LastName are VARCHAR (bytes vary based on value), and DateOfBirth and RowCreatedAt are SmallDateTime (4 bytes).

That means inserting this row: [ 1, John, Kenedy, 1917-MAY-28, 2017-JAN-02 ] would cost 9 (metadata) + 4 (int) + 6 (varchar) + 8 (varchar) + 4 (smalldt) + 4 (small dt). The total is 35 bytes. Say you discovered that the LastName attribute was wrong and someone changed it to 'Kennedy'. Another row would be entered and it would take 36 bytes this time since 'Kennedy' is one character longer. Varchar's take the number of characters + 2 bytes to store. Finally someone realizes the DateOfBirth is also wrong and they change that to '1917-MAY-29'. That would cost another 36 bytes. So all three rows would come to 107 bytes.

Now imagine you had an anchor model of the same information and the same data types and logical operations. You would have an anchor table of the primary key + the row metadata. So 3 rows of 9 bytes + 4 bytes (int), that's 39 bytes. Then there would be additional tables to model the attributes (FirstName, LastName, DateOfBirth). Those would have the same 9 byte metadata per row, a foreign key back to the Anchor Table on each row, and the values themselves. You're also going to probably have fields like 'RowCreatedAt' on each respective table. This is great if the attributes can change at varying times, but it also means you're paying that 4 byte date many times. In fact, chances are going you're using a Datetime2 which uses 6-8 bytes depending on precision.

If you're skimming, here is the key takeaway. Every 'logical' row in an Anchor Model takes many more bytes that it would ordinarily take in a regular schema (all the attributes in the same table). Someone paying attention may raise the point, "But what about revisions to data? In your example, the flat schema restates the unchanged fields multiple times, whereas the Anchor Model would only have to insert  a single Narrow row!" You are completely correct, and if small updates of a few attributes of a large entity happen rapidly, Anchor Modeling would become much more efficient. Unfortunately I can burst that bubble pretty rapidly. Think back to the earlier schema, say after analysis we determine that of all the fields, the LastName field gets changed much more often than the other ones. We could simply move just that column to another table, and pay a much smaller update cost in exchange for a reasonable bit of upfront extra storage. This also ignores the fact that if several attributes update at once, Anchor Modeling would require duplicate dates and additional metadata for the rows while the flat version would not.

Now, when SQL Server retrieves information from disk or in memory, it does so in 'pages', 8Kb a pop, (technically it uses extents which are 8 pages together), so when data is stored on a single table, if you only care about one column, it still has to retrieve all the other information to get all the values for that one column. The exception to this would be nonclustered indexes, which store redundant copies of portions of the table off to the side. In essence, an Anchor Model Schema is like implicitly making each attribute and tie into a non-clustered index since it allows SQL Server to retrieve JUST the information about that attribute or relationship without the other pieces of information coming along for the ride. If you added a non-clustered index to each column of a table, you'd have a logical equivalent in some ways (ps, don't do that).

At this point you might be asking, "Wow, this seems like a lot of baggage, why introduce an Anchor Model at all?" For one, anchor modeling creates a pattern for the schema of a database, and a way to evolve the schema with minimal disruption. Second, it handles volatile information gracefully so rapid updates won't overwhelm a database with redundant data. Basically, you're accepting that it isn't very space efficient up front, in exchange for the fact that it has a reliable structure and decent worst case scenarios. Third, it provides a pattern for modeling temporal information. This is highly important, because most ad-hoc implementations of temporal concepts are simply wrong.

All this being stated, I don't believe Anchor modeling will grossly improve the speed of your scenario. It sounds like best bet would be tuning indexes for specific queries, and making sure those queries have sargable criteria (for instance, no date-math in the query criteria).

Sorry for any typos or weird phrases, I wrote this late at night :/