Indices in Anchor modeling

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Indices in Anchor modeling

Ivo Clemens
Hi,

In a presentation on your site I read that "Secondary indexes are very rarely needed" in Anchor modeling. Why is that? Why would I need fewer attribute indices in Anchor Modeling than in traditional tables? Attribute tables are not sorted or indexed on the attribute field. Does an index on the attribute field not increase performance? (For example of queries where this attribute is used as a filter in the where-clause)
Reply | Threaded
Open this post in threaded view
|

Re: Indices in Anchor modeling

roenbaeck
Administrator
The only times we use secondary indexes are in 3-way or larger ties, where you want to join over the tie through partial paths. Although, I would not recommend this approach, since if you can do such joins you have probably modeled more than one relationship in a single tie, and it should instead be broken apart into several ties, each modeling a single relationship.

Regarding the attributes. In order for an index to be useful there has to be some kind of restrictive condition on the attribute values, for example in a where clause. There are some different scenarios:

A) There are few rows in the attribute.
In this case, scanning the whole attribute is approximately as fast as scanning an index to find the matches and then looking those up in the attribute table.

B) There are many rows in the attribute.
In this case it depends on how restrictive the condition is:

i) The condition is not very restrictive.
Since the condition is matching many rows in the attribute, going through the index and then doing the lookups may actually be slower than scanning the attribute directly.

ii) The condition is very restrictive.
Here finding the match in the index and then doing the lookup is faster than scanning the whole table. However, since tables in AM are narrow, with few columns, scanning a whole table is still much less expensive than in a less normalized modeling technique.

Furthermore, after scanning the initial table you will get a very small intermediate result set. Since in almost all queries joins are present, this propagates to the other tables when executing the join and greatly improves performance compared to having those attributes stored in the same table. So, when using very restrictive conditions and joins, performance will be better than say 3NF anyway, without using a secondary index.

Looking at ii) if the number of distinct values in low compared to the total number of values in the attribute, then you should use a knotted attribute, and the logic of the propagating small result sets will apply. Left are the cases when the number of distinct values are almost as large as the number of values in the attribute. For example, a surname.

It would be interesting to compare some realistic queries over such an attribute, where you have lots of surnames stored, with and without a secondary index. There probably are situations in which an index would improve performance, but then that would have to be weighed against the additional maintenance and storage costs.


Reply | Threaded
Open this post in threaded view
|

Re: Indices in Anchor modeling

Ivo Clemens
Thank you for your extensive reply!

The question arose as I was implementing an Anchor Model for a large(ish) amount of contact data. The attribute under consideration is the contact datetime. Reports and analysts will often select a subset of transaction attributes for a given period.
Contact-datetime is there for every anchor. The datetime is of course very restrictive (hardly any 2 contacts will have the exact same datetime), so situation B ii applies.
And indeed, (realistic) queries of transaction-attributes for a given period do appear to be quite a bit faster with an index on the contact-datetime field.

I also have some questions on 3way ties, but I'll ask them in a separate post.
Reply | Threaded
Open this post in threaded view
|

Re: Indices in Anchor modeling

roenbaeck
Administrator
Just a comment. We decided in one installation to split the datetime of a purchase into two attributes 'date of purchase' and 'time of purchase' using the 'date' and 'time' data types in SQL Server 2008. People querying the data were often interested in one or the other, but rarely both in the same query. So in that case having two attributes instead of one boosted performance.