Quantcast
Viewing all articles
Browse latest Browse all 3587

Re: statistics and density

It's all about trade-offs really. Maintaining statistics and processing them during optimisation can be a costly operation so it is a case of picking a model which given the constraints you have to work with will suffice. However it will still have its limitations. These limitations can be minimised in some areas.

 

The relevancy/usefulness of the range cell density may depend on:

 

  • The data
  • The number of steps in the histogram
  • Sampling rate
  • Method used to gather the stats (hash based or sort based)

 

We'll ignore the data bit. Yes that can be controlled but let's not go redesigning data models (much as they might need it!)

 

When you search for a single fixed search argument it will use the lower of the range cell density and the individual range cell weight (upper limit) if that value falls within a range cell.

 

The idea here is it picks the most selective of the the two. It's not foolproof in any way and the bigger the tables and the less uniform the distribution of the values are, the more chances you might have individual value skew within a range cell and large discrepancies between the various weights of each range cell.

 

You mimimise the chance of this by increasing the number of steps.

 

By specifying a higher number of steps you increase the chances of frequency cells for values that previously may have skewed weights within range cells. It follows that you will most likely have less range cells with more steps. Increasing histogram step count on large tables can have a dramatic effect on the range cell density (and rightly so).

 

Total density is the overall measure of the average run of duplicates and in versions prior to 15, it was used to cost joins.

 

In version15 and 16, it'll still get used (there may be others as well):

 

  • For joins when one side of the join does not have a histogram available (if there is a histogram on both sides, it'll merge the histograms)
  • Unknowns such as  - select blah where column in (select column2 from table)
  • It'll also be used if you have compatibility_mode enabled.
  • It'll also be used to cost joins in any query with 7 or more tables during the alternative greedy costing (used to prime the search engine)

 

In an ideal world you have statistics that tell you the weight of every value in every column and every combination of every value of every column. It doesn't take long for the storage required to be in the terabyte range and for maintenance to take weeks (it takes long enough already!).

There are other ways of representing data but at this point, ASE is what it is.


Viewing all articles
Browse latest Browse all 3587

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>