Tuesday, March 27, 2012

Computing Frequency Distributions

What's the best way to calculate frequency distributions in SQL Server 2000? What I want to do is something akin to the Excel FREQUENCY() function, which takes two arguments: a range of data, and a sorted range of "bins" that define the intervals that the input range is compared against.

I can't use GROUP BY for this, as the data I want to analyze isn't amentable to that approach.

This seems like a common thing, and searching for "Frequency Distribution" in the SQL Server docs doesn't yield any results. Nor does it look like Analysis Services does this either.

I see some commerical products that provide this capability, but I'm looking for something using the tools I am already using, namely MSDE or SQL 2k.I really should have titled this "Computing Histograms". I wound up answering my own question when I reformulated my question a bit and used the correct terminology.

If I was using Oracle I could use the handy, fancy WIDTH_BUCKET predicate to do this. Unfortunately these SQL extensions are not supported by MSSQL Server.

Instead you can do something like this:

DECLARE @.numsteps int, @.start float, @.end float
SELECT @.numsteps=50, @.start='1', @.end='51'

SELECT step, count(*) AS cnt
FROM ( SELECT floor((icud-mn) / (1.0*range/@.numsteps) ) + 1 AS step
FROM ( SELECT min(icud) AS mn, max(icud)-min(icud)+1 AS range
FROM MPARDATA
WHERE icud >= @.start AND icud < @.end
) AS R
CROSS JOIN
( SELECT * FROM MPARDATA
WHERE icud >= @.start AND icud < @.end
)
AS S)
AS RS
GROUP BY step
ORDER BY step

This will return the number of rows in my table MPARDATA where 'icud' (a positive integer value) is 1, 2, 3, 4, 5, and so on.

This article was my inspiration:

http://www.sqlmag.com/Articles/Print.cfm?ArticleID=38251

Hope others find it useful.sqlsql

No comments:

Post a Comment