Archives
Categories
Blog Administration
Powered by
Thursday, March 3. 2016
Calculating Linear Regression (i.e. Best Fit Line) using SQL
How I wish SSRS had a best fit line capability built into its charts... but at least up to SQL 2014, alas such is not the case.
So I cobbled together a SQL function to calculate it for me. The idea was I could pass in a table of data to the function and get the M and X values back.
To do this we need a table we can pass in as a parameter.
CREATE TYPE [dbo].[LineRegressionSource] AS TABLE(
[Period] [NUMERIC](18, 4) NULL,
[PartitionKey] [INT] NULL,
[Value] [NUMERIC](18, 4) NULL
)
I have a schema in my Datawarehouse called Tools, I put things there to be used by report writers and the like
CREATE SCHEMA Tools AUTHORIZATION dbo
And the function, this makes heavy use of a windowing function.. what I figure best is to build a table variable and populate it and then feed it in.
/************************************************************************
FUNCTION: LinearRegression
AUTHOR: mtassin
DATE: 20160303
PURPOSE: To calculate the slope and Y Intercept of a best fit line
given a table of data. Based on the following equations
y =mx + b
where m = n(SUM(xy))  SUM(x)*SUM(y)

n(SUM(x^2))(SUM(x))^2
and b = (SUM(y)  m(SUMX))/COUNT(X)
Parameters:
@Table LineRegressionSource (Period int,PartitionKey int ,Value Numeric(19,3))
Note, that this Linear Regression function expects to receive data in presummarized form
i.e. Only one row per value that would be plotted on a graph.
Period corresponds to a coordinate point on the X Axis
PartitionKey allows for multiple sets of data to be simultaneously linearly regressed, partitioned by said key
Value corresponds to a coordinate point on the Y axis.
Value


 x

+ Period



Modifications:
**************************************************************************/
CREATE FUNCTION Tools.LinearRegression(@Table AS LineRegressionSource READONLY)
RETURNS @SUMX TABLE( PartitionKey int,
NSUMXY numeric(38,5),
SUMXSUMY numeric(38,5),
NSUMX2 numeric(38,5),
SUMX2 numeric(38,5),
SUMX numeric(38,5),
SUMY numeric(38,5),
COUNTX numeric(38,5),
M numeric(38,5),
B numeric(38,5))
AS
BEGIN
INSERT INTO @Sumx(PartitionKey,NSUMXY,SUMXSUMY,NSUMX2,SUMX2,SUMX,SUMY,COUNTX,M,B)
SELECT
DISTINCT
PartitionKey,
NSUMXY = COUNT(Period) OVER (PARTITION BY PartitionKey) SUM(Period Value) OVER (PARTITION BY PartitionKey),
SUMXSUMY = SUM(Period)OVER (PARTITION BY PartitionKey) SUM(Value)OVER (PARTITION BY PartitionKey),
NSUMX2 = COUNT(period)OVER (PARTITION BY PartitionKey) SUM(Period*Period) OVER (PARTITION BY PartitionKey),
SUMX2 = SUM(Period) OVER (PARTITION BY PartitionKey) SUM(Period) OVER (PARTITION BY PartitionKey),
SUMX = SUM(Period) OVER (PARTITION BY PartitionKey),
SUMY = SUM(Value) OVER (PARTITION BY PartitionKey),
COUNTX = COUNT(Period) OVER (PARTITION BY PartitionKey),
NULL,
NULL
FROM
@Table
UPDATE @SUMX
SET M = (NSUMXY  SUMXSUMY)/(NSUMX2  SUMX2),
B = (SUMY  (NSUMXY  SUMXSUMY)/(NSUMX2  SUMX2) SUMX)/COUNTX
RETURN
END
So let's test it.
DECLARE @T AS LineRegressionSource INSERT INTO @T(Period,PartitionKey,Value) VALUES(1,1,2),(2,1,4),(3,1,6),(4,1,8) SELECT
FROM Tools.LinearRegression(@T)
Excel shows that the formula is Y=2X, which is Y=2X+0
(Page 1 of 1, totaling 1 entries)