Archives
Categories
Blog Administration
Powered by

Calculating Linear Regression (i.e. Best Fit Line) using SQL
How I wish SSRS had a best fit line capability built into its charts... but at least up to SQL 2014, alas such is not the case.
So I cobbled together a SQL function to calculate it for me. The idea was I could pass in a table of data to the function and get the M and X values back.
To do this we need a table we can pass in as a parameter.
CREATE TYPE [dbo].[LineRegressionSource] AS TABLE(
[Period] [NUMERIC](18, 4) NULL,
[PartitionKey] [INT] NULL,
[Value] [NUMERIC](18, 4) NULL
)
I have a schema in my Datawarehouse called Tools, I put things there to be used by report writers and the like
CREATE SCHEMA Tools AUTHORIZATION dbo
And the function, this makes heavy use of a windowing function.. what I figure best is to build a table variable and populate it and then feed it in.
/************************************************************************
FUNCTION: LinearRegression
AUTHOR: mtassin
DATE: 2016-03-03
PURPOSE: To calculate the slope and Y Intercept of a best fit line
given a table of data. Based on the following equations
y =mx + b
where m = n(SUM(xy)) - SUM(x)*SUM(y)
-----------------
n(SUM(x^2))-(SUM(x))^2
and b = (SUM(y) - m(SUMX))/COUNT(X)
Parameters:
@Table LineRegressionSource (Period int,PartitionKey int ,Value Numeric(19,3))
Note, that this Linear Regression function expects to receive data in pre-summarized form
i.e. Only one row per value that would be plotted on a graph.
Period corresponds to a coordinate point on the X Axis
PartitionKey allows for multiple sets of data to be simultaneously linearly regressed, partitioned by said key
Value corresponds to a coordinate point on the Y axis.
Value
|
|
| x
|
-------------+-------------- Period
|
|
|
Modifications:
**************************************************************************/
CREATE FUNCTION Tools.LinearRegression(@Table AS LineRegressionSource READONLY)
RETURNS @SUMX TABLE( PartitionKey int,
NSUMXY numeric(38,5),
SUMXSUMY numeric(38,5),
NSUMX2 numeric(38,5),
SUMX2 numeric(38,5),
SUMX numeric(38,5),
SUMY numeric(38,5),
COUNTX numeric(38,5),
M numeric(38,5),
B numeric(38,5))
AS
BEGIN
INSERT INTO @Sumx(PartitionKey,NSUMXY,SUMXSUMY,NSUMX2,SUMX2,SUMX,SUMY,COUNTX,M,B)
SELECT
DISTINCT
PartitionKey,
NSUMXY = COUNT(Period) OVER (PARTITION BY PartitionKey) SUM(Period Value) OVER (PARTITION BY PartitionKey),
SUMXSUMY = SUM(Period)OVER (PARTITION BY PartitionKey) SUM(Value)OVER (PARTITION BY PartitionKey),
NSUMX2 = COUNT(period)OVER (PARTITION BY PartitionKey) SUM(Period*Period) OVER (PARTITION BY PartitionKey),
SUMX2 = SUM(Period) OVER (PARTITION BY PartitionKey) SUM(Period) OVER (PARTITION BY PartitionKey),
SUMX = SUM(Period) OVER (PARTITION BY PartitionKey),
SUMY = SUM(Value) OVER (PARTITION BY PartitionKey),
COUNTX = COUNT(Period) OVER (PARTITION BY PartitionKey),
NULL,
NULL
FROM
@Table
UPDATE @SUMX
SET M = (NSUMXY - SUMXSUMY)/(NSUMX2 - SUMX2),
B = (SUMY - (NSUMXY - SUMXSUMY)/(NSUMX2 - SUMX2) SUMX)/COUNTX
RETURN
END
So let's test it.
DECLARE @T AS LineRegressionSource
INSERT INTO @T(Period,PartitionKey,Value)
VALUES(1,1,2),(2,1,4),(3,1,6),(4,1,8)
SELECT FROM Tools.LinearRegression(@T)
Excel shows that the formula is Y=2X, which is Y=2X+0
(Page 1 of 1, totaling 1 entries)