Pragmatic SQL Server: while loop

I just cannot say this enough - "Cursors are eeeeeeeeeeeeeeeviiiiilllllllllllllllllllllllll!!!!!!!!!!!!!!!!!!!"

I recently encountered this post in Spiceworks and joined in the conversation -

POST:
I believe I already know the answer to this, but I'd love to hear from some people in the community.

Lets say that we're writing a procedure that needs to perform a calculation based on each row of a table - the example I have below is impractical but shows my point. Just how bad is the performance of using a cursor to loop through the rows, versus a method such as below? Is the way I have outlined below the general best way to do a loop like this?

3 COLUMN EXAMPLE TABLE: rownumber identity PK, intData1, intData2

WHILE(@currentRowNumber <= @maxRowNumber)

BEGIN

INSERT INTO #SomeTable(myCalculation)

SELECT SUM(intData1 + intData2) FROM myTable WHERE PK = @currentRowNumber

SET @currentRowNumber = @currentRowNumber + 1

END

Interestingly, the first 'responder' posted this...
"I have recently been asked to remove cursors from existing stored procedures"

Now why would someone have instructed this person to remove cursors from all their existing stored procedures?

As I've previously posted, there are absolutely tons of opinions - some for, some against - the use of cursors; however, nothing speaks better than quantitative data. And to resolve this quantitatively just google 'Itzak Ben Gan' and 'cursor'. Once you've reviewed his articles, you should be clear on just why this responder would have been told to take cursors out of their stored procedure code.

To net it all out - there is hardly any situation you can craft where cursor will out-perform well crafted t-sql.
(There is an interesting exception, when declaring a static cursor one gets considerably improved performance over a plain old vanilla cursor).

That said, assuming you are operating on a table of say, 2,000,000 rows, or even a more modest table of 200,000 rows, figuring out a relational answer to the processing objective will almost always result in faster processing than 'fetching and operating' on one row at a time until you reach the end of the table. I have, on more than one occasion found myself starting to write a loop similar to the one above and while doing it, and writing the various operations to perform inside the loop, I suddenly see that I could just join tables and do a 'mass upsert' type of operation. In short order, I've taken my 'input batch table' and written two different queries - one joins the batch table to the target and inserts all the rows from the batch where the key in the batch is not in the target. Then I delete those rows from the batch. What is left are rows in the batch that have rows already existing in the target - therefore there must be something different about the data (or maybe not depending on who wrote the code that generated the 'input batch' table); so I write an update changing the value for everything in the target to what is in the batch's corresponding row (except for the key of course); joining on the key column(s).

And this is the point of my statement that cursor are evil. It isn't just the endless arguing about performance of the cursor itself, it is the nature of the beast.

The first time I ever saw a cursor, I thought to myself - "this is just a crutch that has been provided to people who are not accustomed to thinking relationally about data in an RDBMS." My reason for reaching that conclusion is the 'loop' model in the POST. Just about every cursor operation I've ever seen is the equivalent of some kind of loop operation; the loop is there to process each row (usually in a batch construct) and it just reeks of sequential thinking - 'first I'll do this to this piece of data in the row, then I'll do that to this piece of data, then blah, blah, blah ... and when I'm done I'll commit the change to this row and go get the next row until I've reach the 'eof' (end of table)'.

If you must do things this way, then by all means use a loop or a cursor. However, be prepared for the day when someone you report to looks at your code, considers all the overhead you're stuffing onto the servers and the implications of same to annual budgets and then be ready to strip all those cursors out (hint, start writing your code now that will go and find all the places where you've used a cursor in a stored procedure or job).

Why? Because crutches become the preferred way of doing things and pretty soon you are missing the whole point of a relational database construct. In effect you have taken a super-charged race car and tied 30 tons of dead weight to it, and then you'll be pondering why things take so long. "Where's the performance bottelnecks"? If the person you report to is any good at relational thinking, they'll know where the bottlenecks are - and they'll tell you to dump the cursors.

Lesson here - think hard about why you think you need that loop... because you probably don't.

Although I don't find the spatial maps in SSRS all that useful (I'll post another article in the future on generating kml files for Google Earth); one thing that some folks have found tedious is establishing the coordinates once the latitude and longitude have been updated in a table.

This can be easily done with a while loop, as demonstrated below. Suppose you want to create a bubble map of sales data.

First you need some invoices. To be pragmatic here, I'll assume you are familiar with your own invoice tables, so instead of creating them here, we'll just show the code I used to extract invoice information from my employer's system:

use DBA

declare @divcode nvarchar(10) -- divisional code
declare @yr int -- year to extract
set @divcode = '00'
set @yr=2012

if exists(select * from sysobjects where type = 'u' and [name]='tbl_mapdinvoices')
drop table tbl_mapdinvoices
;

select IDENTITY(int, 1,1) rowid,
cb.[div-code], cb.[Cust-no], c.[address],c.[city],c.[st],c.[zip-code],c.[map-grid],c.[num-lat],c.[num-lng], cb.[Total]
into tbl_mapdinvoices
from
(
SELECT
case when GROUPING([div-code])=1 then 'All' else [div-code] end [div-code],
case when grouping([Charge-cust])=1 then 'All' else [Charge-cust] end [Charge-cust],
case when grouping([Cust-no])=1 then 'All' else [Cust-no] end [Cust-no],
sum([Gross-amt])+sum([Tax-amt__1])+sum([Tax-amt__2])+sum([Tax-amt__3])+sum([Extra-Charge__1])+sum([Extra-Charge__2])+sum([Extra-Charge__3])
+sum([Extra-Charge__4])+sum([Extra-Charge__5])+sum([Extra-Charge__6])+sum([Extra-Charge__7])+sum([Extra-Charge__8])+sum([Extra-Charge__9])
+sum([Extra-Charge__10])[Total]
FROM [dbo].[invoice]
where DATEPART(yyyy,[invoice-date]) IN (@yr)
group by [div-code], [Charge-cust], [Cust-no]
with CUBE
having [div-code] IN ( @divcode )
and [Charge-cust] <> 'All'
and [Cust-no] <> 'All'
) cb
inner join customer c on cb.[Cust-no]=c.[Cust-no]
and c.[num-lat]+c.[num-lng]<>0
;

Notice the IDENTITY column above? The column rowid will serve as a means to 'loop through' the entire table. If one has say, 10 or 20 invoices in the table, manually updating the table to get coordinates isn't a big deal. But if you have a table of 1,000 or 10,000 or 100,000 invoices to map, well, you need a better way.

Here's a redacted listing of the first two rows of my invoice table to give you an idea of what we have so far:
rowid div-code city st num-lat num-lng total
1 00 MARYVILLE TN 35.754724    -83.971210 524.10
2 00 IRVING TX 32.917917    -97.002038 511.31

Now we're going to add a column to our table for coordinates, we will call it 'spatialdata'

alter table tbl_mapdinvoices add spatialdata geography;

Next, we set up the WHILE LOOP to create the coordinates for us. We're going to do a number of things here, so let's look at the code:

declare @counter int
set @counter = 1
declare @sql nvarchar(max)
declare @lat decimal(10,6)
declare @lng decimal(10,6)

while @counter < (select MAX(rowid)+1 from tbl_mapdinvoices)
begin
set @lat = (select ([num-lat]) from tbl_mapdinvoices where rowid = @counter)
set @lng = (select ([num-lng]) from tbl_mapdinvoices where rowid = @counter)
set @sql = 'update tbl_mapdinvoices set spatialdata ='+''''+'POINT('+(select cast(@lng as varchar(16)) )+' '+(select cast(@lat as varchar(16)) )+')'+''''
set @sql = @sql+' where rowid ='+(select cast(@counter as nvarchar(10)) )+';'
--print @sql
exec sp_executesql @sql
set @counter = @counter+1
end

First we declare our counter as an integer and set it's value to '1'.
While declaring "stuff", we're going to create a string for our UPDATE sql, as well as declaring our latitude and longitude variables.

We initiate our loop with a WHILE statement that essentially says, while the counter variable is less than the value of the maximum row id plus 1 from the table of invoices continue looping.
(You may be asking 'Why not just use a cursor'? The answer is that there are multiple posts and myriad books and other analyses that demonstrate cursors are rarely more efficient than using other T-SQL tactics. I have therefore abandoned using cursors completely as I have never found a need for them.)

Next we set our latitude variable to the value of the num-lat column in our invoice table where the rowid in the table is equal to the counter. Likewise we set the value of longitude variable to our num-lng column value.

Now we employ the DYNAMIC SQL technique to build our UPDATE string.
If you are not familiar with DYNAMIC SQL, once you have an invoice table built with a rowid as described in the code, comment out the exec sp_executesql @sql statement and uncomment out the print @sql statement and run the WHILE LOOP. You should see a series of statements in your answer set that look something like this:

update tbl_mapdinvoices set spatialdata ='POINT(-83.971210 35.754724)' where rowid =1;
update tbl_mapdinvoices set spatialdata ='POINT(-97.002038 32.917917)' where rowid =2;
update tbl_mapdinvoices set spatialdata ='POINT(-80.949954 34.071876)' where rowid =3;
update tbl_mapdinvoices set spatialdata ='POINT(-79.221837 35.431453)' where rowid =4;

(DYNAMIC SQL is a good way to programmatically build statements for subsequent execution.)
Don't forget to uncomment the exec sp_executesql @sql statement and comment out the print @sql statement before you run the code though.
The exec sp_executesql @sql statement is the last thing in the WHILE LOOP before incrementing the counter and returning to the top of the loop.

Once this WHILE loop is finished, you will have populated the spatialdata column of your table.
Below is a redacted listing of the first two rows of my invoice table:

rowid city st num-lat num-lng spatialdata
1   MARYVILLE TN 35.754724 -83.971210 0xE6100000010CCFC0C8CB9AE0414015E3FC4D28FE54C0
2   IRVING   TX 32.917917 -97.002038 0xE6100000010CA2B8E34D7E7540405AD6FD63214058C0

Now we want to convert the spatialdata into coordinates.

alter table tbl_mapdinvoices add coordinates nvarchar(255);

update tbl_mapdinvoices
set coordinates = convert(nvarchar(255),[spatialdata])
;

Below is a redacted listing of the first two rows of my invoice table with coordinates.

rowid city st num-lat num-lng   coordinates
1 MARYVILLE TN 35.754724 -83.971210 POINT (-83.97121 35.754724)
2   IRVING   TX 32.917917 -97.002038 POINT (-97.002038 32.917917)

Now I have the table ready to map the data.
To see an excellent example of how to spatially map data in SSRS, see the posting at this link:
http://www.mssqltips.com/sqlservertip/2174/maps-in-sql-server-2008-r2-reporting-services/

Pragmatic SQL Server

Tuesday, November 20, 2012

LOOPING VS. CURSORS

Friday, August 3, 2012

Setting coordinates set once you have spatial data