Tuesday, November 20, 2012

LOOPING VS. CURSORS

I just cannot say this enough - "Cursors are eeeeeeeeeeeeeeeviiiiilllllllllllllllllllllllll!!!!!!!!!!!!!!!!!!!"

I recently encountered this post in Spiceworks and joined in the conversation -

POST:
I believe I already know the answer to this, but I'd love to hear from some people in the community.

Lets say that we're writing a procedure that needs to perform a calculation based on each row of a table - the example I have below is impractical but shows my point. Just how bad is the performance of using a cursor to loop through the rows, versus a method such as below? Is the way I have outlined below the general best way to do a loop like this?

3 COLUMN EXAMPLE TABLE: rownumber identity PK, intData1, intData2

WHILE(@currentRowNumber <= @maxRowNumber)

BEGIN

INSERT INTO #SomeTable(myCalculation)

SELECT SUM(intData1 + intData2) FROM myTable WHERE PK = @currentRowNumber

SET @currentRowNumber = @currentRowNumber + 1

END

Interestingly, the first 'responder' posted this...
"I have recently been asked to remove cursors from existing stored procedures"

Now why would someone have instructed this person to remove cursors from all their existing stored procedures?

As I've previously posted, there are absolutely tons of opinions - some for, some against - the use of cursors; however, nothing speaks better than quantitative data.  And to resolve this quantitatively just google 'Itzak Ben Gan' and 'cursor'.  Once you've reviewed his articles, you should be clear on just why this responder would have been told to take cursors out of their stored procedure code.

To net it all out - there is hardly any situation you can craft where cursor will out-perform well crafted t-sql.
(There is an interesting exception, when declaring a static cursor one gets considerably improved performance over a plain old vanilla cursor).

That said, assuming you are operating on a table of say, 2,000,000 rows, or even a more modest table of 200,000 rows, figuring out a relational answer to the processing objective will almost always result in faster processing than 'fetching and operating' on one row at a time until you reach the end of the table. I have, on more than one occasion found myself starting to write a loop similar to the one above and while doing it, and writing the various operations to perform inside the loop, I suddenly see that I could just join tables and do a 'mass upsert' type of operation.  In short order, I've taken my 'input batch table' and written two different queries - one joins the batch table to the target and inserts all the rows from the batch where the key in the batch is not in the target.  Then I delete those rows from the batch.  What is left are rows in the batch that have rows already existing in the target - therefore there must be something different about the data (or maybe not depending on who wrote the code that generated the 'input batch' table); so I write an update changing the value for everything in the target to what is in the batch's corresponding row (except for the key of course); joining on the key column(s).



And this is the point of my statement that cursor are evil.  It isn't just the endless arguing about performance of the cursor itself, it is the nature of the beast.

The first time I ever saw a cursor, I thought to myself - "this is just a crutch that has been provided to people who are not accustomed to thinking relationally about data in an RDBMS."  My reason for reaching that conclusion is the 'loop' model in the POST.  Just about every cursor operation I've ever seen is the equivalent of some kind of loop operation; the loop is there to process each row (usually in a batch construct) and it just reeks of sequential thinking - 'first I'll do this to this piece of data in the row, then I'll do that to this piece of data, then blah, blah, blah ... and when I'm done I'll commit the change to this row and go get the next row until I've reach the 'eof' (end of table)'.

If you must do things this way, then by all means use a loop or a cursor. However, be prepared for the day when someone you report to looks at your code, considers all the overhead you're stuffing onto the servers and the implications of same to annual budgets and then be ready to strip all those cursors out (hint, start writing your code now that will go and find all the places where you've used a cursor in a stored procedure or job).

Why?  Because crutches become the preferred way of doing things and pretty soon you are missing the whole point of a relational database construct.  In effect you have taken a super-charged race car and tied 30 tons of dead weight to it, and then you'll be pondering why things take so long.  "Where's the performance bottelnecks"?  If the person you report to is any good at relational thinking, they'll know where the bottlenecks are - and they'll tell you to dump the cursors.


Lesson here - think hard about why you think you need that loop... because you probably don't.

No comments:

Post a Comment