Tuesday, November 20, 2012

LOOPING VS. CURSORS

I just cannot say this enough - "Cursors are eeeeeeeeeeeeeeeviiiiilllllllllllllllllllllllll!!!!!!!!!!!!!!!!!!!"

I recently encountered this post in Spiceworks and joined in the conversation -

POST:
I believe I already know the answer to this, but I'd love to hear from some people in the community.

Lets say that we're writing a procedure that needs to perform a calculation based on each row of a table - the example I have below is impractical but shows my point. Just how bad is the performance of using a cursor to loop through the rows, versus a method such as below? Is the way I have outlined below the general best way to do a loop like this?

3 COLUMN EXAMPLE TABLE: rownumber identity PK, intData1, intData2

WHILE(@currentRowNumber <= @maxRowNumber)

BEGIN

INSERT INTO #SomeTable(myCalculation)

SELECT SUM(intData1 + intData2) FROM myTable WHERE PK = @currentRowNumber

SET @currentRowNumber = @currentRowNumber + 1

END

Interestingly, the first 'responder' posted this...
"I have recently been asked to remove cursors from existing stored procedures"

Now why would someone have instructed this person to remove cursors from all their existing stored procedures?

As I've previously posted, there are absolutely tons of opinions - some for, some against - the use of cursors; however, nothing speaks better than quantitative data.  And to resolve this quantitatively just google 'Itzak Ben Gan' and 'cursor'.  Once you've reviewed his articles, you should be clear on just why this responder would have been told to take cursors out of their stored procedure code.

To net it all out - there is hardly any situation you can craft where cursor will out-perform well crafted t-sql.
(There is an interesting exception, when declaring a static cursor one gets considerably improved performance over a plain old vanilla cursor).

That said, assuming you are operating on a table of say, 2,000,000 rows, or even a more modest table of 200,000 rows, figuring out a relational answer to the processing objective will almost always result in faster processing than 'fetching and operating' on one row at a time until you reach the end of the table. I have, on more than one occasion found myself starting to write a loop similar to the one above and while doing it, and writing the various operations to perform inside the loop, I suddenly see that I could just join tables and do a 'mass upsert' type of operation.  In short order, I've taken my 'input batch table' and written two different queries - one joins the batch table to the target and inserts all the rows from the batch where the key in the batch is not in the target.  Then I delete those rows from the batch.  What is left are rows in the batch that have rows already existing in the target - therefore there must be something different about the data (or maybe not depending on who wrote the code that generated the 'input batch' table); so I write an update changing the value for everything in the target to what is in the batch's corresponding row (except for the key of course); joining on the key column(s).



And this is the point of my statement that cursor are evil.  It isn't just the endless arguing about performance of the cursor itself, it is the nature of the beast.

The first time I ever saw a cursor, I thought to myself - "this is just a crutch that has been provided to people who are not accustomed to thinking relationally about data in an RDBMS."  My reason for reaching that conclusion is the 'loop' model in the POST.  Just about every cursor operation I've ever seen is the equivalent of some kind of loop operation; the loop is there to process each row (usually in a batch construct) and it just reeks of sequential thinking - 'first I'll do this to this piece of data in the row, then I'll do that to this piece of data, then blah, blah, blah ... and when I'm done I'll commit the change to this row and go get the next row until I've reach the 'eof' (end of table)'.

If you must do things this way, then by all means use a loop or a cursor. However, be prepared for the day when someone you report to looks at your code, considers all the overhead you're stuffing onto the servers and the implications of same to annual budgets and then be ready to strip all those cursors out (hint, start writing your code now that will go and find all the places where you've used a cursor in a stored procedure or job).

Why?  Because crutches become the preferred way of doing things and pretty soon you are missing the whole point of a relational database construct.  In effect you have taken a super-charged race car and tied 30 tons of dead weight to it, and then you'll be pondering why things take so long.  "Where's the performance bottelnecks"?  If the person you report to is any good at relational thinking, they'll know where the bottlenecks are - and they'll tell you to dump the cursors.


Lesson here - think hard about why you think you need that loop... because you probably don't.

Monday, November 19, 2012

when pivot isn't the answer...

Recently I helped someone out on Spiceworks, so I thought I might as well post the problem and the resolution here as well.

The requestor wrote: "Right now I get a separate row for each Diagnosis_Code with the same service_ID, I need them to appear as separate columns in one row for each service_ID. There are other columns in the output not listed here, so it cannot be a crosstab, unless that can be nested, I don't know."

The problem description went something like this:

I have a table Diagnosis_Codes with a list of codes as Diagnosis_Code, PK Diagnosis_Code_ID
I have a table Service_Diagnoses which has FK Service_IDs and FK Diagnosis_Code_IDs and Service_IDs
I have a table Services with Service_IDs
For one Service_ID, there can be multiple entries in Service_Diagnoses pointing to that Service_ID. Service_Diagnoses.Diagnosis_Code_ID is used to get a human usable definition for each entry in Service_Diagnoses. It is possible for a single Service_ID to have multiple (not more than 4) entries in Service_Diagnoses.
(Hope that is understandable)
Here is what I need to do:
For each service ID I must produce 1 row. In this row, I need the Service_ID, and the (up to 4) Diagnosis_Code's.
Service_ID | Diagnosis_Code1 | Diagnosis_Code2 | Diagnosis_Code3 | Diagnosis_Code4
Or, it can even be the Service_ID and the (up to 4) Service_Diagnoses.Diagnosis_Code_ID 
------------------------------------------------------------------ end ---------------------------------------------------------------------------------

Now I admit, usually when someone posts something like this, I try to get a little more information.  But after I looked at this one for a bit, it was clear enough that I could see what the requestor wanted, but the problem is a pivot isn't well-suited for his 'free-form' kind of result.

So I returned to a trusted approach that served me well before PIVOT came along, and turned out to be the solution to the requestor's needs (and got me the 'best answer' cookie)!

Here is most of that reponse (and all that you need to understand how to craft the solution to this problem!)

If I understand your tables, they would look something like these:

SELECT * FROM serviceid
serviceid
456
789
123

SELECT * FROM diagnosis
diagid diagcode
78 froze
12 burnt
23 dead
56 broke

SELECT * FROM servicediag
serviceid diagid
123 78
123 12
123 23
456 78
456 12
456 23
789 56
789 23
123 56


The first thing I like to do is a query that joins all the tables together and lets me see how the data looks when combined.
You can put it into a real table or a temp table, but the query below does the joins.

SELECT s.serviceid, sd.diagid, d.diagcode
INTO serviceidlist
FROM serviceid AS s
LEFT OUTER JOIN servicediag AS sd ON s.serviceid=sd.serviceid
LEFT OUTER JOIN diagnosis AS d ON sd.diagid=d.diagid
;

The resulting table contents can be seen using the following:
SELECT * FROM serviceidlist;

serviceid diagid diagcode
456 78 froze
456 12 burnt
456 23 dead
789 56 broke
789 23 dead
123 78 froze
123 12 burnt
123 23 dead
123 56 broke


But you want one line for service id, with a max of four columns of diagnostic code ids, so we need to apply a running sequence,
(1 - 4) for each serviceid.

SELECT * ,
rn=ROW_NUMBER() OVER (PARTITION BY serviceid ORDER BY diagid)
INTO serviceidorderedlist
FROM serviceidlist
;

Now the list looks like the one from the following query on the resulting table, "serviceidorderedlist".

SELECT * FROM serviceidorderedlist
;


serviceid diagid diagcode rn
123 12 burnt 1
123 23 dead 2
123 56 broke 3
123 78 froze 4
456 12 burnt 1
456 23 dead 2
456 78 froze 3
789 23 dead 1
789 56 broke 2


Now, all we need is a technique to consolidate all the diag code ids for the service id.

SELECT serviceid, SUM([diagid1])[diagid1], SUM([diagid2])[diagid2], SUM([diagid3])[diagid3], SUM([diagid4])[diagid4]
FROM
(
SELECT serviceid, diagid[diagid1], 0[diagid2], 0[diagid3], 0[diagid4] FROM serviceidorderedlist WHERE rn=1
UNION
SELECT serviceid, 0[diagid1], diagid[diagid2], 0[diagid3], 0[diagid4] FROM serviceidorderedlist WHERE rn=2
UNION
SELECT serviceid, 0[diagid1], 0[diagid2], diagid[diagid3], 0[diagid4] FROM serviceidorderedlist WHERE rn=3
UNION
SELECT serviceid, 0[diagid1], 0[diagid2], 0[diagid3], diagid[diagid4] FROM serviceidorderedlist WHERE rn=4
)dt
GROUP BY serviceid
;

The query above generates a line for each instance of a service id, up to 4 and sticks the appropriate diag code id into the correct column and 0 into the other columns. The union of all the queries in the derived table (dt) gives you up to 4 lines for each service id where only one column has a diag code and the other 3 have 0's; 1 row for each intersection of a service id and a diag code id. But there can still be up to 4 rows of a service id. So we 'wrap' the derived table query (dt) with another select that groups on the service id and sums the diag code columns. Since only on row in the set for any column has a diag code, adding 0 in the sum function has no effect on altering the data, and you get the result below

serviceid diagid1 diagid2 diagid3 diagid4
123       12         23        56       78
456       12         23        78         0
789       23         56         0          0