Pragmatic SQL Server: 2012

Wednesday, December 5, 2012

The need for a better Excel to SQL Server connection...

Just a bit of a rant here...

If there were ever any 2 products from Microsoft that were just made for each other it is Excel and SQL Server... (well, that is, now that the newer xlsx format is available).

But sadly, these two products don't 'schmooze' as easily as they might.

Oh sure there are plenty of ways to 'integrate' Excel to SQL Server, but they are all sadly wanting - to prove that point all you have to do is Google (or Bing, or whatever) something like 'how do I submit parameter queries to SQL Server from Excel' and you'll suddenly see my point here.

All those GUI tools don't do the job well.
Writing .Net or DBO or any of the other alphabet soup of products... well, you can do it, but you really need to know those languages and some 'gotchas' in order to make those work.

If you've ever developed an SSRS report in BIDS, you know how simple and straightforward it is.
Build a data source. Set the connection for the data source. Write the query. Need input parameters? Define them in the query object. Done. Now run the query - boom, you have your answer set.

Here's the real question ... Why in the world isn't this kind of functionality provided as a means to integrate Excel to SQL Server?

Surely Microsoft can see that this capability ought to exist for Excel. If they cannot it is beyond me why they cannot... or maybe it isn't beyond me.

Because after all, if SQL developers had such a tool, they wouldn't need Sharepoint, or all the other middleware tools would they?

They wouldn't need anything at all except Excel and SQL Server...and maybe that's why it won't ever happen?

provisioning for the unanticipated...

I'm in the process of setting up an SSAS cube.

I'm early in the process... tweaking and perfecting the ETL for the fact and dimension tables.

I just ran into something that I should have thought of before, but never had encountered up to now (not because I'm brilliant, no, far from it, it is because up to now, I was just ... plain ... lucky).

Here is what happened. I was feeling pretty smug as the ETL had been running perfectly for about a week when one day I came in and noticed a failed job in SQL Monitor (Red Gate).

I began to investigate why the job failed. It failed trying to re-apply an FK create on a table after the table had been updated.

After figuring out why the FK create wouldn't happen, I discovered it was nothing about the syntax or mechanics of the alter table statement but rather a disconcerting relationship disconnect between a fact table and an associated dimension table. The fact table had exactly 1 row more than should have been there.
That fact table's one extra row had a key value of '0' (put there by a default constraint in the create table statement). That '0' value told me that for this row, there was no matching value in the dimension table I was trying to relate this table to! In 'perfect world' that never happens (luck); but in reality world it can and does happen (my luck ran out).

I have solved the origin of that mystery of why and how that extra row got there, but here's the point/suggestion I'm blogging about today.

When you are inserting values into your dimension tables, it may be a good move to create a row for a condition such as the one I describe above. My dim tables now all have row #1 with keys of '0', associated rows have 'N/A' (text data types) or '0' (number data types).

That way loads won't fail, and the failure won't be a mind-wracking journey into MSFT's error messages about why a key could not be created (ultimately in my case it wasn't about the key's alter table statement at all, but rather about the data in the 2 tables I was trying to create a relationship for).

You can then also code corrections in the ETL for when the Fact table gets a join to the '0' value data in the key column, but have it done AFTER the load job has completed (and this time it won't fail with an error that has you chasing your tail about why a FK could not be created).

Tuesday, November 20, 2012

LOOPING VS. CURSORS

I just cannot say this enough - "Cursors are eeeeeeeeeeeeeeeviiiiilllllllllllllllllllllllll!!!!!!!!!!!!!!!!!!!"

I recently encountered this post in Spiceworks and joined in the conversation -

POST:
I believe I already know the answer to this, but I'd love to hear from some people in the community.

Lets say that we're writing a procedure that needs to perform a calculation based on each row of a table - the example I have below is impractical but shows my point. Just how bad is the performance of using a cursor to loop through the rows, versus a method such as below? Is the way I have outlined below the general best way to do a loop like this?

3 COLUMN EXAMPLE TABLE: rownumber identity PK, intData1, intData2

WHILE(@currentRowNumber <= @maxRowNumber)

BEGIN

INSERT INTO #SomeTable(myCalculation)

SELECT SUM(intData1 + intData2) FROM myTable WHERE PK = @currentRowNumber

SET @currentRowNumber = @currentRowNumber + 1

END

Interestingly, the first 'responder' posted this...
"I have recently been asked to remove cursors from existing stored procedures"

Now why would someone have instructed this person to remove cursors from all their existing stored procedures?

As I've previously posted, there are absolutely tons of opinions - some for, some against - the use of cursors; however, nothing speaks better than quantitative data. And to resolve this quantitatively just google 'Itzak Ben Gan' and 'cursor'. Once you've reviewed his articles, you should be clear on just why this responder would have been told to take cursors out of their stored procedure code.

To net it all out - there is hardly any situation you can craft where cursor will out-perform well crafted t-sql.
(There is an interesting exception, when declaring a static cursor one gets considerably improved performance over a plain old vanilla cursor).

That said, assuming you are operating on a table of say, 2,000,000 rows, or even a more modest table of 200,000 rows, figuring out a relational answer to the processing objective will almost always result in faster processing than 'fetching and operating' on one row at a time until you reach the end of the table. I have, on more than one occasion found myself starting to write a loop similar to the one above and while doing it, and writing the various operations to perform inside the loop, I suddenly see that I could just join tables and do a 'mass upsert' type of operation. In short order, I've taken my 'input batch table' and written two different queries - one joins the batch table to the target and inserts all the rows from the batch where the key in the batch is not in the target. Then I delete those rows from the batch. What is left are rows in the batch that have rows already existing in the target - therefore there must be something different about the data (or maybe not depending on who wrote the code that generated the 'input batch' table); so I write an update changing the value for everything in the target to what is in the batch's corresponding row (except for the key of course); joining on the key column(s).

And this is the point of my statement that cursor are evil. It isn't just the endless arguing about performance of the cursor itself, it is the nature of the beast.

The first time I ever saw a cursor, I thought to myself - "this is just a crutch that has been provided to people who are not accustomed to thinking relationally about data in an RDBMS." My reason for reaching that conclusion is the 'loop' model in the POST. Just about every cursor operation I've ever seen is the equivalent of some kind of loop operation; the loop is there to process each row (usually in a batch construct) and it just reeks of sequential thinking - 'first I'll do this to this piece of data in the row, then I'll do that to this piece of data, then blah, blah, blah ... and when I'm done I'll commit the change to this row and go get the next row until I've reach the 'eof' (end of table)'.

If you must do things this way, then by all means use a loop or a cursor. However, be prepared for the day when someone you report to looks at your code, considers all the overhead you're stuffing onto the servers and the implications of same to annual budgets and then be ready to strip all those cursors out (hint, start writing your code now that will go and find all the places where you've used a cursor in a stored procedure or job).

Why? Because crutches become the preferred way of doing things and pretty soon you are missing the whole point of a relational database construct. In effect you have taken a super-charged race car and tied 30 tons of dead weight to it, and then you'll be pondering why things take so long. "Where's the performance bottelnecks"? If the person you report to is any good at relational thinking, they'll know where the bottlenecks are - and they'll tell you to dump the cursors.

Lesson here - think hard about why you think you need that loop... because you probably don't.

Monday, November 19, 2012

when pivot isn't the answer...

Recently I helped someone out on Spiceworks, so I thought I might as well post the problem and the resolution here as well.

The requestor wrote: "Right now I get a separate row for each Diagnosis_Code with the same service_ID, I need them to appear as separate columns in one row for each service_ID. There are other columns in the output not listed here, so it cannot be a crosstab, unless that can be nested, I don't know."

The problem description went something like this:

I have a table Diagnosis_Codes with a list of codes as Diagnosis_Code, PK Diagnosis_Code_ID

I have a table Service_Diagnoses which has FK Service_IDs and FK Diagnosis_Code_IDs and Service_IDs

I have a table Services with Service_IDs

For one Service_ID, there can be multiple entries in Service_Diagnoses pointing to that Service_ID. Service_Diagnoses.Diagnosis_Code_ID is used to get a human usable definition for each entry in Service_Diagnoses. It is possible for a single Service_ID to have multiple (not more than 4) entries in Service_Diagnoses.

(Hope that is understandable)

Here is what I need to do:

For each service ID I must produce 1 row. In this row, I need the Service_ID, and the (up to 4) Diagnosis_Code's.

Service_ID | Diagnosis_Code1 | Diagnosis_Code2 | Diagnosis_Code3 | Diagnosis_Code4

Or, it can even be the Service_ID and the (up to 4) Service_Diagnoses.Diagnosis_Code_ID

------------------------------------------------------------------ end ---------------------------------------------------------------------------------

Now I admit, usually when someone posts something like this, I try to get a little more information. But after I looked at this one for a bit, it was clear enough that I could see what the requestor wanted, but the problem is a pivot isn't well-suited for his 'free-form' kind of result.

So I returned to a trusted approach that served me well before PIVOT came along, and turned out to be the solution to the requestor's needs (and got me the 'best answer' cookie)!

Here is most of that reponse (and all that you need to understand how to craft the solution to this problem!)

If I understand your tables, they would look something like these:

SELECT * FROM serviceid

serviceid

456

789

123

SELECT * FROM diagnosis

diagid diagcode

78 froze

12 burnt

23 dead

56 broke

SELECT * FROM servicediag

serviceid diagid

123 78

123 12

123 23

456 78

456 12

456 23

789 56

789 23

123 56

The first thing I like to do is a query that joins all the tables together and lets me see how the data looks when combined.

You can put it into a real table or a temp table, but the query below does the joins.

SELECT s.serviceid, sd.diagid, d.diagcode

INTO serviceidlist

FROM serviceid AS s

LEFT OUTER JOIN servicediag AS sd ON s.serviceid=sd.serviceid

LEFT OUTER JOIN diagnosis AS d ON sd.diagid=d.diagid

;

The resulting table contents can be seen using the following:

SELECT * FROM serviceidlist;

serviceid diagid diagcode

456 78 froze

456 12 burnt

456 23 dead

789 56 broke

789 23 dead

123 78 froze

123 12 burnt

123 23 dead

123 56 broke

But you want one line for service id, with a max of four columns of diagnostic code ids, so we need to apply a running sequence,

(1 - 4) for each serviceid.

SELECT * ,

rn=ROW_NUMBER() OVER (PARTITION BY serviceid ORDER BY diagid)

INTO serviceidorderedlist

FROM serviceidlist

;

Now the list looks like the one from the following query on the resulting table, "serviceidorderedlist".

SELECT * FROM serviceidorderedlist

;

serviceid diagid diagcode rn

123 12 burnt 1

123 23 dead 2

123 56 broke 3

123 78 froze 4

456 12 burnt 1

456 23 dead 2

456 78 froze 3

789 23 dead 1

789 56 broke 2

Now, all we need is a technique to consolidate all the diag code ids for the service id.

SELECT serviceid, SUM([diagid1])[diagid1], SUM([diagid2])[diagid2], SUM([diagid3])[diagid3], SUM([diagid4])[diagid4]

FROM

(

SELECT serviceid, diagid[diagid1], 0[diagid2], 0[diagid3], 0[diagid4] FROM serviceidorderedlist WHERE rn=1

UNION

SELECT serviceid, 0[diagid1], diagid[diagid2], 0[diagid3], 0[diagid4] FROM serviceidorderedlist WHERE rn=2

UNION

SELECT serviceid, 0[diagid1], 0[diagid2], diagid[diagid3], 0[diagid4] FROM serviceidorderedlist WHERE rn=3

UNION

SELECT serviceid, 0[diagid1], 0[diagid2], 0[diagid3], diagid[diagid4] FROM serviceidorderedlist WHERE rn=4

)dt

GROUP BY serviceid

;

The query above generates a line for each instance of a service id, up to 4 and sticks the appropriate diag code id into the correct column and 0 into the other columns. The union of all the queries in the derived table (dt) gives you up to 4 lines for each service id where only one column has a diag code and the other 3 have 0's; 1 row for each intersection of a service id and a diag code id. But there can still be up to 4 rows of a service id. So we 'wrap' the derived table query (dt) with another select that groups on the service id and sums the diag code columns. Since only on row in the set for any column has a diag code, adding 0 in the sum function has no effect on altering the data, and you get the result below

serviceid diagid1 diagid2 diagid3 diagid4

123 12 23 56 78

456 12 23 78 0

789 23 56 0 0

Wednesday, October 3, 2012

Cursors are almost never the answer...

Over the years, I've been amazed at how many times I encounter the use of cursors, while loops, etc.

I confess I fall into the trap of using a while loop more often than I'd care to admit.

Stated bluntly, using cursors is a procedural crutch. Here we have a database engine written to perform operations efficiently on massive amounts of information (millions, maybe billions of rows) and what do people tend to do?

They try to perform a series of operations on one row at a time, using the same procedural thinking that they learned in computer science 101.

Yes, I admit it, sometimes it is easier to think that way, but that is not the point.

The point is to ask 'what is the most efficient way?'!

Numerous posts are sprinkled all over the internet by heavyweights such at Itzak Ben-Gan and Paul Nielsen that scientifically prove that the cursor and/or while loop procedural approach rarely if ever is justified.

There is almost no problem space these techniques can be applied to that cannot be solved more efficiently with a little careful thought and the proper application of table joins.

Think about it.

Monday, September 17, 2012

Track report usage

One of the best ways to ascertain how effective your portal is in reaching the users the reports were designed for is to track how many times a report is viewed and who is viewing it.

The report above is a no frills report I have in my 'administrator reports' section of the portal at our company. It is solely for my use.

It shows the report name and the number of times it has been viewed during the period specified in the input parameters.

The query is as follows:

USE [ReportServerSP]

DECLARE @start DATETIME -- < of course these are the input parameters on the actual report,
DECLARE @end DATETIME -- < I just show them here for clarity

SET @start = '08/01/2012' -- < of course these are the input parameters on the actual report,
SET @end = '08/31/2012' -- < I just show them here for clarity

SELECT [ReportPath][Report], COUNT(*)[TimesViewedForPeriod]
FROM [dbo].[ExecutionLog2] AS el
WHERE [TimeStart] BETWEEN @start AND @end
AND [ReportPath] <> 'unknown'
GROUP BY [ReportPath]
order by count(*) desc
;

The database, ReportServerSP is one we created when we integrated SSRS with WSS3.0, the SP at the end clues us that it is for the 'Sharepoint' based portal.

When a report name is clicked, the next report drilled to provides a breakdown as to the number of times a particular user on the DOMAIN viewed the report.

That linked report's query is:
USE [ReportServerSP]

SELECT [UserName], COUNT(*)[UserViewsForPeriod]
FROM [dbo].[ExecutionLog2] AS el
WHERE [TimeStart] BETWEEN @start AND @end
AND [ReportPath] IN ( @reportpath )
GROUP BY [UserName]
;

Using these two reports has come in very handy. There have been times when the President has wanted to know for a particular report, whether it was viewed during a range of dates, and if so, how many times and by who.

Using these reports came in very handy on those days.

Even if you don't have your SSRS reports integrated to WSS, by pointing these queries to your ReportServer database, you should be able to extract the same data.

Have fun with it!

Friday, September 14, 2012

Make Reports Easier to Use

One of the things about a successful SSRS install is that after a short time there are so many folders and reports, the general user population gets lost.

Try this. I have found it well received by the user community where I work.

Create a Word document as a directory of your reports in a folder.
Each report has its name as a hyperlink directly to the report on the portal.
I captured an image of the rendered report and embedded it in the document under the report title, this gives the user a 'picture' view of what the report looks like.
Below that I provide a short description of the input parameters the report takes and the data the report provides as output.

Each time a report is changed, deleted, or a new report added; I update the document.

I have provided the user community with an understanding of how they can be alerted by email when a change to the document has been made.

In this way, the user community is kept abreast of the reports on the portal and has an easier way of examining reports and getting a direct link to the report they want. Saves a lot of time rummaging around in folders that they may not be as familiar with as people that use a particular report every day.

Also, if you note the particular report that is highlighted above; this is a 'control panel' report.
My service manager wanted a set of analytic reports to examine service on a particular manufacturer's product lines - how many parts have we installed, what % was warranty, how much labor and travel, etc.

We built this control panel report to feed the exact same set of inputs to each report in the series.
In this way, he can run a report, examine it, export it to Excel should he wish; then press the backspace key to get to this control panel. Then all he need do is click the next hyper-linked report name or image and the same set of inputs is fed to the next report in the series.

The point here is that a portal may contain a great set of reports; they may be innovative and extremely useful. But if they are hard to locate, or the user community is frustrated in their attempts to use it, then it won't succeed, no matter how brilliant the reports are.

Take a bit of extra time to make the reports as easily accessible as you possibly can. Solicit feedback from the user community; don't be defensive when they call 'your baby' ugly though; listen to what they are telling you and do your best to incorporate their ideas. Follow up with them and ask them about your implementation of THEIR IDEAS; you'll find that in itself will get their attention. Once you have the lines of communication open on using the reports, you'll hopefully find they're being used more often!

Tuesday, September 4, 2012

Monitoring database size

One thing every DBA ought to have (and many do!) is a way to monitor and routinely report on the size of databases and their growth.

This is important when it comes time to discuss the purchase or expansion of a SAN, NAS or other storage subsystem.

Frequently management will make decisions that significantly accelerate database growth, yet the nature of the change makes it almost impossible to predict with any accuracy how fast the growth will be.

Having a good means of tracking, measuring and reporting on this growth is vital to our success when wearing our DBA hat.

The script below is one I use. Every SQL Server install I do, the first thing I do after getting the server up on its legs is I create the database 'DBA'. This holds all my management and measurement 'stuff'. This script creates the table tbl_dbsize (in the comment). Then, whenever I run it, the script gets information on the database files for each database on the server and the size of the file in MB. The "inserted on" timestamp column gives me a way to measure the growth over time.

We recently made the decision to start scanning copies of important documents at the company where I am employed; these document scans can be added to a 'file room' tab in our ERP system. This is great for our ability to attach document copies to transactions in the system. However, when all our locations start doing this and we have no way of knowing with certainty the size of the scanned files, close monitoring is prudent.

So, I'm ratcheting up the frequency of this job from monthly to weekly right now in order to see what impact these scans will have on growth of the database.

If you don't have a script or job like this, give this one a try.

USE [DBA]
/*
Report: Database Size
Contents: size of database files

CREATE TABLE tbl_dbsize (
[inserted on] DATETIME,
[DB Name] NVARCHAR(255),
[Logical file] NVARCHAR(255),
[Physical file] NVARCHAR(1000),
[MB] INTEGER
)
;
*/

INSERT INTO tbl_dbsize
SELECT GETDATE()[inserted on],
DB_NAME(database_id) AS [DB Name],
Name AS [Logical file],
Physical_Name [Physical file], (size*8)/1024 SizeMB
FROM sys.master_files

SELECT * FROM dbsize; -- <-- right now I don't have the need for an SSRS report, I dump the answer set into an Excel spreadsheet that I can use to track the growth of files over time.

Friday, August 10, 2012

As I indicated in my previous post, I'm not enamored of the geospatial mapping in SQL Server reports. But what if you don't have a brilliant person who can write code to interface to Google maps or Bing? What if your executives would rather use Google Earth? What can we do with geospatial data under those circumstances?

Well this is what we can do. We can create a script that will write our kml file data file for use in the Google Earth application.

Here is the exercise. Suppose the President of your company comes to you and asks if it is possible to show the total sales a customer did with your company for the year, and then to plot that data in Google Earth.

Suppose further that the request is that when you hover over the marker, particular data is displayed such as the total number of orders, the type of order, the sale amount and some address information identifying the customer (number, name, city, etc.).

Of course this exercise pre-supposes that you: a) have a version of SQL Server that supports geospatial data, and b) that you have coordinates for the customer so that their location can be plotted in Google Earth.

From that point forward it is merely a straightforward matter of:

a) understanding Google Earth's kml file structure

b) a bit about manipulating markers

c) stringing kml syntax and customer data together into strings that can be inserted between the 'header' and 'footer' sections of your kml file.

So here is the first part of the script:

USE [<database>] -- set the location to the database holding your customer info.

/* -- here is a google red dot marker URL
GOOGLE MARKERS
http://maps.google.com/mapfiles/ms/icons/red-dot.png
*/

DECLARE @kmlhdr NVARCHAR(MAX) -- your kml header text
DECLARE @kmlftr NVARCHAR(MAX) -- your kml footer text
DECLARE @red NVARCHAR(255) -- your marker text
DECLARE @divcode NVARCHAR(4) -- an operating division to report sales on
DECLARE @yr INT -- the year to report on

SET @divcode = '03'
SET @yr = 2012
SET @red = 'http://maps.google.com/mapfiles/ms/icons/red-dot.png'

-- first the easy part; set the footer
SET @kmlftr = '</Document></kml>'

-- more involved, but once you have it down, it is pretty simply and static
SET @kmlhdr =
(
-- use a select statement to string kml 'snippets' together with descriptive info
SELECT
'<?xml version="1.0"?><kml xmlns="http://earth.google.com/kml/2.0">
<Document>

-- below is description data that shows up in the Places section of Google Earth.
-- careful consideration of how you want to group and report sales can pay off
-- by using the tags below against data that is grouped by the tags.
-- then when multiple kml files are loaded for different sale types for
-- different divisions, Places can be used to toggle each kml file on/off to
-- view sales concentrations.

<name>DISTRIBUTION Sales SC Division</name>
<description>Division Sales | My Company, Inc.</description>

-- next comes 3 style sections;
-- A1 is the normal size of the marker

<Style id="A1"><IconStyle><scale>1</scale><Icon><href>'+@red+'</href></Icon><hotSpot x=".3" y=".8" xunits="fraction" yunits="fraction" /></IconStyle><LabelStyle><scale>0</scale></LabelStyle><BalloonStyle><text><![CDATA[ <b>$[name]</b><br />$[address]<br />$[description]<br/><br/>$[geDirections] ]]></text></BalloonStyle></Style>\n

-- ok a bit about the above kml code; CDATA is going to format some info in

-- each <Placemark> row we create in the next T-SQL script section.

-- A2 is going to show the marker 'inflated' when the President mouses over it
<Style id="A2"><IconStyle><scale>1.5</scale><Icon><href>'+@red+'</href></Icon><hotSpot x=".3" y=".8" xunits="fraction" yunits="fraction" /></IconStyle><LabelStyle><scale>1.0</scale></LabelStyle><BalloonStyle><text><![CDATA[ <b>$[name]</b><br />$[address]<br />$[description]<br/><br/>$[geDirections] ]]></text></BalloonStyle></Style>\n

-- A0 is the style map
<StyleMap id="A0"><Pair><key>normal</key><styleUrl>#A1</styleUrl></Pair><Pair><key>highlight</key><styleUrl>#A2</styleUrl></Pair></StyleMap>'

)

-- I won't go into the details of the above style sections, you can read up on

-- them with the Google Earth documents. Just trust me, a lot of time when into

-- figuring it out, and it works!!

-- Again, the point of this exercise is to show how data can be pulled from

-- invoicing information and customer information, and mapped in Google Earth.

-- so this part of the code is simple mechanics to create some sales data...

-- I'll assume you, the reader can pull your own data.

-- Gets basic sales information...

SELECT ISNULL(wo.[serv-code],'')[serv-code],
i.[disp-no], i.[order-no],
CASE WHEN i.[disp-no]<> '' THEN 'service' ELSE 'distribution' END [ordtype],
i.[Document], i.[Invoice-date],
i.[Charge-cust], i.[Cust-no], i.[Ship-name], i.[Ship-address__1], i.[Ship-city], i.[Ship-st], i.[Ship-zip],
i.[Gross-amt]+i.[Extra-Charge__9]+i.[Extra-Charge__10][Gross-amt]
INTO #stg_invoicedata
FROM [dbo].[invoice] AS i
LEFT OUTER JOIN [dbo].[wo] AS wo ON i.[disp-no]=wo.[wo-no]
WHERE DATEPART(yyyy,i.[Invoice-date]) = @yr
AND i.[div-code]=@divcode
;

-- cubes the sales data to get a sales total ytd for each customer
SELECT
CASE WHEN GROUPING([serv-code])= 1 THEN 'All' ELSE [serv-code] END [serv-code],
CASE WHEN GROUPING([ordtype])= 1 THEN 'All' ELSE [ordtype] END [ordtype],
-- CASE WHEN GROUPING([Document])= 1 THEN 'All' ELSE [Document] END [Document],
CASE WHEN GROUPING([Charge-cust])= 1 THEN 'All' ELSE [Charge-cust] END [Charge-cust],
CASE WHEN GROUPING([Cust-no])= 1 THEN 'All' ELSE [Cust-no] END [Cust-no],
CASE WHEN GROUPING([Ship-name])= 1 THEN 'All' ELSE [Ship-name] END [Ship-name],
CASE WHEN GROUPING([Ship-address__1])= 1 THEN 'All' ELSE [Ship-address__1] END [Ship-address__1],
CASE WHEN GROUPING([Ship-city])= 1 THEN 'All' ELSE [Ship-city] END [Ship-city],
CASE WHEN GROUPING([Ship-st])= 1 THEN 'All' ELSE [Ship-st] END [Ship-st],
SUM([Document])[Orders],
SUM([Gross-amt])[Dollars]
INTO #stg_invoicecube2
FROM #stg_invoicecube
GROUP BY [serv-code],[ordtype],[Document],[Charge-cust],[Cust-no],[Ship-name],[Ship-address__1],[Ship-city],[Ship-st]
WITH cube
HAVING [serv-code]<> 'All'
AND [ordtype]<> 'All'
AND [Charge-cust]<> 'All'
AND [Cust-no]<> 'All'
AND [Ship-name]<> 'All'
AND [Ship-address__1]<> 'All'
AND [Ship-city]<> 'All'
AND [Ship-st]<> 'All'
;

-- joins the sales data with the customer data to get the coordinates
SELECT c.*, c2.[Zip-code], c2.[num-lat], c2.[num-lng],
CASE WHEN c2.[num-lat]='0' THEN 'geocode' END [num-latflag],
CASE WHEN c2.[num-lng]='0' THEN 'geocode' END [num-lngflag]
INTO #tmp_invoicecube
FROM #stg_invoicecube2 c
INNER JOIN [dbo].[customer] AS c2 ON c.[Cust-no]=[c2].[Cust-no]
;

So below is a row from the table just to make clear what we want...

Cust-no Ship-name Ship-address__1 Ship-city Ship-st

10170 MY SAMPLE CUSTOMER 123 Anywhere Street Anytown SC

Orders Dollars Zip-code num-lat num-lng

1 3066.58 29071 35.43802 -80.522617

-- Now we want to create <Placemark> rows for each row in the table.

-- In the example above I plot this row from this data.

-- Here is the script to do that...

SELECT '<Placemark><name>ORDERS: '+cast([orders] as nchar(1))+' DOLLARS: '+cast(cast([dollars] as decimal(10,2))as nvarchar(255))+' TYPE: '+Upper([ordtype])+' - BILL TO: '+[cust-no]+' '+[ship-name]+' - SHIP TO: '+
[cust-no]+' '+[ship-name]+' '+[ship-address__1]+'</name><description>'+
[serv-code]+'</description><Point><coordinates>'+cast([num-lng]as nvarchar(255))+','+cast([num-lat] as nvarchar(255))+',0</coordinates></Point><address>'+[ship-address__1]+' '+[ship-city]+', '+cast([ship-st] as nvarchar(255))+' '+cast([zip-code] AS nvarchar(255))+'</address><styleUrl>#A0</styleUrl></Placemark>'

FROM #tmp_invoicecube

If you know a bit about strings, you should be able to follow the SELECT statement above. Simply stated, the name element is going to hold descriptive data about the total number of orders 'ORDERS: ' and the [orders] data value; total sales ' DOLLARS: ' and the [dollars] data value; plus various other data you and your President decide to have shown when a marker is moused over. Next we provide the geospatial plot between the <Point><coordinates> tags. Lastly we provide the full customer address data between the <address> tags.

Substitute your own column names from your own invoiced data table. To help you I have encased my column names in square brackets in the SELECT statement.

The resulting text string(s) your script generates should be in the form of the one below.

<Placemark><name>ORDERS: 1 DOLLARS: 3066.58 TYPE: DISTRIBUTION - BILL TO: 10170 MY SAMPLE CUSTOMER - SHIP TO: 10170 MY SAMPLE CUSTOMER 123 Anywhere Street</name><description>DIST</description><Point><coordinates>-80.522617,35.438020,0</coordinates></Point><address>123 Anywhere Street Anytown, SC 29071</address><styleUrl>#A0</styleUrl></Placemark>

Now take the resulting answer sets and paste into a simple text editor. Put the output for your @kmlhdr variable into the text editor first. Next place the data set for the script that generates the Placemark strings into the text editor. Last, do not forget to place the text for your @kmlftr variable into the text editor. Then save the file and replace the .txt with .kml.

After you have Google Earth installed, double click your .kml file. When your file is opened, you will see a red dot marker designating the location of sales for a Placemark row. When you hover over it with your mouse, you'll see a label that describes the order - total orders and sales dollars - and customer information.

Friday, August 3, 2012

Setting coordinates set once you have spatial data

Although I don't find the spatial maps in SSRS all that useful (I'll post another article in the future on generating kml files for Google Earth); one thing that some folks have found tedious is establishing the coordinates once the latitude and longitude have been updated in a table.

This can be easily done with a while loop, as demonstrated below. Suppose you want to create a bubble map of sales data.

First you need some invoices. To be pragmatic here, I'll assume you are familiar with your own invoice tables, so instead of creating them here, we'll just show the code I used to extract invoice information from my employer's system:

use DBA

declare @divcode nvarchar(10) -- divisional code
declare @yr int -- year to extract
set @divcode = '00'
set @yr=2012

if exists(select * from sysobjects where type = 'u' and [name]='tbl_mapdinvoices')
drop table tbl_mapdinvoices
;

select IDENTITY(int, 1,1) rowid,
cb.[div-code], cb.[Cust-no], c.[address],c.[city],c.[st],c.[zip-code],c.[map-grid],c.[num-lat],c.[num-lng], cb.[Total]
into tbl_mapdinvoices
from
(
SELECT
case when GROUPING([div-code])=1 then 'All' else [div-code] end [div-code],
case when grouping([Charge-cust])=1 then 'All' else [Charge-cust] end [Charge-cust],
case when grouping([Cust-no])=1 then 'All' else [Cust-no] end [Cust-no],
sum([Gross-amt])+sum([Tax-amt__1])+sum([Tax-amt__2])+sum([Tax-amt__3])+sum([Extra-Charge__1])+sum([Extra-Charge__2])+sum([Extra-Charge__3])
+sum([Extra-Charge__4])+sum([Extra-Charge__5])+sum([Extra-Charge__6])+sum([Extra-Charge__7])+sum([Extra-Charge__8])+sum([Extra-Charge__9])
+sum([Extra-Charge__10])[Total]
FROM [dbo].[invoice]
where DATEPART(yyyy,[invoice-date]) IN (@yr)
group by [div-code], [Charge-cust], [Cust-no]
with CUBE
having [div-code] IN ( @divcode )
and [Charge-cust] <> 'All'
and [Cust-no] <> 'All'
) cb
inner join customer c on cb.[Cust-no]=c.[Cust-no]
and c.[num-lat]+c.[num-lng]<>0
;

Notice the IDENTITY column above? The column rowid will serve as a means to 'loop through' the entire table. If one has say, 10 or 20 invoices in the table, manually updating the table to get coordinates isn't a big deal. But if you have a table of 1,000 or 10,000 or 100,000 invoices to map, well, you need a better way.

Here's a redacted listing of the first two rows of my invoice table to give you an idea of what we have so far:
rowid div-code city st num-lat num-lng total
1 00 MARYVILLE TN 35.754724    -83.971210 524.10
2 00 IRVING TX 32.917917    -97.002038 511.31

Now we're going to add a column to our table for coordinates, we will call it 'spatialdata'

alter table tbl_mapdinvoices add spatialdata geography;

Next, we set up the WHILE LOOP to create the coordinates for us. We're going to do a number of things here, so let's look at the code:

declare @counter int
set @counter = 1
declare @sql nvarchar(max)
declare @lat decimal(10,6)
declare @lng decimal(10,6)

while @counter < (select MAX(rowid)+1 from tbl_mapdinvoices)
begin
set @lat = (select ([num-lat]) from tbl_mapdinvoices where rowid = @counter)
set @lng = (select ([num-lng]) from tbl_mapdinvoices where rowid = @counter)
set @sql = 'update tbl_mapdinvoices set spatialdata ='+''''+'POINT('+(select cast(@lng as varchar(16)) )+' '+(select cast(@lat as varchar(16)) )+')'+''''
set @sql = @sql+' where rowid ='+(select cast(@counter as nvarchar(10)) )+';'
--print @sql
exec sp_executesql @sql
set @counter = @counter+1
end

First we declare our counter as an integer and set it's value to '1'.
While declaring "stuff", we're going to create a string for our UPDATE sql, as well as declaring our latitude and longitude variables.

We initiate our loop with a WHILE statement that essentially says, while the counter variable is less than the value of the maximum row id plus 1 from the table of invoices continue looping.
(You may be asking 'Why not just use a cursor'? The answer is that there are multiple posts and myriad books and other analyses that demonstrate cursors are rarely more efficient than using other T-SQL tactics. I have therefore abandoned using cursors completely as I have never found a need for them.)

Next we set our latitude variable to the value of the num-lat column in our invoice table where the rowid in the table is equal to the counter. Likewise we set the value of longitude variable to our num-lng column value.

Now we employ the DYNAMIC SQL technique to build our UPDATE string.
If you are not familiar with DYNAMIC SQL, once you have an invoice table built with a rowid as described in the code, comment out the exec sp_executesql @sql statement and uncomment out the print @sql statement and run the WHILE LOOP. You should see a series of statements in your answer set that look something like this:

update tbl_mapdinvoices set spatialdata ='POINT(-83.971210 35.754724)' where rowid =1;
update tbl_mapdinvoices set spatialdata ='POINT(-97.002038 32.917917)' where rowid =2;
update tbl_mapdinvoices set spatialdata ='POINT(-80.949954 34.071876)' where rowid =3;
update tbl_mapdinvoices set spatialdata ='POINT(-79.221837 35.431453)' where rowid =4;

(DYNAMIC SQL is a good way to programmatically build statements for subsequent execution.)
Don't forget to uncomment the exec sp_executesql @sql statement and comment out the print @sql statement before you run the code though.
The exec sp_executesql @sql statement is the last thing in the WHILE LOOP before incrementing the counter and returning to the top of the loop.

Once this WHILE loop is finished, you will have populated the spatialdata column of your table.
Below is a redacted listing of the first two rows of my invoice table:

rowid city st num-lat num-lng spatialdata
1   MARYVILLE TN 35.754724 -83.971210 0xE6100000010CCFC0C8CB9AE0414015E3FC4D28FE54C0
2   IRVING   TX 32.917917 -97.002038 0xE6100000010CA2B8E34D7E7540405AD6FD63214058C0

Now we want to convert the spatialdata into coordinates.

alter table tbl_mapdinvoices add coordinates nvarchar(255);

update tbl_mapdinvoices
set coordinates = convert(nvarchar(255),[spatialdata])
;

Below is a redacted listing of the first two rows of my invoice table with coordinates.

rowid city st num-lat num-lng   coordinates
1 MARYVILLE TN 35.754724 -83.971210 POINT (-83.97121 35.754724)
2   IRVING   TX 32.917917 -97.002038 POINT (-97.002038 32.917917)

Now I have the table ready to map the data.
To see an excellent example of how to spatially map data in SSRS, see the posting at this link:
http://www.mssqltips.com/sqlservertip/2174/maps-in-sql-server-2008-r2-reporting-services/

Monday, July 30, 2012

Those pesky aggregate reports - USING CUBE

Over the course of using SQL Server for the years, I confess I have come up with myriad ways of producing those pesky aggregate reports that management so loves; such a report being the YTD sales of customers.

Some of the methods I have created have been to UNION multiple tables holding the data for different groups, building derived tables, doing the equivalent with CTE, and probably many others. I've tried using stored procedures as well, but invariably each request is an ad-hoc one, and the stored procedure approach led nowhere because of it.

Then came 'CUBE'. I have settled off on using this because it very easily and readily creates the information I need. (Now if Microsoft would only improve the PIVOT operation so that multiple aggregates could be handled... but I digress).

Here is an example of how to use a CUBE operation to simplify the summation of sales data for a report of customers' sales YTD.

First we need to build a customer table....

-- create some customers
if exists(select * from sysobjects where type = 'U' and name = 'customer')
drop table customer
;

SELECT '28549'[billing #], '28549'[customer #], cast('ABC Corporation' as nvarchar(255))[name], cast('123 Anywhere Place' as nvarchar(255))[Address],
cast('Irmo' as nvarchar(255))[City], 'SC'[St]
into customer;

insert into customer SELECT '32568'[billing #], '32568'[customer #], 'ABC Corporation'[name], '123 Anywhere Place'[Address], 'Irmo'[City], 'SC'[St];
insert into customer SELECT '46598'[billing #], '46598'[customer #], 'Hannibal Cartons'[name], '1235 Courtney Blvd.'[Address], 'Walterboro'[City], 'SC'[St];
insert into customer SELECT '84956'[billing #], '84956'[customer #], 'Quattle Container'[name], '3857 Cedar Road'[Address], 'Lexington'[City], 'SC'[St];
insert into customer SELECT '51324'[billing #], '51324'[customer #], 'Simmons Freightlines'[name], '456 Anywhere Place'[Address], 'Irmo'[City], 'SC'[St];

select * from customer
;

billing #	customer #	name	Address	City	St
28549	28549	ABC Corporation	123 Anywhere Place	Irmo	SC
32568	32568	ABC Corporation	123 Anywhere Place	Irmo	SC
46598	46598	Hannibal Cartons	1235 Courtney Blvd.	Walterboro	SC
84956	84956	Quattle Container	3857 Cedar Road	Lexington	SC
51324	51324	Simmons Freightlines	456 Anywhere Place	Irmo	SC

Next we need to build some invoice data. Being the pragmatic sort, since I don't need to provide breakdowns by items sold, I don't need the line data, so I'll be using data from the invoice header table only.

-- create some invoice headers
if exists(select * from sysobjects where type = 'U' and name = 'invoicehdr')
drop table invoicehdr
;

select '46598'[billing #], '46598'[customer #], cast('05-11-2012' as datetime)[invoice-date], 1256.55[gross], cast(1256.55*.05 as decimal(10,2))[tax] into invoicehdr
insert into invoicehdr select '28549'[billing #], '28549'[customer #], cast('06-30-2012' as datetime)[invoice-date], 125.36[gross], cast(125.36*.05 as decimal(10,2))[tax]
insert into invoicehdr select '28549'[billing #], '28549'[customer #], cast('07-12-2012' as datetime)[invoice-date], 326.10[gross], cast(326.10*.05 as decimal(10,2))[tax]
insert into invoicehdr select '28549'[billing #], '28549'[customer #], cast('07-28-2012' as datetime)[invoice-date], 42.10[gross], cast(42.10*.05 as decimal(10,2))[tax]
insert into invoicehdr select '32568'[billing #], '32568'[customer #], cast('01-21-2012' as datetime)[invoice-date], 442.30[gross], cast(442.30*.05 as decimal(10,2))[tax]
insert into invoicehdr select '32568'[billing #], '32568'[customer #], cast('03-19-2012' as datetime)[invoice-date], 442.30[gross], cast(442.30*.05 as decimal(10,2))[tax]
insert into invoicehdr select '32568'[billing #], '32568'[customer #], cast('04-19-2012' as datetime)[invoice-date], 442.30[gross], cast(442.30*.05 as decimal(10,2))[tax]
insert into invoicehdr select '32568'[billing #], '32568'[customer #], cast('05-19-2012' as datetime)[invoice-date], 442.30[gross], cast(442.30*.05 as decimal(10,2))[tax]
insert into invoicehdr select '32568'[billing #], '32568'[customer #], cast('06-19-2012' as datetime)[invoice-date], 442.30[gross], cast(442.30*.05 as decimal(10,2))[tax]
insert into invoicehdr select '32568'[billing #], '32568'[customer #], cast('07-19-2012' as datetime)[invoice-date], 442.30[gross], cast(442.30*.05 as decimal(10,2))[tax]
insert into invoicehdr select '46598'[billing #], '46598'[customer #], cast('01-19-2012' as datetime)[invoice-date], 546.30[gross], cast(546.30*.05 as decimal(10,2))[tax]
insert into invoicehdr select '46598'[billing #], '46598'[customer #], cast('03-10-2012' as datetime)[invoice-date], 342.20[gross], cast(342.20*.05 as decimal(10,2))[tax]
insert into invoicehdr select '51324'[billing #], '51324'[customer #], cast('03-14-2012' as datetime)[invoice-date], 156.89[gross], cast(156.89*.05 as decimal(10,2))[tax]
insert into invoicehdr select '51324'[billing #], '51324'[customer #], cast('04-23-2012' as datetime)[invoice-date], 423.65[gross], cast(423.65*.05 as decimal(10,2))[tax]

select * from invoicehdr
;

billing #	customer #	invoice-date	gross	tax
46598	46598	5/11/2012	1256.55	62.83
28549	28549	6/30/2012	125.36	6.27
28549	28549	7/12/2012	326.1	16.31
28549	28549	7/28/2012	42.1	2.11
32568	32568	1/21/2012	442.3	22.12
32568	32568	3/19/2012	442.3	22.12
32568	32568	4/19/2012	442.3	22.12
32568	32568	5/19/2012	442.3	22.12
32568	32568	6/19/2012	442.3	22.12
32568	32568	7/19/2012	442.3	22.12
46598	46598	1/19/2012	546.3	27.32
46598	46598	3/10/2012	342.2	17.11
51324	51324	3/14/2012	156.89	7.84
51324	51324	4/23/2012	423.65	21.18

Now we have a customer table and we have some invoice data for those customers; we are ready to create a sales ytd report.

Using the CUBE operation, we can easily and rapidly get our summary data as shown below:

select
case when grouping(i.[billing #])=1 then 'All' else [billing #] end [billing #],
case when grouping(i.[customer #])=1 then 'All' else [customer #] end [customer #],
sum(i.gross)gross,
sum(i.tax)tax
from invoicehdr as i with (nolock)
group by i.[billing #], i.[customer #]
with cube
order by [billing #], [customer #]
;

To use the 'with cube' operation, you need to use the GROUPING() statement on the columns you want to group by. The GROUPING statement will return a value of 1 for the group and a 0 for each element in the group. However, I like to case my GROUPING() statements to return the word, 'All' for the total and then the data for each element in the group. It just makes more sense to me. You can use whatever you choose, that's one of the points of being pragmatic.

The query returns the following answer set:

billing #	customer #	gross	tax
28549	28549	493.56	24.69
28549	All	493.56	24.69
32568	32568	2653.8	132.72
32568	All	2653.8	132.72
46598	46598	2145.05	107.26
46598	All	2145.05	107.26
51324	51324	580.54	29.02
51324	All	580.54	29.02
All	28549	493.56	24.69
All	32568	2653.8	132.72
All	46598	2145.05	107.26
All	51324	580.54	29.02
All	All	5872.95	293.69

Notice that for the billing # column I get a row back for each billing # in the invoice table, and I get an "All" row back for each data in the customer # column; plus 1 row where the customer # value is "All". Additionally, I get a row of data for each customer # and an "All" row for each customer #. These rows are basically the group total rows. I don't find them of much value, because I usually create our company's reports in SSRS. SSRS provides a grouping function in the tablix control to render group and sub-group totals so these rows only get in the way.

Here is how to filter them out easily... A good YTD report is going to show more data about the customer than their billing and customer numbers, so we want to perform a join of the CUBE query answer set and the customer table as shown below:

select cb.[billing #], cb.[customer #], c.[name], c.[Address], c.City, c.St, cb.gross, cb.tax
from
(
select
case when grouping(i.[billing #])=1 then 'All' else [billing #] end [billing #],
case when grouping(i.[customer #])=1 then 'All' else [customer #] end [customer #],
sum(i.gross)gross,
sum(i.tax)tax
from invoicehdr as i with (nolock)
group by i.[billing #], i.[customer #]
with cube
) cb
inner join customer as c with (nolock) on cb.[billing #]=c.[billing #] and cb.[customer #]=c.[customer #]
order by gross desc
;

Since I know there is no billing # or customer # data with the value "All" the inner join is going to filter out the group and subgroup total rows out of my cube. Since the billing and customer numbers are identical in this data set, and in the interest of having space on my post page for a tidy report, I'll eliminate the billing # from the answer set:

customer #	name	Address	City	St	gross	tax
32568	ABC Corporation	123 Anywhere Place	Irmo	SC	$ 2,653.80	$ 132.72
46598	Hannibal Cartons	1235 Courtney Blvd.	Walterboro	SC	$ 2,145.05	$ 107.26
51324	Simmons Freightlines	456 Anywhere Place	Irmo	SC	$ 580.54	$ 29.02
28549	ABC Corporation	123 Anywhere Place	Irmo	SC	$ 493.56	$ 24.69

Since no billing # matches the 'All' value in the cube data set, and likewise for the customer #, as I wrote above, these rows are eliminated from the report. All we have are the total sales year-to-date for the customers in our customer table.

This is so much simpler than any other method I've ever used. If you don't know about the CUBE and ROLLUP clauses you need to check them out. They're tremendous time savers (and head ache savers)!!

Tuesday, July 24, 2012

Use INFORMATION_SCHEMA.COLUMNS to help build your INSERT, SELECT or other query

Last month I posted on a 'pragmatic' way to use internal tables to help build query statements dealing with large tables.

However, as INFORMATION_SCHEMA is the "new, ANSI standard" way to go, I thought I'd show that it can be used in a similar manner.

Also, I've often read posts where it is alluded that INFORMATION_SCHEMA should be used over sysinternals such as sysobjects and syscolumns.

I recently read this post from SQLServerCentral.com that shows this assertion is false at this time.
http://www.sqlservercentral.com/Forums/Topic1272370-1526-1.aspx#bm1277421

In the case of listing column names, either way is OK. However, you should be aware of when and why you choose to use either. There are some risks involved in always relying on INFORMATION_SCHEMA as Gail Shaw points out.

So, in the interest of preventing this site from going stale, even as I try to get on top of my project pile, I offer this short post to show that you can indeed continue to use this technique with the newer INFORMATION_SCHEMA.COLUMNS view to get a list of table column names in order of their position in the table.

Try this comparison to help you get comfortable with using the new "IS" views.

use dba -- < pick your own database containing large tables

-- pick a large table and substitute its name for 'nat'.

-- THIS IS THE OLD WAY
select name from syscolumns where id in (select id from sysobjects where name = 'nat') order by colorder
-- ANSWER SET
/*EventLog, RecordNumber, TimeGenerated, TimeWritten, EventID, EventType, EventTypeName, EventCategory, EventCategoryName, SourceName, Strings, ComputerName, SID, Message, Data*/

-- THIS IS THE NEW WAY
-- ANSWER SET
select column_name from information_schema.columns where table_name = 'nat' order by ordinal_position
/*EventLog, RecordNumber, TimeGenerated, TimeWritten, EventID, EventType, EventTypeName, EventCategory, EventCategoryName, SourceName, Strings, ComputerName, SID, Message, Data*/

Personally, I'm starting to like the new way better as I don't have to reference the INFORMATION_SCHEMA.tables in order to pull the data from INFORMATION_SCHEMA.columns.
The old way I have to use sysobjects to get the table id for the syscolumns to use.

You can use the INFORMATION_SCHEMA views in a number of interesting ways to help you reduce the time it takes to work with larger objects in your databases.

The following views apply ...

INFORMATION_SCHEMA.CHECK_CONSTRAINTS
INFORMATION_SCHEMA.COLUMN_DOMAIN_USAGE
INFORMATION_SCHEMA.COLUMN_PRIVILEGES
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE
INFORMATION_SCHEMA.CONSTRAINT_TABLE_USAGE
INFORMATION_SCHEMA.DOMAIN_CONSTRAINTS
INFORMATION_SCHEMA.DOMAINS
INFORMATION_SCHEMA.KEY_COLUMN_USAGE
INFORMATION_SCHEMA.PARAMETERS
INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS
INFORMATION_SCHEMA.ROUTINE_COLUMNS
INFORMATION_SCHEMA.ROUTINES
INFORMATION_SCHEMA.SCHEMATA
INFORMATION_SCHEMA.TABLE_CONSTRAINTS
INFORMATION_SCHEMA.TABLE_PRIVILEGES
INFORMATION_SCHEMA.TABLES
INFORMATION_SCHEMA.VIEW_COLUMN_USAGE
INFORMATION_SCHEMA.VIEW_TABLE_USAGE
INFORMATION_SCHEMA.VIEWS

Take time to become familiar with the information these views provide. It can really help you increase your productivity writing T-SQL.