블로그 이미지
LifeisSimple

calendar

1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31

Notice

2012. 1. 6. 16:17 Brain Trainning/DataBase

출처 :  http://www.sqlmag.com/article/tsql/sql-server-2012-window-functions-141022?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+sqlmag%2Ftsqlpowershell+%28SQL%3EArticles%3ET-SQL+and+PowerShell+Scripting+%29&utm_content=Google+International 

Microsoft SQL Server 2012: How To Write T-SQL Window Functions, Part 2

Using offset and distribution functions
SQL Server Pro
InstantDoc ID #141022
Last month, I started a series of articles about the profound window functions and their support in SQL Server 2012. I explained the concept of SQL windowing, I described the elements involved in window specifications (partitioning, ordering, and framing), and I discussed the difference between the two window frame unit options ROWS and RANGE. I showed examples using window aggregate functions. This month, I cover two other types of window functions: offset and distribution. If you missed part 1 of this article series, see "SQL Server 2012's Window Functions, Part 1."

As a reminder, you need to use SQL Server Denali CTP3 or later to run the sample code for this series of articles.  You also need to install the sample database TSQL2012. You also need to download the source code file that creates and populates the sample database.

WINDOW OFFSET FUNCTIONS

Window offset functions let you return a value from a row that's in a certain offset from the current row (LAG and LEAD) or from the first or last row in the window frame (FIRST_VALUE and LAST_VALUE). Let's start with the LAG and LEAD functions.

Window offset functions LAG and LEAD. The LAG function returns a value from a row in the window partition that, based on the window ordering, is the specified number of rows before the current row. Similarly, the LEAD function returns a value from a row in the window partition that, based on the window ordering, is the specified number of rows after the current row. By default, the LAG and LEAD functions assume an offset of one row if an explicit offset wasn't specified.

You indicate the value you want to return from the row in question as the first argument to LAG and LEAD. If you want to indicate an explicit offset, you indicate it as the second argument to the function. If a row isn't found in the requested offset, the functions return a NULL. If you want to return a different value in case a row isn't found, you can indicate such a value as the third argument to the function.

As an example, the following query returns for each customer order the value of the customer's previous order (LAG), as well as the value of the customer's next order (LEAD):

USE TSQL2012;

SELECT custid, orderdate, orderid, val,

LAG(val) OVER(PARTITION BY custid

ORDER BY orderdate, orderid) AS prevval,

LEAD(val) OVER(PARTITION BY custid

ORDER BY orderdate, orderid) AS nextval

FROM Sales.OrderValues; 

Figure 1 shows the output of this query.

Because the calculations are supposed to be performed for each customer independently, the functions partition the window by custid. As for window ordering, it's based on orderdate and orderid as a tiebreaker. Observe that the functions rely on the default offset 1 and return NULL when a row isn't found in the applicable offset.

You can freely mix in the same expression detail elements from the row as well as calls to window functions. For example, the following query computes the difference between the customer's current and previous order values, as well as the difference between the customer's current and next order values:

SELECT custid, orderdate, orderid, val,

val - LAG(val) OVER(PARTITION BY custid

ORDER BY orderdate, orderid) AS diffprev,

val - LEAD(val) OVER(PARTITION BY custid

ORDER BY orderdate, orderid) AS diffnext

FROM Sales.OrderValues; 

Figure 2 shows the output of this query.

As I mentioned, the default when an explicit offset isn't specified is 1 -- but you can indicate your own value as a second argument to the function. You can also indicate as a third argument what to return instead of a NULL when a row isn't found in the requested offset. Here's an example that specifies 2 as the offset and 0 as the value to return when a row isn't found:

SELECT custid, orderdate, orderid, val,

LAG(val, 2, 0) OVER(PARTITION BY custid

ORDER BY orderdate, orderid) AS prev2val,

LEAD(val, 2, 0) OVER(PARTITION BY custid

ORDER BY orderdate, orderid) AS next2val

FROM Sales.OrderValues; 

Figure 3 shows the output of this query.

Window offset functions FIRST_VALUE and LAST_VALUE. The functions FIRST_VALUE and LAST_VALUE return the requested value from the first and last rows, respectively, from the applicable window frame. In "SQL Server 2012's Window Functions, Part 1," I described the concept of a window frame in detail. Quite often, you just want to return the first and last values from the window partition in general and not necessarily from a more restricted window frame. However, this can be a bit tricky to achieve. First, let me provide a query that correctly returns along with each customer's order the values of the customer's first and last orders using the FIRST_VALUE and LAST_VALUE functions, respectively:

SELECT custid, orderdate, orderid, val,

FIRST_VALUE(val) OVER(PARTITION BY custid

ORDER BY orderdate, orderid

ROWS UNBOUNDED PRECEDING)

AS val_firstorder,

LAST_VALUE(val) OVER(PARTITION BY custid

ORDER BY orderdate, orderid

ROWS BETWEEN CURRENT ROW

AND UNBOUNDED FOLLOWING)

AS val_lastorder

FROM Sales.OrderValues; 

Figure 4 shows the output of this query.

You're probably wondering: Why the lengthy window descriptors? More specifically, why the need for an explicit ROWS clause if the request isn't concerned with a more restricted window frame, but rather the partition in general?

This has to do with the fact that the window order clause is in essence only part of the framing specification, and if an explicit window frame isn't specified, the default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. For the FIRST_VALUE function, relying on the default frame would still yield the correct result because the first row in the default window frame is the first row in the window partition. Still, last month I gave a recommendation to stick to ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW instead of the default RANGE option when possible.

As for the LAST_VALUE function, try to think what it would mean to rely on the default framing option. It means that you will basically always get the value from the current row, because that's the last row in the default window frame. So for LAST_VALUE, you really need to be explicit about the frame specification and indicate the ROWS unit, CURRENT ROW as the lower bound and UNBOUNDED FOLLOWING as the upper bound (assuming you want the value from the last row in the window partition).

Just as I showed with LAG and LEAD, FIRST_VALUE and LAST_VALUE can likewise be mixed in expressions that also involve detail elements from the row. For example, the following query returns the difference between the customer's current and first order values, as well as the difference between the customer's current and last order values:

SELECT custid, orderdate, orderid, val,

val - FIRST_VALUE(val) OVER(PARTITION BY custid

ORDER BY orderdate, orderid

ROWS UNBOUNDED PRECEDING) AS difffirst,

val - LAST_VALUE(val) OVER(PARTITION BY custid

ORDER BY orderdate, orderid

ROWS BETWEEN CURRENT ROW

AND UNBOUNDED FOLLOWING) AS difflast

FROM Sales.OrderValues; 

Figure 5 shows the output of this query.

As food for thought, try to think why a grouped version (as opposed to the existing windowed version) of the FIRST_VALUE and LAST_VALUE functions would have made perfect sense, although -- alas -- there's no support for such a version. Also, can you think of a workaround that would achieve the same result?

WINDOW DISTRIBUTION FUNCTIONS

Window distribution functions provide statistical computations. SQL Server 2012 implements two window rank distribution functions called PERCENT_RANK and CUME_DIST and two window inverse distribution functions called PERCENTILE_DISC and PERCENTILE_CONT. I'll start with window rank distribution functions.

To explain what the PERCENT_RANK and CUME_DIST functions calculate, I need to first provide a definition of the elements involved in their internal computation. Let rk be the rank of the row (using the same partitioning and ordering as the window function's partitioning and ordering). Let nr be the number of rows in the partition. Let np be the number of rows that precede or peer with the current row.

Then, the PERCENT_RANK computes a percentile rank as (rk - 1) / (nr - 1), and CUME_DIST computes a cumulative distribution as np / nr. As an example, the following query computes the percentile rank and cumulative distribution of student test scores, partitioned by testid and ordered by score:

SELECT testid, studentid, score,

PERCENT_RANK() OVER(PARTITION BY testid

ORDER BY score) AS percentrank,

CUME_DIST() OVER(PARTITION BY testid

ORDER BY score) AS cumedist

FROM Stats.Scores; 

Figure 6 shows the output of this query.

As an exercise, try to write SQL Server 2008-compatible solutions that compute percentile rank and cumulative distribution.

SQL Server 2012 also implements two window inverse distribution functions called PERCENTILE_DISC and PERCENTILE_CONT. A percentile p, loosely speaking, is the value v from the population, such that p percent of the values are less than v. For example, if 50 percent of the values in the population are less than some value v,then v is the 50th percentile, also known as the median.

The two function variations implement two different distribution models. The PERCENTILE_DISC function implements a discrete distribution model in which the returned value must be one of the values from the population. The PERCENTILE_CONT function implements a continuous distribution model in which the returned value is interpolated from the existing values, assuming continuous distribution.

As a simple example, in case of an even number of values, PERCENTILE_CONT will compute the median as the average of the two middle points. It gets far trickier to understand the interpolation when the requested percentile isn't the median, but fortunately the function has this complexity embedded into it.

For example, the following code computes the median of student test scores per test using both PERCENTILE_DISC and PERCENTILE_CONT:

DECLARE @pct AS FLOAT = 0.5;

SELECT testid, score,

PERCENTILE_DISC(@pct) WITHIN GROUP(ORDER BY score)

OVER(PARTITION BY testid) AS percentiledisc,

PERCENTILE_CONT(@pct) WITHIN GROUP(ORDER BY score)

OVER(PARTITION BY testid) AS percentilecont

FROM Stats.Scores; 

Figure 7 shows the output of this query.

As an exercise, see if you can figure out how to implement a SQL Server 2008-compatible solution for both functions, given any percent as input.

STILL MORE TO COME

This article is the second in a series of articles about window functions. Last month, I introduced the concept of windowing. This month, I covered window offset functions and window distribution functions. Next month, I'll cover optimization of window functions.

posted by LifeisSimple
2012. 1. 6. 15:17 Brain Trainning/DataBase

출처 :  http://www.sqlmag.com/article/tsql/sql-server-2012-window-functions-140228?utm_source=feedburner 


How to Use Microsoft SQL Server 2012's Window Functions, Part 1

An introduction to the concept of windowing
SQL Server Pro
InstantDoc ID #140228

SQL Server 2012 (formerly code-named SQL Server Denali) introduces several important T-SQL programmability features; this article focuses on one of those features-window functions. SQL Server 2005 was the first milestone in supporting window functions; it introduced window ranking functions (ROW_NUMBER, RANK, DENSE_RANK, and NTILE), as well as limited support for window aggregate functions-only with a window partition clause. SQL Server 2012 enhances support for window aggregate functions by introducing window order and frame clauses, support for offset functions (LAG, LEAD, FIRST_VALUE, and LAST_VALUE), and support for window distribution functions (PERCENT_RANK, CUME_DIST, PERCENTILE_DISC, and PERCENTILE_CONT).

Window functions are the best thing to happen since sliced bread; therefore, I'm going to spend more than one article on the topic. In fact, I just finished writing an entire book on the topic-it's that big! This month I'll introduce the concept of windowing, describe the elements involved in window specifications, and cover window aggregate functions. In later articles I'll describe window offset functions, window distribution functions, and optimization of window functions.

To be able to run the examples from this series, you need to use SQL Server Denali CTP3 or later, as well as a sample database called TSQL2012. You can download SQL Server Denali CTP3 from Microsoft's website. www.microsoft.com/betaexperience/pd/SQLDCTP3CTA/enus/default.aspx. You can click the link to download the sample database TSQL2012.
 

DEFINITION

A window function is a function that's applied to a set of rows defined by a window descriptor and returns a single value for each row from the underlying query. The purpose of the window descriptor is to define the set of rows that the function should apply to. You provide the window specification using a clause called OVER that you're probably familiar with from SQL Server 2005's window ranking and aggregate functions. Here's an example from SQL Server 2012, relying on new capabilities:

USE TSQL2012;

SELECT empid, ordermonth, qty,

SUM(qty) OVER(PARTITION BY empid

ORDER BY ordermonth

ROWS BETWEEN UNBOUNDED PRECEDING

AND CURRENT ROW) AS runqty

FROM Sales.EmpOrders; 

Figure 1 shows an abbreviated form of the query output.

Don't worry about not understanding the full window function's specification yet; we'll get to that later. I just wanted you to have the code in front of you as I discuss the concepts. For now, suffice it to say that the query returns detail elements from rows in the EmpOrders view, such as the employee ID, the order month, and the current month's quantity; the query also uses a window function to compute a running total quantity from the beginning of the employee's activity and until the current month. Again, more details are coming shortly.

As you know, there are other forms of set functions that T-SQL supports besides window functions-but window functions have several important advantages. Aggregate functions are a good example of a kind of function that needs to be applied to a set of rows-not the only kind, but a good starting point to explain the concepts. The relevance of the window concept is in how you define the set of rows for the function to apply to, and where you can use the function in the language.

Try to think of the more traditional T-SQL language constructs that let you define a set of rows for an aggregate function to work with. The most obvious one that comes to mind is grouped queries. Grouped queries define groups, or sets, of rows to which aggregate functions can be applied. Each group is represented by one result row. A big limitation of grouped queries is that after you group the data, you have to apply all computations in the context of the groups. You gain insights to new information in the form of aggregates, but you lose the detail. This means that you can't directly involve in the same expression a detail element from a source row as well as an aggregate for the group.

Think of even simple things such as computing a percent of an order value out of the customer total. Using grouped queries, you basically must have one query that groups the data and computes aggregates and another that joins the result of the grouped query and the detail rows. Now think of how things get even more complicated if you need to compute the percent of the current order value out of the customer total, as well as out of the grand total.

Another way to define a set of rows for an aggregate function to work with is using subqueries. Each subquery can be applied to a different set of rows, and you can mix into the same expressions a detail element and the result of an aggregate applied to a subquery. However, a subquery starts with a fresh view of the data.

What if you need to apply the aggregate function to a subset of rows from the underlying query, and the underlying query has a lot going on (table operators such as joins, additional filters, and so on)? You need to repeat in the subquery a lot of the logic that appears in the underlying query. This can lead to verbose and complex code. Also, each subquery is optimized in SQL Server with separate access to the data; the optimizer doesn't have the logic today to internally collapse multiple subquery calculations that need to be applied to the same set of rows to perform only one visit to the data.

Window functions are designed in such an interesting way that they circumvent a lot of the limitations that apply to grouped queries and subqueries. In addition, their design lends itself to good optimization and fairly simple indexing guidelines.

WINDOW AGGREGATE FUNCTION SPECIFICATION

To understand what's involved in the specification of a window function, examine again the query I provided earlier:

SELECT empid, ordermonth, qty,

SUM(qty) OVER(PARTITION BY empid

ORDER BY ordermonth

ROWS BETWEEN UNBOUNDED PRECEDING

AND CURRENT ROW) AS runqty

FROM Sales.EmpOrders; 

The first thing that's important to understand is that the OVER clause defines the set of rows for the function to work with. You can think of it as SUM(qty) OVER(<the following set of rows>). The initial set of rows that the function has to work with is the underlying query's result set. So the form SUM(qty) OVER() basically means the sum of all quantities from the underlying query's result set. Anything you add to the window specification will basically let you further restrict the window. Because the underlying query's result set is achieved only after you get to the SELECT phase of the logical query processing (after FROM, WHERE, GROUP BY, and HAVING), you're allowed to use window functions only in the SELECT and ORDER BY clauses of a query. If you need to refer to the result of a window function in other query clauses, you have to do so indirectly by using a table expression, such as a CTE.

Before I discuss how you can further restrict the window, consider the advantages of defining the set of rows for the function using a windowing concept versus grouping or subqueries. Unlike with grouped queries, the window function defines its own set of rows to work with. There are no limitations imposed on the rest of the expressions in the query. You can mix detail elements and results of window functions in the same expression. You can even have different window functions work with different windows of rows. Unlike subqueries, window functions don't start from a fresh view of the data but rather with the underlying query's result set-hence there's no need to repeat the logic from the underlying query. I also mentioned that window functions lend themselves to good optimization, but I'll cover the optimization part in a future article.

As to the specification of window aggregate functions, the elements you add to the window specification basically let you restrict the set of rows the function will apply to (remember, the starting point is all rows from the underlying query's result set). The three parts that you can indicate for window aggregate functions in SQL Server 2012 are <window partition clause>, <window order clause>, and <window frame clause>.

WINDOW PARTITION CLAUSE

The window partition clause was already available to window aggregate functions in SQL Server 2005. It restricts the window to only those rows that have the same values in the partitioning columns as in the current row. If SUM(qty) OVER() means "the sum of all quantities," SUM(qty) OVER(PARTITION BY empid) means "the sum of all quantities from the rows that have the same employee ID as in the current row." In simpler terms: "the sum of all quantities for the current employee." And as I mentioned, you can mix detail elements and window functions. For example, the expression 100.0 * qty / SUM(qty) OVER() gives you the percent of the current quantity out of the grand total. The expression 100.0 * qty / SUM(qty) OVER(PARTITION BY empid) gives you the percent of the current quantity out of the employee total.

WINDOW ORDERING AND FRAMING

The window order clause is there to give meaning to framing options. Framing means that you define ordering within the window partition, and then based on that ordering you place two boundary points that frame the subset of rows that you want to restrict. For example, our query that computes running totals partitions the window by empid (PARTITION BY empid), orders the window by ordermonth (ORDER BY ordermonth), and frames it with the option ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This means that based on order month ordering, the applicable frame of rows is the subset of rows within the current employee's partition with no low boundary point and until the current row. In our case, this means all months from the beginning of the employee's activity and until the current month. The ROWS option isn't the only supported frame unit; there's another option called RANGE that I'll discuss later.

As for defining the window frame extent (the frame bounds), you can refer to the start or end of the window partition with the options UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING, respectively. You can also indicate a bound as an offset from the current row: <n> PRECEDING means <n> rows before the current row, and <n> FOLLOWING means <n> rows after the current row. You can also refer to the current row simply as CURRENT ROW.

You already saw the form ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This form will likely be the one you use most often for practical reasons. Conveniently, there's an abbreviated form that means the same thing: ROWS UNBOUNDED PRECEDING. Note, though, that if you indicate a window order clause but not a window frame clause, the default isn't exactly the same-but I'll get to it in the next section when describing the RANGE option.

Here's a query showing bounds expressed as offset from the current row:

SELECT empid, ordermonth,

MAX(qty) OVER(PARTITION BY empid

ORDER BY ordermonth

ROWS BETWEEN 1 PRECEDING

AND 1 PRECEDING) AS prvqty,

qty AS curqty,

MAX(qty) OVER(PARTITION BY empid

ORDER BY ordermonth

ROWS BETWEEN 1 FOLLOWING

AND 1 FOLLOWING) AS nxtqty,

AVG(qty) OVER(PARTITION BY empid

ORDER BY ordermonth

ROWS BETWEEN 1 PRECEDING

AND 1 FOLLOWING) AS avg3months

FROM Sales.EmpOrders; 

The first function operates on a frame with just one row-the previous row. The second function operates on a frame with also just one row-the next row. The third function operates on a frame with three rows-two before the current row until the current row. Figure 2 shows the output of this query.

Note that when only one row is involved in the frame, you should use other new functions, called LAG and LEAD-which I'll discuss next month. The offset window frame extent options are mainly interesting when there can be more than one row in the frame.

ROWS VS. RANGE

The ROWS option relies on a very basic concept of an offset in terms of number of rows with respect to the current row. But considering our sample data, you realize that there's no assurance that an employee will handle orders every month. In other words, there could be gaps between months of activity for an employee. Suppose that you want to compute the average quantity for the past 3 months of activity. Using the form ROWS BETWEEN 2 PRECEDING AND CURRENT ROW will give you what you want only if there's an assurance for no gaps between months of activity. But in our case there can be gaps, so the frame will include three rows that could represent a period of more than 3 months. That's what the RANGE option is there to resolve by defining offsets from the current row's value, as opposed to offset in terms of number of rows.

Note, though, that the current implementation of the RANGE option in SQL Server 2012 is very basic. To achieve filtering options such as "the past 3 months" you need window frame extent options that weren't implemented in SQL Server 2012, as well as support for a temporal interval data type that isn't available in SQL Server. But just to give you an idea, according to standard SQL, you'd achieve the task as follows (don't run this code):

SELECT empid, ordermonth, qty,

SUM(qty) OVER(PARTITION BY empid

ORDER BY ordermonth

RANGE BETWEEN INTERVAL '2' MONTH PRECEDING

AND CURRENT ROW) AS sum3month

FROM Sales.EmpOrders; 

The RANGE frame extent options that SQL Server 2012 does support are only with UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING, and CURRENT ROW as delimiters. There's a subtle, yet very important difference between ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW and RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW when the ordering isn't unique (when there's a possibility for ties). The former truly uses the current row as the upper bound; namely, it doesn't include ties, whereas the latter does.

What's even more important to note is that if you indicate a window order clause but not a window frame clause, the default frame will be RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW and not ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. The reason it's important to emphasize this fact is that even when the ordering is unique and the two are logically equivalent, SQL Server's optimizer might not realize this, and then the form with RANGE (the default) can end up being more expensive to compute. As I mentioned, I'll discuss optimization of window functions in a future article, but for now you can already make it a best practice: Stick to the ROWS form whenever possible and try to avoid indicating ordering without framing.

MORE TO COME

This article is the first in a series discussing window functions support in SQL Server 2012. I described the windowing concept, the advantages of window functions compared with more traditional T-SQL constructs, and the elements in the specification of window aggregate functions. Next month I'll cover other window functions that SQL Server 2012 supports.

1


posted by LifeisSimple
prev 1 next