Tuesday, February 22, 2005

Understanding SQL/MX EXPLAIN

Users of HP NonStop™ SQL/MP are accustomed to the listing that explains the query plans. The listing describes the steps involved and provides details in a textual form.

When moving to SQL/MX, most users are at first surprised, overwhelmed, and sometimes annoyed by the results of the EXPLAIN stored function, the source of the equivalent information in SQL/MX. This paper is intended to take a closer look at SQL/MX EXPLAIN output and to provide an approach to its interpretation.

EXPLAIN Is Simple

All query plans in SQL/MX are managed as trees of operators (such as joins, unions, scans). This becomes visible to users at two places, the EXPLAIN table-valued stored function, and the use of CONTROL QUERY SHAPE commands to influence plans.

While the initial reaction to EXPLAIN is completely understandable, it is really simpler and in many ways easier to read than its SQL/MP counterpart. All the information is returned in normalized form as a single tree, with no exceptions (no subqueries, sorts, initialization steps, etc.). Unlike SQL/MP EXPLAIN, the MX version can be retrieved at any time, for any program, and it is the actual plan, not a view of what the optimizer would choose today. Since the interface is SQL, the returned information can be customized in many ways. We’ll see examples of this below.

Which Tools to Use

Unfortunately, the normalized form of EXPLAIN is also what makes the raw output hard to read. Like HTML text, raw EXPLAIN output is really meant to be the input for some tool that presents the information in a more human-readable format. In most cases that tool is Visual Query Planner (or VQP, for short), but there are a few more. See the NonStop SQL/MX Installation and Management Guide on how to install VQP; this tool is included with SQL/MX. In case an ODBC connection is not available or VQP cannot be installed on the PC, a way to explain a query through MXCI is more desirable. Let’s talk about some such ways:

Here is a simple example query (How many employees do we have in each department?) that we’ll use in the next few sections:

PREPARE XX FROM
SELECT d_name, count(*)
FROM emp, dept
WHERE emp.d_num = dept.d_num
GROUP BY dept.d_num, d_name;

First, using the SET SHOWSHAPE command in MXCI can help to get an idea on the query plan. For our sample query above, MXCI might generate the following output, which you may want to simplify for better readability, like it is done below:

control query shape
esp_exchange (
hybrid_hash_join (
sort_groupby (
split_top_pa (
sort_groupby (
scan(path ‘CAT.SCH.EMPX’)))),
esp_exchange (
partition_access (
scan(path ‘CAT.SCH.DEPT’)))));

Alternatively, we could look at the output of the EXPLAIN function. We could print only selected columns, like with this query:


SELECT seq_num, left_child_seq_num AS l,
right_child_seq_num AS r, operator, tname
FROM TABLE (EXPLAIN (NULL, ’XX’));

An undocumented option of SQL/MX does something similar, but with some more helpful information:


DISPLAY_EXPLAIN OPTIONS ’f’ XX;

This could produce the following result for our query:

LC RC OP OPERATOR OPT DESCRIPTION CARD
11 . 12 root 2.39E3
10 . 11 esp_exchange 1:8 (range) 2.39E3
9 3 10 hybrid_hash_join 2.39E3
8 . 9 map_value_ids 2.40E3
7 . 8 sort_partial_groupby 2.39E3
6 . 7 split_top 8 (range):16 (logph) 2.40E3
5 . 6 partition_access 2.40E3
4 . 5 sort_partial_groupby 2.40E3
. . 4 index_scan fs EMPX (s) 9.99E5
2 . 3 esp_exchange 8 (range):4 (range) 2.50E3
1 . 2 partition_access 2.50E3
. . 1 file_scan fs fr DEPT(s) 2.50E3

Again, please note that the above command is currently not supported by HP, has some limitations, and may change in the future. This command internally uses a complex SQL query to produce the output shown. The DESCRIPTION field contains information that depends on the operator type. For scan operators the table name is shown, while exchange operators show the number of parallel instances for the corresponding fragments.

Finally, if we decide to retrieve the full EXPLAIN information, another unsupported tool called FixExplain is available to format the result in a human-readable form.

Getting an Overview


When we look at the output of any of the commands or tools mentioned above we can see the general shape of the query. We can see that a hash join is used, with EMPX (an index on the employee table) on the left and with the department table on the right. This gives us some general idea about how tables are joined together and which indexes are selected.

Some of the displays above show the estimated cardinality of the intermediate results. Another important parameter to look at is cost. VQP is the best way to examine this information. Two numbers are given for each node in the tree, the cost of the operator itself and the cost of the operator and all of its children. SQL/MX considers overlapping operations, and, therefore, the total cost of an operator may be less than the sum of the children’s cost and the operator’s own cost.

Understanding Parallelism

Now that we know the basics of the query plan, it is time to look at how the optimizer has distributed the work over the CPUs. The first step to answering this question is to find the fragments of the query, determining the distribution of the work among different processes.

Operators, such as joins and scans, typically pass their result rows in memory to their parent operators. This means that they have to execute in the same process. The only way a row in SQL/MX can move from one process (such as DP2 or HP NonStop Server Disk Process, an ESP or Executor Server Process, or the application process) to another is by passing through an EXCHANGE operator. EXCHANGE operators consist of two (top and bottom) parts, one in each of the participating processes, and the rows get passed inside the operator. But, there is more to an EXCHANGE operator. It can sometimes also let multiple processes talk with each other, and can repartition data coming from N sources to M destinations.

SQL/MX has three flavors of EXCHANGE operators. Two flavors of the PARTITION_ACCESS (PA in short) operator communicate with DP2 and an ESP_EXCHANGE operator talks to ESPs and provides repartitioning. The two PA operator flavors can be distinguished by the node above the PA. If it’s a SPLIT_TOP node then we are dealing with a parallel flavor that can talk to multiple partitions at the same time. In all other cases we are dealing with an operator that talks to DP2s one at a time. Note that CONTROL QUERY SHAPE commands combine split top and PA node to a single operator called SPLIT_TOP_PA. Either way, SPLIT_TOP and PA form a unit whenever they appear together.

It is often useful to draw the query tree on a piece of paper and to indicate the process boundaries by drawing bubbles around those parts of the tree that execute in the same process. Those parts are called fragments. Here is how to find them. Each fragment has a root operator or an EXCHANGE operator at its top, and, conversely, each root node and EXCHANGE operator is at the top of a fragment. The bottom of a fragment is determined by a “leaf” node such as a scan operator, or by the start of another fragment. Here is the tree of our earlier query drawn in such a way: We see five fragments, one root fragment, two ESP fragments and two DP2 fragments. As you can see, we type the fragments based on where they execute. The top nodes of the fragments (ROOT, ESP_EXCHANGE, PARTITION_ACCESS) indicate this. These top nodes are shown as hollow circles in the diagram. SQL/MX is inherently parallel, and fragments have everything to do with this parallelism. While there is only one root fragment, representing the application program, ESP and DP2 fragments can have multiple parallel instances, running in multiple ESPs or DP2s. The SQL/MX optimizer has only limited influence on how many parallel DP2 fragments there are, since that is determined by the partitioning of the data. However, it has some freedom on the choice of the number of ESPs. The tree above indicates that the ESP fragment performing the hash join has eight parallel instances while the ESP fragment below it has only four parallel instances.

No comments: