On July 10th 2013, I gave a presentation with the Professional Association for SQL Server (PASS) DBA Fundamentals Virtual Chapter. The topic was Indexing Fundamentals and there were some questions that we did not have time to answer. I have combined similar questions and given one answer to cover the specific topic.
Indexing is an integral part of SQL Performance and Architecture. In this presentation we will cover the basics of Clustered and Non-Clustered Index structure and function. Additionally, we will discuss the B-Tree structure, index density (fill factor), page splits, and performance considerations. Finally, there will be a brief discussion of index maintenance.
Now to the Questions!
Q1. Does creating a primary index create a clustered index?
A1. Yes and No. We covered this briefly in the demo. If you already have a clustered index on your table, then you cannot create a second one, thus creating a PK will not create another clustered Index. If you specify when creating your PK that you want it to be non-Clustered, then it will be non-Clustered. By default, if you create a primary key on a heap, then it will be a clustered index if you do not make any changes (i.e. you just create the PK without specifying it as being a Non-Clustered Index).
Q2. Is the clustered key automatically entered when creating the NC index, with existing Clustered index?
Q3. So clustered indexes are stored in the DB right? So how does that affect our disk selection, regarding random verses sequential reads?
A3. All indexes are stored in the database. Clustered indexes include the actual data in the leaf level of the index, Non-Clustered indexes have pointers in the leaf level to the data pages. On a Heap, Non-Clustered indexes have Row Identifiers in the leaf level as a pointer. Non-Clustered Index on a Clustered Index has the Clustered Key in the leaf level as a pointer. As for the second question, this was briefly discussed in the presentation.
Q4. Could a clustered & non clustered index be a covering index meaning including more than one column?
A4. A covering index basically means an index that fulfills the needs of the query. If you have a clustered Index on Column1 and your query does something like SELECT Column1 From Table1 Where Column1=’Index’ then that could be considered a covering index since the index covers the need of the query. The same would be true if Column1 was in a Non-Clustered Index.
Q5. Where is the data stored when using Include (col Name)?
A5. The included column is stored on the leaf level page of a Non-Clustered Index. They are nonkey columns in the Non-Clustered Index. The next Question/Answer combination will elaborate on how that is important to performance.
Q6a. Please elaborate how using INCLUDE in an NON-CLUSTERED INDEX allows a Covered Query.
Q6b. This may be out of scope, but would it be possible to touch on using included columns vs. composite indexes ( eg. whether one is generally better than the other or if it depends – and why optimizer would pass over one vs. the other).
A6. So now we know what a Covering Index is (not Covering Query). Include Columns are used to satisfy the columns being pulled back in a SELECT statement. Let’s take a quick example of a Non-Clustered Index with three columns in it (Column1,Column2,Column3). Column1 and Column2 are both key columns and Column3 is an included column.
SELECT Column1, Column2, Column3
Where Column1 >10 and Column2 = ‘Red’
The Non-Clustered Index is covering for this query because the index has all the necessary columns to satisfy the query. The Leaf Level page of this index will have Column1, Column2, and Column3 stored in it. However, since Column3 is an included column and not a key column it will not help if you use Column3 in your where clause. Key columns help optimize queries when they are included in the where clause and in the SELECT clause (since the key column data is stored in the leaf page). The included column maybe stored in the leaf page, but since it is not a key column, it does not affect the organization of the index and thus is not useful in the WHERE clause.
Tons more information on included columns: http://msdn.microsoft.com/en-us/library/ms190806(v=sql.90).aspx
Q7. when row data is larger than 8k, what happens to the leaf data? i.e how is it managed across the leaf pages?
A7. When row data is larger than 8K it goes into a page called Row Overflow. You cannot use a column that is too large to fit on one page for your clustered index key:
“The index key of a clustered index cannot contain varchar columns that have existing data in the ROW_OVERFLOW_DATA allocation unit. If a clustered index is created on a varchar column and the existing data is in the IN_ROW_DATA allocation unit, subsequent insert or update actions on the column that would push the data off-row will fail. For more information about allocation units, see Table and Index Organization.”
I know this does not fully answer your question, but honestly that question is very tough to answer.
This article goes into a little more detail about the structure of an index and this quote should give you a better understanding of where off-row data is stored.
“A 24-byte pointer to this off-row data is maintained on the original page.” http://msdn.microsoft.com/en-us/library/ms189051(v=sql.105).aspx
Q8. I’ve heard some DBAs say if you rebuild the clustered index, it will also rebuild the nonclustered indexes – as a shortcut to rebuild all the indexes for a table). Do you know if this is accurate ( I am hesitant to accept that, since the non-clustered indexes store the actual data and a RID, which, in my mind, shouldn’t rely on physical storage of the row )?
A8. It Depends! I would recommend you read this fantastic post by Paul Randal on the matter:
Q9a. Do you run the update stats after you reorganize the index and/or after you rebuild the index?
Q9b. Does simply updating statistics help with index fragmentation at all?
A9. If you REBUILD the index, by default the statistics are automatically updated. That is because the default for the STATISTICS_NORECOMPUTE option is OFF when rebuilding. If you create an index with STATISTICS_NORECOMPUTE option ON, then that will over-ride the default behavior. This is a lengthy topic, and I would suggest looking at the section “When to Update Statistics” in this article:
Updating statistics will not help with index fragmentation but may help improve performance of queries. Indexes, need maintenance to prevent fragmentation.
Q10. Would you recommend padding & fill factors on heavy OLTP type tables/databases?
A10; Not sure what the word heavy means, if it is heavy on inserts then yes I would recommend using some fill factor less than 100%. Don’t just take my answer and fly with it, please read these excellent articles by Kendra Little and Pinal Dave on the topic. They have tons more knowledge and experience than I do in the SQL Server space and have great write ups on this topic.
Q11. are those leaf level pages also 8K?
A11. Data and index pages are 8K in SQL Server.
Q12. Will the whole page be brought to the memory on accessing a data row from a page?
A12. SQL Server pulls pages into memory not individual rows.
Even more about pages and extents in SQL Server to help give a better understanding for questions 11 and 12: http://msdn.microsoft.com/en-us/library/ms190969(v=sql.105).aspx
Q13a. With a clustered index, do we always do a table scan? Since the data is ordered based on the clustered index can’t it go directly to the relevant page?
Q13b. How do you decide what type of index to use?
Q13c. Say I have 2 search criteria that I know of in my application, how would I decide whether to create clustered index or non-clustered index and for which search criteria?
A13. You don’t always do a table scan with a clustered index. The SQL Server query optimizer decides what the least costly way of getting the data back to the user is. If that means it has to do a table scan, then it does a table scan. If you design your indexes and queries properly then you can get great performance by doing seeks instead of scans. Again, this is an “it depends” answer. If your query is going to return 75% of the records on a table, then SQL Server will probably do a scan not a seek and filter out what you do not need. The concept of selectivity is very important in creating indexes. You want to create indexes on data that is highly selective (“unique”) so that when query optimizer searches for data it can eliminate large sets of data and zero in on just the rows it needs.
Finally, you have to understand your data, and you have to understand how it is queried and updated. There is no silver-bullet answer for this question. It takes experience and understanding of your system. That is why in my own humble opinion, DBAs need to be good developers as well. Indexes are extremely useful when you align them with your querying needs.
Q14. I understand an index can speed up data reads, but can an index be used to speed up insert/update operations?
A14. Another question where the answer is “it depends.” Let’s take the UPDATE first. If you have a WHERE clause on an UPDATE statement, the query optimizer needs to search for the data before it can update it and that is a read operation which could definitely benefit from an index. But at the same time updating data may cause indexes to have to be updated! As for Inserts, again it depends. I can comfortably say that indexes are fantastic for reading data and that is what they are designed for.
As opposed to re-inventing the wheel I thought I would lead to you David Durrant’s excellent post on this with SQL Server Central:
Some Additional resources:
Articles that have details about Clustered and Non-Clustered Indexes:
This article talks about SQL Server Optimization and covers many topics regarding indexes: