How to Optimize SugarCRM by Choosing MS SQL Server Collations [Part 2]

How to Optimize SugarCRM by Choosing MS SQL Server Collations [Part 2] Author:Taras Denysenko, Developer

Hi again! Here is the second part of SugarCRM tutorial on the topic “How to Optimize SugarCRM by Choosing MS SQL Server Collations” (to read PART 1, click here). In the first part of the article we provided a brief review of MS SQL Server collations and today we’re going to dwell on the practical aspects of the matter and present the results of our experiment with SQL Server collations. Let’s roll!

Unicode and non-Unicode strings under an SQL Server collation

With a little refresher on collations out of the way, the time has come for the most important sentence of this article. Cue the drums! Under an SQL Server collation a comparison between a non-Unicode and a Unicode string entails a conversion of the former to Unicode. It means that if a Unicode literal is checked against a non-Unicode column in a query, every value in said column has to be converted! This conversion cannot be performed the other way around, since a demotion from Unicode is considered to be lossy, regardless of the actual content of the string.

If this doesn’t sound like much, allow me to illustrate the implications with an example. Imagine a table, called jedi_knights, described by the following CREATE statement:

A run-of-the-mill CREATE TABLE statement, you probably see every day:

CREATE TABLE dbo.jedi_knights
(
	id char(36) NOT NULL,
	first_name nvarchar(50) NOT NULL,
	last_name nvarchar(50) NOT NULL,
	corrupted_by_the_dark_side bit NULL

	CONSTRAINT PK_jedi_knights PRIMARY KEY (id ASC)
)

CREATE TABLE dbo.jedi_knights

(

id char(36) NOT NULL,

first_name nvarchar(50) NOT NULL,

last_name nvarchar(50) NOT NULL,

corrupted_by_the_dark_side bit NULL

CONSTRAINT PK_jedi_knights PRIMARY KEY (id ASC)

)

The id column is given the CHAR(36) type, since we identify the Jedi by GUIDs in hexadecimal representation, like SugarCRM would do. Now, let’s run a couple of SELECT queries against this table and look at their execution plans.

This means “the chosen one” in SQL:

SELECT *
FROM jedi_knights
WHERE id = '27af289a-333e-4b14-b090-829eeebd8e52'

SELECT *

FROM jedi_knights

WHERE id = '27af289a-333e-4b14-b090-829eeebd8e52'

Execution plan:

Query 1: Query cost (relative to the batch): 100%
SELECT * FROM [jedi_knights] WHERE [id] = @1

1 2	Query 1: Query cost (relative to the batch): 100% SELECT * FROM [jedi_knights] WHERE [id] = @1

No surprises there. All we needed was a fast index lookup (Index Seek) by primary key and that’s exactly what we’ve got.

For a complete picture, let’s take a look at IO stats, enabled by issuing SET STATISTICS IO ON before the query.

IO stats for query E1 (10000 records):

Table 'jedi_knights'. Scan count 0, logical reads 2, physical reads 2, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

1	Table 'jedi_knights'. Scan count 0, logical reads 2, physical reads 2, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

In-memory buffer was cleared before collecting these. As we can see, MS SQL Server had to read 2 memory pages in total (2 logical reads), both of which had to be retrieved from the disk (2 physical reads), thanks to our unhelpful clearing of the cache. Of course, this number doesn’t tell us much by itself, without a point of comparison, so let’s proceed with our experiment and see how it stacks up against future results.

For the next act, we’ll need to ensure that our id column has a Windows collation set, let’s say Latin1_General_CI_AS.

A query with a Unicode literal against a non-Unicode column:

SELECT *
FROM jedi_knights
WHERE id = N'27af289a-333e-4b14-b090-829eeebd8e52'

SELECT *

FROM jedi_knights

WHERE id = N'27af289a-333e-4b14-b090-829eeebd8e52'

Execution plan under a Windows collation:

Query 1: Query cost (relative to the batch): 100%

SELECT * FROM [jedi_knights] WHERE [id] = @1

1	SELECT * FROM [jedi_knights] WHERE [id] = @1

Okay, that’s a bit more interesting. We still have an index seek, but this time things are complicated by MS SQL Server’s creating an empty row (Constant Scan), populating it with our literal (Compute Scalar) and performing a join. Our tests show that the performance effects of this complication are nothing to write home about: here’s a progression of response times for different number of records under Windows collation – it looks logarithmic to me.

# of records	10000	100000	1000000
non-Unicode literal (E1)	24.3	20.3	19.2
Unicode literal (F1)	27.6	29.8	31.7

The response times are in milliseconds and represent an average value over 10 tests. MS SQL Server’s in-memory buffer was cleared after every test.

What do we have in terms of IO this time?

IO stats for query F1 (10000 records, Windows collation):

Table 'jedi_knights'. Scan count 1, logical reads 2, physical reads 2, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

1	Table 'jedi_knights'. Scan count 1, logical reads 2, physical reads 2, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Can you spot the difference? Scan count is now 1 instead of 0, while the number of reads has remained the same.

And now, let’s change column’s collation to an SQL one, such as SQL_Latin1_General_CP1_CI_AS. Now, when we try the same query one more time, the execution plan will look something like this.

Execution plan under an SQL collation:

Query 1: Query cost (relative to the batch): 100%
SELECT * FROM [jedi_knights] WHERE [id] = @1

1 2	Query 1: Query cost (relative to the batch): 100% SELECT * FROM [jedi_knights] WHERE [id] = @1

Ta-daa! Even though we requested a single record by its primary key, MS SQL Server had to scan all of the records from our table, promote them individually to Unicode and compare with our literal. Needless to say, any index we might have over a column, against which the literal is checked, goes out the window. Here is the progression of response times under SQL collation, which paints a very different (and much gloomier) picture.

# of records	10000	100000	1000000
non-Unicode literal (E1)	19.4	25.9	22.0
Unicode literal (F1)	31.6	63.9	570.5

In real life, with a wide table and a huge number of records, this translates to a difference from several seconds to several minutes to execute a query.

And now, back to our good friend SET STATISTICS IO ON.

IO stats for query F1 (10000 records, SQL collation):

Table 'jedi_knights'. Scan count 1, logical reads 89, physical reads 2, read-ahead reads 87, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

1	Table 'jedi_knights'. Scan count 1, logical reads 89, physical reads 2, read-ahead reads 87, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

89 memory pages were read in total. At first glance it might seem that only 2 of them were read from the disk, which would be strange, given that the cache has been cleared before running this query, but make no mistake: all of them were read from disk.

87 read-ahead reads is just MS SQL Server’s way of saying “Hey! It looks like you’re scanning the table for some reason, I guess I’ll preload some memory pages further down the table from disk into the cache, in case you need them.”

When calculating IO stats, MS SQL Server discounts those read-ahead reads, which turned out to be useful, from the number of physical reads, thus actually reporting the meager 2.

The result of the experiment with MS SQL Server

The main takeaway from this little experiment is the following:

String literal types in a T-SQL query should match the type of columns they will be checked against. In particular, one has to ensure that a Unicode literal is not compared to a non-Unicode column in a WHERE clause, especially under an SQL collation.
When there are no backwards compatibility concerns, a Windows collation is the way to go, while SQL collations are to be steered clear of.
We are left to hope that our table only contains data on the Jedi who survived the contrived plot twist in Episode III, otherwise we might be stuck here for a while, waiting for this query to be executed.

With all of this in mind, let’s take another look at the way SugarCRM uses string literals in its queries.

A typical query sent by SugarCRM to MS SQL Server:

SELECT *
FROM sugarcrm.dbo.accounts
WHERE id = N'1a793a5d-516e-432d-b67d-e812c36adaef'

SELECT *

FROM sugarcrm.dbo.accounts

WHERE id = N'1a793a5d-516e-432d-b67d-e812c36adaef'

Yes, unfortunately, current implementation of SugarCRM database manager leaves it vulnerable to the problem, we’ve just discussed, as one of our customers had the displeasure of discovering. The solution we chose to implement, as well as other options we considered will be described in the next article.

SugarCRM performance optimization

Fortunately, the negative performance effects of this issue may be alleviated by ensuring at installation time that SugarCRM database uses one of Windows collations since, as we discussed, an increase in query execution time due to string type mismatch is much less significant under a Windows collation. It’s not by accident that “at installation time” is in bold type. As long as every new column, created by SugarCRM, is assigned database collation by default, changing database collation later down the line will not affect already existing columns.

If you come across some problems with SugarCRM performance tuning, you are welcome to discuss it with our experts. Leave your details here.

How to Optimize SugarCRM by Choosing MS SQL Server Collations [Part 2]

Unicode and non-Unicode strings under an SQL Server collation

What do we have in terms of IO this time?

The result of the experiment with MS SQL Server

SugarCRM performance optimization

Other instructions on SugarCRM performance tuning:

Leave a Reply Cancel reply