Author: Taras Denysenko, Developer
SugarCRM performance optimization often comes down to subtleties, as one little, seemingly innocuous change may sometimes lead to a dramatic increase or just as dramatic a dip in system productivity and responsiveness. This doubly applies to cases when user experience and overall efficiency of a complex software system depend on seamless interaction between its components, as is the case with SugarCRM integration with MS SQL Server (or any other RDBMS it may rely on).
An easy-to-miss detail in the implementation of either of those components might mean the difference between the user’s work firing on all cylinders and the user’s sitting around, waiting for the annoying loading indicator to finally disappear. This article will focus on one such detail, namely, on a combination of collation and type of string literal, which may be responsible for an increase of several orders of magnitude in MS SQL Server query execution time.
MS SQL data types for storing strings
Here we have the usual suspects – the ubiquitous CHAR, VARCHAR and TEXT data types, used for storing fixed and variable length strings respectively. Yet, there are also NCHAR, NVARCHAR and NTEXT, which are specific to MS SQL Server. The reason why these N-prefixed types are needed is that, unlike some other RDBMS, MS SQL does not allow specifying an arbitrary encoding to be used for strings in regular CHAR, VARCHAR and TEXT columns. Instead, it always uses a legacy fixed-size Windows code page encoding, the code page being inferred from the field’s collation. (By the way, Windows 95’s just called, it wants its encoding back.)
In most cases, each character in the fields of this kind is encoded using 1 byte. This essentially means that other than the core ASCII characters, of which there are 128, only another 128, code page specific, can be represented. Exceptions are made for simplified and traditional Chinese, Japanese and Korean languages, whose code pages use double byte character sets (DBCS), where every character is encoded by 2 bytes.
In practice, if all the values in a column are guaranteed to only contain digits, Latin letters and other ASCII characters, this limitation is of no consequence. With more variety, however, come problems. Let’s say a certain column uses a collation, based on code page 1252 (Western European), which provides all of those accented letters, which Romance languages are so fond of.
As long as all of the values are in such a language and use only ASCII characters, everything is perfect, but should a value pick an entirely random language (e.g. Ukrainian), it becomes completely illegible, as there are no Cyrillic characters on code page 1252. In this case every Cyrillic letter will get replaced by a placeholder character, such as ‘?’ (and let’s be thankful for that, because in older days it might have been replaced by an unrelated character from CP-1252, which just happened to share its code).
If, on the other hand, the code page for the column is 1251 (Cyrillic), Ukrainian letters will do okay, but a value such as “Gruyère de Comté”, a name of a French cheese flavor, will lose all of its accents along with its pretentiousness air of sophistication and become a pedestrian “Gruyere de Comte”, ugh. (Note how MS SQL Server is clever enough to just demote ‘é’ and ‘è’ to ASCII character ‘e’ in this case, instead of replacing them with question marks or Cyrillic letters.)
So, as you can see, regular string types in MS SQL Server effectively lock you into using English plus a select few languages, which share the same alphabet, for all of the values in the column. Depending on concrete scenario this may be either no big deal or a total disaster.
NCHAR, NVARCHAR and NTEXT, collectively referred to as Unicode types, meanwhile, allow storing Unicode strings, thus permitting one column to contain values with characters from virtually any of the world’s alphabets and much more (chess piece symbols, card suits etc.). Sweet! Why don’t we forget about regular types altogether and use Unicode ones all the time? In fact, some people do just that and it works reasonably well for small to medium-sized databases. When it comes to databases with a huge number of records, however, this approach is less than ideal. Why?
The problem with MS SQL Server Unicode and non-Unicode string column types
The problem with Unicode types is that MS SQL Server does not use the de facto standard UTF‑8 encoding for Unicode strings, which is variably sized, and in which character representations take up between 1 and 4 bytes, whilst most alphabets fall into 2-byte category, and ASCII characters are encoded using only 1 byte. Instead, MS SQL Server generally relies on an obsolete fixed-sized encoding UCS-2, where all of the characters, including ASCII ones, are represented by 2 bytes. (By the way, Windows 95 called… Oh wait; I’ve used this one already! Damn.) Consequently, a string stored in a Unicode column will generally take up twice as much space as the same string in a regular column, regardless of its content.
Let me give you the reasons for my abuse of italicized “generally” in the previous paragraph. Firstly, as you may recall, some Asian languages use DBCS code pages for non-Unicode columns, therefore taking up 2 bytes per character either way. Secondly, as it turns out, not everything can be encoded by 2 bytes. That’s why in collations at version 90 and newer so called surrogate pairs – duos of 16 bit codes, which actually represent one character, encoded by 4 bytes – are supported for sorting purposes. Documentation refers to them as “supplementary characters”. Built-in functions, such as LEN and REPLACE, were broken with respect to supplementary characters up until MS SQL Server 2012, while they treated surrogate pairs as 2 distinct (possibly invalid) characters.
As of MS SQL Server 2012, I’m happy to say that they are… still broken, except when used on strings in special collations with _SC flag, which were introduced in this version. By the way, UCS-2 with surrogate pairs is, effectively, UTF-16 – a variable sized encoding, where each character is encoded by either 16 or 32 bits. Therefore, we can say that Unicode strings under collations with _SC flag are encoded in UTF-16.
The assertion that disk space is “cheap” is commonly used as an argument for using exclusively Unicode types to store strings, despite the overhead they incur for ASCII-only strings, compared to regular types. This might arguably be true in a sense that disk space will be the least of one’s concerns, should one go down this route. What’s far more important, particularly in case of a large database, is a performance hit, caused by increased disk read and data transmission times, as well as in-memory cache being able to hold less records, due to the larger size of NCHAR/NVARCHAR, compared to their regular counterparts , thus increasing the frequency of disk reads, which are much slower than RAM reads.
That’s why a balanced approach, where non-Unicode types with code page encoding are used for columns, whose values are guaranteed to only contain characters belonging to one code page, and Unicode types are used in cases where there is no such guarantee, is preferable for large-scale databases.
SugarCRM database issues
SugarCRM builds database schema itself, thereby lifting this burden from the shoulders of CRM solutions providers and engineering teams, tasked with SugarCRM integration. This means, however, that design decisions regarding database structure, including the choice of Unicode versus non-Unicode string column types, are also largely not for us to make.
The policy SugarCRM uses to determine the string column types in MS SQL is as follows:
- VARCHAR for IDs,
- NVARCHAR for other strings.
As you can see, SugarCRM designers understandably decided to approach this issue in the most general way possible. The solution guarantees correctness for values stored in the database in all cases, as the only strings stored in non-Unicode columns are IDs and in SugarCRM, as you may recall, – GUIDs in hexadecimal notation, consisting entirely of ASCII characters.
What the solution does not guarantee, is optimal performance and memory utilization. Why is it so? Because even if in your case the entire text stored in SugarCRM database is in English, which could have got away with 8-bit encoding, you are still left with 16 bits per character and a lot of wasted memory “just in case”. In all fairness, though, we can say that this decision is justified.
What’s a bit less justified is the way SugarCRM formulates SELECT queries to MS SQL Server. Let me explain what I mean.
Here’s a typical query sent by SugarCRM to MS SQL Server:
WHERE id = N'1a793a5d-516e-432d-b67d-e812c36adaef'
Notice the conspicuous N prefix before the literal. As you may know, string literals in Transact-SQL come in two flavours: character string constants, a.k.a. non-Unicode string literals, and (wait for it…) Unicode string literals.
A non-Unicode string literal:
'Honoré de Balzac'
A Unicode string literal:
N'Honoré de Balzac'
As you may have deduced, quoted literals prefixed with N are treated as Unicode strings by MS SQL Server, while the ones undeserving of such a privilege are treated as the common code page encoded rabble they are. SugarCRM, despite using non-Unicode types for IDs, does not distinguish between different string column types. Thus, it uses exclusively Unicode literals in its queries, including those, where such literals are checked against ID columns. What could possibly go wrong? Quite a bit, actually.
A brief glance at MS SQL Server collations
Before we get to the crux of the matter, let’s briefly touch upon collations in MS SQL Server, because they are very much relevant to the problem this article is dedicated to, so it would help to ensure that we are on the same page. A single look at the dropdown menu for SQL collation selection in MS SQL Server Management Studio causes an unprepared user to be paralyzed by the sheer variety: different locales, case sensitive and case insensitive, accent sensitive and accent insensitive. The only distinction that matters for our purposes though, is the one between SQL collation and Windows collation.
SQL collations use a different set of rules for comparing non-Unicode strings, simpler than the one used for Unicode, whereas Windows collations, which were introduced as “recently” (tee-hee…) as MS SQL Server 2000, use the same set of rules regardless of string type. It’s easy to determine by name, to which of these two clubs a certain collation in the list belongs, since SQL collations flaunt their SQL_ prefix in the face of their Windows counterparts, which have no special prefix to speak of.
The joke’s on them, though, because Microsoft stated that Windows collations are to be preferred for new applications, while SQL collations are obsolete and kept around only for backwards compatibility reasons and “specific performance issues”, where their reliance on simpler rules for non-Unicode strings might give them an edge. In practice, though, our tests did not show any advantage of SQL collations in the most common scenarios of interaction with SugarCRM, and Microsoft themselves admit that “the design of most applications does not lead to a situation where the performance difference is significant.”
It’s also helpful to remember that collations can be set on server, database, column and expression levels. If no collation is specified when a database is created, it’s set by default to use server collation. If a column is created with no collation specified, it uses its database collation by default.
This time we’ve covered the necessary basics of collation types and the problems you may come across while working with them. In the second part of the SugarCRM tutorial, we’re going to focus on the practical side of the matter and present the results of our experiment with SQL Server collations.
If you have any further questions on the matter, you are welcome to leave your contacts below. Our experts will be glad to help you settle your issue.