Introduction

Every database designer faces a decision regarding a selection of a database key type. The common practice, textbooks, and intuition suggest that numeric keys are better than character ones. And the strongest argument for numeric keys is performance - it is generally perceived that numeric keys provide for faster comparison operations inside the database thus improving search and access time.

For us humans it is immediately obvious which number is greater 2 or 3, and we can even compare 2,346,678 to 2,345,678. Trying to compare strings “Y9JyXi0Tk4dTo” and “H5ZYZCYQJxC” may actually be hazardous to your health. But since we have invented so many wonderful devices to help us with tasks at which we are no good, this alone can be a reason to question our intuition as far as computer’s inner working is concerned.

Besides enjoying the questioning of common wisdom, there are other pretty good reasons why character keys might be considered instead of traditional numeric ones:

  • Most useful intelligent keys are character, not numeric.
  • There is no built-in limit on a number of possible character keys.
  • Denormalization decisions can be made easier, number of tables reduced, overall design simplified, performance improved.

Generally speaking, the keys can be surrogate and intelligent, simple and compound, auto-increment and random, etc. An ability to make an informed choice regarding the physical key type opens up more possibilities for a database designer as far as logical key type is concerned.

In this article, I intend to compare the performance characteristics of string and numeric keys of various sizes in Oracle and SQL Server and then discuss the applicability of the results.

Performance Comparison

While realizing what kind of a minefield the database performance research is, I conducted a series of direct, but non-scientific tests in order to get a feel of performance characteristics of string and numeric keys in a couple of the typical database usage scenarios.

The following data table was used:

create table doc
(
        id KEY_TYPE,
        body varchar(1024)
)
create index docind on doc(id)

where KEY_TYPE was varied for the comparison between INT, NUMERIC(10), CHAR(5), CHAR(10), CHAR(20), and CHAR(50). All test programs were written in Java. Details of the database set up are listed in the Table 1.

Parameter

Oracle

SQL Server

Box Dell PowerEdge 2450, Pentium III, 2×800 MHz, 512 MB RAM, Windows NT 4 SP5
Version Oracle8i Enterprise Edition Release 8.1.6.0.0 SQL Server Standard v. 7.00.623
JDBC driver Oracle v. 8.1.6 i-net SPRINTA v. 3.05

Table 1 Database Setup

In order to avoid any vendor bias the results were normalized to the NUMERIC(10) key type, so the results below provide the comparison between data types only, not between the database products.

Four data sets were prepared. Each of them contained 900,000 records where each record had one randomly generated key (as appropriate for the key type, see above) and a random length (0..1024) body that consisted of alphanumeric garbage. For each data set and for each database (Oracle and SQL Server) the following operations were performed:

  1. The data set was loaded into the database with the bulk load utility.
  2. A Java test program inserted 3 sets of 50,000 randomly generated records into the table.
  3. A set of about 26,000 keys was extracted from the database at random into a text file.
  4. Another Java program read the file with the keys and performed 3 sets of retrieval operations for each key array.

The timing statistics were recorded and averaged. The results are presented on Figure 1. Please note, that Oracle does not have a 4 byte INT type.

image003.gif

Figure 1

In order to estimate the database’s “internal” sensitivity to the key type, the following query was timed three times for each data set:

select distinct id from doc group by id having count(id)>1

which had to compare keys to each another. The results are on Figure 2.

image0011.gif

Figure 2

As can be seen on Figure 1, for simple insert and retrieve operations the key type and size does not matter until keys get relatively big, i.e., there is a noticeable increase around 50 bytes. For this type of a problem, the database is busy with marshalling data and transmitting them over the network — the changes in keys do not cause perceivable change in insert and retrieve speed.

The situation changes for operations performed entirely inside the database engine (Figure 2. For the SQL Server the native integer type is faster than a numeric one, but small character keys (up to 20 bytes) are comparably fast. For Oracle, which does not have a native integer type, there is no performance difference between numeric and character types. Both databases slow down when the character key size grows over 50 bytes.

The most important practical implication of the results is that there is no performance penalty if the key is smaller than 50 bytes (at least, for Oracle and SQL Server deployed in specially constructed tests and on a computer described above, other possible disclaimers do apply).

A Case for Random Surrogate Keys

A very interesting case can be made for surrogate keys, especially, a random variety. Ad a rule, a random surrogate key is stored as a number [3]. While one of the important requirements to such keys is their size (smaller is better), the character key type might be a better choice.

The type choices for numeric keys are limited to integer and numeric (where supported). The integer type of SQL Server stores up to 2·109 numbers in 4 bytes of storage. For the random keys purposes it is not sufficient — conflicts become too frequent when the number of records reaches 1,000,000 or so. Generally, the empirical rule of thumb is that the key space should be 103..104 times larger than the total number of records; a bigger key will eat up storage space, a smaller one will cause frequent conflicts.

A numeric type is stored in a variable length format (see [1]) that maxes out at 1038 numbers while taking about 20 bytes of storage for Oracle and 17 bytes for SQL Server.

Let’s compare this to a character type. For a character key there is no limit on a key space size (i.e., total number of available keys) whatsoever. It is especially important if a conflict resolution is expensive and a “true” key uniqueness is desired on a global scope. If the random key is alpha-numeric (i.e., contains any characters from the set [0..9, a..z, A..Z]), the total number of keys that can be stored in a 20-character string is 6220≈1036 or, roughly, a number comparable to maximum value that can be stored in the numeric type. Generally, the total number of keys is ak, where ‘a’ is the alphabet size (i.e., the number of possible 1 byte characters that can be present in the key) and ‘k’ is the key size. This relationship is plotted on Figure 3, which shows that the key space grows very rapidly with the increase of the key size and the number of characters in the alphabet.

image004.gif

Figure 3. Key space

It has to be noted that character keys can be of two kinds. One kind is the one that the human being does not ever have to see. This allows to increase the alphabet size with all the special symbols or even full 255 characters that fit into a byte (as long as the database can run a comparison operation on them, i.e., they can be used in a key column of the database of choice). The other kind is a sort of ID number that has to be printed on forms, reports, or read over the telephone. In this case, the alphabet reduction is in order to make keys easier to handle: it is convenient to eliminate all lower case letters and ambiguous looking characters. For example, out of ‘O’ (as in “oven”), ‘0′ (zero), ‘D’ (as in “dog”), and ‘Q’ (as in queen) only 0 (zero) should be retained to avoid confusion.

Conclusions

No performance penalty usually associated with character keys was discovered for SQL Server and Oracle, at least, for simple database operations when the key size is smaller than 50 bytes. This allows to achieve a huge key space for random surrogate keys and to implement flexible denormalization strategies.

Of course, this conclusion should not be generalized — this particular study is not a comprehensive TPC benchmark test, it only covers two simplest database usage scenarios. I can speculate why these databases behave the way they do, but since I do not posses any direct knowledge of their respective implementation details, I have chosen not to. I would like to encourage readers to perform similar tests on databases available to them, especially on open source ones and report the results.

References

  1. Vieira, Robert. Professional SQL Server 7.0 Programming. Wrox Press Ltd., 1999, 1138 p.
  2. Oracle 8 Concepts, Volume 1. Release 8.0, Oracle, 1997.
  3. Celko, Joe. Joe Celko’s data and databases: concepts in practice. Morgan Kaufmann Publishers, 1999, 382 p.