Here is another small content piece on Char v/s Varchar from my Advanced SQL Developer Master Class. You can see all the blog posts in the Developer Series here.
Datatypes is one of the most under rated thing which we ignore when designing tables. In most of the consulting assignments that we do, we have found that when it comes to data type selection for an alphanumeric or character column, designers opt for Varchar or in worst of the cases Varchar(MAX). The reason they have for opting VARCHAR blindly is that CHAR takes more of storage as the empty characters are padded with spaces. This is correct, so when should we use CHAR datatype?
In this blog, I’ll share a scenario where in opting for CHAR datatype makes sense and saves storage too
Let’s first look at a scenario when CHAR is a bad choice.
-- Name is of char datatype CREATE TABLE dbo.Employee_v1 ( EmployeeID INT IDENTITY PRIMARY KEY, eName CHAR(100), ) GO -- Name is of varchar datatype CREATE TABLE dbo.Employee_v2 ( EmployeeID INT IDENTITY PRIMARY KEY, eName VARCHAR(100), ) GO
The above query creates two tables, Employee_v1 and Employee_v2. The Employee_v1 has a name column of CHAR datatype and Employee_v2 table has a name column of VARCHAR datatype. I’ll now populate the tables with 1 million rows.
-- Insert 1 million rows in Employee_v1 INSERT INTO dbo.Employee_v1 ( eName) SELECT 'SomeName' FROM hol.fn_numbers(10000000); -- Insert 1 Million rows into dbo.Employee_v2 INSERT INTO dbo.Employee_v2 ( eName ) SELECT 'SomeName' FROM hol.fn_numbers(10000000);
The function fn_numbers returns number from 1 to n where n is the value passed as argument to the function.
Let’s now compare the space occupied by the two tables.
The output is as expected. The Employee_v2 occupies less storage compared to that of Employee_v1. Let’s look at the number of data pages used to store the data for the two tables
The Employee_v1 used only 30960 pages compared to 140846 pages used by Employee_v1.
VARCHAR is certainly the best choice for this scenario.
Let’s say we now have a requirement to add a new column that will record Personal Identification Number (PAN) of an employee. A PAN is a 10-digit unique number assigned to citizen of India.
-- eName and PAN are Varchar(MAX) CREATE TABLE dbo.Employee_v3 ( EmployeeID INT IDENTITY PRIMARY KEY, eName VARCHAR(MAX), ePan VARCHAR(MAX) ) GO -- eName and ePan are VARCHAR. CREATE TABLE dbo.Employee_v4 ( EmployeeID INT IDENTITY PRIMARY KEY, eName VARCHAR(30), ePan VARCHAR(10) ) GO -- ePan is of CHAR(10) CREATE TABLE dbo.Employee_v5 ( EmployeeID INT IDENTITY PRIMARY KEY, eName VARCHAR(30), ePan CHAR(10) ) GO
I have created three tables, Employee_v3 which has eName and ePan as VARCHAR(MAX), Employee_v4 with eName and ePan as VARCHAR and Employee_v5 with ePan as CHAR.
I’ll now insert 1 million records in each of these three tables.
-- Insert 1 Million rows into dbo.Employee_v3 INSERT INTO dbo.Employee_v3 ( eName, ePan ) SELECT 'SomeName','AAMPO8759B' FROM SQLMaestros.hol.fn_numbers(10000000); -- Insert 1 Million rows into dbo.Employee_v4 INSERT INTO dbo.Employee_v4 ( eName, ePan ) SELECT 'SomeName','AAMPO8759B' FROM SQLMaestros.hol.fn_numbers(10000000); -- Insert 1 Million rows into dbo.Employee_v5 INSERT INTO dbo.Employee_v5 ( eName, ePan ) SELECT 'SomeName','AAMPO8759B' FROM SQLMaestros.hol.fn_numbers(10000000);
Let’s now observer the space occupied by the three tables.
The space occupied by Employee_v3 and Employee_v4 is almost the same, which is 366976 KB or 358 MB, however the space occupied by Employee_v3 is 346328 KB or 338 MB which is 20 MB lesser than the space occupied by Employee_v3 and Employee_v4. Remember, Employee_v5 has ePan column of CHAR datatype. Surprised!!!
Let’s now check the page count for the three tables.
The Employee_v5 table has (45872 – 43291 = 2581) lesser pages than Employee_v3 and Employee_v4.
This difference is because of the fact that the storage size for VARCHAR is calculated as actual length of data + 2 bytes. This 2 byte is to store the variable length details for the VARCHAR datatype.
In our first scenario, this overhead didn’t come into consideration as we were actually storing variable length information. However, In the second scenario, ePan is a 10-digit fixed length string, which means that ePan will always be of 10 digit and therefore CHAR is better choice. There’s no point of adding 2 bytes for storing variable length information when we are storing fixed length string.
The savings are not only on table storage. We’ll even save storage when creating index on ePan column.
-- this will fail as indexes aren't allowed on VARCHAR(MAX) CREATE INDEX ix_pan1 ON dbo.Employee_v3(ePan) GO CREATE INDEX ix_pan2 ON dbo.Employee_v4(ePan) GO CREATE INDEX ix_pan3 ON dbo.Employee_v5(ePan)
The first index query will fail as indexes aren’t allowed on a VARCHAR(MAX) column. An important point to consider when opting for a VARCHAR(MAX) data type.
The above query will create two indexes, ix_pan1 on Employee_v4 and ix_pan2 on Employee_v5 on ePan column.
Let’s check the storage cost for the two indexes.
Observe that the index ix_pan2 on Employee_v5 has 29674-24753=4921 pages and occupies (4921*8)/1024 = 40 MB of less space than Employee_v4.
To summarize, don’t blindly opt for datatypes, choose CHAR when you are certain that a particular column will have fixed length values.
Hope you enjoyed the blog post. There is more learning content in the Developer Series. Also take a look at SQLMaestros Master Classes.