POSTED : June 7, 2016
BY : Jim Egan

Some days ETL developers can’t catch a break due to design flaws that have been lurking for years just waiting to go off like a ticking time bomb. Sometimes these are self-inflicted due to a poor coding approach.  Other times the problem is within the database and won’t show itself until just the right data comes through.  The problem explained below falls into the latter category.

Double Byte Characters

Architects and Developers that have English language blinders on will create many time bombs.  The English language typically uses simple characters that require only a single byte of storage in an Oracle database.  As soon as an application steps outside storing English language data that’s when the fun starts.

The character Ö is used as an example here.  The ASCII value for it is decimal 214 or hexadecimal D6.  For our purposes we’re going to turn this into a Unicode character that requires two bytes instead of a single byte.

The issue that prompted this blog entry is that some years back when the system I’m working on was designed someone decided to declare columns as VARCHAR2 in the Oracle database using the BYTE syntax instead of the CHAR syntax.

CREATE TABLE test_double_byte (five_byte VARCHAR2(5 BYTE));

It may have been the Data Modeler’s fault for explicitly using the BYTE syntax but the DBA could also be at fault.  Oracle will use a default setting for the BYTE/CHAR syntax if it is left off.  VARCHAR2(5) can be either VARCHAR2(5 CHAR) or VARCHAR2(5 BYTE) depending on how the Oracle database parameter NLS_LENGTH_SEMANTICS is set.  If the DBA left the parameter set to BYTE (the install default) and the data modeler wasn’t aware of that then it may explain the problem.

You can verify the current setting for NLS_LENGTH_SEMANTICS using this query:

SELECT value

FROM nls_database_parameters

WHERE parameter = ‘NLS_LENGTH_SEMANTICS’;

In my case the setting was BYTE.

 

Using the table created above let’s create some data.  Remember that we’re forcing the character Ö to a Unicode string that requires two bytes.

> INSERT INTO test_double_byte(five_byte)

> VALUES (UNISTR(‘0d6’);

1 row inserted.

> INSERT INTO test_double_byte(five_byte)

> VALUES (UNISTR(‘0d6’) || UNISTR(‘0d6’);

1 row inserted.

 

Everything is fine so far.  The column five_byte in the second row has two characters that take up four bytes.  The column will hold four bytes so there aren’t any issues – so far.

> INSERT INTO test_double_byte(five_byte)

> VALUES (UNISTR(‘0d6’) || UNISTR(‘0d6’) || UNISTR(‘0d6’);

SQL Error: ORA-12899: value too large for column “TEST_DOUBLE_BYTE”.”FIVE_BYTE” (actual: 6, maximum: 5)

Now we get an error that tells us that we’re trying to put too much data into too small of a column.  We tried to stuff 6 bytes of data into a 5 byte data type.  BOOM!

Solutions

Quick-n-Dirty

In this specific case, since we’re dealing with a legacy system, the easiest solution is to change the data type from VARCHAR2(5 BYTE) to VARCHAR2(5 CHAR).  The Dataflow wasn’t affected and the table metadata stored in the Datastore didn’t need to be refreshed.  It’s a good idea to refresh the metadata but in my case it wasn’t necessary and we wanted a solution that minimized the number of changes.

 

Design Stage

If I was the Data Modeler for this project and we were still in the design stage, I would likely choose to use the NVARCHAR2 data type instead of VARCHAR2.  While VARCHAR2 allows for BYTE and CHAR syntax the NVARCHAR2 data type does not.  NVARCHAR2 is always CHAR.  That means there is never any question as to how many bytes the column will use for storage.  It will use as many bytes as necessary to store that many characters – within the limitations of the data type (see below).

Please note that I selectively use the NVARCHAR2 data type only for those columns that could contain multi byte characters.  Columns that are not free form text, such as indicator columns that will never contain anything other than “Y” or “N”, should still be declared as VARCHAR2.  That data type will use half the storage space as NVARCHAR2.

 

Some things to watch out for:

  1. The VARCHAR2 data type in 11g has a limit of 4,000 bytes.Even if you declare a column as VARCHAR2(4000 CHAR) you can still only stuff 4,000 bytes of data into the column.If your data has 3,999 single byte characters and just one double byte character, then it will exceed what the column can store.
  2. NVARCHAR2 is also limited to 4,000 bytes in 11g.But here Oracle will not allow you to declare NVARCHAR2(4000).It will limit you to NVARCHAR2(2000) and every character (multi byte or single byte) still uses two bytes.
  3. Some functions will truncate data without warning.The RPAD() function is one of these.

> SELECT LENGTH(RPAD(UNISTR(‘0d6’), 2001, UNISTR(‘0d6’) AS STR

> FROM dual;

STR

——

2000

We tried to create a string of 2,001 characters that required 4,002 bytes.  But the RPAD() function only returned 2,000 characters instead.