Sunday, May 15, 2011

KILL THE NULL, KILL'EN ALL

The magical "NULL" pointer has been with us for many years (decades actually). Some developers will argue that null values are useful and even needed when writing software. Nulls allows to declare a variable without a defined value and that's something many people will think is needed when we don't known in advance the real value of an entity.

Imagine for example a database whose squema is upgraded adding a new column "second name" to a table. Since we don't know in advance what value "second name" has to be for each row, and since many new rows possible will have not "second name" it's quite easy to let it default to NULL.

In fact, I will try to show that using NULL values is always a bad idea. The first trouble many people will have trying to get rid of nulls is solving the problem in the previous paragraph. If I don't know the value of a variable of if a variable has not value at all (maybe because performance reasons force to denormalize a database), there is no other way that assigning NULL.

Right? wrong. It's always possible to assign a default value to tag a field as NULL. In the previous example we could simple assign the string "NULL" to indicate a non existent second name. It looks an arbitrary decisition that doesn't improve our code quality at all and we are wasting extra bytes. In fact it does make our application safer. Let's see why.
Imagine a typicall web app in which a form sends the "second name" to our server in a get/post request. Something like second_name=Armstrong.
Then our php server code will look something like:

$second_name = $_REQUEST["second_name"];

Look now what happens if accidentally we misspell second_name in the html form or the php code:
$second_name will be assigned a NULL value. Not the "NULL" string we want in case "second_name" is not used.

The previous example is just one of the many possible ways in which a variable can be silently set a wrong NULL value. As you can see NULL values are quite risky, since languages tend to leave variables in a "null" state when something goes wrong. For example Java, probably the most widely used programming language around has a big design mistake:
try-catch exception demarcation bounds are used also as visibility demarcation bounds. That means that we can not write a code like:

   try{
MyClass myObject = new MyClass();
...
   } catch(Exception e){
...
   }
myPreparedStatement.set(0,myObject);
myPreparedStatement.execute();

The java language syntax force to do something similar to:

MyClass myObject = null;
   try{
myObject = new MyClass();
...
   } catch(Exception e){
...
   }
...
myPreparedStatement.set(0,myObject);
myPreparedStatement.execute();

As we can see in the previous code, If an exception is thrown at object initialization and we are not cautious enough, we will insert a false wrong NULL into the database.

(Note: In the previous example it's possible to move the two last lines inside the try-catch and that will avoid propagating a null to the database. In a larger code it could be impossible and even if it could, it's up to the user writing the code to do so and we are searching a way of writting code that doesn't depends on programmer's skills and experience).

With C#, the situation is even worse, since C# designers were drunk enough to not only copy the same Java pitfalls, but to encourage the use of nulls "everywhere"

In fact, NULL values are the biggest mistake ever made in software history, and as the time pass, the dammage augment.

The problem with NULL is that NULL actually is/means/maps to *nothing*. Null is not "unknown", Null is not "not applicable", not even "not available", not even "right" o "wrong". Null bears no information at all. If we were physicists NULL will just compare to a singularity in our ecuations, a clear symptom that something is really, really wrong.

In the example of the database we can assign the string "NULL" to indicate an unknown value. In fact we can assign the string "Not known" to indicate we don't known it's value, even if it exist, and we can also assign the string "Not applicable" when the row, due to break of normalization doesn't use that value. Our code will now be able to compare the string and apply a bussiness logic or another since now it carries information and state.

Convention let us replace "null" values with a given value or even better, with a given set of values ("not known","not aplicable","" (empty),...) . That's probably what we really want. That means a little bit of extra coding since we will need to create default rows/objects for each entity in our UML diagram, but the extra work is always worth the effort when our code grows beyond the thousand lines of code. Magically null pointers will vanish. Errors will be detected much early and correct exceptions thrown properly.

And last but no least, here comes a trick to avoid NULLable values in a database in a safe way:

  • Imagine you have a set or "core" tables in your squema, for example "client" and "product".

  • Imagine now that due to performance issues or whatever the reason you have another not-normalized table "notNormalizedTable" with a foreing key column "fk_client_id" and "fk_product_id" that many times doesn't have a given set value either for client or product or both.

  • If you want to avoid the risky NULLs the first step is to create default client and product rows. For example:

    client:
    =======
    id | name
    01 | "Not Known"
    02 | "Not Applicable"
    ...
    product:
    =======
    id | name
    01 | "Not Known"
    02 | "Not Applicable"
    ...


  • A new problem arises now. It's possible to accidentally delete either client or product "meta-rows" that represent "Not applicable"/"Not known"/... values, and then our strategy for NULL replacement will go to an end.
    Noneless, the solution is quite easy. Just create a new meta-table "nullKiller" with a foreign key for each table for which we want "meta-rows" with the "ON DELETE NO ACTION ON CASCADE NO ACTION" clause like:

    create table `nullKiller`(
    fk_client_id
    fk_product_id
    ...
    fk_entityN_id
    CONSTRAINT `nullKiller_const1` \
      FOREIGN KEY (`fk_client_id`) \
      REFERENCES `client` (`id`) \
      ON DELETE NO ACTION ON UPDATE NO ACTION,
    CONSTRAINT `nullKiller_const2` \
      FOREIGN KEY (`fk_product_id`) \
      REFERENCES `product` (`id`) \
      ON DELETE NO ACTION ON UPDATE NO ACTION,
    ...
    CONSTRAINT `nullKiller_constN` \
      FOREIGN KEY (`fk_entityN_id`) \
      REFERENCES `entityN` (`id`) \
      ON DELETE NO ACTION ON UPDATE NO ACTION
    );

    then just insert the new rows:

    INSERT INTO `nullKiller` \
      (fk_client_id,fk_product_id,...,fk_entity_N_id) \
      VALUES (1,1,....1), (2,2,....2);


    Once the `nullKiller` rows are inserted we are sure that a reserved set of ids for each entity are reserved for default values thanks to the "ON UPDATE NO ACTION" clause, and we are also sure those rows will not be accidentally deleted thanks the "ON DELETE NO ACTION" clause.


So, you know what to do now: "KILL THE NULL, KILL'EN ALL"

No comments: