How to solve special situations of an unnatural value for a domain?


The question has a slightly more conceptual bias, so I'm going to use a simple generic example.

There are cases where some column may be in special situation, this is because an information:

  • is not available,
  • is invalid at some point,
  • is unknown,
  • is undetermined,
  • has no relevance,
  • may be in a special situation that only makes sense for the specific domain .

One way to indicate the special situation is to use of null , which is already questionable whether it is good or bad, although everything has its merit under certain circumstances. The use of null is a clear case of choosing unnatural special values in the column domain to indicate a special situation.

But you may need to indicate more precisely what the special situation is, the reason for the current state. At ask about null there is a definition of how to solve a special situation. But there may be more than one motive, and worse, new motives may be included in the future.

For example, in an academic control there is a column with students' grades in subjects it may not be enough to have a grade of zero. A zero note is not the same as no note. And a null may not be sufficient since it only indicates that there is no note. You need to know why there is no note, since each circumstance may require different treatment. You need to know if the note:

  • not yet given,
  • does not exist because the student is exempt from that assessment,
  • he did not do the evaluation (which in this case is different from him having zero rating).

I know that in some cases it is possible to avoid certain reasons for improving standardization. That's not the point here.

There are basically two solutions to this:

  • Documents that certain values are considered as these reasons (eg values greater than 10 or negative in a note are special situations broken down into each of these values outside the normal range, in example -1 would indicate the reason a %% of% motif -2 and b to motif -3 ).
  • An auxiliary column is used to indicate the reason for the special situation (eg, there will be an indication of the reasons a, b or c cited above and of course the indication that the note is valid and should be considered as normal, even if it is zero).
  • In a domain with a special situation should we use values or auxiliary columns to indicate the situation? That is, is there more advantages in one than another? I particularly prefer the second.

    How to deal with these cases? After all, queries can give wrong results without considering these cases. And this goes for either option.

    A typical example of the problem is the calculation of the average values of selected rows. But all problems involving c apply equally in this case. Database systems often have a solution to deal with null .

    Bonus point: Is there a way to abstract (make transparent) this? Or the user (perhaps a developer) will have to know this need and properly handle each query?

    asked by anonymous 11.09.2014 / 15:42

    2 answers


    All situations, like almost everything that gives rise to this type of discussion, has advantages and disadvantages. Here is an outline of some of the most common that I come across. Note that I am using fairly basic references and using pseudocode for the sake of clarity only:

    The use of null

    For the obvious example of school grades, null is a good choice, in order to differentiate a zero note from a note not yet posted in the system, if the special condition is just the absence of note.

    The care taken in these cases is that in order to use this data, the queries will require special conditions, such as WHERE not_null( nota ) to calculate a mean, for example. >

    A very common error is when the user does one of these: WHERE coluna = null . This is a very common problem for those who are not accustomed to the null concept. Null can not be used as a value in normal operations without the result being null as well.

    Always remembering that semantically null is treated as an unknown value, which is not always the same as missing value.

    The use of "magic values"

    In the example of the note, we could do differently: say that note -1 (or note 100, whatever is agreed) means an unfilled note.

    In this case we would still have to worry about a WHERE to do all the operations, and in addition we would have to have one of these values as default of the field when creating the table, otherwise make a big mess when developing the application.

    nota INT() DEFAULT 100
    Since some DB connectors have problems returning null , a possible advantage of using magic values is the ease of handling these conditions on the application side.

    Another advantage of this solution is that magic values can be used to indicate more than one situation, not just a special value (what has been said in the edit question yet, but remember ). For example, 100 is "waiting," 101 is "dispensed," 102 is "substitute work," and so on.

    Is it not a case of enum ?

    While taking the example of the notes, an output very similar to the "magic values" is in "name" for these values, defining the field this way:

    nota ENUM( '0','1','2','3','4','5','6','7','8','9','10','aguardando','dispensado')
    The advantages are the readability of data when analyzing and editing the DB with native tools, and the fact that enum values can still be used for averages and sums as long as they are created in the correct order .

    The disadvantage is the same as magic values, the need for conditions to obtain aggregate values, so that only values from 0 to 10 are considered. Perhaps some type conversion is necessary in the aggregators, and there may also be a offset between the zero note and the enum value, depending on the DBMS and the internal value of the initial item.

    The use of extra columns

    This case should probably only be used when separating special and "conventional" values can actually help queries or the application in some way. More rarely, in cases where some of the possibilities mentioned above are not acceptable, either by performance or limitation of the technologies being used.

    If you have many exceptions, the extra indexed column can accelerate the query, returning only what is needed, speeding up, for example the use of aggregator functions such as SUM() and AVG() .

    Data collections beyond the table structure

    Let's forget school grades now, this is for completely different data. If you face situations where the need for special values is large, and could occur in several columns in the previously proposed models, you might even think instead of using a special column.

    For example, a metadata column, where you could store collections textually, such as "collation of degree: 12/03/1960, marriage: 05/21/1967, death: 03/08/2011" in the same field. Remembering that this case makes more sense in relational SGDBs.

    The problem here is that these data are almost unprocessable outside the main application.

    Always remembering that this case is for very specific situations only, I just mentioned it as an add-on. Probably the developer will realize when you need an output of this type.

    11.09.2014 / 18:55

    The response describing the ideal procedure decision will depend on two factors:

    • How resilient your model is in terms of changes, and
    • How resistant to change are dependent processes.

    Let's go to a simulation. In the case mentioned, we can assume a model like this:
















    If your model is flexible, but processes do not

    You can implement the new domain, maintaining the interface expected by non-refactored processes.

    One possibility is to change the name of the original table (which will be consumed by new processes that know of the presence of a new domain) and create a view with the same name as the old table (which will feed the legacy processes):

    Thisisaprocessthattakesintoaccountanymethodofreading;multipleDBMSsalsoallowinferredmanipulationofdatabyaview(suchas MS SQL Server and , for example), thus allowing methods UPDATE , INSERT and DELETE are executed by reference.

    If both the model and the processes are rigid

    The option is to create the domain in a separate repository (another model, or directly in the code via enumeration, static collections or similar).

    Always keep in mind that the closer your data model is to the domains it seeks to express, the better; this translates into shorter periods for maintenance, process mapping and model expansion.

    Solutions involving out-of-scope domains ('tables' in code or similar) overshadow aspects of the model and tend to Spaghettizar solution.

    11.09.2014 / 21:53