Monday, June 9, 2008

Using a Data Dictionary

Database applications can be made much simpler if you maintain a body of data that describes your tables. Every single programming task in a database application needs to know something about the tables it is working with, so every program in a framework and many programs in an application can benefit from a central store of information about the database.

Introducing the Data Dictionary

The term "data dictionary" is used by many, including myself, to denote a separate set of tables that describes the application tables. The Data Dictionary contains such information as column names, types, and sizes, but also descriptive information such as titles, captions, primary keys, foreign keys, and hints to the user interface about how to display the field.

A super-simple beginning Data Dictionary might look like this:

TABLE    | COLUMN  | TYPE | PREC* | SCALE | PK | AUTO** |DESCRIPTION
---------+---------+------+-------+-------+----+--------+-------------------
students | student | int  |  -    |  -    | Y  | IDENT  | Student ID
students | firstnm | char | 20    |  -    |    |        | First Name
students | lastnm  | char | 20    |  -    |    |        | Last Name
students | city    | char | 20    |  -    |    |        | City
students | gpa     | num  |  2    |  1    |    |        | Grade Point Avg

* Precision: Needed for chars, varchars, and numerics
** automation: to be described more below

The First Question: Where to Put It

It might seem that the first question about a data dictionary would be "What do we put in it?" We will get to this question in a moment, but the most important question is usually, "Where do we put it?" By this I mean do you put into an XML file, or do you encode it into classes in program code? Or is it somehow directly entered into database tables? There are of course many opinions on this.

The first opinion is that you do not really need a data dictionary per se because you can get this information from the database server (if this is news to you, google "information_schema" and your favorite database). There are two major drawbacks when you depend on the server. First, you cannot build a database out of your data dictionary, because of course you don't have one until the database is built. The second drawback is much worse, which is that you cannot put any extended properties into the dictionary, and without those the dictionary is of little use.

Another method is to put the dictionary into classes, using the typical one-class-per-table approach and storing the data as properties of the class. There are multiple drawbacks to this approach:

  • Data that is "trapped" in code is very difficult to deal with efficiently, so many operations with the dictionary will be much harder to code and will be slower.
  • The dictionary is spread out into many files.
  • Ironically, a good data dictionary allows you to generate most CRUD forms, and so you don't need those one-class-per-table files filling up your directory. It seems silly to make a class that contains data that makes the class unnecessary.

The option I prefer is a plaintext file, which can be generated by a GUI or typed in by hand. The only requirement for the file is that it be easy to type and read, and easy to parse by a program. These requirements are well met by two formats: YAML and JSON (XML is a bletcherous horror to work with manually, so it is disqualified before the race starts). Both YAML and JSON enjoy parsers and writers in nearly all popular languages, so if you create your data dictionary in one of those you have a file that is human readable and writable, machine readable and writable, useful in nearly every language on every platform, easily placed in source control, and very flexible in what it can represent.

First Basic Use: Building and Upgrading Databases

A data dictionary is a wonderful way to build and upgrade databases. It is a real thrill to write your first SQL generator that scans a table of column definitions, writes out a CREATE TABLE statement, and executes it. The next step in the evolution of this process is to have the program query the INFORMATION_SCHEMA of the database, then work out which columns have been added to your data dictionary that are not in the table, and then upgrade the table with an ALTER TABLE statement.

This basic approach can easily be extended to include indexes and keys.

Sidebar: Data Dictionary Versus Upgrade Scripts

Many programmers use upgrade scripts to alter their schemas. The idea is that programmer Sax Russell adds a feature. Along with his code he writes a script that makes the necessary alterations to the database. Then comes Ann Clayborne who does the same thing, followed by Hiroko Ai, and then Sax again. When a customer upgrades their system, they run all four scripts in order. This can lead to horrible upgrade experiences in cases where multiple scripts are upgrading a large table several times. A data dictionary is far superior because it can simply examine the tables as they are, examine the data dictionary, work out a diff, and execute the smallest set of commands to bring the database current. This approach does require of course that the data dictionary be in source control like any other application file.

Second Basic Use: HTML Code Generation

A rich data dictionary can provide you with everything you need to provide "free" CRUD screens for most of your tables. Now to be sure, there are always those tasks that require special screens for the users, but there is just no reason to sit down and code up a screen to managae a table of customer types, zip codes, or even a general ledger chart of accounts. Here is an example of an expanded data dictionary in YAML format that contains enough information to generate screens with zero application code:

table customer_types:
    # Use this to generate menus!
    module: sales
    # Use this for the menu and the page title
    description: Customer Types
    
    column customer_type:
        type_id: char
        column_precision: 10
        # This is crucial for code generation
        description: Customer Type
        primary_key: "Y"
    column description:
        type_id: char
        column_precision: 40
        description: Description

The use of most of those properties should be pretty obvious, but I would like to point out one particular clever trick you can do. The "primary_key" flag can be used to enable a column during insert mode (if it is a user-entered key), and then to gray out the column in edit mode. When you consider this basic example it starts to become clear that nearly all of the code in most CRUD screens can be reduced to a few routines that read the data dictionary and generate HTML.

I would like to repeat that I do not mean to say that every single CRUD form in an application will work this way. In my experience 9 out of 10 tables can use "free" generated forms, but about 1 in 10 are used so often by users that you end up making special forms with shortcuts and wizards to speed up their work. The dictionary can help you there if you use it to generate the inputs, but it is no use trying to expand the dictionary to cover every conceivable case, there is always one more that the next customer needs that would just plain go faster if you coded it up by hand.

Third Basic Use: Trigger Generation

Generating triggers is a very powerful thing you can do with a data dictionary. We saw last week that the most complete encapsulation of code and data occurs when a trigger is placed on a table. Imagine you had a data dictionary that looked like this:

table orderlines:
    column extended_price:
        type_id: numeric
        column_precision: 10
        column_scale: 2
        calculate: @price * @qty

That magic little "calculate" value can be used to generate a trigger and put it onto the table. The trigger code might look something like this (The SQL version is PostgreSQL):

CREATE OR REPLACE FUNCTION example RETURNS TRIGGER AS
$$
BODY
    -- direct assignments are an error
    if new.extended_price is not null then
        raise error 'Direct assignment forbidden: extended_price';
    else
        new.extended_price = new.price * new.qty;
    end if;
END
$$
SECURITY DEFINER LANGUAGE PLPGSQL;

It is not my purpose here to explain how to generate that trigger, but simply to suggest that it is a very doable thing. Because I am not tracing out the steps, I do need to explain that I slipped those "@" signs into the "calculate" value so that when the trigger is built the builder program could detect column names and put the appropriate "old." and "new." prefixes in front of them.

Fourth Basic Use: SQL Access

A database application is full of code that reads and writes to tables. All frameworks and even very modest websites end up with a handful of routines that handle the mundane tasks of sending inserts and updates to the tables.

The Data Dictionary is the perfect resource for these routines. It can be used to properly format data, put quotes where they belong, clip overlong values (or throw an error), prevent the user from changing a primary key, and many other things.

Fifth Basic Use: Documentation

If you have a "rich" data dictionary, one that contains lots of extended properties that describe everything about columns and tables, and if you build your database and generate your forms out of that dictionary, then the next and final natural step is to generate technical documentation out of it as well.

In a first pass, technical documentation is limited to simply displaying the columns that go into a table, which admittedly does not mean much even to technical users. But once you get past that basic task you can begin to layer on lists of parent and child tables (as links of course), descriptions of formulas, and so on.

In a similar vein, Tooltip descriptions go well in a data dictionary.

Conclusion: True Synchronization

The holy grail of database programming is synchronization of code and data. Sychronization means a lot more than simply making sure you delivered the right code and upgraded the tables. Complete synchronization means that the framework is intrinsically incapable of mistakenly accessing missing columns or failing to consider important columns. At the framework level, if you make use of a data dictionary as described above then your framework will be organically synchronized, which is to say that the synchronization is built into the code itself.

Next Essay: Why I Do Not Use ORM

18 comments:

jezemine said...

Good post. There are also several tools on the market that will generate data dictionaries for you. I wrote one called SqlSpec that works against every major DBMS on the market today. You can check it out here: www.elsasoft.org

Barry Mavin said...

I was the developer of the Recital database and 4GL product way back in 1988. This was the first product (that i'm aware of) that sported a data dictionary. In fact i believe i coined this term. It is good to see that this functionality is still deemed as being important in modern day databases. Thank you for this article.

KenDowns said...

Barry: small world, the first database I used was Recital on a MicroVax. I found the company very helpful and they had a nice product. I have also heard the term data dictionary used by A. Neil Pappalardo of Meditech as early as 93 or so, and by a Foxpro framework called Visual Promatrix in the 90s.

amihay gonen said...

Hi , very nice article.

Here is some of my comments :

1. I think sql translator "http://search.cpan.org/~jrobinson/SQL-Translator-0.09000/" can be use as base code for thi
s builder.

I saw "androdma" os based on PHP (i think perl is more server sided)

2. The builder cannot cover all the senrio , for example :
delete from some table
or add FK while there need to so some primary key values.

KenDowns said...

Amihay:

Your link brought me to a page not found, are there any missing letters?

Perl is more server-side than PHP? Sorry no, they are both server-side languages. I like many find Perl incomprehensible after it is written and prefer the readability of PHP.

The builder cannot handle a delete? What do you mean?

If I understand your point about foreign keys, Andromeda handles that by setting the "auto_insert" flag on a foreign key. This causes the database to insert a value in a parent table if it is not present. Extremely useful when used properly, I make great use of it in my own projects.

KenDowns said...

Amihay: I got the link to work. The project you refer to appears to be about schemas only: tables and columns. While the features look very robust for that sphere, my OP (and my Andromeda framework) are a for more advanced approach to the entire picture of security and automations as well.

Lemming said...

Interesting article. Incidently, I used data dictionaries some while ago in one of my PHP projects. The difference was that I used PHP arrays to describe them, and that these arrays where defined inside the classes that operated on these tables - not in an ORM way, but because of modularization, i.e. the main class of a module defined one or more tables and on instantiation made sure that they exist and are up to date. This is essentially object orientation: the data is defined where it is used.

A problem with this approach (data dictionaries in general) vs. update scripts is that it is impossible to distinguish a renamed column from deleting one column and creating a number. This means that during the update, data is lost. Do you have any good solution to that? I think the most comfortable way would be to auto-generate some none-semantic IDs for each table and column (so that you have no intent to rename them, just like you would only rename a username, but not a numeric ID) and storing this ID somewhere in the database as meta information.

Example:
table foo:
id: 1
column one:
id: 2
column two:
id: 3

And later:
table bar:
id: 1
column three:
id: 2
column four:
id: 4
In this example it is clear that table foo was renamed to bar and column one was renamed to three, but column two was removed and column four was added.

What do you think of this? Have you run into other problems with the dictionary approach? How do you handle the potential need to update the actual data along with a schema update (for example when changing a default value)?

KenDowns said...

Lemming: you ask several significant questions, let me try to answer in no particular order.

When it comes to moving data around during an upgrade, such as populating a new table from an old one, I still use scripts. The reason is that this operation only happens when I discover a design flaw in the tables. Each case is always an exception and there is nothing to be gained by trying to work out general-purpose tricks to handle them. But that being said, the Andromeda framework itself is 4 years old and has exactly 3 scripts. A medical office program I wrote is 2 years old and has 4, though I make changes to it every week or two.

My dictionary does support specifying content. For instance, there is a set of tables that is common to all my apps, which I ignore when I don't need them. One of them is a table of timezones. The spec also includes the values for the table. The first time a database is built the table is populated.

As for renaming columns, all I can say is that I do not do that. Once I create a column 'email' as varchar it stays varchar, though somebody always seems to need it to be 5 more characters, so widening columns is a necessary feature. My dictionary defines columns outside of tables, and then *places* them in tables. This means the same column in two tables will always have the same type, precision, and scale. Andromeda will let me put prefixes and suffixes onto them, so I can have "phone_home" and "phone_cell", which are both "instances" of the "phone" column.

With respect to keeping an ID for the columns, I do not do that. I am a heretic. The unique identifier for the "address" column is "address." Adding an ID in my experience is more work with no benefit. The best example I can offer is a query I had to help a customer write that compared values across years. I innocently wrote "WHERE table1.year > table2.year" and he said, "oh, watch out, year is not a year, its an ID to a table that lists.... you got it, years! WTF!!! For Pete's sake, an ID column to identify an integer value? When will this madness end? Why not friggin put in the G**D**M year! Well, I seem to have gotten a little worked up...

I am not sure what you mean by "data is lost." My own builder is non-destructive, if you remove a column from the spec it does not remove it from the table, it just ignores it from then on.

Here is something you may find interesting. My framework defines appx 40 columns that can be used in any database (http://www.andromeda-project.org/pages/cms/Predefined+Columns.html),
and a typical project will use about a dozen of them over and over with prefixes and suffixes. So an order has "amt_tax", "amt_lines", "amt_final" and so forth. After doing this for four years now it feels so natural I really don't know how to talk about dropping and renaming columns any more.

To another of your points, I do not define the data "where it is used" because its definition is independent of whatever use I have in mind when I first see it. This leads to another statement you make...

...you say "on instantiation made sure that they exist and are up to date". This I most definitely do not do. I write complete applications of dozens of tables and one of the most basic operation is to validate the entire spec as itself being valid before making any changes to a customer's database. This is why the spec itself is a self-contained complete description of the entire database. The builder runs through it and valiates it, then validates that the transition from current to new is itself valid. Only when both of requirements are met will it begin to make structure changes.

Another reason not to build or upgrade a table on instantiation is that the entire build process is fragmented and depends on user interaction. You could improve it to ripple through to things like updating or creating parent tables also, but then why not just upgrade the database in one shot and be done with it?

Nilesh said...

I have begun putting all SQL server metadata on my website. It lists all the information_schema views and the underlying code behind it.

sql documentor said...

This is going to be a free document generator check this sample http://www.sqldocumentor.com/sqlserverhtml/sql_server_index.html

Few more days for the release

David said...

Barry -

I'd guess "data dictionary" was coined at least in the UK in the mid-late 1960s.

By early 1970s in US the debate was over "active" vs "passive."

Keeping track of data structures (an SQL table is just another type of data structure, no different than a flat file) is just one of the many functions of a full blown data dictionary.

There are hundreds of data structure types, of which SQL is simply one.

The term "data dictionary" went out of fashion in the late 1980s & "metadata repository" was swapped into the same space.

The genre has essentially ceased to exist in the marketplace other than in handwaving PowerPoint presentations.

Anonymous said...

Regarding your comment that "you cannot put any extended properties" into a data dictionary when building it from the server, I think it's important to point out that on Microsoft SQL, this isn't entirely correct. In MS SQL 2000 and later, you can use sp_addExtendedProperty to add properties to a database. Once this data is in the database, accessing it to create a data dictionary is pretty simple.

Your first point is entirely correct, however - this data doesn't help you unless you already have the database. The other, unstated, advantage of using the database to create a data dictionary is in the situation of taking over a poorly documented database. While not all properties may be filled out, the use of a script-generated data dictionary may allow the DBA to quickly gather information about a database.

On SpaghettiThoughts.com, there is a script available that creates one additional table to store data dictionary information, as well as multiple stored procedures used to populate it. The script is specifically for Microsoft SQL servers, and can be seen and downloaded at http://spaghettithoughts.com/SQL/SQLCode.cfm?blogid=1009 .

Jay said...

Many dittos. Two quibbles:

You say that using the database engine as a data dictionary won't work because then the data dictionary can't exist until you create the database. True, but so what? I can create tables with SQL as easily as I can type up a data dictionary. I don't have to actually populate the database: I can just define the tables. I've done this on a number of projects.

Now when you point out that there are limits to what metadata you can specify -- that's a more telling point. Many DB engines give me no way to specify a description or domain information.

Also, I'm surprised by your statement that 9/10 of screens can be automatically generated from a dictionary. Perhaps you mean 9/10 of the plain vanilla "update a record" screens. Any app I've worked on, perhaps half the screens can be generated. The rest combine data from multiple tables or are otherwise complex.

(The app I'm working on these days has zero generated screens, probably because the original developers never heard of that idea. Nor had they heard of other new-fangled ideas, like normalized databases. But that's another story.)

Probal DasGupta said...

I still use a data dictionary as a knowledge base that is a super-set of all databases. It is a list of unique data items in the company, but a sub-set of what might be called a Business Terminology document for the company.

Steve Ginal said...

Check out dbmaster (dbmaster. It is a great tool what guys built to pull information schema from a database and you can any data dictionary fields for tables and columns and then quickly search and generate triggers, documentation in any format.

Steve Ginal said...

Correct url for the dbmaster is
http://branegy.com/dbmaster. You should check it out if you're working with data dictionary.

Tekk said...

We were using the term data dictionary back in about 1982 - when developing databases using SSADM - which started up in about 1980... So I think the term was coined well before 1988

John Maxx said...

im just amateur about this.. where could i find data dictionary? is it on a xampp that i created a databaser?