Shipyard
logo
THIS SITE IS NO LONGER MAINTAINED. MOST CONTENT HAS BEEN MIGRATED TO ANCHOR HOSTING WEBSITE.
     
     
Advertising
.au domain names
free transfers, registrations and renewals from $69

Australian web hosting PHP, MySQL, Java
from $198/year

Dedicated servers
Australian, Linux and Windows, $175/month
 

The Anchor Website: .py in the Sky: PART 2

Written by: James Gregory on 05 February 2004

But wait! There's more!

I've got to say, the error handling is probably Python's least sexy feature. It has proven immensely practical, but it's not much fun. The boring stuff ends here, from this point forward I'm planning to give you a whirlwind tour of the 10,000 lines of library code that drives every section of our website along with our internal administration systems.

The Anchor web library is broken up into roughly three sections:

  • Database access code
  • Web rendering code
  • "Glue"

So, what does that mean? When I was first figuring out how to attack the task of replacing the existing TCL code, I came to the conclusion that there's a bunch of static content which I can deal with in whatever way seems most obvious at the time, and there's some database stuff. At that point, we only used this code to keep records of tasks we performed manually, so if I were just replacing this code, that's really about it. Of course I wasn't just replacing existing code -- that's pointless. The reason I was brought in on the project at all was that we needed to start getting the computers to do more work on their own. Thus the "glue" section, which refers generally to the various technologies we use for automatically "doing stuff".

Let's take a look at each of these items in detail.

The Database

I think databases are cool, not because they support ACID transactions or indexing or anything like that, just because I think of some Terminator-esque movie scene every time I say the word. "Your Tranzaktion is committed, baby". As cool as that would be, databases these days are more constrained in their abilities. Basically they only do a few things:

  • Read from records.
  • Update records.
  • Make new records.

They also delete stuff and join stuff and so forth, but let's ignore that for now. Now, the common method of working with databases in web applications is to design your database, probably on the live server, and write a heap of specialised code to deal with it every time you work with it. I've worked like this in the past and found it to be inefficient, and more importantly error prone. The real problem is that if you change your database schema to split out some piece of information to a separate table, it's a lot of work to change every reference to that information. In anchor-lib I've come up with a partial solution. In short:

  • Define the relationships between all tables in a form that Python can read and manipulate (ie, using Python data structures).
  • Write code that uses this information to write the SQL queries for me. Note that if the definition of these relationships changes, then this code will automatically write new SQL that follows this change.
  • Write code to communicate with this library rather than directly to the database

There's a lot of information there, let me give you an example. One of the most common operations this code handles is the problem of recording orders placed from the website along with credit card details to allow payment processing. There's really two ways to do this, you can either lump all the order information along with the credit card details into a single table, or you can split it into two tables -- one for credit card info and one for order info. I'll write another article some time about why the latter is better, but for now it suffices to say that that's the approach we took. So, there's a simple relationship there:

CREATE TABLE ccard (
    id serial primary key,
    number text,
    -- other ccard info.
);

CREATE TABLE "order" (
    id serial primary key,
    ccard_id integer references ccard,
    customer_name text,
    address1 text,
    -- other order info
);

That is, for every row in our "order" table, there is an id that points us to a related row in the "ccard" table. For every row in "order" there will be exactly one row in ccard where order.ccard_id = ccard.id. So now all I have to do to record that order is say "Computer, here is some information about an order and the credit card the customer wants to pay for it with. Put it in the database please.". And honestly, that's all we do.

Well, almost.

This is actually what we say:

order_data = {
    'order.customer_name' : 'Arnold',
    'order.address1'      : 'Austria',
    'ccard.number'        : '4000000000000002',
    'ccard.expiry'        : '04/05'
}
db.handle.multiInsert(order_data)

No hideous SQL, no careful splitting of credit card data from order data, no worries! And there's only one possible function to debug if there's ever any problems with a database insertion.

Well, almost.

Under the covers it's doing a lot of work. Here's what happens:

  • First, split up all the input into tables. That is, group all the credit card information together and all the order specific information together.
  • Then, take this list of tables and identify what refers to what. In this case it would find that there is a field called 'ccard_id' in the 'order' table that refers to the 'id' field in the 'ccard' table.
  • Follow all these joins until we find a table that depends on nothing. In this case it's the 'ccard' table -- we don't need any information other than that given to make a row in the ccard table.
  • Do that insertion, and use the id that gets returned to update any the remaining data to contain this reference, thereby meaning that there will be another table with enough information for an insertion.
  • Repeat until there's no data left.

Which is all pretty complicated, but the nice part is that it's now the computer that does that, rather than me. I never need to think about how orders relate to credit cards, or how domains relate to hosting accounts; as long as I can get all the data together in a big bundle the code will make sure it all gets filed away in the right place.

For the Technically minded

Which might well be all of you. I wanted to explain in slightly more detail the information we're storing about the database, since I've been waving my hands a little up to here. We have a Python module called dbmap that is used to store all this information. For the example I've given you, this is the data we'd use from dbmap

prikeys = {
    'ccards' : 'ccards.ccard_id',
    'order'  : 'order.id'
}

sequence = {
    'ccards' : 'ccards_ccard_id_seq',
    'order'  : 'order_id_seq'
}

depends = {
    'order.ccard_id' : [ 'ccards.ccard_id' ]
}

The first dictionary prikeys lists the name of the primary key for all the tables we're using. It is used to figure out which field contains the 'id' that we need to extract when attempting to build joined rows. The more observant of you will observe that this is information you can fetch from Postgresql's system tables, and that recording it here is bound to cause errors and waste time. There's a few reasons we aren't doing that. The first one is that it was only in Postgres 7.4 that this interface was standardised. This code has been running on a bunch of different Postgres versions, and it is my hope that porting it to other databases will be trivial. Aside from this, querying the database for this kind of information is just unnecessary -- it would slow the library down to a crawl. Finally, there are assertions in the library to prevent this being a significant source of errors. If ever a value is missing, the code will throw an exception with an instruction, in English telling the programmer what needs to happen to fix the problem.

The sequence map is equally mundane. Postgres provides unique values for use in primary keys by associating a sequence of numbers with each numeric primary key. Every time you insert a row, another number is popped off the sequence. However, if you need to use this number as part of another insert (or indeed an update or anything else), then there's a race condition if you perform the insert and then try to fetch the largest value of the primary key. Postgres provides a workaround by allowing you to fetch the value from the sequence before doing the insertion, and to specify this id (which has been reserved for your use) at insert time. That's the only purpose this dictionary serves -- to let us reserve new primary key values.

The last part: the depends map is by far the most exciting part. It lists foreign keys, and the primary keys that they refer to. The primary keys are in a list since we make extensive use of Postgres' table inheritance, so a primary key could refer to any one of a number of tables. In this particular example it refers to only one table though, so the list is of length 1. This is the heart of the system: it's the mechanism that allows the library to figure out how to connect rows in different tables.

But Why?

I want to conclude here with a brief explanation of why this elaborate system is a good thing. To be fair, it is a lot of code, and there were a bunch of problems to be worked out in its infancy. The reasons for its existence why go back to my first few weeks at Anchor. Our database has something in the order of 115 tables in it, with a bunch of different naming conventions and very little documentation. One of the first things I tried to do was to map out the database in a diagram drawing tool. I quickly realised that doing so was almost entirely fruitless. Drawing tools just aren't meant for that kind of thing (it's seriously hard work trying to fit everything in and maintain such a document). Even if I did get this information into Dia (the tool I was using), I'd still have to manually inspect this diagram to write any new queries or check old ones.

The problem got a lot worse when I started thinking about some relatively minor changes to the database design. Normalisation of a database is really, really hard when you have to hand check every query after you do it. I eventually figured that if the database structure could be expressed to the computer then I'd be immune from these problems -- I'd never need to think about how things were connected (that is, the intermediate steps weren't important to me, only that two things were able to be connected somehow), and if those connections changed there would only be one file to update.

This isn't the only solution. Another system I've seen involves putting all the SQL JOIN statements into functions in a single file, and referring to that list of joins. Such a system works ok, and is less complicated to get running initially. I found that these systems eventually had errors creep in. Even though the information was centralised, it was still a lot of work to maintain.

The system I've implemented is also not without its limitations. I'm finding that I'm having to extend the initially quite simple implementation to deal with more and more complex joins. Thankfully we've had surprisingly few bugs creep in. I attribute this to the fact that all our code uses the database library, so it's probably the most rigorously tested code-path of all. It's also got quite slow on some operations. These operations generally don't affect customer facing code, but they've become a minor concern for internal systems.

This solution isn't for everybody, but the next time you're dealing with a 100 table database, give some thought as to whether you want to go writing all that SQL again. Give some thought to how much you want to debug that SQL. There's probably some SQL gurus out there who can do this without elaborate libraries. More power to you. Most importantly: use the right tool for the job.

x