|
|
|
The Anchor Website: .py in the Sky: PART 2
Written by: James Gregory on 05 February 2004
But wait! There's more!
I've got to say, the error handling is probably Python's least sexy
feature. It has proven immensely practical, but it's not much fun. The boring
stuff ends here, from this point forward I'm planning to give you a
whirlwind tour of the 10,000 lines of library code that drives every section
of our website along with our internal administration systems.
The Anchor web library is broken up into roughly three sections:
- Database access code
- Web rendering code
- "Glue"
So, what does that mean? When I was first figuring out how to attack the
task of replacing the existing TCL code, I came to the conclusion that there's
a bunch of static content which I can deal with in whatever way seems most
obvious at the time, and there's some database stuff. At that point, we only
used this code to keep records of tasks we performed manually, so if I were
just replacing this code, that's really about it. Of course I wasn't just
replacing existing code -- that's pointless. The reason I was brought in on the
project at all was that we needed to start getting the computers to do more
work on their own. Thus the "glue" section, which refers generally to the
various technologies we use for automatically "doing stuff".
Let's take a look at each of these items in detail.
The Database
I think databases are cool, not because they support ACID transactions or
indexing or anything like that, just because I think of some Terminator-esque
movie scene every time I say the word. "Your Tranzaktion is committed, baby".
As cool as that would be, databases these days are more constrained in their
abilities. Basically they only do a few things:
- Read from records.
- Update records.
- Make new records.
They also delete stuff and join stuff and so forth, but let's ignore that
for now. Now, the common method of working with databases in web applications
is to design your database, probably on the live server, and write a heap of
specialised code to deal with it every time you work with it. I've worked like
this in the past and found it to be inefficient, and more importantly error
prone. The real problem is that if you change your database schema to split
out some piece of information to a separate table, it's a lot of work to
change every reference to that information. In anchor-lib I've come up with a
partial solution. In short:
- Define the relationships between all tables in a form that Python can
read and manipulate (ie, using Python data structures).
- Write code that uses this information to write the SQL queries for me.
Note that if the definition of these relationships changes, then this code
will automatically write new SQL that follows this change.
- Write code to communicate with this library rather than directly to the
database
There's a lot of information there, let me give you an example. One of the
most common operations this code handles is the problem of recording orders
placed from the website along with credit card details to allow payment
processing. There's really two ways to do this, you can either lump all the
order information along with the credit card details into a single table, or
you can split it into two tables -- one for credit card info and one for order
info. I'll write another article some time about why the latter is better, but
for now it suffices to say that that's the approach we took. So, there's a
simple relationship there:
CREATE TABLE ccard (
id serial primary key,
number text,
-- other ccard info.
);
CREATE TABLE "order" (
id serial primary key,
ccard_id integer references ccard,
customer_name text,
address1 text,
-- other order info
);
That is, for every row in our "order" table, there is an id that points us
to a related row in the "ccard" table. For every row in "order" there will
be exactly one row in ccard where order.ccard_id = ccard.id. So now all I
have to do to record that order is say "Computer, here is some information
about an order and the credit card the customer wants to pay for it with. Put
it in the database please.". And honestly, that's all we do.
Well, almost.
This is actually what we say:
order_data = {
'order.customer_name' : 'Arnold',
'order.address1' : 'Austria',
'ccard.number' : '4000000000000002',
'ccard.expiry' : '04/05'
}
db.handle.multiInsert(order_data)
No hideous SQL, no careful splitting of credit card data from order
data, no worries! And there's only one possible function to debug if
there's ever any problems with a database insertion.
Well, almost.
Under the covers it's doing a lot of work. Here's what happens:
- First, split up all the input into tables. That is, group all the credit
card information together and all the order specific information
together.
- Then, take this list of tables and identify what refers to what. In this
case it would find that there is a field called 'ccard_id' in the 'order'
table that refers to the 'id' field in the 'ccard' table.
- Follow all these joins until we find a table that depends on nothing. In
this case it's the 'ccard' table -- we don't need any information other
than that given to make a row in the ccard table.
- Do that insertion, and use the id that gets returned to update any the
remaining data to contain this reference, thereby meaning that there will
be another table with enough information for an insertion.
- Repeat until there's no data left.
Which is all pretty complicated, but the nice part is that it's now the
computer that does that, rather than me. I never need to think about how
orders relate to credit cards, or how domains relate to hosting accounts; as
long as I can get all the data together in a big bundle the code will make
sure it all gets filed away in the right place.
For the Technically minded
Which might well be all of you. I wanted to explain in slightly more detail
the information we're storing about the database, since I've been waving my
hands a little up to here. We have a Python module called dbmap that
is used to store all this information. For the example I've given you, this is
the data we'd use from dbmap
prikeys = {
'ccards' : 'ccards.ccard_id',
'order' : 'order.id'
}
sequence = {
'ccards' : 'ccards_ccard_id_seq',
'order' : 'order_id_seq'
}
depends = {
'order.ccard_id' : [ 'ccards.ccard_id' ]
}
The first dictionary prikeys lists the name of the primary key for
all the tables we're using. It is used to figure out which field contains the
'id' that we need to extract when attempting to build joined rows. The more
observant of you will observe that this is information you can fetch from
Postgresql's system tables, and that recording it here is bound to
cause errors and waste time. There's a few reasons we aren't doing that. The
first one is that it was only in Postgres 7.4 that this interface was
standardised. This code has been running on a bunch of different Postgres
versions, and it is my hope that porting it to other databases will be
trivial. Aside from this, querying the database for this kind of information
is just unnecessary -- it would slow the library down to a crawl. Finally,
there are assertions in the library to prevent this being a significant source
of errors. If ever a value is missing, the code will throw an exception with
an instruction, in English telling the programmer what needs to happen
to fix the problem.
The sequence map is equally mundane. Postgres provides unique
values for use in primary keys by associating a sequence of numbers with each
numeric primary key. Every time you insert a row, another number is popped off
the sequence. However, if you need to use this number as part of another
insert (or indeed an update or anything else), then there's a race condition
if you perform the insert and then try to fetch the largest value of the
primary key. Postgres provides a workaround by allowing you to fetch the value
from the sequence before doing the insertion, and to specify this id (which
has been reserved for your use) at insert time. That's the only purpose this
dictionary serves -- to let us reserve new primary key values.
The last part: the depends map is by far the most exciting part.
It lists foreign keys, and the primary keys that they refer to. The primary
keys are in a list since we make extensive use of Postgres' table inheritance,
so a primary key could refer to any one of a number of tables. In this
particular example it refers to only one table though, so the list is of
length 1. This is the heart of the system: it's the mechanism that allows the
library to figure out how to connect rows in different tables.
But Why?
I want to conclude here with a brief explanation of why this elaborate
system is a good thing. To be fair, it is a lot of code, and there were a
bunch of problems to be worked out in its infancy. The reasons for its
existence why go back to my first few weeks at Anchor. Our database has
something in the order of 115 tables in it, with a bunch of different naming
conventions and very little documentation. One of the first things I tried to
do was to map out the database in a diagram drawing tool. I quickly realised
that doing so was almost entirely fruitless. Drawing tools just aren't
meant for that kind of thing (it's seriously hard work trying to
fit everything in and maintain such a document). Even if I did get this
information into Dia (the tool I was using), I'd still have to manually
inspect this diagram to write any new queries or check old ones.
The problem got a lot worse when I started thinking about some relatively
minor changes to the database design. Normalisation of a database is really,
really hard when you have to hand check every query after you do it. I
eventually figured that if the database structure could be expressed to the
computer then I'd be immune from these problems -- I'd never need to think
about how things were connected (that is, the intermediate steps weren't
important to me, only that two things were able to be connected
somehow), and if those connections changed there would only be one file
to update.
This isn't the only solution. Another system I've seen involves putting all
the SQL JOIN statements into functions in a single file, and
referring to that list of joins. Such a system works ok, and is less
complicated to get running initially. I found that these systems eventually
had errors creep in. Even though the information was centralised, it was still
a lot of work to maintain.
The system I've implemented is also not without its limitations. I'm
finding that I'm having to extend the initially quite simple implementation to
deal with more and more complex joins. Thankfully we've had surprisingly few
bugs creep in. I attribute this to the fact that all our code uses the
database library, so it's probably the most rigorously tested code-path of
all. It's also got quite slow on some operations. These operations generally
don't affect customer facing code, but they've become a minor concern for
internal systems.
This solution isn't for everybody, but the next time you're dealing with a
100 table database, give some thought as to whether you want to go writing all
that SQL again. Give some thought to how much you want to debug that SQL.
There's probably some SQL gurus out there who can do this without elaborate
libraries. More power to you. Most importantly: use the right tool for the
job.
|