Thursday, December 9, 2010

Django and database migrations using South


Currently I am working on a Django project that uses multiple databases. To migrate the databases automatically when the asscoiated models change, we use South. Unfortunately South cannot handle ForeignKeys between models from different databases, see this thread on the South Users Google Group. When South tries to execute the migration step that introduces such a relationship, it aborts the migration and leaves me with an incomplete database. This also causes a problem when running unit tests as South's test runner integration will try to build the database and fail. Consequently, no tests are run.

It is very easy to work around this:

  1. Disable South's test runner integration: set the option SOUTH_TESTS_MIGRATE in settings.py to False

  2. Disable South in you Django project: remove 'south' from the tuple of installed apps (in settings.py)

  3. Create the database using Django's syncdb: bin/django syncdb

  4. Enable South: add 'south' to the tuple of installed apps (in settings.py)

  5. Fake the migration up to the last step: bin/django migrate <app> --fake

South uses a special table in your database to register which migrations have been performed, south_migrationhistory. After the last step this tables contains all the steps.

Wednesday, April 28, 2010

Debugging Python code from within Emacs on Windows

It is very easy to debug Python code from within Emacs. However, it does not work as advertised on my Windows XP development machine - I do not know whether it works as advertised on Linux. The problem is that the Emacs buffer to interact with the Python debugger (Pdb) only displays the directory of the source file I want to debug:

Current directory is c:/projects/customer X/scripts/converter/converter

In case the problem is related to the Emacs version, I am using version 23.1.1.

After some investigations, I found out that there two things I had to take care of to get the interaction with Pdb working.

1. To start Pdb from within Emacs, you M-x the command "pdb" (without the quotes). You then have to enter the command-line required to start Pdb. On my Windows XP machine that is

python c:/Python26/Lib/pdb.py updateweights_tests.py

Emacs parses the output of Pdb to stdout to let the user interact with Pdb. By default, this output is buffered on Windows and Emacs can only parse the part of the output that has been flushed. This can result in a deadlock where Emacs is waiting for more PDb output and Pdb for user input.

To turn off buffering, you have to supply the Python interpreter with the command-line option -u:

-u : unbuffered binary stdout and stderr; also PYTHONUNBUFFERED=x
see man page for details on internal buffering relating to '-u'

The command-line to start Pdb becomes

python -u c:/Python26/Lib/pdb.py updateweights_tests.py

2. Function gud-pdb-marker-filter (gud.el) uses regular expression gud-pdb-marker-regexp to identify specific parts of the Pdb output. This regular expression is defined as follows:

(defvar gud-pdb-marker-regexp
"^> \\([-a-zA-Z0-9_/.:\\]*\\|\\)(\\([0-9]+\\))\\([a-zA-Z0-9_]*\\|\\?\\|\\)()\\(->[^\n]*\\)?\n")

The first group of the regexp should match any file path. However, it fails to recognize paths that contain a space, such as the path that contained my sources. This was easily fixed by the addition of a space to the first group:

(defvar gud-pdb-marker-regexp
"^> \\([-a-zA-Z0-9_ /.:\\]*\\|\\)(\\([0-9]+\\))\\([a-zA-Z0-9_]*\\|\\?\\|\\)()\\(->[^\n]*\\)?\n")

I use nosetests to automatically collect my unit tests and execute them. It is very easy to run nosetests from Emacs and automatically invoke Pdb as soon as a unittest.TestCase assertion fails. First, create a Python module that invokes nose.run():

import nose; nose.run()

Then M-x the command pdb and enter the following command-line in the minibuffer:

python -u nosetests.py --pdb-failures --nocapture --quiet

The options "--pdb-failures --nocapture --quiet" are intended for and automatically picked up by the call to Nose.run. The nosetests "Usage" message has the following to say about these options:

--pdb-failures Drop into debugger on failures
-s, --nocapture Don't capture stdout (any stdout output will be
printed immediately) [NOSE_NOCAPTURE]
-q, --quiet Be less verbose

The "--quiet" option is required so nose does not output messages that make it impossible for Emacs to parse the Pdb output.

Thursday, February 4, 2010

Bazaar

What better way to evaluate a Distributed Version Control System than by actually using it on a real project? Even if the candidate project is maintained in Subversion...

Why do I want to check out a Distributed Version Control System?

At the place I work we use Subversion as our version control system (VCS). We do not host it ourselves but rely on an external provider for that. In general, we are very pleased both with Subversion and our provider.

However, one problem with this setup is that I need to be online for each VCS action that requires access to the repository. This is not always possible or only through a smallband connection, especially when I am on the road. So no commit, no creation or retrieval of a branch, no history etc.

But even when I am at the office and I am online through a broadband connection, some actions can take up a lot of time due to the sheer amount of data that has to be transferred. The primary example of this is the retrieval of a branch of the project I spent most of my time on. The retrieval of a branch amounts to a 158M download. At the office, we have a download speed that maxes out at 200 KB/s. In theory the retrieval takes a 13 minutes but in practice it takes more than 20 minutes.

Usually my colleagues and I "work around" this problem by working directly on the trunk. This is not a problem when the changes can be applied and tested in a short timeframe. However, changes that appear minor can turn out anything but minor. They take longer to develop than expected and all that time you cannot commit your changes in fear of polluting the trunk. To make things worse, chances are other developers have committed changes to the trunk which you have to checkout before you can commit at all.

Distibuted version control systems (DVCS) can help me with these issues as they allow me to store the complete repository locally. So I decided to test drive one of these DVCSs, namely Bazaar (Bzr) on the aforementioned project. There are two reasons why I chose Bzr. First, there exists an extension to Bzr that allows one to interface to a Subversion repository, see this link for more information. But this functionality is not exclusive to Bzr, for example Git has it too. This brings me to the second reason. At the time of writing this blog entry, Bzr was the only DVCS I had used and it had been a good experience.

Way of working

Just as the documentation of the Bzr Subversion advises, I created a shared repository and within that repository, I created a Bzr checkout of the trunk of the Subversion repository:

bzr init-repo --default-rich-root app-repo

cd app-repo
bzr checkout https:<path-to-trunk> app-trunk

The last command gave me some headaches, but that is another story.

The reason I created a checkout and not an ordinary branch is that when you commit to a checkout, Bzr also passes your commit through to the original branch. In my case, this is the trunk in the Subversion directory. This immediately makes clear the way of working I have in mind:
  1. When I want to work on a branch, I branch my local checkout and work on the new branch. I commit all my changes to the local branch.
  2. When I want to merge my changes with the trunk, I push them from my branch to the checkout of the trunk.
First results

To checkout a branch from the Subversion repository through the TortoiseSVN client took 11 minutes and 17 seconds. This is done measured at home, where I have a higher download speed than at the office. To compare, to branch the local Bzr checkout takes 1 minute and 9 seconds. This is almost 10 times as fast as the Subversion checkout, and more than 17 times as fast as the Subversion checkout at the office. This was a great speed-up although the 1 minute and 9 seconds itself left me a bit underwhelmed. Why does it still take more than a minute to branch? One cause could be the actual size of the branch. Once branched, the directory tree of the new branch takes up 1GB and it definately takes time to write that amount of data.

Unanswered questions

With the repository layout described above, can I recreate a branch after I have deleted its working tree, even if the trunk has evolved in the meantime? The real question is how Bzr identifies branches inside a shared repository. By their path in the repository directory? What whould happen when I create a branch whose path coincides with a previously created branch whose working tree I have deleted? For now it is not really an issue as I will uniquely name my branches and probably, once deleted, never need to restore them. But the answer to this question would increase my understanding of the inner workings of Bzr.

What about collaborating with my colleagues that only use Subversion. Can I push my local branch directly to a Subversion branch? Does a shared repository present the best setup for me? Maybe I should I use stacked branches? For now, lets see how it all works out in practice.

Kind regards.