The PGXN infrastructure is currently made up of four parts:
The network of mirrors, derived from the master mirror, and synchronized on various schedules via rsync (host a mirror).
PGXN Manager (code) is the core of the network. It provides the interface for users to upload releases, processes those releases, indexes them, and puts them on the master mirror. Details below.
PGXN API (code) is a mirror server with benefits. Once an hour, it rsyncs from the master mirror, and does extra processing of new and modified files, notably full-text indexing. I’ll write up some details next week.
PGXN Site (code) powers the main site. It’s a thin wrapper around the API server, using WWW::PGXN to fetch JSON files and convert them to HTML. I’ll write more about this bit next week, too.
Some details on Manager. This is by far the most complicated part of the system. Which is funny, because I hadn’t anticipated that when I started work on PGXN (I’d estimated half as many hours as for implementing the site). But as I worked through design issues and wrote the code, the need for that complexity became apparent — and not just because it’s the only part that offers authentication. The main reason it’s complex is so that no other part needs to be.
Allow me to explain. People can upload almost anything as a distribution. So long as the META.json adheres to the spec, the rest can be just about anything. But the upload doesn’t necessarily end up on the network unmodified. Sure, if you follow the guidelines of the HOWTO what you uploaded will be exactly what ends up on the network. But I didn’t want to be that strict about PGXN Manager would accept. So in addition to verifying the structure of your META.json file, PGXN::Manager::Distribution also:
Extracts the archive. If it’s not a zip file, a zip archive is created from the contents of the uploaded file. You can upload any kind of archive readable by Archive::Extract. The currently supported formats are: .tar, .tar.gz, .gz, .Z, tar.bz2, .tbz, .bz2, .zip, .xz,, .txz, .tar.xz and .lzma. By always converting to a zip file, PGXN client apps can be quite simple, not having to worry about the archive format.
Validates the META.json. This part isn’t perfect, yet. I fixed a bug today where it would die and return a 500 on a missing version number. It’d probably be worthwhile to adapt CPAN::Meta::Validator to validate PGXN META.json files at some point, both for Manager and for developers wanting to validate before uploading.
Normalizes all version numbers in the META.json into semantic versions. You can specify the distribution version, prerequisite versions, and extension versions as simple numbers and they’ll be converted to semantic versions. A version like “1.20”, for example, becomes “1.2.0”. See the SemVer documentation for details on how versions are normalized via the declare() method. This normalization is done so that client applications will get known valid semantic versions to compare when determining dependencies. However, it’s best that they be semantic versions to begin with. Normalized versions will be written back to the archive META.json file (with the “generated_by” key updated to reflect that PGXN Manager regenerated the file). If no versions need validating, the archive META.json will be left alone.
Makes sure the zip archive has a directory prefix named "$dist-$version/". If the archive has no directory prefix, or if the prefix is not "$dist-$version/", the archive is rewritten with that prefix. This ensures that the archive will always extract into a directory with the same name as the archive and not spray files all over your desktop when you unzip it.
Copies or writes out a new zip file named "$dist-$version.pgz". Think of .pgz as “PostgreSQL Zip” or, if you’d rather, “PGXN Zip”. Either way, it’s just a zip archive.
That processing done, with a good META.json and zip archive, the JSON, username, and SHA1 of the zip archive are handed off to the database for more processing. The add_distribution() database function does all the heavy lifting here. It:
Parses the JSON string, validates that all required keys are present, and normalizes version numbers. Yes, this is redundant, but I don’t think I need to lecture the reads of this blog about database integrity. :-)
Creates a new metadata structure and stores all the required and many of the optional meta spec keys, as well as the SHA1 of the distribution file, the date, and the user’s nickname.
Sets the “release_status” to “stable” if there was no status in the original JSON.
Adds a “provides” section to the metadata if none was included in the original JSON. In such a case, it assumes that the distribution contains one extension and that it has the same name and version as the distribution itself.
Validates that the uploading user is owner or co-owner of all provided extensions. If no one is listed as owner of one or more included extensions, the user will be assigned ownership. If the user is not owner or co-owner of any included extensions, an exception will be thrown.
Records the distribution, extensions, and tags in the database.
Once all this work is done, add_distribution() returns all the JSON that needs to be written to the mirror. These files make up the “index” on the network, and include:
META.json file (example). This file is derived from the META.json included in the archive (example), but reflects all the normalization changes and added keys outlined above.It also returns JSON for network statistics files. These are updated every time a new release is uploaded:
dist.json lists the 56 most recent releases and has a count of all distributions and of releases of those distributions.extension.json lists the 56 most recent extension releases and a count of all extensions on the network.user.json lists of the 56 most prolific users (based on the number of distributions) and a count of distributions and releases for each.tag.json lists the 56 most popular tags (measured by the number of distributions they’re associated with) and a count of all tags on the network.summary.json has basic summary information about the network, which is just counts of distributions, releases, extensions, users, tags, and mirrors.If you think that’s a lot of data to be updated, you’re right! But since releases are relatively infrequent (a couple a day at the moment), it’s best to generate all this stuff as static files that are rsynced to all mirrors. In this way, any mirror can function as a very simple, lightweight REST API. And indeed, that’s just how the planned PGXN client will behave. The WWW::PGXN already provides the interface it will use.
I guess that was a lot of information. Let this be a reference document for interested hackers, then. The core functionality is all there, but there’s a lot more to be done:
rrr (#10).$nickname@pgxn.org will be forwarded to a user’s actual address. This would allow us to remove literal email addresses from the JSON files and the site (the site obfuscates them, but still…). Anyone got some good postfix chops for this? The users table is quite simple (#13).Want to help out? Fork PGXN Manager and have at it. Hell, at this point I’d really appreciate a code review, as I’m pretty sure there’s only been one set of eyes on this code so far.
Next week, I plan to blog about
rsyncBut given how these things go, and how I need to start writing mirror API and API server documentation, it might take me a longer to get to them all.
Yesterday was a busy day. In addition to making the first PGXN release, I updated the fundraising spreadsheet and then the thermometer displayed on the main site. The good news is that things are coming along nicely. Thanks to recent contributions from Command Prompt, Marchex, Hitoshi Harada, and 25th-floor, we are now just $2500 short of our goal of $25,000. Thank you all!
Can you help us get to our goal in time for PgWest 2010, which is November 2-4 in San Francisco? I’ll be giving a talk there, “Building and Distributing PostgreSQL Extensions Without Learning C”, in which PGXN will of course be featured. Would be great to announce that the fundraising was successful.
As for the time I’ve put in so far, I’m happy to have PGXN Manager up and working, but of course it has taken more hours than I expected. 76.5 so far. I’d estimated 40. Meanwhile, the database design is up to 43 hours from the estimated 24. And that doesn’t count the hours I spent chasing shiny yaks and shaving them, like SemVer and Router::Resource. Those of you who estimate development projects, take heed! I clearly need to double all my estimates before I submit them.
Still, with the fundraising nearly done, I’m committed to finishing this project. I view it as a project budget, and so that’s what it will be, whether or not it takes me twice or four times as many hours as I’d estimated.
That said, you could help. Right? PGXN Manager is in good shape, but it’s not done. If you’d like to roll up your sleeves and contribute some code, please fork it, build it, and hack what you can. A few things on the to-do list:
The database API is there for these bits already; the code would mainly be in Perl. Hit me on #pgxn on Freenode if you want to help.
If documentation is your thing, contributions there would be appreciated, as well. In particular, the About PGXN page is a bit thin. Other interfaces will need help, too. More on that as we add users.
Thanks everyone for your support!