The PGXN infrastructure is currently made up of four parts:
The network of mirrors, derived from the master mirror, and synchronized on various schedules via rsync (host a mirror).
PGXN Manager (code) is the core of the network. It provides the interface for users to upload releases, processes those releases, indexes them, and puts them on the master mirror. Details below.
PGXN API (code) is a mirror server with benefits. Once an hour, it rsyncs from the master mirror, and does extra processing of new and modified files, notably full-text indexing. I’ll write up some details next week.
PGXN Site (code) powers the main site. It’s a thin wrapper around the API server, using WWW::PGXN to fetch JSON files and convert them to HTML. I’ll write more about this bit next week, too.
Some details on Manager. This is by far the most complicated part of the system. Which is funny, because I hadn’t anticipated that when I started work on PGXN (I’d estimated half as many hours as for implementing the site). But as I worked through design issues and wrote the code, the need for that complexity became apparent — and not just because it’s the only part that offers authentication. The main reason it’s complex is so that no other part needs to be.
Allow me to explain. People can upload almost anything as a distribution. So long as the META.json adheres to the spec, the rest can be just about anything. But the upload doesn’t necessarily end up on the network unmodified. Sure, if you follow the guidelines of the HOWTO what you uploaded will be exactly what ends up on the network. But I didn’t want to be that strict about PGXN Manager would accept. So in addition to verifying the structure of your META.json file, PGXN::Manager::Distribution also:
Extracts the archive. If it’s not a zip file, a zip archive is created from the contents of the uploaded file. You can upload any kind of archive readable by Archive::Extract. The currently supported formats are: .tar, .tar.gz, .gz, .Z, tar.bz2, .tbz, .bz2, .zip, .xz,, .txz, .tar.xz and .lzma. By always converting to a zip file, PGXN client apps can be quite simple, not having to worry about the archive format.
Validates the META.json. This part isn’t perfect, yet. I fixed a bug today where it would die and return a 500 on a missing version number. It’d probably be worthwhile to adapt CPAN::Meta::Validator to validate PGXN META.json files at some point, both for Manager and for developers wanting to validate before uploading.
Normalizes all version numbers in the META.json into semantic versions. You can specify the distribution version, prerequisite versions, and extension versions as simple numbers and they’ll be converted to semantic versions. A version like “1.20”, for example, becomes “1.2.0”. See the SemVer documentation for details on how versions are normalized via the declare() method. This normalization is done so that client applications will get known valid semantic versions to compare when determining dependencies. However, it’s best that they be semantic versions to begin with. Normalized versions will be written back to the archive META.json file (with the “generated_by” key updated to reflect that PGXN Manager regenerated the file). If no versions need validating, the archive META.json will be left alone.
Makes sure the zip archive has a directory prefix named "$dist-$version/". If the archive has no directory prefix, or if the prefix is not "$dist-$version/", the archive is rewritten with that prefix. This ensures that the archive will always extract into a directory with the same name as the archive and not spray files all over your desktop when you unzip it.
Copies or writes out a new zip file named "$dist-$version.pgz". Think of .pgz as “PostgreSQL Zip” or, if you’d rather, “PGXN Zip”. Either way, it’s just a zip archive.
That processing done, with a good META.json and zip archive, the JSON, username, and SHA1 of the zip archive are handed off to the database for more processing. The add_distribution() database function does all the heavy lifting here. It:
Parses the JSON string, validates that all required keys are present, and normalizes version numbers. Yes, this is redundant, but I don’t think I need to lecture the reads of this blog about database integrity. :-)
Creates a new metadata structure and stores all the required and many of the optional meta spec keys, as well as the SHA1 of the distribution file, the date, and the user’s nickname.
Sets the “release_status” to “stable” if there was no status in the original JSON.
Adds a “provides” section to the metadata if none was included in the original JSON. In such a case, it assumes that the distribution contains one extension and that it has the same name and version as the distribution itself.
Validates that the uploading user is owner or co-owner of all provided extensions. If no one is listed as owner of one or more included extensions, the user will be assigned ownership. If the user is not owner or co-owner of any included extensions, an exception will be thrown.
Records the distribution, extensions, and tags in the database.
Once all this work is done, add_distribution() returns all the JSON that needs to be written to the mirror. These files make up the “index” on the network, and include:
META.json file (example). This file is derived from the META.json included in the archive (example), but reflects all the normalization changes and added keys outlined above.It also returns JSON for network statistics files. These are updated every time a new release is uploaded:
dist.json lists the 56 most recent releases and has a count of all distributions and of releases of those distributions.extension.json lists the 56 most recent extension releases and a count of all extensions on the network.user.json lists of the 56 most prolific users (based on the number of distributions) and a count of distributions and releases for each.tag.json lists the 56 most popular tags (measured by the number of distributions they’re associated with) and a count of all tags on the network.summary.json has basic summary information about the network, which is just counts of distributions, releases, extensions, users, tags, and mirrors.If you think that’s a lot of data to be updated, you’re right! But since releases are relatively infrequent (a couple a day at the moment), it’s best to generate all this stuff as static files that are rsynced to all mirrors. In this way, any mirror can function as a very simple, lightweight REST API. And indeed, that’s just how the planned PGXN client will behave. The WWW::PGXN already provides the interface it will use.
I guess that was a lot of information. Let this be a reference document for interested hackers, then. The core functionality is all there, but there’s a lot more to be done:
rrr (#10).$nickname@pgxn.org will be forwarded to a user’s actual address. This would allow us to remove literal email addresses from the JSON files and the site (the site obfuscates them, but still…). Anyone got some good postfix chops for this? The users table is quite simple (#13).Want to help out? Fork PGXN Manager and have at it. Hell, at this point I’d really appreciate a code review, as I’m pretty sure there’s only been one set of eyes on this code so far.
Next week, I plan to blog about
rsyncBut given how these things go, and how I need to start writing mirror API and API server documentation, it might take me a longer to get to them all.
I’m making good progress on PGXN Manager. Hopefully I can start alpha testing it next week. As I mentioned previously, I had estimated 40 hours of work to create it, but was hoping to get it done in around 30 (because I spent 10 extra hours on the database design). So far I’m at 23 hours, so it’s looking pretty good.
Architecturally, I’ve gone for a very minimal Plack-based app. No Catalyst, Jifty, or even Dancer. Just a very simple Plack app that uses Router::Simple::Sinatraish to route URIs to the appropriate controller actions (which are just class methods). The controller just dispatches to Template::Declare-based templates for the HTML rendering. I guess I’ve kind of created my own framework here, but really, there ain’t much to it. This app is simple enough that I couldn’t see the use of adding all the overhead of a framework.
Meanwhile, I’ve been hacking on the distributon class. This will be the core class of the app. It takes a Plack upload object and a username and does all the rest, analyzing an uploaded archive, normalizing it if necessary, registering it with the database, and indexing it by updating all the appropriate JSON files on the mirrors. It’s nearly finished, but I have one other thing to do in the database, first.
In my last update, I asked for advice on whether or not PGXN should allow an extension with the same version number to appear in multiple distributions. And thanks to a comment from Aristotle, I’m changing it to allow that. But it also means that my original extension json spec needs to change.
Here’s an example of what I’m thinking. Say that there are three versions of an extension named “trip”, and that they appear in distributions as follows:
trip 0.2.6
pair-0.3.0
trip 0.2.5
trip-0.2.2
pair-0.2.2rc
pair-0.2.1
trip-0.1.1
trip 0.2.4
pair-0.1.1rc
pair-0.1.0
So sometimes it’s in the “trip” distribution and other times it’s in the “pair” distribution. My thought is that, for a given version, it would list the distributions it’s in in reverse chronological order (by upload date). So the format would be:
{
"latest": "stable",
"stable": { "dist": "pair", "version": "0.3.0" },
"testing": { "dist": "pair", "version": "0.2.2rc" },
"distributions": {
"0.2.6": [
{ "dist": "pair", "version": "0.3.0" }
],
"0.2.5": [
{ "dist": "trip", "version": "0.2.2" },
{ "dist": "pair", "version": "0.2.2rc", "status": "testing" },
{ "dist": "pair", "version": "0.2.1" },
{ "dist": "trip", "version": "0.1.1" }
],
"0.2.4": [
{ "dist": "pair", "version": "0.1.1rc", "status": "testing" },
{ "dist": "pair", "version": "0.1.0" }
]
}
}
This way, every distribution it’s included in is listed, and clients can quickly tell where to find the latest stable, testing, and unstable versions, and which of those is the most recent. This is a bit more convoluted than the original, but I think is a good choice, in that it’s comprehensive but also easy to figure out what’s the latest.
Unless you can think of a better format, this is what I’m going with. Comments?
Look for a post next week announcing an alpha program!