The second piece of the PGXN infrastructure, after PGXN Maanager, is the PGXN API Server. I’ve just finished the API documentation, which covers both the lightweight static file API provided by mirrors and the superset provided by the API server. So now seems like a good time to talk about the design of the API server and how it works.
At its core, the PGXN API server is just another mirror. It has an hourly
cron job that rsyncs to the master mirror, updating the mirror. But then it
iterates over the rsync log and transforms some things. Here’s what it does:
README file and any files recognized by Text::Markup and converts them to sanitized HTML with a table of contents. Such files can then be used to display the README on the distribution page and to display individual documentation files.META.json generated by PGXN Manager. For example, as of this writing, the API server’s semver 0.2.1 META.json and the unversioned semver.json are identical. Effectively, this format has all the metadata from the META.json as well as a list of all releases of the distribution from the semver.json. This is useful for displaying all the data on the distribution page by fetching the data in a single API request.META.json file. For example, if you look at the semver 0.0.0 META.json, you’ll see that it includes 0.2.1 in its list of releases, even though 0.2.1 was released after 0.2.0. This allows semver 0.2.0 page on the main site to have a select list of version to choose from, including versions released later, with a single API request.semver.json to the API semver.json.theory.json to the API theory.json and the mirror data types.json to the API data types.json. This allows the user page and tag pages to include the abstract in the list of distributions released by the user or associated with a tag.All of this merging stuff came out of my thinking following the discussion of the PGXN API RFC. The decision to use Lucy instead of PostgreSQL’s full-text search followed rather naturally from this, as I quickly realized that there was no other driving need for a relational database behind the API at all. The only dynamic API is the search API. Everything else is just static files. And given the performance issues of in-database search, as well as the desire to have fewer outside dependencies, made the decision a natural one.
Beyond the syncing, there is a very simple web server providing the HTTP REST interface to the static JSON files and the full-text search. That’s it, really. The API server is really just another mirror on steroids. The nice thing is that it allows an interface, such as WWW::PGXN or the new PGXN client to work with either interface, just failing gracefully when API server APIs are unavailable.
If you want to learn more about the specifics of the REST API, the API documentation has all the details. Really, it’s quite comprehensive!
I actually consider the API to be 1.0-complete at this point, unlike PGXN Manager. The only thing I want to add is JSONP support for static JSON files (right now it’s only for search results) and might tweak a few things here and there, but otherwise I think it’s in pretty good shape.
Longer term, though, it might be worthwhile to add some other features to enhance the value of PGXN overall. Some ideas:
But I think we need to build up some momentum on the foundation that’s in place. Have you submitted your extensions, yet?
The PGXN infrastructure is currently made up of four parts:
The network of mirrors, derived from the master mirror, and synchronized on various schedules via rsync (host a mirror).
PGXN Manager (code) is the core of the network. It provides the interface for users to upload releases, processes those releases, indexes them, and puts them on the master mirror. Details below.
PGXN API (code) is a mirror server with benefits. Once an hour, it rsyncs from the master mirror, and does extra processing of new and modified files, notably full-text indexing. I’ll write up some details next week.
PGXN Site (code) powers the main site. It’s a thin wrapper around the API server, using WWW::PGXN to fetch JSON files and convert them to HTML. I’ll write more about this bit next week, too.
Some details on Manager. This is by far the most complicated part of the system. Which is funny, because I hadn’t anticipated that when I started work on PGXN (I’d estimated half as many hours as for implementing the site). But as I worked through design issues and wrote the code, the need for that complexity became apparent — and not just because it’s the only part that offers authentication. The main reason it’s complex is so that no other part needs to be.
Allow me to explain. People can upload almost anything as a distribution. So long as the META.json adheres to the spec, the rest can be just about anything. But the upload doesn’t necessarily end up on the network unmodified. Sure, if you follow the guidelines of the HOWTO what you uploaded will be exactly what ends up on the network. But I didn’t want to be that strict about PGXN Manager would accept. So in addition to verifying the structure of your META.json file, PGXN::Manager::Distribution also:
Extracts the archive. If it’s not a zip file, a zip archive is created from the contents of the uploaded file. You can upload any kind of archive readable by Archive::Extract. The currently supported formats are: .tar, .tar.gz, .gz, .Z, tar.bz2, .tbz, .bz2, .zip, .xz,, .txz, .tar.xz and .lzma. By always converting to a zip file, PGXN client apps can be quite simple, not having to worry about the archive format.
Validates the META.json. This part isn’t perfect, yet. I fixed a bug today where it would die and return a 500 on a missing version number. It’d probably be worthwhile to adapt CPAN::Meta::Validator to validate PGXN META.json files at some point, both for Manager and for developers wanting to validate before uploading.
Normalizes all version numbers in the META.json into semantic versions. You can specify the distribution version, prerequisite versions, and extension versions as simple numbers and they’ll be converted to semantic versions. A version like “1.20”, for example, becomes “1.2.0”. See the SemVer documentation for details on how versions are normalized via the declare() method. This normalization is done so that client applications will get known valid semantic versions to compare when determining dependencies. However, it’s best that they be semantic versions to begin with. Normalized versions will be written back to the archive META.json file (with the “generated_by” key updated to reflect that PGXN Manager regenerated the file). If no versions need validating, the archive META.json will be left alone.
Makes sure the zip archive has a directory prefix named "$dist-$version/". If the archive has no directory prefix, or if the prefix is not "$dist-$version/", the archive is rewritten with that prefix. This ensures that the archive will always extract into a directory with the same name as the archive and not spray files all over your desktop when you unzip it.
Copies or writes out a new zip file named "$dist-$version.pgz". Think of .pgz as “PostgreSQL Zip” or, if you’d rather, “PGXN Zip”. Either way, it’s just a zip archive.
That processing done, with a good META.json and zip archive, the JSON, username, and SHA1 of the zip archive are handed off to the database for more processing. The add_distribution() database function does all the heavy lifting here. It:
Parses the JSON string, validates that all required keys are present, and normalizes version numbers. Yes, this is redundant, but I don’t think I need to lecture the reads of this blog about database integrity. :-)
Creates a new metadata structure and stores all the required and many of the optional meta spec keys, as well as the SHA1 of the distribution file, the date, and the user’s nickname.
Sets the “release_status” to “stable” if there was no status in the original JSON.
Adds a “provides” section to the metadata if none was included in the original JSON. In such a case, it assumes that the distribution contains one extension and that it has the same name and version as the distribution itself.
Validates that the uploading user is owner or co-owner of all provided extensions. If no one is listed as owner of one or more included extensions, the user will be assigned ownership. If the user is not owner or co-owner of any included extensions, an exception will be thrown.
Records the distribution, extensions, and tags in the database.
Once all this work is done, add_distribution() returns all the JSON that needs to be written to the mirror. These files make up the “index” on the network, and include:
META.json file (example). This file is derived from the META.json included in the archive (example), but reflects all the normalization changes and added keys outlined above.It also returns JSON for network statistics files. These are updated every time a new release is uploaded:
dist.json lists the 56 most recent releases and has a count of all distributions and of releases of those distributions.extension.json lists the 56 most recent extension releases and a count of all extensions on the network.user.json lists of the 56 most prolific users (based on the number of distributions) and a count of distributions and releases for each.tag.json lists the 56 most popular tags (measured by the number of distributions they’re associated with) and a count of all tags on the network.summary.json has basic summary information about the network, which is just counts of distributions, releases, extensions, users, tags, and mirrors.If you think that’s a lot of data to be updated, you’re right! But since releases are relatively infrequent (a couple a day at the moment), it’s best to generate all this stuff as static files that are rsynced to all mirrors. In this way, any mirror can function as a very simple, lightweight REST API. And indeed, that’s just how the planned PGXN client will behave. The WWW::PGXN already provides the interface it will use.
I guess that was a lot of information. Let this be a reference document for interested hackers, then. The core functionality is all there, but there’s a lot more to be done:
rrr (#10).$nickname@pgxn.org will be forwarded to a user’s actual address. This would allow us to remove literal email addresses from the JSON files and the site (the site obfuscates them, but still…). Anyone got some good postfix chops for this? The users table is quite simple (#13).Want to help out? Fork PGXN Manager and have at it. Hell, at this point I’d really appreciate a code review, as I’m pretty sure there’s only been one set of eyes on this code so far.
Next week, I plan to blog about
rsyncBut given how these things go, and how I need to start writing mirror API and API server documentation, it might take me a longer to get to them all.