The second piece of the PGXN infrastructure, after PGXN Maanager, is the PGXN API Server. I’ve just finished the API documentation, which covers both the lightweight static file API provided by mirrors and the superset provided by the API server. So now seems like a good time to talk about the design of the API server and how it works.
At its core, the PGXN API server is just another mirror. It has an hourly
cron job that rsyncs to the master mirror, updating the mirror. But then it
iterates over the rsync log and transforms some things. Here’s what it does:
README file and any files recognized by Text::Markup and converts them to sanitized HTML with a table of contents. Such files can then be used to display the README on the distribution page and to display individual documentation files.META.json generated by PGXN Manager. For example, as of this writing, the API server’s semver 0.2.1 META.json and the unversioned semver.json are identical. Effectively, this format has all the metadata from the META.json as well as a list of all releases of the distribution from the semver.json. This is useful for displaying all the data on the distribution page by fetching the data in a single API request.META.json file. For example, if you look at the semver 0.0.0 META.json, you’ll see that it includes 0.2.1 in its list of releases, even though 0.2.1 was released after 0.2.0. This allows semver 0.2.0 page on the main site to have a select list of version to choose from, including versions released later, with a single API request.semver.json to the API semver.json.theory.json to the API theory.json and the mirror data types.json to the API data types.json. This allows the user page and tag pages to include the abstract in the list of distributions released by the user or associated with a tag.All of this merging stuff came out of my thinking following the discussion of the PGXN API RFC. The decision to use Lucy instead of PostgreSQL’s full-text search followed rather naturally from this, as I quickly realized that there was no other driving need for a relational database behind the API at all. The only dynamic API is the search API. Everything else is just static files. And given the performance issues of in-database search, as well as the desire to have fewer outside dependencies, made the decision a natural one.
Beyond the syncing, there is a very simple web server providing the HTTP REST interface to the static JSON files and the full-text search. That’s it, really. The API server is really just another mirror on steroids. The nice thing is that it allows an interface, such as WWW::PGXN or the new PGXN client to work with either interface, just failing gracefully when API server APIs are unavailable.
If you want to learn more about the specifics of the REST API, the API documentation has all the details. Really, it’s quite comprehensive!
I actually consider the API to be 1.0-complete at this point, unlike PGXN Manager. The only thing I want to add is JSONP support for static JSON files (right now it’s only for search results) and might tweak a few things here and there, but otherwise I think it’s in pretty good shape.
Longer term, though, it might be worthwhile to add some other features to enhance the value of PGXN overall. Some ideas:
But I think we need to build up some momentum on the foundation that’s in place. Have you submitted your extensions, yet?
Last night I deployed PGXN::Manager v0.2.1 and uploaded the first distribution, pair. If you follow that link you’ll see three files:
pair-0.1.0.json is metadata file generated by PGXN manager to describe the distribution. Most of its data is taken from the META.json included in the uploaded zip file, but a few keys, like “sha1”, are generated, and others, like “release_status” are added if they’re not in the included META.json.README file distributed with pair.Following the spec I previously wrote up, there are a number of other files that get created when a new distribution is uploaded to PGXN. For the pair extension, we got:
by/dist/pair.json, which will be updated with information for every release of the “pair” distribution.by/extension/pair.json, which will be updated for every upload containing the “pair” extension.by/owner/theory.json, which will be updated every time I upload a distribution.Files for every tag listed in the metadata are also created. In this case, that includes:
by/tag/key value pair.jsonby/tag/key value.jsonby/tag/ordered pair.jsonby/tag/pair.jsonby/tag/variadic function.jsonEach of these files will be updated every time a distribution is uploaded containing the relevant tag.
You’ll soon be able to upload your own extension distributions to PGXN. If you’re interested, please subscribe to the mail list, where I’ll soon be inviting folks to get an account and start uploading.
But first, a blog post on how to create a PGXN-friendly distribution archive. Coming up shortly.
I’m making good progress on PGXN Manager. Hopefully I can start alpha testing it next week. As I mentioned previously, I had estimated 40 hours of work to create it, but was hoping to get it done in around 30 (because I spent 10 extra hours on the database design). So far I’m at 23 hours, so it’s looking pretty good.
Architecturally, I’ve gone for a very minimal Plack-based app. No Catalyst, Jifty, or even Dancer. Just a very simple Plack app that uses Router::Simple::Sinatraish to route URIs to the appropriate controller actions (which are just class methods). The controller just dispatches to Template::Declare-based templates for the HTML rendering. I guess I’ve kind of created my own framework here, but really, there ain’t much to it. This app is simple enough that I couldn’t see the use of adding all the overhead of a framework.
Meanwhile, I’ve been hacking on the distributon class. This will be the core class of the app. It takes a Plack upload object and a username and does all the rest, analyzing an uploaded archive, normalizing it if necessary, registering it with the database, and indexing it by updating all the appropriate JSON files on the mirrors. It’s nearly finished, but I have one other thing to do in the database, first.
In my last update, I asked for advice on whether or not PGXN should allow an extension with the same version number to appear in multiple distributions. And thanks to a comment from Aristotle, I’m changing it to allow that. But it also means that my original extension json spec needs to change.
Here’s an example of what I’m thinking. Say that there are three versions of an extension named “trip”, and that they appear in distributions as follows:
trip 0.2.6
pair-0.3.0
trip 0.2.5
trip-0.2.2
pair-0.2.2rc
pair-0.2.1
trip-0.1.1
trip 0.2.4
pair-0.1.1rc
pair-0.1.0
So sometimes it’s in the “trip” distribution and other times it’s in the “pair” distribution. My thought is that, for a given version, it would list the distributions it’s in in reverse chronological order (by upload date). So the format would be:
{
"latest": "stable",
"stable": { "dist": "pair", "version": "0.3.0" },
"testing": { "dist": "pair", "version": "0.2.2rc" },
"distributions": {
"0.2.6": [
{ "dist": "pair", "version": "0.3.0" }
],
"0.2.5": [
{ "dist": "trip", "version": "0.2.2" },
{ "dist": "pair", "version": "0.2.2rc", "status": "testing" },
{ "dist": "pair", "version": "0.2.1" },
{ "dist": "trip", "version": "0.1.1" }
],
"0.2.4": [
{ "dist": "pair", "version": "0.1.1rc", "status": "testing" },
{ "dist": "pair", "version": "0.1.0" }
]
}
}
This way, every distribution it’s included in is listed, and clients can quickly tell where to find the latest stable, testing, and unstable versions, and which of those is the most recent. This is a bit more convoluted than the original, but I think is a good choice, in that it’s comprehensive but also easy to figure out what’s the latest.
Unless you can think of a better format, this is what I’m going with. Comments?
Look for a post next week announcing an alpha program!
Following my post outlining a possible network directory structure, Aristotle Pagaltzis saw fit to bug me via email about a different approach. I couldn’t understand WTF he was talking about until today. Then it lit my brain on fire. As a result, I now think that there is a much better way to organize the metadata files for the PGXN — one that happens not to include any symbolic links (which is something that Andreas König has been flagging, via email, as a possible bottleneck).
First, the /dist directory will be the same as before. Releases of pgTAP would be in:
dist/p/pg/pgtap/pgtap-0.23.pgz
dist/p/pg/pgtap/pgtap-0.23.json
dist/p/pg/pgtap/pgtap-0.23.readme
dist/p/pg/pgtap/pgtap-0.24.pgz
dist/p/pg/pgtap/pgtap-0.24.json
dist/p/pg/pgtap/pgtap-0.24.readme
dist/p/pg/pgtap/pgtap-0.25.pgz
dist/p/pg/pgtap/pgtap-0.25.json
dist/p/pg/pgtap/pgtap-0.25.readme
The only change is that the pgtap.json symlink is gone.
Now, the new stuff. In the root directory will be a file, index.json, that contains templates for URIs. It will look something like this:
{
"dist": "/dist/$a/$ab/$dist/$dist-$version.pgz",
"readme": "/dist/$a/$ab/$dist/$dist-$version.readme",
"meta": "/dist/$a/$ab/$dist/$dist-$version.json",
"by-dist": "/by/dist/$a/$ab/$dist.json",
"by-extension": "/by/extension/$a/$ab/$extension.json",
"by-owner": "/by/owner/$a/$ab/$owner.json",
"by-manager": "/by/manager/$a/$ab/$manager.json",
}
The PGXN client will always fetch this file before it does anything else, because the file tells it how to find stuff. The advantage here is that the client doesn’t have to know anything about how the directory is actually organized, just what the template variables might be. They are:
$dist: A distribution name$version: A version number$extension: An extension name$owner: An owner’s name$manager: A release manager’s name (managers are the people who upload distributions to PGXN)$a: The first letter of a distribution, extension, owner, or manager name.$ab: The first two letters of a distribution, extension, owner, or manager name.I’m not thrilled about using prefix-staggering to avoid having too many files in a directory. But the truth is that this approach allows me to punt. I could also make sure the client supports, for example, $bc and $cd, so that one could stagger things differently. And then the nice thing is that I don’t have to use those at all. The templates will tell the client exactly how to construct the URIs for things, and the templates needn’t include those staggering variables if they’re not appropriate. The client won’t care because it will have no built-in knowledge of how things are organized. It will have to find out from index.json.
From the URI templates, you can now see where the other metadata will be stored. For extension names, a hypothetical pgTAP distribution with two extensions will have a JSON file for each extension:
/by/extension/p/pg/pgtap.json
/by/extension/s/sc/schematap.json
The pgtap.json file will look something like this:
"stable": "0.25.0",
"testing": "0.26.0b1",
"unstable": "0.30.0u",
"versions": {
"0.26.0b1": { "dist": "pgtap", "version": "0.26.0b1", "status": "testing" },
"0.30.0u": { "dist": "pgtap", "version": "0.30.0u", "status": "unstable" },
"0.25.0": { "dist": "pgtap", "version": "0.25.0", "status": "stable" },
"0.24.0": { "dist": "pgtap", "version": "0.24.0", "status": "stable" },
"0.25.0": { "dist": "pgtap", "version": "0.23.0", "status": "stable" }
}
Right at the top, it would always list the most recent stable, testing, and unstable version number, and then it would have a list metadata for all versions. Said metadata would include the associated distribution name, version, and release status.
Here’s how it would work. Say I ask the client to install pgtap:
PGXN> install extension pgtap
The client would first fetch /index.json, then look for the URI template for “by-extension”, which is /by/extension/$a/$ab/$extension.json. Filling in the template, it would know to request /by/extension/p/pg/pgtap.json. With that file, it would see that the most recent stable version is in the “pgtap” distribution version 0.25.0. Using the dist URI template, which is /dist/$a/$ab/$dist-$version.pgz, it would then fetch /dist/p/pg/pgtap/pgtap-0.25.0.pgz.
The advantage here is that there are no symbolic links and no knowledge of the directory structure built into clients. The client just knows to fetch /index.json and then to use the templates in that file to fetch other information. That’s the whole interface. Very RESTful.
The structure of the other /by files would be similar. For
PGXN> install dist pgtap
the client would use the “by-dist” URI template to construct the URL /by/dist/p/pg/pgtap.json. That file would have something like:
"stable": "0.25.0",
"testing": "0.26.0b1",
"unstable": "0.30.0u",
"versions": {
"0.26.0b1": "testing",
"0.30.0u": "unstable",
"0.25.0": "stable",
"0.24.0": "stable" ,
"0.23.0": "stable"
}
So then the client would know that “0.25.0” was the most recent version, and use the dist URI template to request /dist/p/pg/pgtap/pgtap-0.25.0.pgz.
If The client command had been:
PGXN> readme dist pgtap
It would use the readme URI template. And the command:
PGXN> meta dist pgtap
Would use the meta URI template to fetch the metadata for the distribution.
If the client had requested a specific version:
PGXN> install dist 0.23.0
It could either use the by-dist URI template to download the list of all versions to see if 0.23.0 was valid, or just use the dist URI template to try to download the distribution itself.
And finally, the owner and manager JSON files, such as
/owner/t/th/theory.json
Would look something like:
"full_name": "David Wheeler",
"email": "theory@pgxn.org",
"uri": "http://justatheory.com",
"distributions": {
"pgtap": [ "0.25.0", "0.24.0", "0.23.0" ]
"pair": [ "0.2.0", "0.1.0", "0.0.5" ]
}
With that, the client can be asked to fetch metadata for a given owner name and use it to figure out what distributions and versions the the owner, um, owns. One could then fetch the metadata, readme, or distribution file for any of those distributions and versions.
Overall, I think that this is a much better solution than I outlined before. If only I could figure out something more elegant that the prefix-staggering/hashing stuff, it would be just about perfect.
Thoughts?
I’ve posted a draft of the “PGXN distribution metadata specification,” or PGXN Meta Spec.” This document specifies the structure and format of the META.json file that PGXN will require in every distribution. In fact, this is the only required file in a distribution. Its job is to describe the distribution, its extensions, and its dependencies, among other things. This file is key to the whole thing.
To create it, I’ve ported the CPAN Meta Spec, version 2, but with all deprecated fields removed, and some of the more complex stuff taken out. I also made a couple of the “required” fields “optional.” At its simplest, the file might look something like this:
{
"name": "pgTAP",
"abstract": "Unit testing for PostgreSQL",
"version": "0.25.0",
"owner": "David E. Wheeler <theory@pgxn.org>",
"license": "postgresql",
"meta-spec": {
"version": "1.0.0",
"url": "http://github.com/theory/pgxn/wiki/PGXN-Meta-Spec"
}
}
Not too bad, eh? The URL for the spec might change (might move it to the main site and/or the mirrors), but otherwise, I think this is pretty solid. Not too much work to deal with, and reasonably easy to create by hand (which is likely how we’ll all start out).
And additional key that will be really important for the PGXN client is prereqs. This key allows you to identify the prerequisites (from PGXN or the PostgreSQL core contrib extensions) required to build, test, and/or use a distribution. For example, if I were to release an ordered pair extension, it of course would include tests written with pgTAP. So I’d have something like:
{
"name": "pair",
"abstract": "An ordered pair data type",
"version": "0.1.0",
"owner": "David E. Wheeler <theory@pgxn.org>",
license: "postgresql",
"meta-spec": {
"version": "1.0.0",
"url": "http://github.com/theory/pgxn/wiki/PGXN-Meta-Spec"
},
"prereqs": {
"runtime": {
"requires": {
"PostgreSQL": "8.0.0"
},
"recommends": {
"PostgreSQL": "8.4.0"
}
},
"test": {
"requires": {
"pgTAP": 0
}
}
}
}
That’s saying that the “pair” distribution requires PostgreSQL 8.0.0 or higher and any version of pgTAP to run the test suite. I’ve also recommended PostgreSQL 8.4, as that’s where it will run best.
Of course, to get the real power of PGXN, you’ll also want to use the provides key, which allows you to identify the extensions included in your distribution. Say that I finally got around to breaking out the schema testing assertions from the logical testing assertions in pgTAP. I might call the second module “schematap.” So to spell it out, I’d add this to the first example above:
"pgtap": {
"file": "sql/pgtap.sql.in",
"version": "0.25.0"
},
"schematap": {
"file": "sql/schematap.sql.in"
}
So now the indexer will know that the “pgtap” extension is in sql/pgtap.sql.in and the “schematap” extension is in sql/schematap.sql.in. This is important because it allows other distributions to specify “schematap” as a prerequisite. It also means that, in the PGXN client, you can type something like:
PGXN> install schematap
And, because “schematap” will have been indexed on the network, the client will be able to find the pgTAP distribution and install it, complete with the “schematap” extension.
A final note. Version numbers in the Perl community are a disaster. Wanting to avoid that whole morass, I had originally intended to require numeric version numbers. But David Golden — the current maintainer of the CPAN Meta Spec — pointed me to Semantic Versioning, a version number specification by GitHub’s Tom Preston-Werner. This style of version numbering is great for PGXN for a few reasons:
So this is the standard that PGXN will require. Every version number will be dotted-integer with three integers (X.Y.Z) and an optional ASCII string at the end. That’s it. PGXN won’t invest any special meaning in the version string the way CPAN does. It will just compare version numbers.
Anyway, please review the spec itself and leave any comments or questions below. I expect to start hacking on this stuff this week!
Update 2010-08-24: I’ve just updated the spec to change “owner” to “maintainer.” I think that the latter term is much better for this purpose. And then I can use “owner” in PGXN to identify the person who uploads a distribution.
I’ve been thinking about the arrangement of stuff to be distributed to the mirrors. In doing so, I’ve kept three goals in mind:
The first two are a bit mutually-contradicting. Thinking scalability mainly means minimizing the chances that too many files can be put into a single directory. CPAN started out with all author directories in a single directory. Given the sheer number of CPAN contributors, this quickly got to be a bottleneck. To correct for that, they started “hashing” the first two letters of author names to create subdirectories. My CPAN directory, for example, is D/DW/DWHEELER. That doesn’t quite make for an intuitive location, but it’s not bad, and the tradeoff seems sufficient to keep things sane.
The storage of metadata is important, too, as I plan to have the client send requests for JSON files to mirrors in order to find distributions. Basically, this came down to a naming convention, as well as a recipe for how to find metadata files for people and extensions. What I’ve come up with is three directories:
meta will contain PGXN metadatadist will contain distributionsby will contain query-able JSON filesThe meta directory will contain PGXN metadata (mirrors.json, timestamp, other detritus). If I ever create an index file that lists all distributions, it would go there, too. I’m not going to discuss it any further here, except to note that it already has mirrors.json, which clients will be able to use to present users with a list of mirrors to choose from.
The dist directory will be organized with directories named with “hashes” of the first two letters of distribution names. Let’s say that I’m releasing pgTAP 0.25 on PGXN. To distribute it, the management application will create the directory (if it doesn’t already exist):
dist/p/pg/pgtap/
Then, for the 0.25 release, it will add three files to that directory:
dist/p/pg/pgtap/pgtap-0.25.pgz
dist/p/pg/pgtap/pgtap-0.25.json
dist/p/pg/pgtap/pgtap-0.25.readme
The pgtap-0.25.pgz file will contain the zipped distribution ready for download. pgtap-0.25.json will contain metadata about the distribution, such as the owner’s name, the manager’s name (more on these folks below), list of included extensions, location of the .pgz and .readme, and its SHA1. pgtap-0.25.readme will of course contain the README for the distribution (if it has one).
Every release of pgTAP will have these three files, so after several releases, the pgtap directory might have these files:
dist/p/pg/pgtap/pgtap-0.23.pgz
dist/p/pg/pgtap/pgtap-0.23.json
dist/p/pg/pgtap/pgtap-0.23.readme
dist/p/pg/pgtap/pgtap-0.24.pgz
dist/p/pg/pgtap/pgtap-0.24.json
dist/p/pg/pgtap/pgtap-0.24.readme
dist/p/pg/pgtap/pgtap-0.25.pgz
dist/p/pg/pgtap/pgtap-0.25.json
dist/p/pg/pgtap/pgtap-0.25.readme
dist/p/pg/pgtap/pgtap.json
The last file there, pgtap.json, will actually be a symlink to the JSON file for latest production release of pgTAP. In this case, it would link to pgtap-0.25.json. The nice thing about this is that, if a client wants to find the information about the latest release of the pgtap distribution, all it will have to do is send an HTTP GET request for dist/p/pg/pgtap/pgtap.json to any mirror.
The by directory will also contain JSON files for clients to request. The idea is that you want to find information “by” something. To start with, there will be three subdirectories:
by/extension/
by/manager/
by/owner/
The first directory, by/extension/, will contain links to JSON files for extensions. Say that the pgTAP distribution offers two extensions to PostgreSQL named “pgtap” and “schematap”. The links would be:
by/extension/p/pg/pgtap/pgtap-0.23.json
by/extension/p/pg/pgtap/pgtap-0.24.json
by/extension/p/pg/pgtap/pgtap-0.25.json
by/extension/p/pg/pgtap/pgtap.json
Each of these will simply be symlinks pointing to the appropriate distribution files:
dist/p/pg/pgtap/pgtap-0.23.json
dist/p/pg/pgtap/pgtap-0.24.json
dist/p/pg/pgtap/pgtap-0.25.json
dist/p/pg/pgtap/pgtap.json
Yes, the last one is a symlink to a symlink. Similarly, these files for “schematap”:
by/extension/s/sc/schematap/schematap-0.23.json
by/extension/s/sc/schematap/schematap-0.24.json
by/extension/s/sc/schematap/schematap-0.25.json
by/extension/s/sc/schematap/schematap.json
Point to exactly the same files. Essentially, this is a way for extensions to point to the distributions that contain them. by/extension/s/sc/schematap/schematap.json points to dist/p/pg/pgtap/pgtap.json, which contains path to the .pgz file to download (and lots of other metadata, too).
The idea is that, whatever the name of the extension you want, the client will be able to easily find the metadata file that tells it where to find the distribution.
The by/manager and by/owner directories, on the other hand, contain JSON files with information about managers and owners and their distributions. Definitions:
An “owner” is someone who owns a distribution. This will often be the original author of an extension, but may be someone else if maintenance has been passed on. Basically, the “owner” is the person who should be contacted with bug reports and the like
A “manager” is someone who manages the release process. This is the user who will log into the PGXN management application and upload a distribution for release.
These two people will often be the same person, but not always. I’ve avoided the term “author” (the term used by CPAN) because the author of an extension may no longer maintain it. “Owner” seemed like a better choice (individual distributions are free to describe their contributors however they wish).
As the owner of a few extensions on PGXN, I’d have this file:
by/owner/t/th/theory.json
This file would contain a list of my distributions and perhaps some other information (like my full name and blog URL).
As the manager of extensions on PGXN (that is, I actually uploaded them), I would also have:
by/manager/t/th/theory.json
This file would contain a list of the distributions I’ve released on PGXN. This might be exactly the same as the list in my owner file, but may not be. Perhaps for one release of pgTAP, say 0.26, Duke Leto uploaded a release. In that case, 0.26 would probably be in my owner file, but not in my manager file: it would be in Duke’s manager file, instead.
So these are the basics of the directory structure for the networked mirrors. Note that I haven’t thought much about everything that will go into the JSON files (or whether or not they’d be versioned). That will likely depend quite a lot on what the management database ends up looking like. I’ll be working on that next.
But other than that, comments? Questions? Criticisms? Recommendations? Leave a comment and let me know!