Things slowed up a bit over the last couple of months, I admit. There are any number of reasons for that, not the least were the intrusion of the holidays and a little side project I’ve been hacking on after-hours (and sometimes during-hours). But I’m ramping things up again now and need your feedback on my current plans. Here’s what I’m working on: the search site.
Well, sort of. First of all, I’ve decided that the “search site” should not be a separate thing. The main site will be the search site. This is following the example of JSAN, as well as feedback from Graham Barr, who created and maintains CPAN Search. Apparently people are often confused that search.cpan.org is separate from www.cpan.org. No point in adding in confusion from the beginning. And besides, now that the PGXN fund-raising is over, I don’t know what else would go on the home page.
The other thing that’s happened is, just as I was getting my butt in gear on this stuff, a new CPAN search site came to my attention, μετα CPAN. This is an interesting project. What they did instead of creating a monolithic HTML search site is to create a simple API that serves nothing but JSON. It has search and displays metadata for CPAN objects (distributions, maintainers, modules, etc.). The search site, then, is not really a site at all, but a pure JavaScript application. Once you load it, it just uses the API server to get all the data. There are a few tricks server-side to proxy the API server so as to avoid cross-site scripting issues. But otherwise it just works in the browser.
Now I’m not sure I’ll do the same thing, exactly, but there’s a lot of appeal in creating a RESTful API server that’s independent of the search site, and then building the search site to use it. It also has the advantage of being useful for other projects to just use. Want to create a PGXN search widget for your blog? Yeah, there’s an API for that.
Of course, thanks to the “RESTful Directory” design for the mirrors (described here and revised here), any mirror is a lightweight API already. There’s a lot of metadata one can get just from the static JSON files it generates. The design is flexible—but designed with a command-line client in mind. As such, many commands executed in a command-line client would likely requires multiple requests to a mirror. For example:
> install pgtap
This would request /by/extension/pgtap.json from the server. It would then parse that file and see that the latest stable version of pgTAP is in the distribution “pgTAP” at version “0.25.0”. So it would then download /dist/pgTAP-0.25.0.pgz to install.
This is great for a command-line client, but wouldn’t be so great for a search site to be responsive. Ideally, a site should send a single request to get all the data it needs for a particular page.
So here’s what I’m thinking for a PGXN API server: It will offer a superset of the functionality of any other PGXN mirror. That is, all the JSON files in a mirror will be present, but many of them will have more information than they would on other mirrors. And then, of course, there will be other URIs to offer additional API calls.
So what does that look like? Let’s take the pgTAP distribution, which I released on PGXN earlier this week. To find the pgTAP distribution, one requests:
From that, one can see that the latest table release is 0.25.0, and so one can then request
to get all the metadata for that particular release. What I propose, to avoid the two requests, is to include the contents of the second file in the first. That would then have all the data necessary to generate the pgTAP distribution page on the PGXN site.
The API would offer similar supersets of data for the extension , owner , and tag metadata files, to have the data necessary for the design of the corresponding extension, owner and tag layouts of the site.
In addition to adding metadata to the existing mirrored JSON files, there would be other resources available for request from the API server. They would include:
Extension Documentation. Each distribution may include documentation for included extensions in the doc subdirectory. These will go under the directory for a specific distribution such as /dist/pgTAP/pgTAP-0.35.0/doc/pgtap.html. The latest version of each document would also be available under /by/extension, as in /by/extension/pgtap.html. This requires that the documentation file have the same base name as the extension file itself.
Other documentation. I’d like to support arbitrary documentation, such as for included binary executables, HOWTOs, etc. The canonical copies will go under the versioned distribution URL, of course, but I’m not sure about permalinks. That might require an extension of the Meta Spec; I haven’t quite figured that out, yet.
Source code. There will be an interface to browse an unpacked copy of any distribution as plain text. This will be under /src, as in /src/pgTAP/pgTAP-0.35.0/.
Of course. This is the big one, really. I think it makes sense to have the /by URI respond to search requests. Thus, a request for
/by?q=testing
would search everything. If you only want to search a certain category of object, you’d hit the appropriate URI:
/by/dist?q=tap
/by/owner?q=clochard
/by/tag?q=test
/by/extension?q=gis
The nice thing about this is that it retains the existing entity URLs. The directory level determines which entities you get.
So that’s my thinking on the search API. I’m going to start hacking on it in earnest tomorrow, and perhaps next week I can get a very early version out (basically just another mirror to start with).
But what do you think? Seem like a sane approach? Am I missing anything obvious or doing anything clearly stupid? Please let me know in the commments!
Following my post outlining a possible network directory structure, Aristotle Pagaltzis saw fit to bug me via email about a different approach. I couldn’t understand WTF he was talking about until today. Then it lit my brain on fire. As a result, I now think that there is a much better way to organize the metadata files for the PGXN — one that happens not to include any symbolic links (which is something that Andreas König has been flagging, via email, as a possible bottleneck).
First, the /dist directory will be the same as before. Releases of pgTAP would be in:
dist/p/pg/pgtap/pgtap-0.23.pgz
dist/p/pg/pgtap/pgtap-0.23.json
dist/p/pg/pgtap/pgtap-0.23.readme
dist/p/pg/pgtap/pgtap-0.24.pgz
dist/p/pg/pgtap/pgtap-0.24.json
dist/p/pg/pgtap/pgtap-0.24.readme
dist/p/pg/pgtap/pgtap-0.25.pgz
dist/p/pg/pgtap/pgtap-0.25.json
dist/p/pg/pgtap/pgtap-0.25.readme
The only change is that the pgtap.json symlink is gone.
Now, the new stuff. In the root directory will be a file, index.json, that contains templates for URIs. It will look something like this:
{
"dist": "/dist/$a/$ab/$dist/$dist-$version.pgz",
"readme": "/dist/$a/$ab/$dist/$dist-$version.readme",
"meta": "/dist/$a/$ab/$dist/$dist-$version.json",
"by-dist": "/by/dist/$a/$ab/$dist.json",
"by-extension": "/by/extension/$a/$ab/$extension.json",
"by-owner": "/by/owner/$a/$ab/$owner.json",
"by-manager": "/by/manager/$a/$ab/$manager.json",
}
The PGXN client will always fetch this file before it does anything else, because the file tells it how to find stuff. The advantage here is that the client doesn’t have to know anything about how the directory is actually organized, just what the template variables might be. They are:
$dist: A distribution name$version: A version number$extension: An extension name$owner: An owner’s name$manager: A release manager’s name (managers are the people who upload distributions to PGXN)$a: The first letter of a distribution, extension, owner, or manager name.$ab: The first two letters of a distribution, extension, owner, or manager name.I’m not thrilled about using prefix-staggering to avoid having too many files in a directory. But the truth is that this approach allows me to punt. I could also make sure the client supports, for example, $bc and $cd, so that one could stagger things differently. And then the nice thing is that I don’t have to use those at all. The templates will tell the client exactly how to construct the URIs for things, and the templates needn’t include those staggering variables if they’re not appropriate. The client won’t care because it will have no built-in knowledge of how things are organized. It will have to find out from index.json.
From the URI templates, you can now see where the other metadata will be stored. For extension names, a hypothetical pgTAP distribution with two extensions will have a JSON file for each extension:
/by/extension/p/pg/pgtap.json
/by/extension/s/sc/schematap.json
The pgtap.json file will look something like this:
"stable": "0.25.0",
"testing": "0.26.0b1",
"unstable": "0.30.0u",
"versions": {
"0.26.0b1": { "dist": "pgtap", "version": "0.26.0b1", "status": "testing" },
"0.30.0u": { "dist": "pgtap", "version": "0.30.0u", "status": "unstable" },
"0.25.0": { "dist": "pgtap", "version": "0.25.0", "status": "stable" },
"0.24.0": { "dist": "pgtap", "version": "0.24.0", "status": "stable" },
"0.25.0": { "dist": "pgtap", "version": "0.23.0", "status": "stable" }
}
Right at the top, it would always list the most recent stable, testing, and unstable version number, and then it would have a list metadata for all versions. Said metadata would include the associated distribution name, version, and release status.
Here’s how it would work. Say I ask the client to install pgtap:
PGXN> install extension pgtap
The client would first fetch /index.json, then look for the URI template for “by-extension”, which is /by/extension/$a/$ab/$extension.json. Filling in the template, it would know to request /by/extension/p/pg/pgtap.json. With that file, it would see that the most recent stable version is in the “pgtap” distribution version 0.25.0. Using the dist URI template, which is /dist/$a/$ab/$dist-$version.pgz, it would then fetch /dist/p/pg/pgtap/pgtap-0.25.0.pgz.
The advantage here is that there are no symbolic links and no knowledge of the directory structure built into clients. The client just knows to fetch /index.json and then to use the templates in that file to fetch other information. That’s the whole interface. Very RESTful.
The structure of the other /by files would be similar. For
PGXN> install dist pgtap
the client would use the “by-dist” URI template to construct the URL /by/dist/p/pg/pgtap.json. That file would have something like:
"stable": "0.25.0",
"testing": "0.26.0b1",
"unstable": "0.30.0u",
"versions": {
"0.26.0b1": "testing",
"0.30.0u": "unstable",
"0.25.0": "stable",
"0.24.0": "stable" ,
"0.23.0": "stable"
}
So then the client would know that “0.25.0” was the most recent version, and use the dist URI template to request /dist/p/pg/pgtap/pgtap-0.25.0.pgz.
If The client command had been:
PGXN> readme dist pgtap
It would use the readme URI template. And the command:
PGXN> meta dist pgtap
Would use the meta URI template to fetch the metadata for the distribution.
If the client had requested a specific version:
PGXN> install dist 0.23.0
It could either use the by-dist URI template to download the list of all versions to see if 0.23.0 was valid, or just use the dist URI template to try to download the distribution itself.
And finally, the owner and manager JSON files, such as
/owner/t/th/theory.json
Would look something like:
"full_name": "David Wheeler",
"email": "theory@pgxn.org",
"uri": "http://justatheory.com",
"distributions": {
"pgtap": [ "0.25.0", "0.24.0", "0.23.0" ]
"pair": [ "0.2.0", "0.1.0", "0.0.5" ]
}
With that, the client can be asked to fetch metadata for a given owner name and use it to figure out what distributions and versions the the owner, um, owns. One could then fetch the metadata, readme, or distribution file for any of those distributions and versions.
Overall, I think that this is a much better solution than I outlined before. If only I could figure out something more elegant that the prefix-staggering/hashing stuff, it would be just about perfect.
Thoughts?
I’ve been thinking about the arrangement of stuff to be distributed to the mirrors. In doing so, I’ve kept three goals in mind:
The first two are a bit mutually-contradicting. Thinking scalability mainly means minimizing the chances that too many files can be put into a single directory. CPAN started out with all author directories in a single directory. Given the sheer number of CPAN contributors, this quickly got to be a bottleneck. To correct for that, they started “hashing” the first two letters of author names to create subdirectories. My CPAN directory, for example, is D/DW/DWHEELER. That doesn’t quite make for an intuitive location, but it’s not bad, and the tradeoff seems sufficient to keep things sane.
The storage of metadata is important, too, as I plan to have the client send requests for JSON files to mirrors in order to find distributions. Basically, this came down to a naming convention, as well as a recipe for how to find metadata files for people and extensions. What I’ve come up with is three directories:
meta will contain PGXN metadatadist will contain distributionsby will contain query-able JSON filesThe meta directory will contain PGXN metadata (mirrors.json, timestamp, other detritus). If I ever create an index file that lists all distributions, it would go there, too. I’m not going to discuss it any further here, except to note that it already has mirrors.json, which clients will be able to use to present users with a list of mirrors to choose from.
The dist directory will be organized with directories named with “hashes” of the first two letters of distribution names. Let’s say that I’m releasing pgTAP 0.25 on PGXN. To distribute it, the management application will create the directory (if it doesn’t already exist):
dist/p/pg/pgtap/
Then, for the 0.25 release, it will add three files to that directory:
dist/p/pg/pgtap/pgtap-0.25.pgz
dist/p/pg/pgtap/pgtap-0.25.json
dist/p/pg/pgtap/pgtap-0.25.readme
The pgtap-0.25.pgz file will contain the zipped distribution ready for download. pgtap-0.25.json will contain metadata about the distribution, such as the owner’s name, the manager’s name (more on these folks below), list of included extensions, location of the .pgz and .readme, and its SHA1. pgtap-0.25.readme will of course contain the README for the distribution (if it has one).
Every release of pgTAP will have these three files, so after several releases, the pgtap directory might have these files:
dist/p/pg/pgtap/pgtap-0.23.pgz
dist/p/pg/pgtap/pgtap-0.23.json
dist/p/pg/pgtap/pgtap-0.23.readme
dist/p/pg/pgtap/pgtap-0.24.pgz
dist/p/pg/pgtap/pgtap-0.24.json
dist/p/pg/pgtap/pgtap-0.24.readme
dist/p/pg/pgtap/pgtap-0.25.pgz
dist/p/pg/pgtap/pgtap-0.25.json
dist/p/pg/pgtap/pgtap-0.25.readme
dist/p/pg/pgtap/pgtap.json
The last file there, pgtap.json, will actually be a symlink to the JSON file for latest production release of pgTAP. In this case, it would link to pgtap-0.25.json. The nice thing about this is that, if a client wants to find the information about the latest release of the pgtap distribution, all it will have to do is send an HTTP GET request for dist/p/pg/pgtap/pgtap.json to any mirror.
The by directory will also contain JSON files for clients to request. The idea is that you want to find information “by” something. To start with, there will be three subdirectories:
by/extension/
by/manager/
by/owner/
The first directory, by/extension/, will contain links to JSON files for extensions. Say that the pgTAP distribution offers two extensions to PostgreSQL named “pgtap” and “schematap”. The links would be:
by/extension/p/pg/pgtap/pgtap-0.23.json
by/extension/p/pg/pgtap/pgtap-0.24.json
by/extension/p/pg/pgtap/pgtap-0.25.json
by/extension/p/pg/pgtap/pgtap.json
Each of these will simply be symlinks pointing to the appropriate distribution files:
dist/p/pg/pgtap/pgtap-0.23.json
dist/p/pg/pgtap/pgtap-0.24.json
dist/p/pg/pgtap/pgtap-0.25.json
dist/p/pg/pgtap/pgtap.json
Yes, the last one is a symlink to a symlink. Similarly, these files for “schematap”:
by/extension/s/sc/schematap/schematap-0.23.json
by/extension/s/sc/schematap/schematap-0.24.json
by/extension/s/sc/schematap/schematap-0.25.json
by/extension/s/sc/schematap/schematap.json
Point to exactly the same files. Essentially, this is a way for extensions to point to the distributions that contain them. by/extension/s/sc/schematap/schematap.json points to dist/p/pg/pgtap/pgtap.json, which contains path to the .pgz file to download (and lots of other metadata, too).
The idea is that, whatever the name of the extension you want, the client will be able to easily find the metadata file that tells it where to find the distribution.
The by/manager and by/owner directories, on the other hand, contain JSON files with information about managers and owners and their distributions. Definitions:
An “owner” is someone who owns a distribution. This will often be the original author of an extension, but may be someone else if maintenance has been passed on. Basically, the “owner” is the person who should be contacted with bug reports and the like
A “manager” is someone who manages the release process. This is the user who will log into the PGXN management application and upload a distribution for release.
These two people will often be the same person, but not always. I’ve avoided the term “author” (the term used by CPAN) because the author of an extension may no longer maintain it. “Owner” seemed like a better choice (individual distributions are free to describe their contributors however they wish).
As the owner of a few extensions on PGXN, I’d have this file:
by/owner/t/th/theory.json
This file would contain a list of my distributions and perhaps some other information (like my full name and blog URL).
As the manager of extensions on PGXN (that is, I actually uploaded them), I would also have:
by/manager/t/th/theory.json
This file would contain a list of the distributions I’ve released on PGXN. This might be exactly the same as the list in my owner file, but may not be. Perhaps for one release of pgTAP, say 0.26, Duke Leto uploaded a release. In that case, 0.26 would probably be in my owner file, but not in my manager file: it would be in Duke’s manager file, instead.
So these are the basics of the directory structure for the networked mirrors. Note that I haven’t thought much about everything that will go into the JSON files (or whether or not they’d be versioned). That will likely depend quite a lot on what the management database ends up looking like. I’ll be working on that next.
But other than that, comments? Questions? Criticisms? Recommendations? Leave a comment and let me know!
Work has finally started. We now have mirroring. This was the first task for the project, and it’s now checked off on the status page. Hurrah!
This turned out to be a pretty simple task, of course. For now I’m using the Kineticode (my other employer) server to host both the PGXN site and the master mirror. This is temporary, just a way to get things up and going as quickly as possible. The master mirror runs rsyncd with read-only access to the pgxn path. So all you have to do to mirror it is set up a cron job like:
rsync -az --delete rsync://master.pgxn.org/pgxn /path/to/pgxn
If the destination directory is under an HTTP root, you’re done. Otherwise, throw a web server over it and then you’re done.
Of course, the web server isn’t really necessary yet. Soon it will be the interface for downloading distributions and metadata from the mirrors. Right now, the mirrors just have two files in them (a README and an index.html). This is just the start of things. The next step is to figure out the directory structure for the mirrors. I’m working on that right now and will soon be asking for feedback. Watch this space for details!