So I’m designing the full text indexing for the PGXN search site. I’m modeling it on CPAN Search, which has been great. There are four search options:
The documentation search is the one I’m perhaps least sure about. It assumes that each extension in a distribution will have documentation. But so far that has not really been the practice for PostgreSQL extensions. Most folks seem to stick the documentation in the README. And even then it can be almost nothing. So a search for “count nulls” probably would not find “countnulls” extension, because there is no documentation. What should I do about this? I’m thinking one of:
Encourage folks to write documentation. I’m going to do this anyway, because the docs will really help the visibility of an extension on the site. It looks like this. If you have no docs for an extension, your extension will not appear in the search results (or perhaps it might, but link to the distribution).
If there is no documentation for an extension in a distribution, index the README as the documentation. I’m not really keen on this idea, because the README should describe the distribution, how to install it, etc. I’m planning to use it in the distribution-specific index. Documentation of the extension should be more about how the extension works, what it’s interface is, etc. Or so it seems to me, at least (I’m admittedly biased to this practice among CPAN modules). But at least with this approach there would be a link to “documentation” for an extension on the search site.
Erm, not really thinking of any other options. I feel pretty strongly that folks should write docs for their extensions, as much as possible, and I’ve set things up so that, from PGXN’s point of view, at least, you can write documentation in whatever format you like (assuming the format is supported by or added to Text::Markup), as long as they’re in a doc/ or docs directory. I want it to be as easy as possible. But I also want there to be decent search results ASAP.
Comments?
Things slowed up a bit over the last couple of months, I admit. There are any number of reasons for that, not the least were the intrusion of the holidays and a little side project I’ve been hacking on after-hours (and sometimes during-hours). But I’m ramping things up again now and need your feedback on my current plans. Here’s what I’m working on: the search site.
Well, sort of. First of all, I’ve decided that the “search site” should not be a separate thing. The main site will be the search site. This is following the example of JSAN, as well as feedback from Graham Barr, who created and maintains CPAN Search. Apparently people are often confused that search.cpan.org is separate from www.cpan.org. No point in adding in confusion from the beginning. And besides, now that the PGXN fund-raising is over, I don’t know what else would go on the home page.
The other thing that’s happened is, just as I was getting my butt in gear on this stuff, a new CPAN search site came to my attention, μετα CPAN. This is an interesting project. What they did instead of creating a monolithic HTML search site is to create a simple API that serves nothing but JSON. It has search and displays metadata for CPAN objects (distributions, maintainers, modules, etc.). The search site, then, is not really a site at all, but a pure JavaScript application. Once you load it, it just uses the API server to get all the data. There are a few tricks server-side to proxy the API server so as to avoid cross-site scripting issues. But otherwise it just works in the browser.
Now I’m not sure I’ll do the same thing, exactly, but there’s a lot of appeal in creating a RESTful API server that’s independent of the search site, and then building the search site to use it. It also has the advantage of being useful for other projects to just use. Want to create a PGXN search widget for your blog? Yeah, there’s an API for that.
Of course, thanks to the “RESTful Directory” design for the mirrors (described here and revised here), any mirror is a lightweight API already. There’s a lot of metadata one can get just from the static JSON files it generates. The design is flexible—but designed with a command-line client in mind. As such, many commands executed in a command-line client would likely requires multiple requests to a mirror. For example:
> install pgtap
This would request /by/extension/pgtap.json from the server. It would then parse that file and see that the latest stable version of pgTAP is in the distribution “pgTAP” at version “0.25.0”. So it would then download /dist/pgTAP-0.25.0.pgz to install.
This is great for a command-line client, but wouldn’t be so great for a search site to be responsive. Ideally, a site should send a single request to get all the data it needs for a particular page.
So here’s what I’m thinking for a PGXN API server: It will offer a superset of the functionality of any other PGXN mirror. That is, all the JSON files in a mirror will be present, but many of them will have more information than they would on other mirrors. And then, of course, there will be other URIs to offer additional API calls.
So what does that look like? Let’s take the pgTAP distribution, which I released on PGXN earlier this week. To find the pgTAP distribution, one requests:
From that, one can see that the latest table release is 0.25.0, and so one can then request
to get all the metadata for that particular release. What I propose, to avoid the two requests, is to include the contents of the second file in the first. That would then have all the data necessary to generate the pgTAP distribution page on the PGXN site.
The API would offer similar supersets of data for the extension , owner , and tag metadata files, to have the data necessary for the design of the corresponding extension, owner and tag layouts of the site.
In addition to adding metadata to the existing mirrored JSON files, there would be other resources available for request from the API server. They would include:
Extension Documentation. Each distribution may include documentation for included extensions in the doc subdirectory. These will go under the directory for a specific distribution such as /dist/pgTAP/pgTAP-0.35.0/doc/pgtap.html. The latest version of each document would also be available under /by/extension, as in /by/extension/pgtap.html. This requires that the documentation file have the same base name as the extension file itself.
Other documentation. I’d like to support arbitrary documentation, such as for included binary executables, HOWTOs, etc. The canonical copies will go under the versioned distribution URL, of course, but I’m not sure about permalinks. That might require an extension of the Meta Spec; I haven’t quite figured that out, yet.
Source code. There will be an interface to browse an unpacked copy of any distribution as plain text. This will be under /src, as in /src/pgTAP/pgTAP-0.35.0/.
Of course. This is the big one, really. I think it makes sense to have the /by URI respond to search requests. Thus, a request for
/by?q=testing
would search everything. If you only want to search a certain category of object, you’d hit the appropriate URI:
/by/dist?q=tap
/by/owner?q=clochard
/by/tag?q=test
/by/extension?q=gis
The nice thing about this is that it retains the existing entity URLs. The directory level determines which entities you get.
So that’s my thinking on the search API. I’m going to start hacking on it in earnest tomorrow, and perhaps next week I can get a very early version out (basically just another mirror to start with).
But what do you think? Seem like a sane approach? Am I missing anything obvious or doing anything clearly stupid? Please let me know in the commments!
I’ve been working on my PostgreSQL Conference West presentation, which heavily features PGXN, of course. I think it’s looking good. If you’re at PGWest or are in the San Francisco area and free, come see the talk! Should be a good introduction to creating PostgreSQL extensions and distributing them on PGXN. The latest bit I added is a section on the modifications needed to support the forthcoming CREATE EXTENSION support slated for 9.1. Fortunately it’s dead simple, and will make dealing with extensions in the database a lot simpler, administratively. Really looking forward to that. Of course I’ll post slides once the talk is over.
As part of preparing for the talk, and because there isn’t currently much to actually see of PGXN, I’ve been mocking up the layout for the new search site, which as you know from the status page is the next part of the project I’m slated to work on. I’ve been committing the mockps to the gh-pages branch of the repository, which means you can see what it looks like live on the net right here. That’s the home page, including our sponsor links and tag cloud. Click the “PGXN Search” button to see a mockup of search results (or get them here). Click on any search result to see the mockup of a documentation page (or link it here). The design is based on the lazydays open-source Web design, and I’m quite happy with it. Your thoughts?
As this comes together, I’m gearing up to start hacking on the app to produce the search site. At this point, I’m thinking that it would become the new home page for PGXN, rather than a separate search.pgxn.org site. Thoughts?
I’ll post the slides tomorrow.