When we last looked at the organization of the mirror, I was pretty happy with the design except for one thing: The use of the letter hashing variables in the URI templates. It’s just ugly and, damnit, is it really necessary anymore? I mean, sure, I had to work around this issue in Bricolage back in 2005, but on modern file systems like zfs and ext3, does it really matter anymore?
I was chatting about this with Schwern just now, and he thought it just didn’t matter anymore. So I asked my fellow PGX associate Jeff Frost about this, and he said, “Used to be anytime you went above 1000 entries in one directory, things would start to slow down. Certainly reiserfs, xfs, jfs, and zfs don’t suffer that issue.” But what about ext3?
I happen to have a box with ext3, so I tested it. Here’s what I found. To stat one file among 20,000,
Not bad. And on file among 200?
Well, you can’t get much closer than that. What about subdirectories? To stat a file inside one of 20,000 subdirectories,
time tells me:
And a file inside one of 200 subdirectories:
Well, I can live with that. I suppose there are some file systems out there that still have this problem, but you know what? I’m not going to worry about them. It’s thinking too far in advance anyway (if PGXN has this kind of scaling problem we’ll be lucky!), and the farther out we get, the less of a problem it is.
So you know what? I’m not going to use the hashing of extension, distribution, and owner names. Let the file systems worry about that performance, not me.