djs.to

darrin's musings on software, linux, and anything else.

Hyper Estraier

This must be a pretty common scenario. Your company has a private intranet which is a cobbled together group of web servers, wikis, maybe Bugzilla if you do software development, maybe some CRM systems, inventory systems, databases with web front-ends and so on.

After a while everyone starts complaining that that can’t find the information they need. Nothing is structured, and there is no global search facility. What everyone craved was a google-like search for the private data the intranet.

logo

This was becoming quite a problem at my place of work. Until I discovered Hyper Estraier. This is one of those delightfully terse pieces of software that is not only incredibly fast, but flexible enough that I am confident I can use it to index any intranet data we will have.

The best feature, in my opinion, is the "Document Draft" file format. I can write scripts to take data direct from intranet databases and push into the indexing engine. So instead of having it go out and crawl our intranet via http - which can be difficult when authentication is in play, and rather slow as well, I can write a script to collect the required text from whichever system and have it bulk-indexed.

For example, I currently pull comment text directly from mysql for our ~30k Bugzilla entries, and get them indexed in about a minute.

Hyper Estraier handles incremental index updates too, and this would be a more correct way to handle something like Bugzilla, by just re-indexing a bug whenever it is changed. This way you get near real-time updates to the index. Perhaps one day I'll code that up, but for now, the full-indexing seems so fast that I just rebuild from scratch every couple of hours via cron.

Once everything is indexed, just hook up the supplied CGI program in your intranet, and google-style searching is all yours! We have the search box embedded right at the top of the intranet home page, for easy access. It typically executes searches in under 0.1 seconds.

Not a full replacement for well managed, structured documentation, but our users love it.