Indexing Web pages: maybe books aren’t such a bad model after all!

by Geoffrey Hart

Previously published, in a different form, as: Hart, G.J. 1999. Index the Web. Intercom, June:26-28.

Challenging our assumptions

One of our favorite clichés is that you can’t use the printed book as a model for online information. Web-based information, which is following the same evolutionary progress as online help systems, has inherited this “books are bad” philosophy. However, any statement we’ve begun to take for granted bears some re-examination, because unquestioningly accepting dogma undermines our efforts to improve communication. Whatever their purported flaws, books represent the result of a few millennia of evolution, and though they’re manifestly imperfect, they’ve nonetheless become effective communications tools. Forgetting this has led us to overlook one traditional aspect of books that will prove equally helpful to our Web-based audience: the index. After all, as Lori Lathrop has observed, a book’s index is a really just a hardwired form of hypertext. That being the case, why not use the same approach online?

Early online help systems generally lacked an index, and relied primarily on built-in hyperlinks in the form of a table of contents with no synonyms; searching was often limited to typing “help” at a command line, then picking from a list of displayed topics. Over time, help systems have increased in sophistication and added a variety of search features, whether based on keywords embedded in the files or search engines of varying degrees of sophistication. Promoters of this technology argue that a main advantage of online information is the ability to automatically search for specific text, thereby eliminating the need for traditional indexes. Though this is true in principle, the actual results have been far less satisfying because search engines remain oblivious to the context and relevance of words and phrases, and despite some promising advances in the technology, will continue to be “blind” for some time to come. As a result, many users remain highly dissatisfied with text-based searches, and in the interim, until the technology improves, Help software now generally includes at least a provision for a traditional index.

Web-based information is still largely stuck in an earlier state of evolution, but is developing fast. So let’s challenge the assumption about indexes and see whether we can’t bring the Web up to parity with online help by using indexing to help make a Web site more usable. Before we begin, let me make one thing clear: I won’t attempt to cover the principles of indexing, since that would take a good book; Bonura (1994) and Mulvany (1994), among others, have done an excellent job of teaching the basics. Here, I intend solely to apply the book model to the Web environment. Actual usability testing has not yet been done, and this would be a fruitful area for others to explore; in fact, I hope to work on a case study some time in 1999. As with any other relatively new technique, it’s going to take some experimentation until we successfully adapt the principles of paper-based indexing to the new medium.

Principles of indexing

Before beginning, it helps to review the distinction between a table of contents and an index. A table of contents presents a high-level overview of the structure of a body of information, from which readers can create a mental model of what goes where. In effect, it provides access to broad areas of information, not specific facts within those broad areas. In this sense, the table of contents is functionally equivalent to the home page of a Web site. An index, on the other hand, provides no overview of structure, although a gifted, patient reader could conceivably create a sketchy map of the overall structure of a book by correlating page ranges with the topics covered in the index. Instead, an index points directly to the conceptual information that fleshes out the overall structure.

The principles of creating a high-level structure (chapters, sections, and subsections) include grouping like things, separating disparate things, and chunking information into progressively smaller, more focused units. Conversely, and no less simplistically, creating an index’s low-level structure requires labeling of each discrete concept, providing synonyms so that those who use different labels can find your label, and providing cross-references to related topics likely to be of interest.

In essence, the model provided by printed books remains robust for online information, since form matches function and the function is unchanged. So the same conceptual principles I’ve described transfer entirely intact to online indexing; the primary difference arises in the formatting and other presentation requirements. This much is true: the appearance of an index cannot be the same online as it is in print, and for the usual reasons: different screen dimensions, different resolution, the ability to use color online to enhance communication, and so on.

Creating an online index is no more difficult than—or perhaps just as difficult as—creating any other page full of hyperlinks: create the labels (index entries) for each link, then apply the URL for the destination of that link, whether a separate Web page or a cross-reference within the same document (the index document). The file size for an online index is inevitably larger than that of its printed equivalent, simply because (for example) “indexing: page 9” uses far fewer characters than “indexing: http://www.example.com/tools/indexing/overview”. Fortunately, file sizes are less of an issue than they once were, and even today, HTML has the virtue of being an extremely efficient file format compared with most word processor files.

Obstacles to indexing

Given this context, it would seem to be a no-brainer to include indexes in all our Web sites—yet it’s not being done. What are the obstacles? Again, the printed book provides a good model.

Indexes are expensive and time-consuming

We’ve inherited the first two obstacles, and by far the most important ones, unchanged from books. First, book indexes are done poorly because indexing is a highly skilled profession, and indexing can be expensive; thus, it’s one of the first things to bring in-house or skimp on when budgets are tight. (Many publishers now ask authors to index their own books.) Second, indexing is time-consuming, and because publishers usually finalize indexes only once a project is complete and ready to publish, there’s often little time to do a high-quality job. Acknowledging the importance of an index is one thing, but making indexing part of your corporate culture is quite another. Doing so solves both problems, whether in print or online, but first you must sell your managers on the idea.

It’s easy to demonstrate that full-text search engines simply aren’t an acceptable alternative—get the manager to try one! Solving the time problem relies on the same solution increasingly being applied with books: create the index as you write the book (or in the context of this article, incorporate index tags as you create the Web page). Unfortunately, there are currently no tools available for doing this for HTML, and this leaves the task of creating the index manually. Netscape is rumored to be working on a means of incorporating indexes in HTML files, and I’ve located a promising tool for HTML indexing that is currently undergoing beta testing <http://www.html-indexer.com/>, but neither is yet “ready for prime time”. [Author's note: Since this article was first published, HTML Indexer has become a respected, broadly used tool. Another tool, Deva Tools, is also seeing increased use: <http://www.devahelp.com/>.] Once indexes become a standard feature of Web pages, expect the authoring tools to support indexing; until then, we’ll have nonstandard, ad hoc solutions.

Dynamic sites are hard to index

Another problem is that Web sites are highly dynamic, whereas books are static. One of the “dogmas” for creating a good Web site is that “if you build it, they will come; if you don’t keep changing it, they will go”. Poppycock! It’s certainly true that your information must remain up to date, but there are many categories of information (and particularly reference material such as dictionaries, lists of physical constants for scientists, and atlases) for which the bulk of the information rarely changes, and sites that provide this information remain extremely useful to specific audiences. Nonetheless, many Web sites do change at a phenomenal rate, and maintaining an up-to-date index (i.e., adding new links and cross-references, removing obsolete entries and cross-references) for such sites may be next to impossible using solely manual indexing methods.

The solution to maintenance will almost certainly be an automated indexing tool of some sort, integrated within authoring software, but until such tools are available, I can only propose a partial solution based on the approach we use with books: include indexing information in each Web page as part of the header, whether as a comment line or as some form of meta tag. Before updating a Web page, have an indexer scan the document header, extract the appropriate tags, and manually insert them in the index document. Similarly, before deleting a page from a site, have the indexer locate the page’s tags and manually remove all references to the page from the master index. Using a tightly controlled list of indexing keywords would likely prove essential to the success of this approach, and would probably require at least one person to be in charge of maintaining the list and enforcing guidelines for its use.

Multiple-page targets are poorly supported in HTML

Another practical problem is that most indexes provide multiple page references or a page range for several index entries. However, an HTML index can only provide one jump for each entry, so on the surface, it appears that this aspect of book indexing won’t transfer well to the Web. A partial solution would be to create subentries for each main index entry, and make each subentry suffciently specific to provide the necessary context. For example, if the main entry were “sales tax”, there might be subentries for each jurisdiction. Another solution might be to group information so that general references (the equivalent of “pages 11–17”) point to a “table of contents” page, and the table of contents itself provides the appropriate jumps to more refined topics. Nonetheless, neither solution is as easy to implement or as elegant as the multiple page references and page ranges that we can create for books.

Usability is tough to quantify

The final obstacle is a bit unusual, given that “everyone knows” the value of an index. If everyone knows it, then surely we can quantify it? Well, not exactly. Although it seems perfectly logical that including indexes in your Web pages will increase usability, that assumption remains to be tested. My feeling is that in an absolute sense, no usability study will ever disprove the value of a skillfully created index, but in a relative sense, cost-justifying indexes by quantifying their value may prove considerably more challenging. I hope some of my readers will be the ones to come up with that cost-justification.

Conclusions

It pays to challenge our assumptions, because doing so often leads to interesting possibilities. Even so, it’s important to test those possibilities. I’ve made several explicit and implicit assumptions in writing this article that may prove incomplete, inadequate, or incorrect, and it’s going to take some testing to find out which is the case. I’m hoping that my colleagues will be able to begin this process and eventually prove me right—or prove me wrong and show us the right way to do things. Moreover, as I’ve noted in this article, it will take some time to resolve the difficulties I’ve identified. In the meantime, we’ll need to develop practical workarounds until the tools catch up with the need. Members of STC’s Indexing SIG, among others, should be able to provide these workarounds.

Literature

Bonura, L.S. 1994. The art of indexing. John Wiley & Sons, New York, NY.

Holbert, S. 1998. How to index Windows-based online help. Intercom, May:26–27.

Mauer, P. Embedded indexing. Intercom, April:8–10.

Mulvany, N.C. 1994. Indexing books. University of Chicago Press, Chicago, IL.

Acknowledgments

My thanks to Lori Lathrop for performing a reality check on an earlier version of this article.