We might as well come clean at the very beginning. We're both librarians. That admission is double-edged; while nobody really actively dislikes librarians and most people in fact think that libraries are good thingsand the people who work in them do good workonce you say the word librarian, the old stereotype clicks into place. You can almost see it happenpeople see the bun, the sensible shoes, hear the shushing. . ..
In fact, our perspective as librarians is very helpful in many ways when it comes to looking for information specific to a topic or question, as opposed to just surfing your way through Cyberspace for fun. Librarians typically are taught to examine a question, understand the information needed and motivations that underlie it, then use one or more information-seeking tools to try to find particular items and sources in response. This involves a great deal of strategy, knowledge of many particular sources along with their organization and search protocols, an ability to work with people, and an understanding that the tools themselves are less important than their appropriate use. It also requires the ability to integrate what you find using these tools to help people find what they're looking for.
This is the perspective that we bring to the Net. When you hear discussions about searching for information on the Internet, you'll hear people talking about how to use Archie, how to find people's e-mail addresses with finger or Netfind, intriguing ways to make Veronica work more effectively, and so on. Those are, of course, all very important, and without these tools it would be completely impossible to find anything out there. The truth is, though, that all of the search tools and protocols currently available on the Internet are very primitive and permit only very simple searching.
This chapter discusses how searching for Net stuff is different from searching in other domains you might be familiar with, why it's been difficult to construct more sophisticated search mechanisms on the Net, and it proposes some ideas and strategies to take a more comprehensive approach to looking for information.
Before we discuss the searching tools themselves, it's important to understand some of the nature of the resources being searched. There are a wide variety of these, including:
This variety is part of the richness and thus the value of the Internet as a communication medium and information resourcebut it makes comprehensive directed searching much more difficult than is the case in more traditional systems such as library catalogs, indexes, and abstracting sources. Why? Several reasons:
Different formats. A quick glance at the preceding list shows several different kinds of formats in which text could be imbedded (Listserv, Usenet, Gopher, FTP, WAIS), and a number of nontextual formats (sound, image, moving image, directory). Several different search mechanisms have arisen to deal with many of these, but they have distinct interfaces, command structures, and access paths. There are still as yet no commonly accepted ways to search through the nontextual sourcesshy of using textual tags or following hyperlinksbut this reduces to browsing.
These sources are dynamic. It's true. Anyone who's been on the Internet for any period of time has had this experience: a source you really like and have come to depend on all of a sudden isn't there any more. Perhaps the server is down, but it's also possible that the address has changed or the person who maintained it has graduated or moved or something ,and so it's just no longer there. That's very frustrating for the experienced user, but imagine the reaction of a novice who runs across this situation. They might well imagine that it's their fault or that their individual computer isn't working right.
Sources are sometimes unreliable. The stories of the 95-element Periodic Table and the 1911 Roget's Thesaurus are perhaps only the best-known tip of the iceberg of information sources that are badly out of date or incompleteor just plain wrong. This problem is exacerbated by the distributed nature of the Net: Once a "bad" source gets out there, it's a simple matter to point your Gopher at it or include a link for it on your home page, thus spreading its influence.
Some of these are unique sources. And as such are probably quite valuable, but uniqueness also means that they are perhaps unknown or foreign to a particular user, and thus less likely to be pursued. Suppose the perfect resource for your needs is available through WAIS, but you didn't know that and instead your search went through Veronica and some Web searching. You'll never find itnot because you couldn't but because you never thought to look there.
Many of these also arise in the "traditional" information searching world. There are many out-of-date or badly done print and nonnetworked digital resources, some of them are very highly specialized (or expensive) and thus not well known, and there are a large number of access mechanisms. However, they typically don't change overnight on the shelf; the access mechanisms are usually pretty well explained, and the work and reputations of authors and publishers help to allow users to decide whether or not what's there is any good. The special circumstances around Internet resources are, however, worth taking account of.
Now let's talk a bit about the tools that we use to find things on the Net, how they work, and how they compare to searching tools used in non-networked environments.
There is a strong match between searching tools and resource types on the Net. Archie can only be used to search for FTPable files, and nothing else really works for FTP. Similarly, Veronica goes with Gopher, Web tools are used to search the Web, and so on. It is true that there are overlaps (WAIS through Gopher, Gopher through Web searching tools, and so on), but the connections are very strong. This is often the case outside the Net, but there are important exceptions. Many library catalogs also provide access to journal and periodical indexing sources and other types of resources. Commercial services such as DIALOG provide a common command structure and interface to hundreds of different databases of different types. Thus, it is harder in general to learn to search on the Net because there are so many different toolseven though many of them are extremely simple to learn and use.
You might even say they're too easy to use. Many of these search tools permit only very simple matching of single words or character strings without the benefit of Boolean operators, adjacency searching (looking for two or more words in order), or field searching (looking for a word in the title or author fields, for example). They seem so easy to use, though, that a novice searcher might be fooled into believing that the results they get are complete and comprehensive when, in fact, they may be far from it. Archie is particularly notable here, since it provides virtually no capabilities beyond simple character matching.
But this is because there's very little else it can do. Most of these tools (WAIS a notable exception) have far less information about resources to work with than do traditional tools. In general, there is no indexing or abstracting of networked resources, and no controlled vocabulary or generally agreed-upon index terms used to organize and search. In FTP land, we have brief (often extremely cryptic) file names; in Gophers we have menu titles that may or may not reflect well the contents of documents; and much Web searching is limited to text between <title> and <h1> tags.
WAIS at least operates with the full text of documents, which gives it much more material to work with. WAIS has other difficulties as a search tool, though. Although it does allow for "natural language" searching (for example, it will accept anything you type at it, discarding stop words), it then proceeds to search on all remaining words in the search statement, implicitly ORing them together, and constructing a ranked list of retrieved items based on frequency of occurrence of those words, and a few other things. Since the WAIS search engine's algorithms have never been adequately described, it's hard to really know exactly what is going on in there; most people who use it have experienced frustration since searches often produce no useful results or the ranking mechanism somehow puts the really valuable stuff towards the bottom of the list. WAIS is a worthwhile attempt to provide access to full text items, and the ideas that underlie are on the whole good, but the actual mechanism falls short in many important ways.
One further comment: These tools are also distributed, as are the resources. Thus any particular Archie or Veronica server or Web search site is only as good as the person who maintains or creates it, and it may be down or busyas is the case with any networked resource. On the other hand, it's much easier to make one of these available than to mount a new commercial search service.
The preceding chapters have given some good instruction on how to use searching tools; here we'd like to give some advice and tips not so much on specific strategy or the operation of individual tools, but rather strategy on when to use which tools, and how to decide which ones might be most effective.
In many ways, this is the worst tool available. Since it is able to operate with only extremely limited information about FTPable files, there's only so much it can dobut its lack of features beyond simple character-by-character matching make it often frustrating to use. This frustration is made worse by its use of implicit ORif you search for more than one word in a single Archie statement (for example, prog internet society), the results will be any entry containing either of those words or both, thus producing many more returns than probably desired. There's no easy or efficient way to do a Boolean AND in Archie.
Of course, if you're looking for the location of an FTPable file, it's about all you have to work with. We suggest starting a search for information using Archie only if you know that the resource you're looking for is an FTPable text file (like, say, the early edition of Zen and the Art of the Internet) or software file (Mailstrom, for example).
As a secondary source, you might try Archie if the topic you're looking for is relatively specific, can be expressed using only one unambiguous word (thus avoiding the AND problem), and has some potential to be found in an FTPable file. This is not the place to do free-form, let's-see-what's-out-there searching unless you have a lot of time on your hands.
We've already discussed some of the limitations of the WAIS retrieval mechanism. Again, this is the only place (so far) that you'll be able to search the full texts of the resources, so you might well have more luck searching here than with other tools. Please bear in mind, though, that searching in full text is very tricky, especially with words that are ambiguous or have more than one meaning. Searching using WAIS on the word "bush" when doing a search on shrubbery will also get you items about George and Barbara Bush, among other things.
This problem is mitigated (somewhat) by the ranking mechanism that tries to take word frequency at least partially into account, but that doesn't always work.
We advise a two-step approach to searching in WAIS. The first step is useful when searching in the Directory of Servers, a sort of catalog of the individual information resources available via WAIS. In this source, you are only searching descriptions of resources, not the actual documents themselves. These descriptions range from the quite good to the poor to the nearly nonexistent, and so we suggest searching for broader subject areas here. The results you get from such a search will be descriptions of individual resources, and you can explore these individually (more about this in a second) to see which ones might be most useful. For example, if you were looking for articles or criticism of the Iliad and the Odyssey, don't search for those words in the Directory of Serversthey won't be there. Search instead for words like poetry, poem, classical, or even literature, although that one can be used in many different ways. Notice how we select terms that are conceptually broad but are not that ambiguous.
After you've identified potentially good resources, proceed to the second stage. (Experienced WAIS users may know resources well enough to skip the first stage.) Here, you are actually searching in the sources themselves, so you now can use the more specific words like Iliad, Odyssey, epic, and Homer; the likelihood of weird retrievals is lower here if you've carefully chosen resources to search in.
Veronica is probably the most sophisticated of the Internet searching tools; it supports Boolean searching, truncation (searching on tornad* to get tornado, tornadoes, tornadic, and so on), and enables you to control the number of results you'd like to see using the -m option. It still, however, only permits searching of a very limited amount of information about particular items: menu names or directory titles, at most a line each.
Nonetheless, it does allow you to do some quite effective searching. We advise searching on specific words if you have them and use them in sensible combinations, but don't get too specific right away. Take advantage of the Boolean search capabilities. If you're looking for information about the Internet Society, for example, try searching on isoc first, the Society's acronym, and a nice, unambiguous word. If it turns out that you get too few retrievals using this strategy, try broadening to a search like
(internet and society) or isoc
By the same token, a search on statistic* to pull up available statistical information will produce a lot of stuffprobably too much, in fact. Something more specific is called for, maybe
inferen* and statistic*
or
statistic* and software
or
statistic* and internet and (use or usage)
depending on what you're looking for. Veronica is really the only search tool that can support this kind of broadening and narrowing of strategies, such as is possible with many nonnetworked systems.
As we write this (late summer of 1994), this is the area that is changing most quickly and dramatically. At first, there were no tools for directed searching of the World Wide Web, just starting points and gateway pages. Then the first few started to appear. Then, as more and more attention was being paid to the Web protocol because of the appearance of NCSA Mosaic, the floodgates started to open. Now, there are pages devoted to providing front ends to many different search engines. An example of one is the Meta-Index maintained by the Centre Universitaire d'Informatique (CUI) of the University of Geneva. Its URL is: http://cui_www.unige.ch/meta-index.html.
We'll discuss three of these engines in a bit more detail. More are being developed and released all the time; by the time you read this, there may well be many more available.
One of the earliest successful engines was JumpStation, developed by Johnathon Fletcher of the UK. Its URL is http://www.stir.ac.uk/jsbin/js, and it permits searching of text that appears in Web documents, but only small parts of it. These documents are created using a markup language called HTML (HyperText Markup Language). They have tags imbedded in them to distinguish between major parts of the documents (lists, images, links to other documents, and so on) JumpStation takes advantage of this by permitting searching within two kinds of tags: the titles of documents and the first-level headers (roughly analogous to the major, roman-numeral divisions in an outline). This can make for effective searching, but again there are no conventions for what people write in these documents, so you're never really sure what you're getting.
The Web Crawler, from the University of Washington (URL: http://www.biotech.washington.edu/WebCrawler/WebQuery.html), is also produced by software robots (as are many Web searchers) that scan through the Web, starting with an initial group of documents and following links throughout Web space. The full text of documents are included in the indexwhich permits more thorough searchingand some simple Boolean searching is permitted, but users are still faced with the pitfalls of full-text searching. It does, however, have many attractive features. A side benefit of this discovery process is the Web Top 25 List, the most frequently referenced Web documents found by the robots.
The World Wide Web Worm (WWWW) is also built by programs that scan the Web identifying new documents. The Worm (URL: http://www.cs.colorado.edu/home/mcbryan/WWWW.html) permits searching of URLs of documents, so it is easy to find all the Web sites in particular countries, or of particular organizationsa search for all Web documents containing the .se domain (from Sweden) or the syr.edu domain (from Syracuse University). Content-based searching is more problematic because there is no convention for indicating content within the URL.
In addition to these directed searching tools, there are a few other techniques we advise. We often find it useful to search through the text of the What's New page maintained by NCSA (URL: http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/whats-new.html); the descriptions are usually reasonable reflections of the content of individual Web sites, and they provide a convenient way to search for recent resources. It may well require repeated searches on several different pages because the old ones are divided up by month, but it is still a useful strategy.
There are also a number of subject-oriented pointer pages, such as the WWW Virtual Library:
http://info.cern.ch/hypertext/DataSources/bySubject/Overview.html
the Whole Internet Catalog:
http://nearnet.gnn.com/wic/newrescat.toc.html
and others. They all suffer from a similar problem: Once the number of resources they point to grows beyond a certain point, it gets very difficult to maintain these lists in a reasonable way. They can get quite cumbersome and harder to use. Again, though, they do provide another, more general approach to searching for Web-based material. These two lists, along with several others also maintained manually, can be searched via the W3 Catalog maintained by CUI in Switzerland (URL: http://cui_www.unige.ch/cgi-bin/w3catalog).
Due to the nature of the searching tools available and the kinds of resources you find out there, we give you the following advice: if you're just starting out looking for information about a specific question, or on a particular topic, it's often best to start with Veronica or Web searching. There's a great deal of useful information available via both Gophers and the Web, and it's just plain easier to get at it. FTP and WAIS have their uses, but we see them as secondary approaches, in most cases.
A lot of this becomes more clear as you gain experience with the different types of resources typically found under different protocols. It does get easier as you spent more time on the Internet.
One of the other things you'll discover the more time you spend on the Net is that the most important resource that exists out there is the other several million people who coinhabit it. Sometimes they have the answer; sometimes they can point you to the person who does or the server where it sits. But once you've tried a bit of searching using these tools, it's often best to start asking around.
It's a natural leap to make. There's a global network enabling access to millions of computers. On those computers are located a lot of really interesting things. But there's no really good way to find the good stuff. What's your natural impulse? Same as oursask someone. Research has shown that that is the preferred information-seeking strategy of many people: ask your neighbor, ask your colleague, ask your friends. Well, it works on the Net, too. Ask anyone.
There are a couple of ways in which this takes place. Send your friends e-mail and ask them what they found that they like, or if they know of any keen stuff about Babylon 5. It might work, especially if they share similar interests. But, chances are, they might not know of specific things.
Your next best strategy is to go to Listservs or Usenet newsgroups. We haven't talked much about these because searching really doesn't come into play here. You can search within archives of old Listserv postings, and you can search within the subject lines of articles in Usenet (using many newsreader clients), but these are much more communication channels than information sources.
It's possible that your questions might be answered by an FAQ (Frequently Asked Questions) document maintained by a group. These are exactly what they sound like: questions that often arise in the context of a particular group and answers that people in the group have developed over time so that they don't clog up the group. Once you find one or more groups that appear to be in your area of interest, the best thing to do is to subscribe to the group or scan it and look first for the FAQ. Most high-traffic groups will have the FAQs posted every week or two, so it shouldn't be too hard to find. (See Chapter 17, "Discussion Forums," for more details on how to go about this.)
If the FAQ doesn't answer your question, or if it piques your curiosity enough to find out more, then just follow the flow of conversation of the group. You can learn a lot by just reading, and by participating in the discussion as appropriate. We have often found that just subscribing to a newsgroup or Listserv for a few days can be very enlightening.
At some point, though, you may find that the specific answer to your specific question (just what do you feed pet squirrels, anyway?) isn't being responded to on rec.pets, so you decide to ask the question yourself. Here's where a bit of caution and thought are required.
Most of the groups we've spent time with are composed of terrific people who will go to surprising lengths to help other people. We've both had the experience of asking a question and getting several responses within hours: recipes for Siberian ravioli (seriously) or charts showing microcomputer sales over the previous three years broken out by manufacturer. It's amazing!
But sometimes you have to be a bit careful how you ask these questions. You must always remember that many of these groups have become communities where participants have histories and know each other, and have developed mores and norms for discussion. You wouldn't barge into somebody's seminar or living room, and without so much as a hello, start asking them questions. It's rude, and probably won't get you very far towards getting help or an answer. The same sort of thing applies here.
We advise subscribing to a group and spending some time just reading and learning about the group and what goes on there, and how the discussion goes, before posting anything including a question. There are some groups where people can get attitudes and be very touchy about how things are said and, if some of these people decide to get on your case about being a newbie asking stupid questions in a place you don't understand, it can be an unpleasant experience for you. Whether this is a good or appropriate thing or not is a question for another day. The fact is that it can and does happen, and you should be forewarned and prepared. Reading a group for a few days will give you a much better feel for the sensitivity level of people and, most importantly, whether or not your question can be dealt with usefully here. Maybe there's a better group like rec.pets.squirrels or something that can be of more help.
We want to emphasize: don't be intimidated. Most of the groups that we participate in are filled with terrific people who are more than happy to see new folks participating and help them out. You've just got to be a bit careful at first.
This chapter and the preceding chapters in this section have discussed the individual searching tools, their usage, and their problems. While any individual tool might be good for quick and dirty searching, it can be quite a headache to exhaustively search for all of the Internet's information regarding a specific subject. After all, there are times when you will want to know everything that's out there on beer brewing. What techniques can you apply that go beyond asking around in a few Usenet newsgroups or doing a simple Veronica search?
Well, you might think about doing an exhaustive search as the equivalent of an all-out war: you have to use every tool in your arsenal. However, would it be effective to just throw each of your resources at the target all at once? You might find this approach to be frustratingly disorganized, hard to keep track of, and unfulfilling in all respects.
Instead, you need to plan an attack that integrates all the tools described above in an effective way. This means you'll want to use the appropriate tools at the appropriate times: certain approaches can be used to set up the search, others can harvest the bulk of the relevant information, and so on. This section describes a series of generic steps that you can use in most any exhaustive search for information.
You should begin by considering how the information you find is going to be used. If you are the only person who will use this information, your search will be much less complicated. If there is a larger audience involved, you will need to determine whether they share your perspective, vocabulary, and access to the Internet. Are they a scholarly crowd, corporate researchers, high school students? Do they know how to use the Internet's tools already? Are they likely to have the same level of connectivity to the Internet as you do? The answers to these questions should help you determine such things as how to "publish" your "product" (for example, in print or via some Internet tool), what language to use, how much detail to provide, whether or not you need to explain what a Gopher client is, and so forth.
The nature of the subject itself can dramatically affect how your search will go. Because the Internet's information is quite spotty in terms of coverage, you might find that there is too much information related to your subject, or perhaps not enough. Although the process of information searching is iterative, you'll want to avoid changing gears in the middle of this process: backtracking means you'll lose time and perhaps interest in completing your search.
To avoid having this happen, think about both the broader and narrower versions of your subject. For example, if you are interested in finding the Internet's offerings on beer, you might consider the broader category of beverages. In the back of your mind, your narrower interest may really be information on brewing beer. Then keep your eyes peeled for information fitting both the broader and narrower categories during your search. This way you'll be prepared if you have to narrow or broaden your search later on.
Record your current knowledge of the Internet's resources relevant to your subject in a single document. This document can serve as the starting point for your search. Ask yourself what you already know about each resource: how did you learn about it? Who maintains it? Is it valuable or not, and why? The answers will help you determine where else to look, whom to ask for assistance, and what characteristics to look for when evaluating a resource.
Later on you'll be asking for help on various mailing lists and newsgroups. Before you do this, you'll want to demonstrate that you've already made at least some effort to get to know the subject at hand to gain their respect, show that you know what you're doing, and show you're serious. You'll also want to avoid receiving responses that mention the same old well-known resources.
Building on what you already know, a quick-and-dirty search proves to other Netizens that you're someone worth helping out. It can give you an initial sense of how much is out there on your subject; you may find this moment a good time to adjust the broadness or narrowness of your search.
The best tools for quick-and-dirty searching are Veronica and JumpStation. They don't require much expertise and (usually) return usable results.
As described earlier, there are many techniques for identifying groups of people who might share your interest in a subject. What many people forget at this point is to consider such issues besides subject; namely, the level of appropriateness of a specific group's discussion to your information need. For example, your investigation of artificial intelligence resources might find four or five relevant groups. As you lurk on these groups, you might find some to be high volume "general interest" groups full of unsophisticated ramblings, and worse, many questions. Remember, you're looking for a good forum for your question, and don't want to compete with so many others. So avoid these groups in favor of those that may be narrower or less on your topic but that, instead, provide a tighter and more responsive community.
Once you've spent some time lurking and feel sufficiently comfortable with a few relevant communities, send them a message that briefly describes: 1) who you are; 2) what kind of information you are looking for; 3) where you have already looked; and 4) what you already know (this can be a short listing of resources identified in the steps above). And of course, be sure to state that you will make the results of your search available to the Internet.
Remember that this process can be frustrating in that it won't provide you with the immediate results that a Veronica search might. Nor will your query always be received favorably. On the other hand, remember that people are the best source of information on and about the Internet, and that your query has at least planted some seeds that might take some time to nurture before you can harvest their results.
You've now completed all the necessary preliminaries for doing an exhaustive Internet search. You've thought about such issues as audience and subject, you've found some useful resources, contacted a few relevant communities, and ideally you've started getting some helpful pointers from those communities.
Now you can begin the heavy-duty searching you've been itching to do. Identify a set of keywords, plug them into the searching tools as described in the previous chapters, and see what happens. In fact, this is the best time to apply all the tricks you've now learned for Archie, WAIS, Veronica, and the various Web searchers.
Of course, keep in mind that, as mentioned before, some tools are less than useful for certain purposes, so you'll want to spend less time with them. For example, unless there is a lot of yeast-growth-tracking software out there, Archie probably won't be too useful for your search on beer brewing. Also, you know that certain tools are not your personal favorites. So don't spend too much time wrestling with WAIS at the expense of the rest of your search. You can always come back to it later.
You probably have found everything you're looking forat least, for the moment. You might want to record what you've found as an FAQ or Internet guide so you can come back to it later. Also, good netiquette dictates that you send copies to the kind folks from various mailing lists and newsgroups who helped you out.
You also might want to keep current on your subject. Maybe beer brewing technologies will be revolutionized in the next six months (one can only hope!), so you might want to do the search again in the future. And until the bugs are worked out of Knowbots, filters, and other intelligent agents, you will have to pretty much start your search all over. For this reason, consider briefly recording what you had to go through: which keywords you used, which groups and tools were the most helpful, what resources you found. A little work in this area now could save you a lot of time and hassle in the future.
We've tried, throughout this chapter, to present some ideas and ways of thinking about searching for information over the Internet, from our perspective as people who deal with information for a living. The approach we advocate heresearching where appropriate and efficient, asking people for help when possible and feasible, and pulling varied sources of information together to form a coherent wholeis, we think, at the moment the most sensible one to take. It's not perfect by any means: sometimes haphazard, certainly incomplete and idiosyncratic, but it does produce useful stuff.
It is also the case that, at least in the near future, "traditional" searching tools such as we're accustomed to in commercial, library-oriented systems are not going to arise in the Internet environment. Two reasons, primarily: (1) these are complicated and large projects; nobody's paying for their development and so they probably won't be developed in a big hurry, and (2) the highly distributed and dynamic nature of the Net mitigates against their development. This is because nobody has sufficient control over what's going on out there to hold it still long enough for it to be "indexed" or "classified," or even described at all well, and besidesthe traditional tools don't always work in the print world, either.
The problems of representing and searching for information are large and difficult ones. Centuries of work in libraries have produced mixed results, but on the whole, we have systems that work tolerably well. Over the last few decades, a number of ideas have been proposed from the domains of computer science, notably artificial intelligence, towards solutions of these problems. In our view, these suggestions have oversimplified the problem and thus have, on the whole, failed. The heart of the matter is languagethe medium in which information is often bound upand its ambiguity and flexibility. Information in forms of audio, image, or moving image may present even greater challenges, as we seek to develop vocabularies and structures to represent them.
This is an old story. As new technologies have been developed that affect our ways of storing information (writing, printing, scholarly journals, magnetic storage, and so on), new methods have had to be devised to find it again. Writing led to titles; mass printing led to book catalogs; journals led to indexes; magnetic storage led to data structures. Now we have globally distributed high-speed networks, and a new response is required. The ultimate form (or forms) of that response are at present unclearbut for the moment, a few basics from the library world may suffice.