SEARCHING MULTIPLE DIGITAL LIBRARIES:
A DESIGN ANALYSIS

Prepared for the Digital Library Project

Revised November 2, 1995

Michael Buckland
School of Information Management & Systems
University of California, Berkeley
buckland@otlet.berkeley.edu
ABSTRACT

An analysis of the objectives, problems, and interactions in searching multiple digital libraries. No single library can satisfy all needs. Digitalizing libraries makes the task of extending a search to additional libraries easier. The key questions are: Where to look first?, Where to look next?, and When to stop looking? The task varies with the type of search (all instances, some instances, some "good" instances, etc.); the skewedness of the distribution relevant documents across libraries; duplication in holdings between libraries; differing objectives of service providers; the situational nature of deciding whther two documents should be considered the same, and the relationship between different retrieval techniques and the retrieval task.

  1. THE PROBLEM

    A fundamental problem in library services is that no single library collection will be complete for any given group of users. The distribution of demand over documents in any given domain is known to be highly skewed, a pattern known since 1934 as Bradford's Law of scattering. Tailoring the selection of documents in a collection to match the distribution of demand (one component of collection development) can greatly increase the proportion of demand satisfied by one library.

    With paper-based libraries the inconvenience of extending a search to other libraries encourages users to make do with what is conveniently at hand, thereby diminishing the appearance of inadequacy. Nevertheless, a motivated searcher must always expect to search in multiple libraries whenever completeness is desired or an inquiry leads outside the scope of the local library.

    The issues addressed here apply equally to collections of documents on paper and to digital stores. However, given sufficient standardization and interoperability, the digital library environment should drastically reduce the inconvenience and, thereby, the disincentive associated with extending a search to multiple sites. Extending searches to geographically-distant collections is becoming much easier. Given the rapid proliferation of digital library collections, support for searching multiple sites emerges as centrally important for cost-effective library service.

    The searcher. We assume that the searcher desires one or more documents, but does not know what, if any, suitable documents exist and, even if the existence and identity were known, the searcher does not know where copies can be found. The problem of deciding where to look can be reduced to three operational questions:

    (i) Which library to search first?
    (ii) Which library to search next? and
    (iii) When to stop searching?

    Where to search first (i) can be considered a special case of the more general question of where to search next (ii). In this analysis we concentrate on the sequential search of individual digital libraries, deferring, for another report, examination of simultaneous, parallel searching of multiple libraries and of mixed strategies combining sequential and simultaneous searches.

    The server. We start with the assumption that servers (the providers of library service) are indifferent as to who searches and/or how much searching is done. Later we will examine this assumption and consider the options available to servers that are not indifferent to the amount and nature of searching.

    The purpose of this design analysis is to identify the principal factors involved in the searching of multiple digital libraries, to clarify the relationships between them, and to formulate requirements for operational support. Of particular interest is the challenge posed by a searcher who wants "a few good documents" in a Boolean retrieval environment. Procedural issues of compatibility and interoperability are outside the scope of this report.

    The following sections review different kinds of search query, the notion of relevance, the question of when two documents are "the same", and basics of search theory. We then examine the impact on search effectiveness of three related variables: the distribution of relevant documents across sites; effectiveness in ordering the sequence of sites in the search; and the effects of duplication in document holdings across sites. Finally we will briefly consider the options open to the servers and note some technical consequences of the differing ordering power of Boolean and probabilistic retrieval techniques.

  2. SEARCH SPECIFICATION

    We categorize searches in the following ways:

    Types of search I: Instances

    (a) An "every instance" search is one in which a search is deemed satisfied when every different document with the specified characteristics have been found. Since any collection might contain an instance, an "every instance" search requires that every library be searched.

    In this discussion we are concerned with instances of different document (types) rather than with duplicate copies (tokens) of the same document. Ordinarily duplicate copies are not of interest, after the first, because redundant. However, seeking every copy (token) of each different document (type) may sometimes be needed. We call this a "census search". A traditional example is the compilation of a census of all known copies of particular kinds of early printed works. An example that becomes more feasible with digital libraries would be locating all known copies of document in order to make corrections or additions.

    (b) An "any single instance" search is a search that is deemed satisfied when any instance of a document with the specified characteristics has been found.

    (c) An "any N instances" search is a search that is deemed satisfied when any N instances of documents with the specified characteristics have been found.

    An "any single instance" search can be treated as a special case of "any N instances" searching where N equals 1. Although an "every instance" search is different in kind from "any N instances" searches, it, too, could also be treated as an example of an "any N instances" search if N is set high enough to ensure that all sources will be searched before N instances have been found.

    (d) An "extreme instance" search is a search for whatever document has an extreme value for one or more attributes, e.g. the most recent document of some type.

    Types of search II: "Good" (or "preferred") documents

    Conventional Boolean search systems in operational bibliographical and library systems ordinarily operate on a primitive, unambiguous binary distinctions: Documents categorized as being "about information retrieval" or "by Boccaccio". Of all possible values of all defined attributes, the presence of one or more is specified as a requirement for retrieval. The presence of other attribute values is, formally, a matter of indifference so far as the formal search specification is concerned. In real life a common form of search is for a one or a few "good" documents, rather than just any one or few instances of documents with the specified characteristics. In other words, although a few randomly selected documents that satisfy the search specifications might sometimes be satisfactory, it is likely that a few good books on information retrieval theory are desired. So support for searching for "good" documents is a desirable design requirement.

    We regard searches for "good" books as differing from searches for "any" one or more instances by being characterized by preferences, i.e. "I want some books about information retrieval and, given a choice, I would prefer them to be recent, in English, and by a well-reputed author." Of course, in the design of formal systems we can only be concerned with situations in which preferences can be operationally specified.

    Specification

    Mechanical information systems require that a search has to be expressed formally, but, in practice, searches more complex than simple "look-up" are unlikely to be completely specified for several reasons. Only some attributes are present in the data; only a few of those provided can be searched; and only a limited number of forms of search are supported (e.g., typically, right-hand truncation but not term adjacency). Complex specifications are inconvenient to formulate and searchers have a low tolerance for complexity in search specification. Empirical studies of online library catalog usage have consistently shown that functionality for specifying complex searches is little used (e.g. Norgard et al. 1993).

    The final selection is ordinarily not done by the information retrieval system, but by the human searcher, picking over, with criteria that are probably not only subjective, but also ill-defined and possibly unconscious, the set eventually generated by the interaction with the retrieval system. In this way limitations of mechanical retrieval systems are alleviated. Exploring how far support for this final human sorting can be incorporated into the formal retrieval system is itself a valid part of the agenda for research and development in selection systems. However, we conclude thatthe design task is not to try to automate this final choice, but, rather, to empower the human searcher to understand the options better.

    For these reasons, we believe that the design of retrieval systems should be include support searches for "some good" documents, whether searching in one or multiple libraries. Searches for "some good" documents are characterized by preferences, e.g. "Given a choice, prefer a document that is recent". Hence, in addition to the binary values of conventional Boolean searching (i.e. the specification of attribute values that are either mandatory or else a matter of indifference), we assume that systems designed to support searching for "good" documents need to be able to handle a three-valued approach to attribute values:

    Specifying a preference, a conditional requirement, means that the attribute itself is situationally significant, taking effect if and when prescribed circumstances are encountered. Of course, there may well be multiple preferences, e.g. "I'd prefer an English version and, if possible, the 1934 London edition."

    A conditional, adaptive approach appears to be particular useful:

    (i) When an absolute (or approximate) number of documents is desired and the population of (more or less) desirable documents is unknown and difficult to predict. If prediction were easy, then it would not be difficult to formulate a more or less restricted binary Boolean search that would yield a set of approximately the desired size of retrieval set. There is substantial empirical evidence that difficulty in predicting the size of retrieved sets is a major practical problem in searching bibliographies and online catalogs.

    (ii) When multiple sources are to be considered, since the larger the number of sources to be searched the less likely that the population of desirable documents can be predicted.

    (iii) When the contents of the libraries are somewhat duplicative. This makes the number of different documents in a set of libraries more difficult to estimate. (More on this below).

    Simply stated, preferences are conditional. They take effect situationally, adaptively. The more difficult it is to predict the outcome of searches, the more desirable it is to develop adaptive search strategies that are adaptive in the sense of incorporating conditional specifications as noted above. The larger the universe of digital libraries the greater the difficulty of predicting what would be found and the greater the need to support adaptive, heuristic searching able to incorporate preferences.

  3. RELEVANCE

    Traditional practice in retrieval research has been to use the term "relevant" for documents that should be retrieved. This convention has proven highly convenient for quantitative analysis even though studies of relevance judgements routinely find them to be highly inconsistent.

    The meaning, hence also the aboutness, significance, validity, and probable utility, of documents are subjectively and situationally constructed by the individual users, it follows that relevance judgements cannot be considered objective or expected to be consistent. (See, e.g., Schamber 1994). In this situation, allowing increased scope for accommodating searchers preferences can be expected to improve the potential effectiveness of a retrieval system.

  4. SAMENESS AND SUBSTITUTABILITY

    The question of whether two instances of a document are really "same" or not becomes important. A rigorous position is that, strictly speaking, there is no such thing as "the same" with respect to documents. Two copies of the same document are just that: two different instances and not, therefore, the same. In practice "the same" has a pragmatic interpretation. It means that two or more objects are equally acceptable for some purpose. A photocopy of a document using only one side of the paper is distinguishable from another one made using both sides of the paper. If we wish only to read the text, the different versions are "all the same to us" and one could be a substitute for the other. However, if we intended to cut and paste the paper text, the difference would matter. Similarly, we might find ascii and bit-mapped images of a text equally acceptable for some purposes, but not for others. A low-resolution version of a high-resolution image may be as acceptable in some circumstances but not in others. It follows that the determination of sameness, being situational, can only be done during the selection process, not at the time of storage. The most that can be done during storage and indexing (i.e. prior to retrieval) is to categorize the documents so as facilitate future decisions concerning substitutability and, also, to adopt some conventional threshold of sameness. For printed documents on paper the conventional technical threshold for "same" is the "edition", with the terms "impression", "variant" and "copy" (or "exemplar") being used for finer distinctions when needed.

  5. SEARCH THEORY

    What follows briefly summarizes the most relevant elements of search theory.

    Search sequence

    (a) Likelihood of finding. If the cost of searching different sources were identical or irrelevant, the optimal search strategy is to look first at the source most likely to contain what is sought, if we happened to have such a priori knowledge, then the second most likely, and so on. In other words the sources should be ranked and searched in decreasing likelihood of finding what is sought.

    (b) Cost of searching. If the expected likelihood of search success were the same for all sources, then the least expensive search strategy would be to start with the source that is the least expensive to search, then (if need be) move on to the least expensive remaining source, and so on. In other words, the sources should be ranked and searched in order of the increasing search cost.

    (c) Cost-effective searching. Where the sources vary with respect to both the likelihood of finding and to the cost of searching, the optimal search strategy is achieved by establishing the expected cost-benefit ratio for each source, expressed as the likelihood of finding what is sought divided by the expected cost of searching that source. One then starts with the source with the highest value for this ratio, then moves on that with the highest ratio among the remaining sources, and so on. In other words, the sources should be ranked and searched in decreasing order of the expected search success / search cost ratio.

    Stopping the search

    While these rules tell one where to look, they do not tell one whether or when to stop searching before all sources have been searched. For this it is necessary to compare the marginal cost-benefit with some other, alternative use of resources.

    One may hope to complete a search for "any instance" or "any N instances" with one or a few sources. Continuing after one (or N) have been found would be pointless even though more might be found. On contrast, "every instance" and "extreme instance" searches require every source to be searched because the very last source may be where the last or most extreme instance is to be found.

    Searches for "N good" instances are more complex. If N instances have been found that fully satisfy all the required and preferred attribute values, further searching becomes pointless and the same situation as a satisfied "any N instances" search has been reached. If the required attribute-values have been satisfied, but the preferred attribute values have been incompletely achieved, there remains the possibility of benefit in continuing the search: This implies a heavy weight is placed on the preferred attribute and, to the extent that that has been done, the search comes to resemble an "extreme value search".

    Search diseconomy

    If there is a cost in, for example, time, effort, or money, associated with searching, searching all sources may not be affordable or warranted. In particular, since libraries tend to have at least some degree of overlap in contents, diminishing returns in terms of search results are to be expected as the number of sources searched increases. The marginal benefit will tend to diminish but the marginal cost can be expected to remain stable or to increase. The combined effect is an increase in the marginal cost per relevant instance found as well as a steady increase in overall cost for the entire search and, therefore, an increase in average cost per relevant instance found. At some stage the cost of continuing the search is likely to become uneconomical, depending on the circumstances and assumed alternative uses of constrained resources. There is an issue of scalability: the larger the number of digital libraries, the less plausible the attempt to search them all.

    Searches for "good" instances involve preferences and the strength of the preference is a significant variable with respect to the stopping decision. If the preference is mild, close to indifference (e.g., "I'd prefer a version in WordPerfect 5.1 but ascii or any common wordprocessing format would do"), then this preference would be useful in discriminating if and when ample instances have already been found and there is a choice to be made from among them, but it provides weak justification for continuing the search. If the preference is strong (e.g., "I want the original 1934 version of the text if at all possible."), then the preference is closer to being mandatory and is a stronger argument for continuing the search.

  6. DUPLICATION

    We define duplication as the existence in two or more libraries of instances of the same document. ("Same", as noted above, is not a simple concept. We assume, in what follows, an arbitrary level of similarity, e.g. of textual equivalence, as in copies of the same edition.)

    Libraries, digital or otherwise, are, by their nature, characterized by high levels of duplication. Indeed, one of the principal expected benefits of the move from paper-based to digital libraries is in the massive cost-savings expected to result from an expected reduction in duplication (Lemberg 1995).

    High levels of duplication have important consequences for searching. The marginal cost-benefit of extending a search to another source is very sensitive to the degree of duplication because the greater the degree of duplication the less the marginal benefit of extending the search. At the limiting case in which all sources are entirely duplicative, i.e. containing exactly the same documents, it would be pointless to extend the search beyond the first source. If, however, there were no duplication between the sources (i.e. entirely disjoint sets), then one would expect to search multiple sources for all except easily satisfied searches.

    A sensible first approach to the selection of which of multiple libraries to search is to check or to estimate the overlap between the search specification and the contents of each source separately, and tools are emerging for this purpose (e.g. CHECK (Buckland, Butler, Kim, Norgard & Plaunt 1995); GlOSS (Gravano, Garcia-Molina & Thomas 1994). But CHECK and GlOSS, to the extent that they provide estimates of the existence of relevant documents in a digital libraries provide estimates of what relevant documents would be found if that source were the first or only) source searched. After that, such an estimate needs to be discounted by the degree of duplication between that source and all the sources that have already been searched. One could also say that CHECK and GlOSS estimate of the upper limit of the number of relevant instances.

    With significant and varied degrees of overlap, the optimal choice of sites to be searched can even depend on number of sites searched. Consider the hypothetical example in the following Venn diagram showing the duplication of relevant documents in three sources:


    Figure 1: Venn diagram of duplication among three sources.

    If only one source were searched the optimal choice would be source A which would yield 10 relevant documents, better than 9 for B or 8 for C. Having searched A, the best next choice would be to search B, increasing recall from 10 to 15. But if two sources were to be searched, searching B and C, and not searching A, would be the best strategy, yielding a combined total of 16 relevant documents, compared with 15 for A and B and 14 for A and C. Although such examples can be demonstrated, empirical studies might show that the frequency and magnitude of such non-optimality is insignificant in practice.

    The methodology for detailed analysis of duplication among library collection is described in Buckland, Hindle & Walker 1975. The underlying mechanism is the creation of a large data matrix composed of a row for each document and a column for each location. The number of copies of that document held at that location is entered in each cell as in Table 1.


    Table 1. Matrix of holdings of three documents in seven libraries.

    The effects of searching the libraries in any given order is shown by rearranging the columns into the specified search order, as in Table 2.


    Table 2. Matrix of holdings of documents in reordered libraries.

    The columns are then inspected and the recall (i.e. number of relevant retrieved documents found) is counted: A "1" indicates that a copy of a relevant document has been found, but it is duplicative unless it is the first 1 found in its row (as underlined). In this simple example, the first library searched (B) yielded one relevant document (a). The second library (G) yielded another relevant document (c). The next two libraries (F, A) add nothing: Each yields one relevant document but it is duplicative. The fifth library searched yields three instances of relevant documents (a, b, c), but only one of these (b) is non-duplicative, so the useful increment in recall is one. Extending the search to libraries C and E would not add any new documents. It can be seen, with hindsight, that starting (and ending) with library D would have been a better search order. The Recall row shows the marginal contribution of each library. The Cumulative recall row provides the data for drawing a Recall curve. This simple model is sufficient for basic Boolean searches. To study more complex searches involving preferences and adaptive search statements, some additional elaboration is needed. A convenient approach is to disaggregate the matrix with a dimension for each attribute intended to be used as a preference. (Examples can be found in RECALL PERFORMANCE WITHOUT DUPLICATION

    A graph of the cumulative increase in the number of relevant documents found ("Recall") as a search is expanded is standard practice in information retrieval. It can be readily adapted to the present analysis of multiple digital libraries if one treats the contents of the digital libraries as one large collection, segmented by library. Such as graph shows the increase in recall as the search is extended from one digital library to the next, as in Figure 2.

    No skewedness, no duplication

    As an illustration we imagine ten digital libraries, each with the same number of relevant documents, but with no duplication between them. We find that:

    (i) As the search is expanded from one source to the next, the number of documents retrieved (the marginal recall), stays constant. This is shown in the straight diagonal line D in Figure 2;

    (ii) Under these conditions different search orders make no difference to the number retrieved from each library. Ordering the sources randomly is as good a solution as any choice of ordering.

    (iii) An "every instance" search must always continue until all sources have been searched, but if, say, N instances are all that is wanted, they would, in this example, have been retrieved with the searching of the 6th source. Thereafter no additional instances are wanted, the marginal recall becomes zero and the slope of the recall curve becomes flat as indicated by the D'.

    The ordering of the sources is assumed to be a matter of indifference to the searcher, but it might not be if the costs of searching differed between sources. The ordering of the sources makes no difference to what they contribute in the "every instance" search, but it does in the "any 6 instances" case: sources 7 - 10 are not searched.


    Fig. 2. Recall curve: No skewedness, no duplication.

    Skewed distribution, no duplication. We continue the assumption of no duplication within the set of libraries, but relax the assumption about all libraries contributing the same number of instances. We start with the most extreme case in which one and only one library contains all the relevant items. We find that the recall curve is very sensitive to the search order as shown in Fig. 3.


    Fig. 3. Cumulative recall curves.

    With a optimal ordering the library containing all relevant items would be searched first and the recall curve would rise vertically, then remain flat (curve Pt in Fig. 3). Any given search order involves a vertical line where the productive library is reached. With the worst possible ordering all unproductive libraries are searched first and recall curve remains flat until the last source is found, whereupon it rises vertically (curve Pv). These two extreme cases define the limits, the region of possible recall results. Over many trials with random ordering the average will tend to the diagonal line D.

    Where "any N instances" are sought (N being less than "every"), the value of N determines where the horizontal line appears but, otherwise not the shape of the graph.

    We can generalize from this discussion that where ordering tends to be better than random, the recall curve will be in the upper left region, above the diagonal, and the better the ordering the more the recall curve will tend to the extreme, limiting case of curve Pt.

    Cases in which the distribution of relevant documents is less skewed, similar but less extreme results emerge. In brief, still assuming no duplication, whenever the distribution of documents is at all skewed and the search ordering is better than random, we find a convex cumulative recall curve rising monotonically from the origin and reaching the top right corner when the last source has been searched. (See line R in Fig. 3). One might say that the skewedness allows greater curvature and better ordering achieves greater curvature.

    Skewedness and ordering amplify each other's effect. With random ordering, the recall curve tends to a straight diagonal regardless of skewedness. With zero skewedness, the ordering is irrelevant. What really makes for effective recall is the combination of skewedness with good ordering.

  7. RECALL PERFORMANCE WITH DUPLICATION

    No skewedness, with duplication. In the limiting, extreme case of complete duplication, all libraries contain copies of the same documents. The effect is the same as the case of optimal ordering with extreme skewedness where one library contains all the relevant items and one inerringly picks that library first, described above and shown in curve Pt in Fig. 3. The difference is that while the result is the same, the search order has become irrelevant: With complete duplication, any ordering will have this same result.

    As the degree of duplication increases from none to complete duplication the cumulative recall curve shifts gradually from the straight, diagonal line D of Fig. 3 to the extreme skewedness of line Pt. Increasing duplication moves the recall curve toward the top left.

    Skewedness, with duplication

    In practice the situation ordinarily encountered is the fourth combination: a skewed (i.e. uneven) distribution of relevant documents across digital libraries with some duplication. The complexity of this combination resists tidy analysis, but two different approaches are possible.

    (i) Techniques such as CHECK and GlOSS can be used to estimate the expected recall from each digital library viewed in isolation. In a sequential search, these estimates should, after the first site, be considered to be estimates of the upper limit of marginal recall, since the estimates need to be discounted for duplication with documents already found at previously searched sites. The degree of duplication between the contents of any given library and the contents of the previously searched libraries can be estimated using the techniques described in Buckland, Hindle and Walker (1972).

    This approach has two disadvantages: (a) Estimates of duplication between collections or even between subsets of documents in the collections may not be a reliable predictor for any highly specific search query; and (b) This approach does not scale well. As the number of digital libraries increases, the discounting for duplication becomes progressively more important and progressively more complex with the very rapid increase in ordering possibilities as the number of searchable sites increases.

    (ii) Another approach is to use the search result at each site cumulatively to form progressively more reliable predictions of the likely marginal benefits of continuing the search. Stated very crudely, when, as the search progresses, the marginal recall becomes trivial, as evidenced by consistent zeros and the Cumulative Recall curve R in Fig. 3 becomes flat, then it probably time to discontinue the search. (A separate technical report on this aspect is intended).

  8. ADAPTIVE SEARCHES FOR "N GOOD" INSTANCES

    The search for "N good" instances of the form "given a choice, I would prefer" implies an adaptive retrieval strategy that can respond dynamically to the results encountered, either during the search or in post-search processing of retrieval results. A search for "N good" instances depends on an explicit specification of preferred values for one or more attribute-values as the basis for "better" to be determined.

    In terms of the present analysis this can be viewed as an elaboration of the data presented in Table 2. There would need to be an additional dimension to the table for a disaggregation of the values of each attribute (e.g. date of publication) upon which preferences would operate. Some empirical examples of such data can be found in Buckland, Butler, Norgard and Plaunt (1992).

  9. OPTIONS OF THE DIGITAL LIBRARY

    Discussion has so far been in terms of the interests of the searcher: How to achieve a cost-effective search of multiple sites. But what if the sites themselves are not indifferent to whether or not they are searched? We are not aware of any studies of motivations in this situation but we can easily surmise two different attitudes:

    (a) Commercial sources, such as DIALOG, can, in general, be expected to wish to increase the number of search received, as this would tend to increase revenue in a business dominated by fixed costs.

    (b) Public sector sources, typically motivated by political or public service motivations, may well wish to provide a search service yet hope that the search traffic from outside the clientele they are funded to serve will be minimal, at least during peak hours. Online catalogs of university libraries, made available as a public service, are likely feel this way.

    How, then, should a service position itself to increase (or decrease) the number of searches received? Other things, being equal, a service motivated to change the number of searches received should seek to change its position in the order in which services are searched:

    (a) The later its position in the search order, the greater the likelihood that the search will have been completed before reaching this source. With an "any N" instances search, N instances may already have been found and the search discontinued before the search reaches this source. Likewise, with an "any N good" instance search, a sufficiently satisfactory recall may have been achieved before the search reaches the source. Similarly, an earlier position increases that likelihood.

    (b) With any search, it may have been decided that, because of diminishing returns, the search was no longer worth continuing and so has been discontinued rather than completed, but with the same effect for the source.

    (c) Where there is duplication, the later a source is positioned, the greater the likelihood that any relevant instances are duplicative of those already found and so not of value. More precisely, it is not so much the position in the ordering of sources but the degree of overlap with the sources already searched, but, whatever the degree of overlap, it can only increase as a source's position in the order recedes.

    Any empirical data capable of generating Cumulative Recall curves of the form shown in Figures 3 could also be used to demonstrate the effects of any given change in the ordering of the sources. One would merely need to re-examine the empirical data with a different ordering.

    We note that if two services were to merge they would become a single larger service. The same effect is achieved if a union (i.e. combined) index to two or more services is created even though the services themselves were not merged. The effect is to enable the searcher to survey more services with fewer searches. At the limit, with just one comprehensive union index there would be only one place to search. Whether or not to merge or to participate in a union index is strategic decision for each digital library. The optimal strategy to minimize (maximize) the number of searches received would seem to be non-participation in a union catalog if it can also be arranged that the source would be searched (if at all) after (before) the union index.

    Further analysis would require clarification of what the objectives of the service might be. For example, a service may seek to increase (or decrease) the number of times it is searched or it may seek to increase (or decrease) the number of relevant items found. These are not the same, though the effects may be similar.

  10. EMPIRICAL ANALYSES

    Work is in progress with two different emphases:

    (i) The effects of any definable search strategy on any imaginable number of digital libraries and of skewedness and duplication in their collections can be handled by creating matrices of the form shown in Tables 2 with additional dimensions for each attribute of interest.

    Real data in this form pertaining to networks of up to 100 actual libraries can derived by postprocessing searches in the University of California's MELVYL online union catalog. MELVYL search results can be disaggregated by library so that what is held in any given library can be examined. The data from any given search can be fed into a matrix of the form shown in Table 1 and treated accordingly, i.e. re-arranged in any given ordering of the libraries with information on what is duplicative retained. From this rearranged data the effects that would resulted from searching the libraries in any given order can be reconstructed, recall graphs like those in Figure 2 constructed, and other analyses made.

    However, for the purpose of exploratory research and sensitivity analysis, the collection of empirical data is not necessary. All that is necessary is the creation of data in the form of Table 1 extended to any arbitrary number of sources.

  11. RETRIEVAL TECHNIQUES

    Retrieval systems generate a more or less weak ordering of documents. At an extreme, simple Boolean techniques yield a binary partitioning: Retrieved and Not-retrieved. The problem is that with such a weak ordering it is not practical to predict the absolute size of the retrieved set. Therefore, if, as is usual, one does not want every instance retrieved, it is not realistic to expect to be able to formulate a search statement that yields any specified number of documents. Hence, with retrieval techniques with weak ordering, one has to modify the search statement iteratively until a suitably-sized retrieved set results. The "FEWER" command of the OASIS experimental prototype provides computer-assisted support for this within a single database environment (Buckland, Butler, Norgard & Plaunt 1992). Generalizing this approach to multiple database environment would provide the basis in an environment of simple Boolean services for an intelligent "knowbotic" agent capable of adapting search statements adaptively in response to what is found.

    Retrieval techniques with stronger ordering capability lend themselves, such as Boolean systems supporting weighted search queries, have less need for adaptiveness during the search. Probabilistic retrieval systems ordinarily yield rather strictly ordered retrieved sets, but are ordinarily based on analysis of the relationship between an individual document and all other documents in the same database ("term frequency / inverse document frequency") and, therefore, do not lend themselves to searches of multiple databases.

    Research on retrieval techniques for searching multiple digital libraries is in progress and will be the subject of future reports.

    Acknowledgements

    This report was written for and partially supported by the NSF, ARPA, NASA Digital Libraries Initiative grant to the UC Berkeley Digital Libraries Project. It was also supported by the U.S. Department of Education through HEA IID grant R197D40008 "Online access in multiple database environments", part of the OASIS research program.

    References

    Buckland, M. K., M. H. Butler, Y. Kim, B. A. Norgard & C. Plaunt. 1995. Partnerships in navigation: An information retrieval research agenda. In: ASIS '95: Proceedings of the 58th American Society for Information Science Annual meeting, Chicago, Oct. 8-12, 1995. Medford, NJ: Information Today, 1995. 84-89

    Buckland, M. K., M. H. Butler, B. A. Norgard & C. Plaunt. 1992. OASIS: A front-end for prototyping catalog enhancements. Library Hi Tech Issue 40: 7-22.

    Buckland, M. K., A. Hindle & G. P. M. Walker. 1975. Methodological problems in assessing the overlap between bibliographical files and library holdings. Information Processing and Management 11: 89-105.

    Flater, D. W. & Y. Yesha. 1993. Towards flexible distributed information retrieval. In: Adam, N. R. & B. K. Bhargava, eds. Advanced database systems. Berlin: Springer, 259-276.

    Gravano, L., H. Garcia-Molina & A. Thomas. 1994. The effectiveness of GlOSS for the text database discovery problem. SIGMOD Record 23: 126-137.

    Lemberg, R. 1995. A life-cycle cost analysis of the creation, storage and dissemination of a digitized document collection. DLIS dissertation, University of California, Berkeley.

    Norgard, B. A., M. G. Berger, M. K. Buckland and C. Plaunt. 1993. The online catalog: From technical services to access service. Advances in librarianship 17: 111-148.

    Schamber, L. 1994. Relevance and information behavior. Annual Review of Information Science and Technology 29: 3-48.