An analysis of the objectives, problems, and interactions in searching multiple digital libraries. No single library can satisfy all needs. Digitalizing libraries makes the task of extending a search to additional libraries easier. The key questions are: Where to look first?, Where to look next?, and When to stop looking? The task varies with the type of search (all instances, some instances, some "good" instances, etc.); the skewedness of the distribution relevant documents across libraries; duplication in holdings between libraries; differing objectives of service providers; the situational nature of deciding whther two documents should be considered the same, and the relationship between different retrieval techniques and the retrieval task.
A fundamental problem in library services is that no single library collection will be complete for any given group of users. The distribution of demand over documents in any given domain is known to be highly skewed, a pattern known since 1934 as Bradford's Law of scattering. Tailoring the selection of documents in a collection to match the distribution of demand (one component of collection development) can greatly increase the proportion of demand satisfied by one library.
With paper-based libraries the inconvenience of extending a search to other libraries encourages users to make do with what is conveniently at hand, thereby diminishing the appearance of inadequacy. Nevertheless, a motivated searcher must always expect to search in multiple libraries whenever completeness is desired or an inquiry leads outside the scope of the local library.
The issues addressed here apply equally to collections of documents on paper and to digital stores. However, given sufficient standardization and interoperability, the digital library environment should drastically reduce the inconvenience and, thereby, the disincentive associated with extending a search to multiple sites. Extending searches to geographically-distant collections is becoming much easier. Given the rapid proliferation of digital library collections, support for searching multiple sites emerges as centrally important for cost-effective library service.
The searcher. We assume that the searcher
desires one or more documents, but does not know what, if any,
suitable documents exist and, even if the existence and identity were
known, the searcher does not know where copies can be found. The
problem of deciding where to look can be reduced to three operational
questions:
(i) Which library to search first?
Where to search first (i) can be considered a special case of
the more general question of where to search next (ii). In this
analysis we concentrate on the sequential search of individual digital
libraries, deferring, for another report, examination of simultaneous,
parallel searching of multiple libraries and of mixed strategies
combining sequential and simultaneous searches.
The server. We start with the assumption that
servers (the providers of library service) are indifferent as to who
searches and/or how much searching is done. Later we will examine
this assumption and consider the options available to servers that are
not indifferent to the amount and nature of searching.
The purpose of this design analysis is to identify the principal
factors involved in the searching of multiple digital libraries, to
clarify the relationships between them, and to formulate requirements
for operational support. Of particular interest is the challenge
posed by a searcher who wants "a few good documents" in a Boolean
retrieval environment. Procedural issues of compatibility and
interoperability are outside the scope of this report.
The following sections review different kinds of search query, the
notion of relevance, the question of when two documents are "the
same", and basics of search theory. We then examine the impact on
search effectiveness of three related variables: the distribution of
relevant documents across sites; effectiveness in ordering the
sequence of sites in the search; and the effects of duplication in
document holdings across sites. Finally we will briefly consider the
options open to the servers and note some technical consequences of
the differing ordering power of Boolean and probabilistic retrieval
techniques.
We categorize searches in the following ways:
Types of search I: Instances
(a) An "every instance" search is one in which a search is deemed
satisfied when every different document with the specified
characteristics have been found. Since any collection might contain
an instance, an "every instance" search requires that every library be
searched.
In this discussion we are concerned with instances of different
document (types) rather than with duplicate copies (tokens) of the
same document. Ordinarily duplicate copies are not of interest, after
the first, because redundant. However, seeking every copy (token) of
each different document (type) may sometimes be needed. We call this
a "census search". A traditional example is the compilation of a
census of all known copies of particular kinds of early printed works.
An example that becomes more feasible with digital libraries would be
locating all known copies of document in order to make corrections or
additions.
(b) An "any single instance" search is a search that is deemed
satisfied when any instance of a document with the specified
characteristics has been found.
(c) An "any N instances" search is a search that is deemed satisfied
when any N instances of documents with the specified characteristics
have been found.
An "any single instance" search can be treated as a special case of
"any N instances" searching where N equals 1. Although an "every
instance" search is different in kind from "any N instances" searches,
it, too, could also be treated as an example of an "any N instances"
search if N is set high enough to ensure that all sources will be
searched before N instances have been found.
(d) An "extreme instance" search is a search for whatever document has
an extreme value for one or more attributes, e.g. the most recent
document of some type.
Types of search II: "Good" (or "preferred") documents
Conventional Boolean search systems in operational bibliographical
and library systems ordinarily operate on a primitive, unambiguous
binary distinctions: Documents categorized as being "about information
retrieval" or "by Boccaccio". Of all possible values of all defined
attributes, the presence of one or more is specified as a requirement
for retrieval. The presence of other attribute values is, formally, a
matter of indifference so far as the formal search specification is
concerned. In real life a common form of search is for a one or a few
"good" documents, rather than just any one or few instances of
documents with the specified characteristics. In other words,
although a few randomly selected documents that satisfy the search
specifications might sometimes be satisfactory, it is likely that a
few good books on information retrieval theory are desired. So
support for searching for "good" documents is a desirable design
requirement.
We regard searches for "good" books as differing from searches for
"any" one or more instances by being characterized by preferences,
i.e. "I want some books about information retrieval and, given a
choice, I would prefer them to be recent, in English, and by a
well-reputed author." Of course, in the design of formal systems we
can only be concerned with situations in which preferences can be
operationally specified.
Specification
Mechanical information systems require that a search has to be
expressed formally, but, in practice, searches more complex than
simple "look-up" are unlikely to be completely specified for several
reasons. Only some attributes are present in the data; only a few of
those provided can be searched; and only a limited number of forms of
search are supported (e.g., typically, right-hand truncation but not
term adjacency). Complex specifications are inconvenient to formulate
and searchers have a low tolerance for complexity in search
specification. Empirical studies of online library catalog usage have
consistently shown that functionality for specifying complex searches
is little used (e.g. Norgard et al. 1993).
The final selection is ordinarily not done by the information
retrieval system, but by the human searcher, picking over, with
criteria that are probably not only subjective, but also ill-defined
and possibly unconscious, the set eventually generated by the
interaction with the retrieval system. In this way limitations of
mechanical retrieval systems are alleviated. Exploring how far
support for this final human sorting can be incorporated into the
formal retrieval system is itself a valid part of the agenda for
research and development in selection systems. However, we conclude
thatthe design task is not to try to automate this final choice, but,
rather, to empower the human searcher to understand the options
better.
For these reasons, we believe that the design of retrieval systems
should be include support searches for "some good" documents, whether
searching in one or multiple libraries. Searches for "some good"
documents are characterized by preferences, e.g. "Given a choice,
prefer a document that is recent". Hence, in addition to the binary
values of conventional Boolean searching (i.e. the specification of
attribute values that are either mandatory or else a matter of
indifference), we assume that systems designed to support searching
for "good" documents need to be able to handle a three-valued approach
to attribute values:
Specifying a preference, a conditional requirement, means that the
attribute itself is situationally significant, taking effect if and
when prescribed circumstances are encountered. Of course, there may
well be multiple preferences, e.g. "I'd prefer an English version and,
if possible, the 1934 London edition."
A conditional, adaptive approach appears to be particular useful:
(i) When an absolute (or approximate) number of documents is desired
and the population of (more or less) desirable documents is unknown
and difficult to predict. If prediction were easy, then it would not
be difficult to formulate a more or less restricted binary Boolean
search that would yield a set of approximately the desired size of
retrieval set. There is substantial empirical evidence that
difficulty in predicting the size of retrieved sets is a major
practical problem in searching bibliographies and online catalogs.
(ii) When multiple sources are to be considered, since the larger the
number of sources to be searched the less likely that the population
of desirable documents can be predicted.
(iii) When the contents of the libraries are somewhat duplicative.
This makes the number of different documents in a set of libraries
more difficult to estimate. (More on this below).
Simply stated, preferences are conditional. They take effect
situationally, adaptively. The more difficult it is to predict the
outcome of searches, the more desirable it is to develop adaptive
search strategies that are adaptive in the sense of incorporating
conditional specifications as noted above. The larger the universe of
digital libraries the greater the difficulty of predicting what would
be found and the greater the need to support adaptive, heuristic
searching able to incorporate preferences.
Traditional practice in retrieval research has been to use the term
"relevant" for documents that should be retrieved. This convention
has proven highly convenient for quantitative analysis even though
studies of relevance judgements routinely find them to be highly
inconsistent.
The meaning, hence also the aboutness, significance, validity, and
probable utility, of documents are subjectively and situationally
constructed by the individual users, it follows that relevance
judgements cannot be considered objective or expected to be
consistent. (See, e.g., Schamber 1994).
In this situation, allowing increased scope for accommodating
searchers preferences can be expected to improve the potential
effectiveness of a retrieval system.
The question of whether two instances of a document are really "same"
or not becomes important. A rigorous position is that, strictly
speaking, there is no such thing as "the same" with respect to
documents. Two copies of the same document are just that: two
different instances and not, therefore, the same.
In practice "the same" has a pragmatic interpretation. It means that
two or more objects are equally acceptable for some purpose. A
photocopy of a document using only one side of the paper is
distinguishable from another one made using both sides of the paper.
If we wish only to read the text, the different versions are "all the
same to us" and one could be a substitute for the other. However, if
we intended to cut and paste the paper text, the difference would
matter. Similarly, we might find ascii and bit-mapped images of a
text equally acceptable for some purposes, but not for others. A
low-resolution version of a high-resolution image may be as acceptable
in some circumstances but not in others.
It follows that the determination of sameness, being situational, can
only be done during the selection process, not at the time of storage.
The most that can be done during storage and indexing (i.e. prior to
retrieval) is to categorize the documents so as facilitate future
decisions concerning substitutability and, also, to adopt some
conventional threshold of sameness. For printed documents on paper
the conventional technical threshold for "same" is the "edition", with
the terms "impression", "variant" and "copy" (or "exemplar") being
used for finer distinctions when needed.
What follows briefly summarizes the most relevant elements of search theory.
Search sequence
(a) Likelihood of finding. If the cost of searching different sources
were identical or irrelevant, the optimal search strategy is to look
first at the source most likely to contain what is sought, if we
happened to have such a priori knowledge, then the second most likely,
and so on. In other words the sources should be ranked and searched
in decreasing likelihood of finding what is sought.
(b) Cost of searching. If the expected likelihood of search success
were the same for all sources, then the least expensive search
strategy would be to start with the source that is the least expensive
to search, then (if need be) move on to the least expensive remaining
source, and so on. In other words, the sources should be ranked and
searched in order of the increasing search cost.
(c) Cost-effective searching. Where the sources vary with respect to
both the likelihood of finding and to the cost of searching, the
optimal search strategy is achieved by establishing the expected
cost-benefit ratio for each source, expressed as the likelihood of
finding what is sought divided by the expected cost of searching that
source. One then starts with the source with the highest value for
this ratio, then moves on that with the highest ratio among the
remaining sources, and so on. In other words, the sources should be
ranked and searched in decreasing order of the expected search success
/ search cost ratio.
Stopping the search
While these rules tell one where to look, they do not tell one whether
or when to stop searching before all sources have been searched. For
this it is necessary to compare the marginal cost-benefit with some
other, alternative use of resources.
One may hope to complete a search for "any instance" or "any N
instances" with one or a few sources. Continuing after one (or N)
have been found would be pointless even though more might be found.
On contrast, "every instance" and "extreme instance" searches require
every source to be searched because the very last source may be where
the last or most extreme instance is to be found.
Searches for "N good" instances are more complex. If N instances have
been found that fully satisfy all the required and preferred attribute
values, further searching becomes pointless and the same situation as
a satisfied "any N instances" search has been reached. If the
required attribute-values have been satisfied, but the preferred
attribute values have been incompletely achieved, there remains the
possibility of benefit in continuing the search: This implies a heavy
weight is placed on the preferred attribute and, to the extent that
that has been done, the search comes to resemble an "extreme value
search".
Search diseconomy
If there is a cost in, for example, time, effort, or money, associated
with searching, searching all sources may not be affordable or
warranted. In particular, since libraries tend to have at least some
degree of overlap in contents, diminishing returns in terms of search
results are to be expected as the number of sources searched
increases. The marginal benefit will tend to diminish but the
marginal cost can be expected to remain stable or to increase. The
combined effect is an increase in the marginal cost per relevant
instance found as well as a steady increase in overall cost for the
entire search and, therefore, an increase in average cost per relevant
instance found. At some stage the cost of continuing the search is
likely to become uneconomical, depending on the circumstances and
assumed alternative uses of constrained resources. There is an issue
of scalability: the larger the number of digital libraries, the less
plausible the attempt to search them all.
Searches for "good" instances involve preferences and the strength of
the preference is a significant variable with respect to the stopping
decision. If the preference is mild, close to indifference (e.g.,
"I'd prefer a version in WordPerfect 5.1 but ascii or any common
wordprocessing format would do"), then this preference would be useful
in discriminating if and when ample instances have already been found
and there is a choice to be made from among them, but it provides weak
justification for continuing the search. If the preference is strong
(e.g., "I want the original 1934 version of the text if at all
possible."), then the preference is closer to being mandatory and is a
stronger argument for continuing the search.
We define duplication as the existence in two or more libraries of
instances of the same document. ("Same", as noted above, is not a
simple concept. We assume, in what follows, an arbitrary level of
similarity, e.g. of textual equivalence, as in copies of the same
edition.)
Libraries, digital or otherwise, are, by their nature, characterized
by high levels of duplication. Indeed, one of the principal expected
benefits of the move from paper-based to digital libraries is in the
massive cost-savings expected to result from an expected reduction in
duplication (Lemberg 1995).
High levels of duplication have important consequences for searching.
The marginal cost-benefit of extending a search to another source is
very sensitive to the degree of duplication because the greater the
degree of duplication the less the marginal benefit of extending the
search. At the limiting case in which all sources are entirely
duplicative, i.e. containing exactly the same documents, it would be
pointless to extend the search beyond the first source. If, however,
there were no duplication between the sources (i.e. entirely disjoint
sets), then one would expect to search multiple sources for all except
easily satisfied searches.
A sensible first approach to the selection of which of multiple
libraries to search is to check or to estimate the overlap between the
search specification and the contents of each source separately, and
tools are emerging for this purpose (e.g. CHECK (Buckland, Butler, Kim, Norgard & Plaunt 1995);
GlOSS (Gravano, Garcia-Molina & Thomas 1994).
But CHECK and GlOSS, to the extent that they provide estimates of the
existence of relevant documents in a digital libraries provide
estimates of what relevant documents would be found if that source
were the first or only) source searched. After that, such an estimate
needs to be discounted by the degree of duplication between that
source and all the sources that have already been searched. One could
also say that CHECK and GlOSS estimate of the upper limit of the
number of relevant instances.
With significant and varied degrees of overlap, the optimal choice of
sites to be searched can even depend on number of sites searched.
Consider the hypothetical example in the following Venn diagram
showing the duplication of relevant documents in three sources:
If only one source were searched the optimal choice would be
source A which would yield 10 relevant documents, better than 9 for B
or 8 for C. Having searched A, the best next choice would be to
search B, increasing recall from 10 to 15. But if two sources were to
be searched, searching B and C, and not searching A, would be the best
strategy, yielding a combined total of 16 relevant documents, compared
with 15 for A and B and 14 for A and C. Although such examples can be
demonstrated, empirical studies might show that the frequency and
magnitude of such non-optimality is insignificant in practice.
The methodology for detailed analysis of duplication among library
collection is described in Buckland, Hindle &
Walker 1975. The underlying mechanism is the creation of a large
data matrix composed of a row for each document and a column for each
location. The number of copies of that document held at that location
is entered in each cell as in Table 1.
The effects of searching the libraries in any given order is shown by
rearranging the columns into the specified search order, as in Table
2.
The columns are then inspected and the recall (i.e. number of relevant
retrieved documents found) is counted: A "1" indicates that a copy of
a relevant document has been found, but it is duplicative unless it is
the first 1 found in its row (as underlined). In this simple example,
the first library searched (B) yielded one relevant document (a). The
second library (G) yielded another relevant document (c). The next
two libraries (F, A) add nothing: Each yields one relevant document
but it is duplicative. The fifth library searched yields three
instances of relevant documents (a, b, c), but only one of these (b)
is non-duplicative, so the useful increment in recall is one.
Extending the search to libraries C and E would not add any new
documents. It can be seen, with hindsight, that starting (and ending)
with library D would have been a better search order. The Recall row
shows the marginal contribution of each library. The Cumulative
recall row provides the data for drawing a Recall curve.
This simple model is sufficient for basic Boolean searches. To study
more complex searches involving preferences and adaptive search
statements, some additional elaboration is needed. A convenient
approach is to disaggregate the matrix with a dimension for each
attribute intended to be used as a preference. (Examples can be found
in RECALL PERFORMANCE WITHOUT DUPLICATION
A graph of the cumulative increase in the number of relevant documents
found ("Recall") as a search is expanded is standard practice in
information retrieval. It can be readily adapted to the present
analysis of multiple digital libraries if one treats the contents of
the digital libraries as one large collection, segmented by library.
Such as graph shows the increase in recall as the search is extended
from one digital library to the next, as in Figure 2.
No skewedness, no duplication
As an illustration we imagine ten digital libraries, each with the
same number of relevant documents, but with no duplication between
them. We find that:
(i) As the search is expanded from one source to the next, the number
of documents retrieved (the marginal recall), stays constant. This is
shown in the straight diagonal line D in Figure 2;
(ii) Under these conditions different search orders make no difference
to the number retrieved from each library. Ordering the sources
randomly is as good a solution as any choice of ordering.
(iii) An "every instance" search must always continue until all
sources have been searched, but if, say, N instances are all that is
wanted, they would, in this example, have been retrieved with the
searching of the 6th source. Thereafter no additional instances are
wanted, the marginal recall becomes zero and the slope of the recall
curve becomes flat as indicated by the D'.
The ordering of the sources is assumed to be a matter of indifference
to the searcher, but it might not be if the costs of searching
differed between sources. The ordering of the sources makes no
difference to what they contribute in the "every instance" search, but
it does in the "any 6 instances" case: sources 7 - 10 are not
searched.
Skewed distribution, no duplication. We continue the assumption
of no duplication within the set of libraries, but relax the
assumption about all libraries contributing the same number of
instances. We start with the most extreme case in which one and only
one library contains all the relevant items. We find that the recall
curve is very sensitive to the search order as shown in Fig. 3.
With a optimal ordering the library containing all relevant items
would be searched first and the recall curve would rise vertically,
then remain flat (curve Pt in Fig. 3). Any given search order involves
a vertical line where the productive library is reached. With the
worst possible ordering all unproductive libraries are searched first
and recall curve remains flat until the last source is found,
whereupon it rises vertically (curve Pv). These two extreme cases
define the limits, the region of possible recall results. Over many
trials with random ordering the average will tend to the diagonal line
D.
Where "any N instances" are sought (N being less than "every"), the
value of N determines where the horizontal line appears but, otherwise
not the shape of the graph.
We can generalize from this discussion that where ordering tends to be
better than random, the recall curve will be in the upper left region,
above the diagonal, and the better the ordering the more the recall
curve will tend to the extreme, limiting case of curve Pt.
Cases in which the distribution of relevant documents is less skewed,
similar but less extreme results emerge. In brief, still assuming no
duplication, whenever the distribution of documents is at all skewed
and the search ordering is better than random, we find a convex
cumulative recall curve rising monotonically from the origin and
reaching the top right corner when the last source has been searched.
(See line R in Fig. 3). One might say that
the skewedness allows greater curvature and better ordering achieves
greater curvature.
Skewedness and ordering amplify each other's effect. With random
ordering, the recall curve tends to a straight diagonal regardless of
skewedness. With zero skewedness, the ordering is irrelevant. What
really makes for effective recall is the combination of skewedness
with good ordering.
No skewedness, with duplication. In the limiting, extreme case of
complete duplication, all libraries contain copies of the same
documents. The effect is the same as the case of optimal ordering
with extreme skewedness where one library contains all the relevant
items and one inerringly picks that library first, described above and
shown in curve Pt in Fig. 3. The difference
is that while the result is the same, the search order has become
irrelevant: With complete duplication, any ordering will have this
same result.
As the degree of duplication increases from none to complete
duplication the cumulative recall curve shifts gradually from the
straight, diagonal line D of Fig. 3 to the
extreme skewedness of line Pt. Increasing duplication moves the
recall curve toward the top left.
Skewedness, with duplication
In practice the situation ordinarily encountered is the fourth
combination: a skewed (i.e. uneven) distribution of relevant documents
across digital libraries with some duplication. The complexity of
this combination resists tidy analysis, but two different approaches
are possible.
(i) Techniques such as CHECK and GlOSS can be used to estimate the
expected recall from each digital library viewed in isolation. In a
sequential search, these estimates should, after the first site, be
considered to be estimates of the upper limit of marginal recall,
since the estimates need to be discounted for duplication with
documents already found at previously searched sites. The degree of
duplication between the contents of any given library and the contents
of the previously searched libraries can be estimated using the
techniques described in Buckland, Hindle and
Walker (1972).
This approach has two disadvantages: (a) Estimates of duplication
between collections or even between subsets of documents in the
collections may not be a reliable predictor for any highly specific
search query; and (b) This approach does not scale well. As the
number of digital libraries increases, the discounting for duplication
becomes progressively more important and progressively more complex
with the very rapid increase in ordering possibilities as the number
of searchable sites increases.
(ii) Another approach is to use the search result at each site
cumulatively to form progressively more reliable predictions of the
likely marginal benefits of continuing the search. Stated very
crudely, when, as the search progresses, the marginal recall becomes
trivial, as evidenced by consistent zeros and the Cumulative Recall
curve R in Fig. 3 becomes flat, then it probably
time to discontinue the search. (A separate technical report on this
aspect is intended).
The search for "N good" instances of the form "given a choice, I would
prefer" implies an adaptive retrieval strategy that can respond
dynamically to the results encountered, either during the search or in
post-search processing of retrieval results. A search for "N good"
instances depends on an explicit specification of preferred values for
one or more attribute-values as the basis for "better" to be
determined.
In terms of the present analysis this can be viewed as an elaboration
of the data presented in Table 2. There would need to be an
additional dimension to the table for a disaggregation of the values
of each attribute (e.g. date of publication) upon which preferences
would operate. Some empirical examples of such data can be found in
Buckland, Butler, Norgard and Plaunt (1992).
Discussion has so far been in terms of the interests of the searcher:
How to achieve a cost-effective search of multiple sites. But what if
the sites themselves are not indifferent to whether or not they are
searched? We are not aware of any studies of motivations in this
situation but we can easily surmise two different attitudes:
(a) Commercial sources, such as DIALOG, can, in general, be expected
to wish to increase the number of search received, as this would tend
to increase revenue in a business dominated by fixed costs.
(b) Public sector sources, typically motivated by political or public
service motivations, may well wish to provide a search service yet
hope that the search traffic from outside the clientele they are
funded to serve will be minimal, at least during peak hours. Online
catalogs of university libraries, made available as a public service,
are likely feel this way.
How, then, should a service position itself to increase (or decrease)
the number of searches received? Other things, being equal, a service
motivated to change the number of searches received should seek to
change its position in the order in which services are searched:
(a) The later its position in the search order, the greater the
likelihood that the search will have been completed before reaching
this source. With an "any N" instances search, N instances may
already have been found and the search discontinued before the search
reaches this source. Likewise, with an "any N good" instance search,
a sufficiently satisfactory recall may have been achieved before the
search reaches the source. Similarly, an earlier position increases
that likelihood.
(b) With any search, it may have been decided that, because of
diminishing returns, the search was no longer worth continuing and so
has been discontinued rather than completed, but with the same effect
for the source.
(c) Where there is duplication, the later a source is positioned, the
greater the likelihood that any relevant instances are duplicative of
those already found and so not of value. More precisely, it is not so
much the position in the ordering of sources but the degree of overlap
with the sources already searched, but, whatever the degree of
overlap, it can only increase as a source's position in the order
recedes.
Any empirical data capable of generating Cumulative Recall curves of
the form shown in Figures 3 could also be used to demonstrate the effects of
any given change in the ordering of the sources. One would merely
need to re-examine the empirical data with a different ordering.
We note that if two services were to merge they would become a single
larger service. The same effect is achieved if a union
(i.e. combined) index to two or more services is created even though
the services themselves were not merged. The effect is to enable the
searcher to survey more services with fewer searches. At the limit,
with just one comprehensive union index there would be only one place
to search. Whether or not to merge or to participate in a union index
is strategic decision for each digital library. The optimal strategy
to minimize (maximize) the number of searches received would seem to
be non-participation in a union catalog if it can also be arranged
that the source would be searched (if at all) after (before) the union
index.
Further analysis would require clarification of what the objectives of
the service might be. For example, a service may seek to increase (or
decrease) the number of times it is searched or it may seek to
increase (or decrease) the number of relevant items found. These are
not the same, though the effects may be similar.
Work is in progress with two different emphases:
(i) The effects of any definable search strategy on any imaginable
number of digital libraries and of skewedness and duplication in their
collections can be handled by creating matrices of the form shown in
Tables 2 with
additional dimensions for each attribute of interest.
Real data in this form pertaining to networks of up to 100 actual
libraries can derived by postprocessing searches in the University of
California's MELVYL online union catalog. MELVYL search results can
be disaggregated by library so that what is held in any given library
can be examined. The data from any given search can be fed into a
matrix of the form shown in Table 1 and treated
accordingly, i.e. re-arranged in any given ordering of the libraries
with information on what is duplicative retained. From this
rearranged data the effects that would resulted from searching the
libraries in any given order can be reconstructed, recall graphs like
those in Figure 2 constructed, and other analyses
made.
However, for the purpose of exploratory research and sensitivity
analysis, the collection of empirical data is not necessary. All that
is necessary is the creation of data in the form of Table 1 extended to any arbitrary number of
sources.
Retrieval systems generate a more or less weak ordering of documents.
At an extreme, simple Boolean techniques yield a binary partitioning:
Retrieved and Not-retrieved. The problem is that with such a weak
ordering it is not practical to predict the absolute size of the
retrieved set. Therefore, if, as is usual, one does not want every
instance retrieved, it is not realistic to expect to be able to
formulate a search statement that yields any specified number of
documents. Hence, with retrieval techniques with weak ordering, one
has to modify the search statement iteratively until a suitably-sized
retrieved set results. The "FEWER" command of the OASIS experimental
prototype provides computer-assisted support for this within a single
database environment (Buckland, Butler, Norgard &
Plaunt 1992). Generalizing this approach to multiple database
environment would provide the basis in an environment of simple
Boolean services for an intelligent "knowbotic" agent capable of
adapting search statements adaptively in response to what is found.
Retrieval techniques with stronger ordering capability lend
themselves, such as Boolean systems supporting weighted search
queries, have less need for adaptiveness during the search.
Probabilistic retrieval systems ordinarily yield rather strictly
ordered retrieved sets, but are ordinarily based on analysis of the
relationship between an individual document and all other documents in
the same database ("term frequency / inverse document frequency") and,
therefore, do not lend themselves to searches of multiple databases.
Research on retrieval techniques for searching multiple digital
libraries is in progress and will be the subject of future reports.
Buckland, M. K., M. H. Butler, Y. Kim,
B. A. Norgard & C. Plaunt. 1995. Partnerships in
navigation: An information retrieval research agenda. In: ASIS
'95: Proceedings of the 58th American Society for Information Science
Annual meeting, Chicago, Oct. 8-12, 1995. Medford, NJ:
Information Today, 1995. 84-89
Buckland, M. K., M. H. Butler, B. A. Norgard &
C. Plaunt. 1992.
OASIS: A
front-end for prototyping catalog enhancements. Library Hi
Tech Issue 40: 7-22.
Buckland, M. K., A. Hindle &
G. P. M. Walker. 1975. Methodological problems in assessing the
overlap between bibliographical files and library
holdings. Information Processing and Management 11: 89-105.
Flater, D. W. & Y. Yesha. 1993. Towards flexible
distributed information retrieval. In: Adam, N. R. & B. K. Bhargava,
eds. Advanced database systems. Berlin: Springer, 259-276.
Gravano, L., H. Garcia-Molina & A. Thomas. 1994. The
effectiveness of GlOSS for the text database discovery problem. SIGMOD
Record 23: 126-137.
Lemberg, R. 1995. A life-cycle cost analysis of the
creation, storage and dissemination of a digitized document
collection. DLIS dissertation, University of California,
Berkeley.
(ii) Which library to search next? and
(iii) When to stop searching?

Figure 1: Venn diagram of duplication among three sources.
Table 1. Matrix of holdings of three documents in seven libraries.

Table 2. Matrix of holdings of documents in reordered libraries.

Fig. 2. Recall curve: No skewedness, no duplication.

Fig. 3. Cumulative recall curves.
Acknowledgements
This report was written for and partially supported by the NSF, ARPA,
NASA Digital Libraries Initiative grant to the UC Berkeley Digital Libraries Project. It was also supported by the U.S. Department of Education through HEA IID grant R197D40008 "Online access in multiple database environments", part of the OASIS
research program.References