Interfaces to Scientific Data Archives

supported in part by the
National Science Foundation*
IIS-9803760

this report compiled by
Roy Williams
Center for Advanced Computing Research
roy@caltech.edu

and
Julian Bunn
CERN
julian.bunn@cern.ch

Reagan Moore
San Diego Supercomputer Center
moore@sdsc.edu

James C. T. Pool
Center for Advanced Computing Research
jpool@cacr.caltech.edu

This document and supporting material may be found at
http://www.cacr.caltech.edu/isda

1. Executive Summary

Many scientific endeavors produce large quantities of heterogeneous data that is to be analyzed by loose, distributed collaborations. There is a call for federally-funded data to make itself useful through its availability to those who are not experts in the meaning of the data. Scientific data, in contrast to text or image data, is often useless without sophisticated, customized data-mining and knowledge extraction tools. Given these three conditions, there is an urgent need for software infrastructures to create, maintain, evolve, and federate these active digital libraries of scientific data; infrastructures that consider the newcomer learning to use the system as well as the seasoned professional. In the following we summarize some of the findings and recommendations from the rest of this report.

Metadata was an important topic: how to describe data objects, to make catalogs, to form relationships between data objects. An effective way to do this is by using structured documents to describe large binary objects; machines such as search engines and summarizers can then parse the document. Such structure can be provided with the XML language, a rationalized and extensible version of HTML. The Dublin Core metadata standard is an effective and viable way to provide the semantics and structure of these metadata records. What is needed in addition is the development of languages needed to exchange metadata, schemas and ontologies. For each of these levels of abstraction, a definition language and a manipulation language are needed. For more information, see sections 3.1, 4.2, 4.3 and 6.3.

Little was said about how to assure the long-term future of digital scientific archives: perhaps a partial solution to this would be to request such plans at the proposal stage of the archive. Many of those present felt themselves to be scientists and users of data archives, but few wanted to be librarians: maintaining, ingesting, abstracting, teaching, and attaching provenance to data. See sections 3.3 and 3.4.

The Web is universally seen as the lingua franca for the client who is new, or who does not have special software installed, or who wishes to access the archive from an arbitrary machine. In addition to this important role, the HTTP protocol increasingly provides communications between machines as well as between machines and people. Web servers are also being used as brokers, providing unified access to databases, legacy systems, supercomputers, and data archives. See section 3.6.

Discussions of architectures for scientific data archives converged on Web, distributed computing and databases, connected through standard protocols and brokers. In distributed computing, the major contenders are CORBA and Java RMI; few of this scientific group had experience with CORBA, but most were optimistic about Java. Most agreed that using databases is much more effective than working with flat files, but there is a complex and idiosyncratic choice between the simplicity of the relational database and the flexibility of the object database. It is most important in the design of distributed systems to always consider the error/diagnostic stream. See sections 3.5, 3.6, 3.7, 3.10.

Text-based interfaces should not be rejected in favor of point-and-click. Text can be stored, edited, documented, and quoted; the text history of an interactive session can be converted to a batch script; documented scripts can be used as didactic examples. Mature users prefer the speed and flexibility of text, but novices prefer a GUI, therefore we should concentrate attention on how a user can reuse what she has learned as a novice when she reaches the text-based world of the mature user. See section 4.1.

Authentication and security are important as soon the archive moves beyond the prototype: besides protecting computing facilities, we wish to prevent immature data from being distributed, and to prevent loss of first-discovery rights. There is a pressing need for authentication packages that bridge the gap between Unix tools (e.g. ssh, PGP) and Web tools (e.g. SSL, certificates). Another need is to be able to hand over authentication from one machine to another. See section 4.4.

When users can produce derivative data products and put these back into the archive, then subsequent users need to know and trust the people or organizations that did the processing. Soon we shall feel the need to attach signatures, peer-review, publisher imprints, and other parentage information to data objects, not just papers in journals. Such information should not get in the way, it should be obvious, difficult to forge, and ideally it should remain in place as the data is reinterpreted and filtered. See section 4.7.

* This workshop was sponsored by the National Science Foundation, under the grant IIS-9803760 (PI: Roy Williams) awarded by the Information and Data Management Program and the Special Projects Program of the Information and Intelligent Systems Division. All opinions, findings, conclusions and recommendations in any material resulting from this workshop are those of the participants, and do not necessarily reflect the views of the National Science Foundation.

2. Introduction

The High Performance Computing and Communications Program (HPCC) has been remarkably successful in focusing the attention of the Nation's scientists and engineers on the use of networked computers as tools to bring together large electronic data archives, and also to mine the data they contain. The Vice-President of the USA has endorsed the interoperation of geospatial and other scientific data archives as a national goal in the address ``The Digital Earth''.

The NSF has made a major thrust in the areas of Knowledge, Distributed Intelligence, and Digital Library technology, providing interfaces to such archives so that the public and specialists can search, browse, discover and retrieve the information that they need: such an interface converts the archive to a library. The tools from these projects allow meaningful, peer-reviewed data to be distributed over any network, from laboratory, to campus, to the world -- with suitable authentication. This is the impact for the professional scientist. The eventual aim is not simply an interface to a single data archive, but to federate such archives such that what is learned at one place works at another, and so that data can be effectively combined from multiple, independent archives.

If the archives contain raw scientific data, then by themselves they may be worth little to the end user: only when they can be filtered, queried, and mined can useful information be extracted from the archives. The workshop targeted the question of how high-performance computing resources, commercial software, agents, and brokers can be combined to produce powerful knowledge extraction tools that operate on the library, and thus make it an active digital library.

The objective of this workshop was to examine software infrastructures through case-studies; software for creating and maintaining active digital libraries of scientific data, for ensuring that the archive is flexible and extensible, that it is as easy as possible to learn how to use, and so that different groups can use each other's work instead of repeating the same work.

``The Digital Earth'' by Al Gore, US Vice-President, http://www.opengis.org/info/pubaffairs/ALGORE.htm
Knowledge and Distributed Intelligence proposal solicitation, http://www.nsf.gov/kdi
Digital Libraries Initiative, phase 2: http://www.nsf.gov/pubs/1998/nsf9863/nsf9863.htm

2.1 Users of Scientific Data Archives

In the past, the users of a scientific digital archive would all work in the same building, perhaps a professor and students. Everyone would know the file format, the data would be on tapes, the metadata in log books, programs and scripts in a comfortable muddle.

But now the audience is widening. The Internet years have brought increasing diversity into the scientific community, as it becomes possible to collaborate with distant colleagues, perhaps colleagues that have not met each other. The result is an increasing emphasis on effective electronic collaboration, including more sophisticated approaches to their data resources. Files will be replaced with dynamically created data objects that carry a description of themselves; data and metadata should be available in many ways: sorted, queried, filtered and transformed.

Another effect of the changes in computing technology is that the nature of scientific data is changing. Now much more data is produced because it is possible to store and manipulate it. In the past, the data would be immediately reduced and calibrated, but now the raw signals are stored in addition to the reduced data, in case a recalibration is needed. Furthermore, we are seeing a separation between the taking and the analysis of data. One group may set up the instrument, the simulation, the survey that creates the data, then produce a vast archive of data; then there may be a different group of people that extract scientific knowledge from the archive, and combine it with other archives.

The users of computers, including scientists of all ages, are changing. They will expect to learn by experimenting, not by reading a manual. They will expect software to install itself, and will have little patience with the gritty details of makefiles. At the same time, scientific software is working with objects that are much more abstract than those found in office software, so the challenge is to make systems with a graduated learning path. At the beginning, the system should do simple things simply, with a familiar look and feel, and graceful error recovery. As the user of the software advances, there will be many ways to do a single task, integration with other packages, scripting for repeatability and collaboration. Users would be well-served by systems that allow and encourage browsing and serendipity as well as directed search.

2.2 Using the Archive

To use the library effectively, the most useful thing is contact information for a real human librarian to answer questions, to provide a walk-through, to teach people. Given the expense of such a service, we point out that users will be able to teach themselves with some of the following information. There should be an overview of the archive: the nature and meaning of the data objects, the reason why the archive has been set up, the file formats and why they have been chosen, who uses the archive and their activities, what tools they use. A map of the library should be available, with access control restrictions clearly stated, together with instructions about how to get access to restricted areas. A data pyramid shows raw data at the bottom, with the derived data and metadata in the higher levels. Other pages could include an index, glossary, and links to related projects.

2.3 Structure of this Report

In the workshop, we made an in-depth and broad assessment of approaches to creating such an active library, and the challenges that must be addressed. In particular, we hope to identify useful common factors and find or formulate effective standards. The workshop lasted two and a half days, and this report documents the findings. This introductory section together with Section 3 (Findings and Questions) and Section 4 (Recommendations) constitute a complete document that may be read independently of subsequent sections.

The workshop was structured around a number of case-studies (Section 5) and tools (Section 6) so that the interaction of these illustrates needs for standards and abstraction, identifies similarities, focuses on real-life problems, and thus curbs the excesses of theory. The objective of the workshop was to try to answer a set of discussion questions in the context of the case studies, using small group discussion, and the results from the five groups are reported in Section 7. Two sections report the results of surveys, conducted before the workshop, Section 8 covering the participants of the workshop and Section 9 covering a sample of scientific data archives. Section 10 is a list of participants.

3. Findings and Questions

3.1 Metadata

As in all workshops about digital libraries, there was a great deal of discussion about metadata. It is defined in terms of relationships between data objects, or as searchable data about data; for example cataloging information, call numbers, file names, object ID's, hyperlinks, ownership records, signatures and other parentage documents.

We point out that representations of data, summary data, and other kinds of derived data are not strictly considered to be metadata, for example thumbnail images, graphs and visualizations, power spectra, instrument calibration information. Computers can create these kinds of data automatically, and they can also make records of low-level data descriptors such as file size, and file name, location, timestamps, access control, etc.

The more interesting metadata issues concern the semantic content, the "meaning" of the data. Currently, this can only be created by a human mind, and the easiest way to do this is when the data objects are created in the first place. The key here is to create structured documents in a sophisticated markup language such as XML or SGML, or failing that, compliant HTML, with all the tags properly formed (see section 4.2). If this is done, machines will be able to parse the document for the purposes of abstracting, sorting, graphing, summarizing and archiving.

Metadata may also be added to the archive gradually. The archive might be processed to extract a new attribute for the catalogue database, for example an evaluation of whether the data is valid, in some sense. Larger quantities of data can be added to the archive by providing hyperlinks to other servers. If these additions are deemed valid by the administrators of the archive, they will make the transition from personal to public; from the cached data belonging to an individual scientist to an acknowledged part of the library. Other kinds of metadata that can be gradually added might include information on who has accessed or cited which parts of the library.

Scientific data archives are more often sophisticated query engines than collections of documents. We should consider a data object to consist of more that just a MIME-type and binary data, but rather a combination of the binary with a page of structured text which is a description of the object. This document, the metadata, is therefore generated in real time, as is the requested data object. We recommend that a markup language such as XML is used for the metadata description. As an example, scientific data contains numerical parameters; if the description contains tags like <param name=lambda>0.37</param> rather than the usual ad hoc script files, then (in the future) sophisticated XML software will be able to catalogue the results, and to sort, graph, and summarize the results.

3.2 Collaboration

Collaboration is the lifeblood of scientific investigation, especially collaboration between disparate fields of enquiry. The structure of the scientific data archive can assist and encourage collaboration in a number of ways.

Collaboration can be fostered by federating libraries. For example, the Digital Sky project (Section 5.4) is a federation of surveys at optical, infrared, and radio wavelengths; it is expected that there will be new astronomical knowledge when these archives are interoperating, knowledge that is created from joining existing archives, without any new observations. But another, social, effect of the library interoperation is the encouragement of collaboration. In forming metadata standards and agreeing upon semantics, subfield experts see a wider picture and work more closely with experts from related fields. See also section 4.3.

The library can provide a bulletin board and a mailing list. Useful contributed material includes documented scripts of existing sessions, that can be used by others. Collaboration tools -- groupware -- can provide shared browsing, whiteboard, and conferencing facilities. Currently it is difficult to do collaborative scientific visualization, because it means that the pixels of one screen are copied, rather than the much smaller model from which the pixels are constructed. To use commercial groupware for scientific purposes, we need the API to be published that allows specialized scientific visualization software to be connected to the groupware in the bandwidth-optimal way described above.

When designing user interfaces, we should not think only about the isolated user, but also in terms of shared browsing, tutorial sessions, and exchange of scripts among geographically-separated users.

3.3 Who are the Librarians?

Many of those at the workshop considered themselves as creators or users of a scientific data archive, but not as a librarian or administrator. So the question we are left with is who are the librarians for this increasing number of ever more complex libraries? Who will be ingesting and cataloguing new data; maintaining the data and software; archiving, compressing and deleting the old. Somebody should be analyzing and summarizing the data content, assuring provenance and attaching peer-review; encouraging interaction from registered users and project collaborators. Perhaps the most important function of the librarian is to answer questions and teach new users.

3.4 How Long Will the Archive Last?

Scientific data archives contain valuable information that may be useful for a very long time, for example climate and remote-sensing data, long-term observations of astronomical phenomena, or protein sequences from extinct species. On the other hand, it may be that the archive is only interesting until the knowledge has been thoroughly extracted, or it may be that the archive contains results that turn out to be flawed. Thus a primary question about such archives is how long the data is intended to be available, followed by the secondary questions of who will manage it during its lifetime, and how that is to be achieved.

Data is useless unless it is accessible, unless it is catalogued and retrievable, unless the software that reads the binary files is available and there is a machine that can run that software. While we recognize the finite lifetime of hardware such as tapes and tape readers, we must also recognize that files written with specific software have a finite lifetime before they become incomprehensible binary streams. Simply copying the whole archive to newer media may solve the first problem, but to solve the second problem the archive must be kept ``alive'' with upgrades and other intelligent maintenence.

A third limit on the lifetime of the data archive may be set by the lifetime of the collaboration that maintains it. At the end of the funding cycle that created the archive, it must be transformed in several ways if it is to survive. Unless those that created the data are ready to take up the rather different task of long-term maintenance, the archive may need to be taken over by a different group of people with different interests; indeed it may pass from federal funding to commercial product. The archive should be compressed and cleaned out before this transformation.

3.5 Flexibility

The keys to flexibility in a complex, distributed software system are specification of interfaces, clear control channels, and modularity of components.

Interfaces should be layered, with the upper layers specified closely and documented by the system architect, while the lower layers should be a standard software interface, specified by a standards body or a de facto standard which is well-accepted; something that will be used and extended in the future. Unless there is really a new paradigm, it should not be necessary for the system architect to specify a protocol at the byte level, with all the error-prone drudgery of byte-swapping and packetization. It is also not a good idea to utilize a proprietary or non-standard protocol, unless documentation and source-code is available and there are no property-rights issues.

Each process or thread that is involved in the computation should have a channel through which it receives control information, and it should be clear and well-defined when this channel opens, closes or changes. If there are separate control and data streams, it must be clear who is listening where. The Unix model of input, output, and error streams is a good one.

On the other hand, the components that communicate through these interfaces need not be carefully tested, committee-approved software; they can be inefficient prototypes, hacked code, or obsolete implementations by a bankrupt company. Once the interfaces and protocols are well-known and strong, different implementations of the components can be created and replaced easily.

3.6 Brokers

A broker is a software process in a distributed computing system that connects clients with servers, that translates, interprets, redirects or fuses queries and their results. Brokers provide the flexibility to unify servers with different implementation details and language variations, to translate between languages and to encapsulate legacy systems. A broker allows a user to create a complex query without exposing all the complexity; for example a broker can create queries in a language such as SQL using a simple wizard-based graphical interface. With a broker, the client and server protocols and services can be optimized independently, with the broker providing the translation. Brokers can also be used to translate a single language to the particular dialect that each server might want.

Brokers are also useful for system design, to provide modularity and portability. When considering the choice of a database product, it may be advantageous to separate the database from the rest of the system so that it can be extended or replaced with another product at a later time. This can be achieved by placing a broker between the database and the client system.

A very popular kind of broker that is universally accessible is the Web server. Most of the archives in the survey (section 9) have provided a web interface to the archive. With an enormous implementation effort in the business world, it is easy to use a web server as the protean broker. A web server can work with multiple, heterogeneous databases, with high-performance archival storage systems, with text-based legacy systems; it can hide the underlying OS; it can also provide authentication, encryption, multithreading, and sessions; web servers and application servers can be made from software components. There is much research into parallel, high-performance web servers and other advanced servers.

3.7 Distributed Software Components

Modern software technology offers the promise of flexible, high-performance distributed systems where the code can be understood and modified without needing to know how everything works, but knowing only the semantics and methods of one object. Distribution is supposedly easy if all the code is written in Java, through Remote Method Invocation (RMI) and JavaSpaces. At a higher level of complexity and flexibility, CORBA provides portable distribution of objects whose methods are written in other languages such as C++. Of course the Java optimism may only be because it is newer than CORBA.

Components offer the idea of plug-and-play software. When the component is introduced to a system, it announces its purpose through a process known as introspection, and allows customization. A GUI or other application is made by connecting components through events, with the events produced by one component being sent to other components. The creation of an interface thus includes many kinds of people: besides the end-user of the interface, there is a library of components written by different people, and the person who connects the components together into a GUI. Components offer an advantage to the end-user as well as the creator of the GUI, because once the user has learned how to use a component, the knowledge can be reused; for example the file-chooser has the same look in all Microsoft Windows applications.

Java RMI: http://java.sun.com/products/jdk/rmi/index.html

3.8 Database Software

Most of the case-studies have embraced the idea that a relational or object database is essential for flexibility and modularity in design, for querying, for sorting, for generating new types of data object or document.

Object database practitioners are largely convinced of their superiority over relational databases for scientific data, but of course relational data bases can also work for scientific data. Object databases provide features such as data abstraction and hiding, inheritance, and virtual functions. One might also argue that if programs are in an object-oriented language, then it is natural and proper that the data be in an object database.

A portability question arises between relational and object databases. Relational schemas are simpler than the richness that is possible with a full object model, so porting between relational products entails writing the tables in some reasonable form, perhaps ASCII, and reading them into the other product. With an object database, however, much of the implementation work is writing code that interfaces to the proprietary API, and porting to another ODBMS will only be easy if the code is designed for portability right from the start.

While most scientific data archives have embraced the idea that a database is essential for storing the metadata, another question is how to store the large binary objects that represent the data objects themselves. One point of view maintains that investing time, effort, and money into a DBMS implies that it should be used for managing all the data in a unified way. This is appropriate when there are many, relatively static, not-too-large data objects. The other point of view is that cataloguing functions are very different from the specialized operations on the large data objects, and when we write the code that does complex processing and mining, we want to work directly with the data, not through the DBMS API. Splitting data from metadata in this way reduces dependence on a particular DBMS product, making it much easier to port the archive to a different software platform, since only the metadata needs to be moved.

The Object-Oriented Database System Manifesto:
http://www.cs.cmu.edu/People/clamen/OODBMS/Manifesto/htManifesto/Manifesto.html
Why use an ODBMS (from Poet Inc.): http://www.poet.com/products/object_development_solutions/white_papers/relational_vs_object/relational_vs_object.html

3.9 Commercial Software

Using commercial software in a digital library can be difficult if the client must pay for the use of the software. Licensing agreements are often written from the point of view of a single user running the software on a single workstation. While advances such as floating licences are welcomed, new kinds of software and licences are still needed: short-term licences, free client software, licences for running the software as part of an application server, or licences based on measured usage.

Sometimes it is a good idea for the library implementor to insulate herself from uncontrolled changes in a commercial product by thinking of escape right from the start. It may be possible to encapsulate the commercial package into a few functions or classes, so that the package can be "swapped out'', if necessary, at some future date.

3.10 Exceptions and Diagnostics

One of the most difficult aspects of distributed systems in general is exception handling. For any distributed system, each component must have access to a "hotline" to the human user or log file, as well as diagnostics and error reporting at lower levels of urgency, which are not flushed as frequently. Only with a high quality of diagnostic can we expect to find and remove not only bugs in the individual modules, but the particularly difficult problems that depend on the distributed nature of the application.

3.11 Deep Citation

In principle, the account of a scientific experiment includes detailed quantitative evidence that allows differentiation between competing theories. For a computational simulation or data mining investigation, the analogue of this evidence includes metadata that allows the reader full access to evidence. Carrying this idea to its logical conclusion would allow readers access not only to the data that was used, but also the programs that created and extracted the knowledge content of the data, so that they can verify results and examine variations by running the simulation or mining code themselves. This is deep citation. While this idea may seem excessive for a paper in a journal, it would be very desirable for a close collaboration or for educational purposes. Individual researchers would also benefit from keeping deep citations to their own work.

3.12 Data Driven Computing

We are used to the idea of writing a procedural program which reads files from some kind of data service. In some circumstances, this may not be the correct model: rather we write the code as a handler of data objects which are handed to the compute service in some arbitrary order by the data service. Suppose, for example, that we wish to apply an algorithm to all the data of a large data archive, and that the archive is stored on a tape robot, where the data objects are arbitrarily ordered on many tapes. If the compute process is in control, then the robot may thrash the tapes furiously, in order to deliver the data objects in the order demanded by the program. But in the data-driven model, all the data objects are extracted from a tape in the order in which they appear and delivered to the compute process (the handler), and the job is completed in much less time. Certain kinds of query, involving touching most of the data of the archive, may be scheduled for a ``data-driven run'', where many such queries, from different users, can be satisfied with a single run through the data, perhaps scheduled to run over a weekend.

4. Recommendations

4.1 Text-based Interfaces

While the point-and-click interface is excellent for beginners, mature users prefer a text-based command stream. Such a stream provides a tangible record of how we got where we are in the library; it can be stored or mailed to colleagues; it can be edited, to change parameters and run again, to convert an interactive session to a batch job; the command stream can be merged with other command streams to make a more sophisticated result; a command stream can be used as a start-up script to personalize the library; a collection of documented scripts can be used as examples of how to use the library.

We should thus focus effort on the transition between beginner and mature user. The graphical interface should create text commands, which are displayed to the user before execution, so that the beginner can learn the text interface.

In a similar fashion, the library should produce results in a text stream in all but the most trivial cases. The stream would be a structured document containing information about how the results where achieved, with hyperlinks to the images and other large objects. Such an output would then be a self-contained document, not just an unlabeled graph or image. Because it is made from structured text, it can be searched, archived, summarized and cited. See also Section 3.1.

4.2 XML as a Document Standard

Extensible Markup Language (XML), which has been developed in a largely virtual W3 project is the new 'extremely simple' dialect of SGML for use on the web. XML combines the robustness, richness and precision of SGML with the ease and ubiquity of HTML. Microsoft and other major vendors have already committed to XML in early releases of their products. Through style sheets, the structure of an XML document can be used for formatting, like HTML, but the structure can also be used for other purposes, such as automatic metadata extraction for the purposes of classification, cataloguing, discovery and archiving.

A less flexible choice for the documents produced by the archive is compliant HTML. The compliance means that certain syntax rules of HTML are followed: rules include closing all markup tags correctly, for example closing the paragraph tag <p> with a </p> and enclosing the body of the text with <body> ... </body> tags. More subtle rules should also be followed so that the HTML provides structure, not just formatting.

An introduction to Structured Documents: http://ala.vsms.nottingham.ac.uk/vsms/java/epub/xml.html
Extensible Markup Language: http://www.sil.org/sgml/xml.html

4.3 Metadata Standards and Federation

If metadata is to be useful to other libraries, to search and discovery services, or for federation of archives, then it must be standardized: there must be a consensus about the semantics, the structure, and the syntax of the metadata. A basis for the semantics is provided by the Dublin Core standard, that is gaining strong momentum in the library world, as well as among commercial information providers. There are fifteen key components (Title, Author/Creator, Subject/Keywords, Description, Publisher, Other Contributor, Date, Resource Type, Format, Resource Identifier, Source, Language, Relation, Coverage, Rights Management.). The Dublin Core also includes ways to extend the specification either by adding components or by hierarchically dividing existing components.

Federation of archives grows in synergy with the creation of metadata standards. When a local effort tries to join two archives, a common vocabulary is created, leading to a metadata standard. Not only does this encourage collaboration, but also it leverages more knowledge from existing data assets. Metadata standards also encourage the beginning of collaborations because they allow and encourage discovery services.

We felt that the Dublin Core is an effective metadata standard that is appropriate (with extensions) to scientific data objects. What is needed in addition is the development of languages needed to exchange metadata, schemas and ontologies. For each of these levels of abstraction, a definition language and a manipulation language are needed. For more information, see section 6.3.

The Dublin Core: http://purl.org/metadata/dublin_core

4.4 Authentication and Security

The workshop identified a need for an integration and consensus on authentication and security for access to scientific digital archives. Many in the scientific communities are experts in secure access to Unix hosts through X-windows and text interfaces. In the future we expect to be able to use any thin client (Java-enabled browser) to securely access data and computing facilities. The workshop felt that there was a distinct lack of consensus on bridges from these latter access methods to the Unix world that many scientists inhabit.

We identified access-control levels as follows, in addition to the usual public-access home-page for the project:

Low-security access: this area can be accessed with a clear-text password, control by domain name, HTTP authentication, a password known to several people, or even "security through obscurity". This kind of security emphasizes ease of access for authenticated users, and is not intended to keep out a serious break-in attempt. Appropriate types of data in this category might be prototype archives with data that is not yet scientifically sound, catalogs or other metadata, or data that is part of a collaborative effort, such as a partly-written paper.
High-security access: access to these data and computing resources should be available only to authorized users, to those with root permission on a server machine, or to those who can watch the keystrokes of an authorized user. Access at this level allows copying and deletion of files and long runs on powerful computing facilities. The data may be valuable intellectual property and/or the principal investigator may have first-discovery rights. Appropriate protocols include Secure Socket Layer (SSL), Pretty Good Privacy (PGP), secure shell (ssh), One-Time Passwords (OTP) and digital certificates.

Once a user is authenticated to one machine, we may wish to do distributed computing, so there should be a mechanism for passing authentication to other machines. One way to do this is to have trust between a group of machines, perhaps using ssh; another way would be to utilize a metacomputing framework such as Globus, that provides its own security. Once we can provide effective access control to one Globus server, it can do a secure handover of authentication to other Globus hosts. Just as Globus is intended for heterogeneous computing, the Storage Resource Broker provides authentication handover for heterogeneous data storage.

Globus: http://www.globus.org
Storage Resource Broker: http://www.npaci.edu/DICE/srb/srb.html
One-Time Passwords: http://www.ietf.org/html.charters/otp-charter.html

4.5 Standard Scientific Data Objects

We would like standard semantics and user-interface for common objects that arise in scientific investigations. An example is the multi-dimensional point set, where several numerical attributes are chosen from a database relation (a table) as the "dimensions" of the space, and standard tools used to create 2D or 3D scatterplots, principle component extractions, and other knowledge extraction methods.

Another example of a standard object is a trajectory in a high-dimensional phase space, which occurs when storing the results of a molecular dynamics or other N-body computation, or when a multi-channel time-series is recorded from a scientific instrument.

4.6 Request Estimation and Optimization

It is important that the user is given continuous feedback when using the library. When a non-trivial query is issued, there should be an estimate of the resources (time, cost, etc) that are needed to satisfy it, and the user accepts these charges before continuing. Large queries may be scheduled to run later, smaller queries will provide a continuously-updated resource estimate.

An area that needs particular attention is the estimation of resources in distributed systems, for example when a query joins data from geographically-separated sites and computes with it at a third.

4.7 Data Parentage, Peer-review, and Publisher Imprint

Information without provenance is not worth much. Information is valuable when we know who created it, who reduced it, who drew the conclusions, and the review process it has undergone. To make digital archives reach their full flower, there must be ways to attach these values in an obvious and unforgeable way, so that the signature, the imprint of the author or publisher stays with the information as it is reinterpreted in different ways. When the information is copied and abstracted, there should also be an mechanism to prevent illegal copying and reproduction of intellectual property while allowing easy access to the data for those who are authorized.

4.8 Cheap Supercomputers for Archives

There is much interest in the High-Performance Computing Community about the use of cheap off-the-shelf components to make powerful machines, and we recommend that this effort can be leveraged from compute-intensive to data-intensive problems. In particular, fruitful research could ask how to create an active digital library with a cluster of personal computers, running free software, each with its own cheap disk. What kinds of database software can run on a system consisting of many Linux or NT personal computers? What types of file system are useful with such a distributed system?

5. Presentations on the Case Studies

5.1 Object Databases for High-Energy Physics

Julian Bunn, CERN <Julian.Bunn@cern.ch>

This talk described the GIOD project: Globally Interconnected Object Databases for High Energy Physics at CERN's Large Hadron Collider. This is a joint project between CERN, Caltech CACR, Caltech High-Energy Physics, and Hewlett-Packard, and it addresses the data storage and access requirements for particle collider experiments at CERN in the next 15-20 years. Data at ~100Mbyte/sec from the Collider experiments will be reconstructed to particle tracks, energy clusters etc. in near real time, and stored in an ODBMS (Object Data Base Management System), with significant fractions of the database replicated amongst world-wide ``regional centres''.

Goals of the project include investigating the scalability of commercial ODBMS, organizing the data to optimize access and analysis for the end-user physicist, and using leading-edge hardware and software such as the HP Exemplar supercomputer, the High-Performance Storage System (HPSS), Objectivity database, C++, Java, ATM, and new paradigms for data analysis tools.

The old data analysis solution is PAW: it deals with n-dimensional array called nTuples and histograms. It has been developed over many years, and a parallelization layer added. But it is inflexible, unscalable, hard to maintain, and hard to customize. New tools should be object-oriented, robust and modular; they should be convenient, flexible, and efficient for data access and manipulation, and support scripting. A new system, called LHC++, is under construction, with Iris Explorer as the base data and analysis tool. A serious problem, yet to be addressed, is the inertia of the end-users in moving from the Fortran-like PAW to the Object/Dataflow model of LHC++. Some examples of Java tools were presented. Java is easier to use than C++; it is powerful, safe, free, and interfaces cleanly to the ODBMS.

The GIOD Project: http://pcbunn.cithep.caltech.edu/
This presentation: http://pcbunn.cithep.caltech.edu/isda_mar98/
HPSS: http://www.sdsc.edu/hpss/
Objectivity database: http://www.objy.com
Iris Explorer: http://www.nag.co.uk/Welcome_IEC.html

5.2 The Genesis Neural Database Project

Jenny Forss, Caltech <jenny@bbb.caltech.edu>

This database project is an outgrowth of the popular Genesis (GEneral NEural SImulation System) package for computer simulation of neurons and neural systems. Developed at the Bower lab at Caltech, Genesis has several hundred users worldwide. This presentation is about two new developments, the linked Neural Database and Modelers Workspace, designed to provide access to neural models and associated data, including references, experiments, and simulation results. The Modelers Workspace is a web-accessible, Java front-end to the database, for generating queries and visualizing data, with interfaces for building models and running simulations.

Some benefits of the digital library approach are: easy access to the work of collaborators and others, that background information is always accessible, the verification of models through simulation, fast access to data through the web interface, and educational purposes. The Modelers Workspace software runs on the client machine, with a local database and local Genesis simulator installation; other databases at remote locations contain published Genesis simulations, experimental data, and literature references.

The project has chosen an object database approach for the Neural Database not only because of the complex, nested data types, but also because the object-oriented Java language of the interface is a natural fit. The Objectstore product from Object Design has been chosen because it is cheap, lightweight, 100% Java, and it has an advantageous University program. A decision has been made to use CORBA for integrating heterogeneous databases.

Genesis Neural Database and Modelers Workspace: http://www.bbb.caltech.edu/GENESIS/genesis.html
Genesis: http://www.bbb.caltech.edu/GENESIS/genesis.html
Object Design: http://www.odi.com/
CORBA, the Common Object Request Broker Architecture: http://www.acl.lanl.gov/CORBA/

5.3 The LIGO Gravitational Wave Detector

Sam Finn, Northwestern University <L-Finn@Nwu.Edu>

LIGO (Laser Interferometric Gravitational Wave Observatory) is a large NSF-funded project dedicated to the detection of cosmic gravitational waves and the harnessing of these waves for scientific research. It will consist of two widely separated installations within the United States, operated in unison as a single observatory. Gravitational waves are ripples in the fabric of space and time produced by violent events in the distant universe, for example by the collision of two black holes or by the cores of supernova explosions; they are emitted by accelerating masses much as electromagnetic waves are produced by accelerating charges. These ripples in the space-time fabric travel to Earth, bringing with them information about their violent origins and about the nature of gravity. To detect the very weak waves that are predicted will require two installations, each with a vacuum pipe arranged in the shape of an L with four-kilometer arms. When the waves enter the LIGO detector they will decrease the distance between the ends of the arms by much less than the diameter of a hydrogen atom over the 4 kilometer length. These tiny changes can be detected by the interference patterns of high-power laser beams reflected back and forth along these arms.

The detector is broadband, operating from ~50 Hz to ~3 kHz, and signals may occur at any time. The continuous data stream consists of a gravity-wave channel (16 kbyte/sec) together with other channels with instrument control and status data, and physical environment monitoring. The data rate of these ``other'' channels is roughly 100 times the data rate from the gravity-wave channel. Metadata such as instrument state, calibration, operator logs, and summaries, is expected to be ~2 kbyte/sec. The data archive from the three detectors will be reduced by up to 90% in volume and stored for at least five years, creating 250 Tbyte.

Users of the LIGO data archive will be worldwide. Scientists will be looking for theoretically expected signals, using optimal filtering and very long FFT's; other necessary services will include convolution, correlation, power spectra, heterodyning, and principal component analysis. We expect different kinds of user access patterns: many simple requests for short segments of data with minimal preprocessing; some requests for large scale data crunching with periodic requests for large volumes; some users may want long stretches with one or two of the hundreds of available channels; some may want short stretches of all the channels.

LIGO home page: http://www.ligo.caltech.edu

5.4 NPACI Digital Sky Project

Tom Prince, Caltech <prince@caltech.edu>

If the pixel size is one arcsecond, and a sky survey uses 2 bytes per pixel, then the survey size is one terabyte; for multi-wavelength surveys the size is tens of terabytes. For the first time, these sizes can be handled by the available computing technology, thus allowing construction of the Digital Sky.

The project will integrate a catalog and image database. Requirements, including the handling of over 15 Tbyte, with a compute-intensive database engine and a compute-intensive image analysis facility, will be met with an object-relational or relational database architecture; the data-handling and compute services will be provided by NPACI.

The project is actually a federation of independent data archives, initially with an optical survey (DPOSS), an infrared survey (2MASS), and two surveys at radio wavelengths. The surveys will not be centralized, they will remain under the control of their producers. DPOSS will use the Objectivity object-oriented database, and 2MASS uses the Informix object-relational product; Digital Sky will produce a broker that can interface with each of these in order to communicate with the user interface layer and thence to the client. Emphasis is on modular components through well-define interfaces, on extensibility, on customizability.

The Digital Sky will allow ``virtual observing'' as well as computational and statistical analysis of large classes of sources. The data associated with Digital Sky itself will be in the form of relationships between sources in the federated surveys (``this source in DPOSS is probably the same as that source in 2MASS''). It will provide services that include: Query generation and optimization; Source catalog search, Image catalog search, Image archive access, System metadata access, and Relationship generation.

The Digital Sky: http://www.cacr.caltech.edu/SDA/digital_sky.html
NPACI, National Partnership for Advanced Computational Infrastructure: http://www.npaci.edu
Objectivity: http://www.objy.com
Informix: http://www.informix.com

5.5 ESO VLT Science Archive Research Environment

Miguel Albrecht, European Southern Observatory <malbrech@eso.org>

The Very Large Telescope, under construction in Chile, uses the Data Flow System to enable its components to work together in an efficient manner, and to realize its full scientific potential. The VLT can be scheduled for maximal utilization of available observing time, with continuous calibration and monitoring, and delivery to scientists of processed data products.

Some of the challenges for the archive facility have been answered. DVD jukeboxes provide economical long-term storage of large data volumes (1 Tbyte/year in 1999 to 90 Tbyte/year in 2004). There is a reliable data distribution to subscribers, an observatory-wide dictionary of keywords and semantics, a multi-site data warehouse of VLT operations, configurations, observatory logs, and ambient conditions.

The complete information life cycle for VLT observations is handled by the Data Flow System. Proposals are submitted electronically, Observation Blocks are created for each target, and a scheduler system prepares a list of possible Observing Blocks to be executed for the next few nights, and a short-term scheduler select the next Observation Block based on environment and configuration data. Raw data and calibration information is automatically collected for analysis and approved calibration procedures applied. Quality control is monitored and the finished data products are both stored in the archive and returned to the scientist.

The archive records the history of VLT observations: it is thought of as another research tool, another VLT instrument. It supports VLT observation preparation and analysis as well as disseminating the results to astronomers and the community.

The Very Large Telescope Project: http://www.eso.org/research/vlt-proj/
The ESO Data Interface Definition: http://archive.eso.org/dicb/
The VLT Science Archive Project: http://archive.eso.org/VLT-SciArch/

5.6 On-demand Geodetic Imaging

Mark Simons, Caltech <simons@gps.caltech.edu>

Synthetic Aperture Radar (SAR) is one of a number of remote-sensing products that form scientific data archives. The raw data stream requires careful and intensive processing to provide intelligible images, producing images that are not affected by cloud cover and can penetrate moderate amounts of vegetation. Interferometric SAR (InSAR) uses even more careful and intensive processing. It can provide an extremely accurate digital elevation model of the terrain as well as measure centimeter level surface displacements. InSAR provides good spatial resolution over a wide spatial extent. Since the data is taken by satellite, there are frequent repeat observations even of remote areas.

Applications of InSAR include estimation of volcanic and seismic hazard, uplift and subsidence due to mining or fluid migration, seismic damage assessment, as well as scientific questions concerning geologic strain budget and lithospheric rheology. The proposed Geodetic Imaging Facility will go beyond just archiving of raw radar data. Our primary effort will be to provide on demand processing of InSAR data on parallel supercomputers, and fusion of the InSAR products with other data streams including commercial GIS databases, other remote sensing data, and data from existing GPS and seismographic networks which are managed under separate archiving facilities. The prototype will allow different classes of users to efficiently access the raw data, the multiple levels of reduced data, "pre-packaged" data products, and the archive catalogs. We will be placing special emphasis on how users progress from discovery, to first interaction, to continuous and serious use of the archive. The prototype will pay careful attention to authentication, session-management, accounting, caching, database technology and fault-tolerance.

5.7 The Earth Observing System

Ramachandran Suresh, NASA/GSFC/Raytheon STX <suresh@rattler.gsfc.nasa.gov>

The Earth Observing System is part of NASA's Mission to Planet Earth, an extensive international and interagency collaboration providing long-term data for 24 key measurements required for the study of global climate change and the interaction between atmosphere, oceans, land and biota.

The EOSDIS distributed data system supports heterogeneous data archives, and uses standard data format and metadata standards. Over 240 data products will be provided from EOS AM mission to (eventually) 10,000 worldwide users.

EOSDIS data will be stored in HDF-EOS format, an extension of the Hierarchical Data Format developed by NCSA, and the metadata will be in ECS format, which is compliant with the Federal Geographic Data Committee metadata standard. Currently users can search, locate, browse and order sateliite and other reomtely observed data through the V0 IMS system.

It is expected enormous amounts of data will be generated from many Earth Observing Satellites including EOS series in the next few years. The collective data volumes from all these missions could be as high as 15 Terra bytes per day in the next few years.

Some of the outstanding user-interface challenges for EOSDIS are regarding searching and locating data within this vast and complex archive. Some queries that cannot be answered within the current system might be of the form

``Show me (something) located (somewhere) during (time period)'', for example
``Show me sulfur dioxide concentrations in the Pacific for the last five years''.
``Show me (implication) of (event) to (ecosystem)'', for example
``Show me the flooding effects of increased volcanic activity on the Amazon watershed''.
``Show me (change) that usually precedes (other change)'', for example
``Show me cloud changes that precede an increase in wind over Indonesia''

To answer such knowledge-based queries, we need: Direct access to a large base of data and a large base of science knowledge; Real-time knowledge-based processing that really works; New tools for communication across disciplines.

In contrast, the current system has much data that is not online, slow processing and I/O, and the data is not stored in a relational or object database. While historical approaches of organizing data and metadata are still valid, the emphasis should be on organizing knowledge and navigating information. Digital library technologies and knowledge-based algorithm development are essential to accomplish this task.:

For further information: http://www.earth.nasa.gov/

5.8 Molecular Dynamics Trajectory Database

Lennart Johnsson and Michael Feig, University of Houston <johnsson@cs.uh.edu>

Based on a prototype implementation, we present an outline of a high-level architecture for databases for molecular dynamics trajectories as dynamical analogues of widely used structural databases of chemical and biochemical molecules. The objective of the proposed databases is to facilitate sharing of available simulation data through convenient, secure and differentiated Web access. The envisioned, shared, molecular dynamics database would be predominantly queried for thermodynamical, mechanical and geometrical properties, that are calculated from the particle trajectories, rather than the trajectories themselves. Examples include time correlation and spatial distribution functions, transport coefficients, structure-specific angles and distances, and root mean square deviations from reference structures. Thus, molecular dynamics databases have to extend beyond the management of the trajectory data to the integration of flexible analysis programs, that can perform common types of analysis, but also allow the incorporation of user-defined routines for special requests.

To generate the final data as requested by the user, additional data manipulations like averaging, window selection, Fourier transformations, and comparisons with other quantities from the same or other trajectories may be necessary. These simple operations can be applied to the processed data within interactive timescales in a separate second analysis step each time a request is made. Then, caching of the final output is unnecessary. The data organization of a molecular dynamics database is thereby described in three layers consisting of the raw trajectories, the processed data, and the transient final data presented to the user. In typical simulations the three-dimensional coordinates and velocities of the thousands to millions of simulated individual objects are stored at tens of thousands to tens of millions of timesteps over the course of the simulation. The resulting storage requirements are 0.1 - 100 GB for a single trajectory in the first data layer. In addition to the particle coordinates and velocities, structural and thermodynamical information is included with each trajectory. While storage of the first layer is static, the second layer will grow dynamically with each new quantity, that has not been calculated before. This requires content descriptors at the second layer to distinguish the quantities and their different data structures.

The large data objects and computational requirements for most analysis calculations in a molecular dynamics database combined with inherent parallelism of the algorithms employed suggest the use of parallel computers and (wide area) distributed computing environments to provide sufficient performance.

An easy implementation of a standardized user interface with graphical capabilities is provided by a web browser application. It guarantees platform independence and inherently supports distributed computing. By using browser supported client-side applications the final data preparation and graphical display can be done on the user side to further distribute the computational load.

A database with a structure as described above for molecular dynamics data could be used not only for molecular dynamics trajectories, but also for other dynamical data of similar structure. An example would be astrophysical simulations, where interactions between stars rather than atoms are involved. Other possible applications are hydrodynamic and mixed particle-fluid simulations.

We have implemented a prototypical molecular dynamics database for DNA simulations. Due to disk space limitations on a typical workstation architecture the trajectories of all simulations are stored primarily on magnetic tape. They are loaded onto the machine only once to perform a standard analysis run, that includes calculation of the most commonly examined structural/mechanical parameters of DNA like helical parameters, backbone dihedral angles, and root mean square deviations from canonical A- and B-DNA structures over the course of the simulation. The results of the analysis are then stored in compact form on disk. The interface to the database through a web browser is controlled by a collection of scripts, that allow database queries for the precalculated quantities as well as maintenance of the database itself. With each query, averages over time or different structural entities of the DNA and selection of a simulation time window can be specified. To fulfill a query the requested prestored analyzed data is extracted, processed accordingly, and then converted into a graphical representation for display within the browser. Alternatively the final data can be sent to the user by electronic mail.

5.9 Overview of the Planetary Data System

Steve Hughes, JPL <Steve.Hughes@jpl.nasa.gov>

The NASA Planetary Data System (PDS) is an active archive that provides high quality, usable planetary science data products to the science community. This system evolved in response to scientists' requests for improved availability of planetary data from NASA missions, with increased scientific involvement and oversight. It is sponsored by the NASA Office of Space Science, and includes seven university/research center science teams, called discipline nodes, as well as a central node at the Jet Propulsion Laboratory. PDS today is a leader in archive technology, providing a basic resource for scientists and educators.

The goal of PDS is to enable science by providing high quality, usable planetary data products to the science community. We are committed to curating and providing science expertise for quality products, not just warehousing items. We distribute data to a wide audience while ensuring that the media and the system itself are long-lived, operating beyond the existence of any one planetary project. There are several objectives, discussed below.

The first objective is to publish quality, well-engineered data sets. All PDS-produced products are published on CD media and all PDS-produced products have been peer reviewed by a group of scientists to ensure that the data and the related descriptions are appropriate and usable. PDS provides easy access to these data products by a system of on-line catalogs organized by planetary disciplines. While PDS does not itself fund the production of archive data from active missions, we work closely with projects to help them design their data products. If projects deliver well-documented products then the investigator expertise is available and also those data can be used immediately by the general science community.

Another objective is to maintain the archive data standards to ensure future usability. Over the years of PDS experience, a set of standards have been evolved for describing and storing data so that future scientists unfamiliar with the original experiment can analyze the data using a variety of computer platforms with no additional support beyond the product. These standards address structure of the data, description contents, media design, and a standard set of terms.

The final PDS objective is to provide expert scientific help to the user community. PDS is an active archive, rather than a storehouse, which is staffed by scientists and engineers familiar with the data. While most of the archive products may be accessed or ordered automatically by users, PDS provides discipline-scientists to work with users to select and understand data. Special processing can be done to generate user-specific products. In addition, these same scientists form a network into the science community to hear requests and advice for PDS.

PDS Home Page: http://pds.jpl.nasa.gov/
PDS Standards Documents: http://pds.jpl.nasa.gov/prepare.html
PDS Standards Reference: http://pds.jpl.nasa.gov/stdref/stdref.htm
Planetary Science Data Dictionary: http://pds.jpl.nasa.gov/cgi-bin/ddsearch.pl?FIRST

5.10 The Digital Puglia Radar Atlas

Giovanni Aloisio, University of Lecce, and Roy Williams, Caltech <aloisio@sara.unile.it>

Digital Puglia is a prototype of a system to provide ``processing on demand'' for Synthetic Aperture Radar (SAR) data, where a user causes a computing job to run, perhaps on remote machines, to create the data she requests. The output of the processor may be further processed by procedures such as mosaicking, registration, interpolation, rotation, GIS integration, or tasks such as Principal Component Analysis, Singular Value Decomposition, or Maximum Likelihood Classification or fusion with other data, such as a digital elevation model.

The library will contain raw SAR data and processed SAR images from the Italian Space Agency. A major feature of the library is the utilization of high-performance computing in addition to data storage. When processed images are not available in the archive, they can be computed immediately by available computing resources from raw data that is in the library, and the result not only delivered to the client, but also stored back into the library, with the data catalogue automatically updated. As Digital Puglia is built, we will address access control, authentication, security, billing and encryption services. We are also interested in mechanisms by which users can publish derivative data at the library, with the opportunity to attach signature and peer-review certificates.

SARA (Synthetic Aperture Radar Atlas) is a web-based digital library that has been running successfully at Lecce, Caltech, and the San Diego Supercomputer Center for over a year, that is a prototype for Digital Puglia. Data is replicated on multiple servers to provide fault-tolerance and also to minimize the distance between client and server. SARA already allows clients to download SAR images from the public domain SIRC dataset. A client navigates web pages containing Java applets that implement a GUI showing a map of the world. Clicking on the map zooms in on a part of the world until he can see the coverage of the atlas in terms of the SAR images, which are perhaps 50km in size. Chosen subsets of the image can then be downloaded in any of a variety of formats.

The interface to the library is based on the `thin client' model, which assumes as little as possible of the client's terminal, specifically, that the client has a connection to the Internet and a Java-enabled web browser, and need have no other software or hardware. With this thin-client access, we hope to encourage wide, collaborative use of the library -- from the scientist at a conference, in a colleague's office, from a laptop on an airplane. The thin client model enables a quick entrance to the digital library, without reading manuals or meticulous software installation.

Authentication is through one-time passwords and telnet. When the telnet server issues a challenge, an applet at the client may be used to produce the one-time password. We use a modified version of the Java telnet client `WebTerm' to communicate with the server. Modifications allow this text-based protocol to be extended to a fully-functional graphical user interface. The (suitably authenticated) client is connected to a metacomputer (a combination of geographically distributed computing and data services) through typing text commands, or Java applets to generate the text commands more easily. The metacomputer is controlled by the Globus metacomputing environment.

Synthetic Aperture Radar Atlas: http://sara.unile.it/sara, or http://www.cacr.caltech.edu/sara
Globus Metacomputing Environment: http://www.globus.org
WebTerm telnet client: http://www.nacse.org/web/webterm2.0/

6. Presentations on Tools

6.1 Experiences with Distributed Environmental Data Repositories at SDSC

Richard Marciano, San Diego Supercomputer Center <marciano@sdsc.edu>

SDSC is involved in NPACI (National Partnership for Advanced Computational Infrastructure). A major thrust of NPACI is the creation of wide-area, distributed computing and data infrastructures. The `Computational Grid' project provides a web-linked way to execute distributed applications across heterogeneous compute platforms; the InterLib project provides information discovery and data publication, also using the web as a substrate. Services provided by an information-based computing architecture should include:
Registration and publication, to add data to the archive,
Discovery support, so people can find the archive and what is in it,
A Storage broker, to present a unified view of heterogeneous storage,
Authentication, access control, and encryption services,
A scheduler for large queries that are compute or data intensive,
A library of compute methods to apply to the data, and
Metadata and cataloging services.

The SRB (Storage Resource Broker) is a client-server based middle-ware implemented at SDSC to provide uniform access interface to different types of storage devices. SRB provides a uniform API that can be used to access data sets by connecting to distributed heterogeneous resources that may be replicated. SRB, in conjunction with MCAT (Metadata Catalog), provides a means for accessing data sets and resources through querying their attributes instead of knowing their physical names and/or locations.

SRB has been used in the Berkeley and Alexandria digital libraries, environmental and ecological repositories at SDSC, and for a prototype patent retrieval system at SDSC. SRB allows clients to use HPSS, Unix files, Oracle, DB2, and other data stores.

An application which uses the SRB is TIES, a distributed data atlas for estuarine studies of Chesapeake Bay. The TIES program conducts three cruises per year, one each in the spring, summer, and fall. Each cruise consists 26 across-bay transects and one axial transect. On each transect, measurements are made of six hydrographic variables, including temperature, salinity, density, oxygen, backscatterance, and chlorophyll concentration. Additionally, four size classes of zooplankton density are measured as is total fish biomass.

Storage Resource Broker: http://www.npaci.edu/DICE/srb/srb.html

6.2 The ASCI Scientific Data Management Tools

Celeste Matarazzo, Lawrence Livermore Lab <celestem@llnl.gov>

The five focus areas for SDM in the Accelerated Strategic Computing Initiative (ASCI) program at Livermore are:
Metadata tools, for extraction, browsing, editing,
Data modeling and formats, to provide common data representations,
Metadata management, to provide a distributed infrastructure using commercial databases and networking,
Metadata exploitation, to federate metadata resources and allow mining, and
Communication and collaboration between three major Government labs.

Because a central goal of the ASCI program is to push simulation and modeling for Science-Based Stockpile Stewardship to unprecedented levels, a substantial portion of the SDM effort is concerned with managing the data emerging from the ASCI simulation codes.

The ASCI SDM tools provide scientists with advanced capabilities for organizing, searching, and interacting with simulation results and the data used in support of simulations. Scientists require access to disparate types of information, including computed and experimental data, papers, reports, and notes, not only in archives, but throughout a distributed, heterogeneous computing environment. More than just improved speed in retrieving files, scientists need new data management techniques that assist in organizing and synthesizing the large amounts of information scattered across the network. The ASCI SDM tools are summarized below.

SimTracker summarizes computer simulation results by automatically generating textual and image metadata from calculation output. The scientist uses a web browser to view this metadata as the code is running, or after it has completed. SimTracker also simplifies access to analysis tools, and it provides a means for documenting calculations for future reference.

E-Notes is a Java-based client application for viewing and editing metadata. Several metadata elements may be associated with objects, including object type, creator, thumbnail graphics, keywords, and a summary.

E-Server is a broker between the E-Notes client and the heterogeneous metadata and data repositories. Communication with the client is via Java RMI (Remote Method Invocation); E-server receives object names and returns Nmetadata record objects. Communication with the databases is done either directly through JDBC (Java Data Base Connectivity), or, in a multi-tier architecture, via a combination of custom and proprietary software.

Other tools include E-Search, a Java tool for searching metadata databases, to be integrated with E-Notes; E-Miner, a tool for automatically generating metadata from files on local disks and in the HPSS archival storage system.; and XDIR, a graphical FTP client for transferring and managing files. Metadata fields for data objects are based on the standard defined by the Nuclear Weapons Information Project.

ASCI Project: http://www.llnl.gov/ASCI/
LLNL's Intelligent Archive tools: http://www.llnl.gov/ia/

6.3 Dublin Core Metadata: Resource description in the Internet Commons

Stuart Weibel, Online Computer Library Center <weibel@oclc.org>

Let us define metadata as `structured data about data'. Examples of metadata are the cataloguing record in a library, the table of contents in a book, a hyperlink to a web page. When the metadata is effectively structured, then the data can be automatically scanned for the purposes of information discovery, archiving, cataloguing, making a synopsis.

Clearly, metadata is most effective if it is created in a conventional way, so that digital archives can collaborate, interoperate, share software tools, can be federated. The Dublin Core is such a convention: it is a set of semantics that may be beginning to form the heart of a global metadata convention. We also describe the Resource Description Framework (RDF), for encoding and transporting metadata about web documents.

A resource description community is characterized by common semantic, structural, and syntactic information. The Internet Commons embraces many formal and informal Resource Description Communities: libraries, museums, geospatial data, commerce, science, home pages, etc. and interoperability between these requires conventions about:
Semantics: the meaning of the elements,
Structure: metadata should be both human readable and machine parseable,
Syntax: the grammar should convey semantics and structure

Semantics and the human-readable structure are the responsibility of the Resource Description Communities, such as the Dublin Core group; machine-parseable structure and syntax are encoded with HTML, SGML, XML and its subset, RDF.

The Dublin Core has evolved in a series of workshops over the last few years. The objective is to build an interdisciplinary consensus about a core element set for resource discovery, with simple and intuitive semantics, which is cross-disciplinary, international, and flexible. The Dublin Core element set is:

Title, Author/Creator, Subject/Keywords, Description, Publisher,
Other Contributor, Date, Resource Type, Format, Resource Identifier,
Source, Language, Relation, Coverage, Rights Management

There are 15 elements, each is optional, each is repeatable, each is extensible (a starting point for richer description). Each is designed to be interdisciplinary (semantic interoperability), and international (10 languages supported so far). Semantics may be extended by refinement: improving the sharpness of description with a controlled vocabulary of qualifiers that refine semantics. Semantics may also be improved by extension: additional elements, or complementary packages of metadata, for example for discipline-specific or rights-management purposes. In the case of scientific archives, extensions could be made with high-level descriptors to describe the datasets, or by the use of domain-specific schemes, refining semantics of the core elements.

The Dublin Core is being formalized as a standard by several bodies, and there are over 50 implementation projects in 10 countries. Archivists may consider the Dublin Core as a metadata standard if they have a rich standard already, but need a simple one; or to reveal the archive to other communities using common semantics; or to provide a unified access to databases with different underlying schemas; or simply to avoid inventing the semantics anew.

The Resource Description Format is a W3C initiative for an architecture for metadata on the web, a convention to support interoperability among applications that exchange metadata. The syntax is expressed in XML, but the semantics is defined by others.

The RDF data model is a directed graph, where the nodes are resources and the edges are named properties, so that an RDF description is a graph of arbitrary complexity. RDF is important because there is a market demand for its deployment, because it provides a model and syntactical framework for metadata, and because it will support independently developed metadata element sets, such as MARC, Dublin Core, TEI, EAD, CIMI, etc.

Finally we will have the means to express highly structured data and metadata on the Web in an extensible fashion. Tools for managing metadata will be integrated into the Web infrastructure. The biggest challenge is to promote consistent deployment.

Dublin Core Homepage: http://purl.org/metadata/dublin_core
XML, Extensible Markup Language: http://developer.netscape.com/viewsource/bray_xml.html
RDF Working Group: http://www.w3c.org/RDF

6.4 Design Issues and Solutions at NCSA

Nancy Yeager, University of Illinois <nyeager@ncsa.uiuc.edu>

A scientific data repository should allow metadata searching for discovery, and when the correct dataset is found, it should be possible to search for and retrieve a piece of data within the dataset. The data may be in various multimedia formats, it may be abstracts and hyperlinks to other published sources. The repository should handle large amounts of data, it should classify and query datasets based on metadata, and it should serve large amounts of data that may be in tertiary storage.

The Astronomical Digital Image Library at NCSA is such a repository. It ingests data, responds to search queries, and allows browsing and delivery of data to clients. The user community is encouraged to publish data, and the library is linked to other databases, allowing cross-database search. Gazebo is a search gateway that can execute a query against multiple target data sources, with queries expressed in simple XML-based protocol.

Interacting with a scientific digital library is perhaps not just a query and a response, but a conversation. The user searches, browses the result, retrieves a subset, then back to searching, and so on. The implementation should provide the correct set of services to support this: chunking, subsetting, compression, efficient I/O. This flexibility of delivery can be achieved with HDF (Hierarchical Data Format).

An HDF file is a multidimensional array of records, with efficient I/O implementation, including thread-tolerant MPI-IO for parallel applications; also chunking, tiling and compression are supported. The interface is object-oriented; this is leveraged to produce an `abstract object model', which is a broker that provides interoperability. The broker can search and access data from the HDF library, or from netCDF, FITS or other file formats. In a similar fashion, interoperabilty can be achieved between HDF-EOS files and OpenGIS framework for Geographical Information Systems. The HDF format is advancing in several new directions, with a Java-HDF library, and an object exchange mechanism through serialized Java objects.

Astronomical Digital Image Library: http://imagelib.ncsa.uiuc.edu/imagelib.html
HDF Home Page: http://hdf.ncsa.uiuc.edu/

6.5 Collaborative Databases, Scientific Visualization, and Education

Geoffrey Fox, Syracuse University <gcf@npac.syr.edu>

The current incoherent but highly creative Web will merge with distributed object technology in a multi-tier client-server-service architecture with Java-based combined Web-ORB's. There are several technologies in collaboration and competition: Java (Beans and RMI), CORBA, XML, DCOM. Even before the overall architecture is clear, we can use one or more middle server tiers to connect clients to backend archives. The middle layer may be a CGI or Java-servlet web server, it may be a transaction server, it may be an ORB (object request broker). Most of the Java activity in the commercial world focuses here, at the server and the middleware, with Enterprise Javabeans and servlets.

Java Grande is a project exploring Java for high performance scientific and engineering computing. We believe Java has the potential to be a better environment for this than Fortran and C++. Services (databases, MPP's, etc.) are encapsulated as distributed objects, with the technologies mentioned above. The architecture consists of three tiers: at the `big iron' tier, Globus connects MPI supercomputers and databases with parallel I/O channels. These machines communicate with the `distributed object' tier, where serialized objects are exchanged between servers (session management, metacomputing resource discovery, visualization, etc.). In the third tier are clients and user interfaces. Much of the necessary software infrastructure is provided by JWORB, a collection of protocol integration tools.

WebWisdom is a collaboration and education system that has been used successfully in teaching real courses to real students with the instructor and students divided between Syracuse, NY and Jackson, MS. Only low bandwidths (4-10 Kbytes/sec) are needed, because bulky data is mirrored at each site. The Tango collaboration environment is the collaboration substrate for WebWisdom, providing a synchronous shared event model, text chat rooms, shared browsing, slide shows, and white board services. Other services are being added, including shared Powerpoint, Excel, shared Java applets, shared VRML.

Scivis is a scientific visualization environment, which is collaborative and customizable, and written as a 100% Java application. Scivis3D is released, using the Java3D package for access to high-performance 3D rendering if the client workstation is capable.

This talk: http://www.npac.syr.edu/users/gcf/isdamar98
``Building Distributed Systems on the Pragmatic Object Web'': http://www.npac.syr.edu/users/shrideep/book
Java Grande Forum: http://www.npac.syr.edu/projects/javaforcse

7. Breakout Group Reports

7.1 The User Experience

Group Leader: Cherri Pancake <pancake@cs.orst.edu>

Characterize the users of an Active Digital Library

Scientific data archives have a very diverse audience including everyone from research scientists to university and K-12 students, to legislative administrators; there is thus a very wide range of expertise. The local users and the advanced users get the attention first, and others are added later. It might have something to do with the way digital libraries are being set up now, and the kind of interfaces that are available, but we certainly saw that in the future there is going to be growing pressure to include other audiences, perhaps even those who do not love computers the way we do.

How do people go about using the library?

Some have a very specific question in mind, and they want to use the library as a resource to answer it in the most efficient way. Some know about a specific collection of data and they want to use it in the most effective way. But there are other kinds of users too, and these are the ones that aren't really being supported very well from the point of view of user interfaces. Some users want to extract and dissect a portion of the data, bring it down and then merge it with data from other libraries. There is very poor support for another category of users: those who are ``shopping around'', who are browsing, looking for interesting things. But this is what educators, legislators and administrators want to do, and they get essentially no support right now. I think you'll see that some of our responses to the other questions reflect the fact that we think user interfaces don't support these type of users.

How can we help people learn to use the digital library?

The best we decided by far is a walk-through with an experienced user. This is a metaphor for human librarians in a physical library, but unfortunately it doesn't scale. So what else could we do? As part of the first experience, we would like to see excellent and exciting examples of what the digital library provides and turn that into an enticing tutorial so that somebody can quickly see the value of the library. Other real pluses would be to have the classification scheme clearly visible from the opening overviews of the library. Usually you go to the first screen and if you don't know, if you haven't already seen what's there and how to use it, you don 't have much of a clue of how to proceed.

There needs to be a lot more attention to consistency across the library so that a user can leverage previous experience with other libraries or user-interfaces. We need multiple interface levels: progressively more features will support progressively more advanced users.

We must have a real help facility, help on the buttons and widgets of the GUI, help on organization of the library, help on mental models of how the library works, indexes to the help documents, help on how to get help. Help should be telling people how to use the library effectively, not answering the questions to which the implementors happen to know the answers.

And finally we need more navigation metaphors. For geospatial libraries, we can start at the level of the world and gradually zoom in to the areas you want to go to. That works for some types of information: but there are other things that could happen, in time series for example, you can have a mental model of infinite future and infinite past, rather than the implementors view of files and directories. We need to get more creative in showing people at the outset how we navigate.

How can we entice new users to the library?

Given the pressure to map our libraries to new audiences, we might mimic the human library. When you walk into a physical library, what happens? The librarian is able to size you up, see if you are a 13-year-old rocket scientist or a senior citizen who is afraid of cyberspace, and gauge the level of interface accordingly. The human librarian finds a way to relate the information or knowledge that's in the library to familiar information and experiences.

The computer users of today use trial and error to learn how computers work. They do not read manuals, in fact they don't like to use the help facility until they are stuck. So we should make sure that this does not lead to system crashes, but instead leads to exciting new avenues to reach the data.

What poor user interfaces are in widespread use?

So what are these interfaces that do not work? Well first of all SQL: it has arcane syntax, it is error prone and it's not from the users view at all, it's based on the database administrators view. Though we might point out that SQL can be used to build an interface: the user interacts with a GUI, a broker, which in turn makes the SQL and displays the results.

The web itself we decided is not really a very good user interface. It doesn't really provide enough structure by itself, but just like SQL, a broker can use the web as a substrate on which to build an interface. Another terrible thing are these blank forms that say ``type in your query'' with no examples or instructions available. If you don't already know, you're lost from the beginning.

What good user interfaces are in widespread use?

The one that always gets panned is the command-line user interface, but they do offer one really good feature, that you can see a history, which becomes a script for another session. You know how you arrived where you are, what you could have done differently; you can save the history and use it again to get to the same place, or, with editing, to a slightly different place. Or you can use a program to make a thousand variations on the script and get to a thousand different places.

How can we achieve a unified interface that is graduated for all levels of users?

The answer is we can't, so we might as well forget that. You cannot have the same interface for a specialist in a scientific discipline and a schoolchild, so we should stop fooling ourselves. What we can have, though, are multiple interfaces that all access the same library but target different levels of understanding, and then provide ways to move back and forth between the levels. You may come in at a novice level, use the library, then decide you're more of an expert, that you'd like to see a more professional approach.

It's very important that there be certain consistencies. What you're going to show the novice is just a few features, the usual things, the ``wizard'' approach. However, those features must still work when you move into the expert field; it allows people to learn, to capitalize on the experience of the novice interface.

Also, we think it would be really nice for user interfaces to provide a little self-assessment. When I come in how do I know which level we should try? There might be a self-assessment: Have you ever heard of the term? Have you heard of it but you don't really know what it's about?

How can we capitalize on the user's familiarity with analysis and visualization tools?

We should serve up data to the user in formats that are acceptable to the tools that the user wants to use, so they can continue with the familiar tools or familiar environment. By transporting the data from the library in a familiar form it should make it really easy to extract subsets. This is the way, for example, that I've been told by science education people to reach the K-12 audience. We had climate data, but they didn't want the precision that the climatologists and meteorologists wanted; they wanted coarser resolution with no error bars. With the ability to export a common format, it's also going to help entice new users to the library, and that's the ability to extract data and then merge it with content from other libraries.

What should the interface remember about what a user did in the past?

Currently, the user interface doesn't know what you like to do, it doesn't know what you did, it doesn't know your expertise, it doesn't know what you're interested in, and yet we ought to be able to record some of that information. At the very least, the user shouldn't have to memorize these operations that he or she did to get something. We ought to be able to recover that in order to provide reusability.

Can we analyze user actions in order to improve and optimize the library?

Several very good examples were talked about: forming FAQ's, hot items of the day, most-requested items, and so on. There can be a privacy issue here if you associate the query with the user that made it: there was a total division of opinion on this. Some said there is no privacy issue, some said there was, and some said not unless we publish or distribute the information.

Should the user be a provider as well as a consumer of library data?

Yes, it's important if the library is going to grow, but it needs to be done in a controlled fashion. The new content could be acceptable as separate documents: some sort of extensions to the existing content. Another example might be a bulletin board, or a collection of questions from users and the corresponding answers from the librarians.

There are problems here: if anybody can add things in, what about the responsibility of the library to stand behind the information? We decided that there probably should be multiple levels of referee or control. Some data meets top quality control standards; there's data that's been submitted, contributions that have been reviewed and edited by the librarians, publishers or referees; and then are loosely moderated contributions. We thought they probably need to be moderated for appropriateness for the library.

Now the biggest problem that we saw is from the user contributions, because it raises a lot of sticky issues about over-simplification of information, and about intellectual property.

Should users be able to refine or add to the metadata that's in the library?

If the library is going to grow and expand to have new mechanisms classifying the information and navigating through the data, we need to allow users to do this. So we took the analogy that the users add `cards are in a different color', meaning that it's not part of the library, but it might give you some really good information. So we also looked at the idea that new ways of moving through the library could very well be the best contribution that users could offer. The problem is going to be how to organize these contributions so that they are classifiable and searchable.

What kinds of access control are needed?

There may be temporary or immature results that really shouldn't be released yet. There may be sensitive or personal data. It may be that the PI that collected or generated the data needs first rights to the scientific results. We think it would be good to try to protect subsets of the data; for example, protecting very high resolution images, but releasing lower resolution images that would be of use to the general public.

We might need some other kinds of control than just privilege of access, perhaps by priority. During emergencies or peak traffic hours we could guarantee access for legitimate researchers, or for those who pay. And then there's the issue of greedy users -- they could be limited by the number of uses per day or the amount of data transferred per day.

How can tools be transferable across libraries?

Obviously if we want to leverage the effort that other people are putting in, we want to be able to reuse those tools. The key here is a well defined interface to the library data, because we're not going to get true transfers unless it's cheaper to remap the existing tools to the new library than it is to start from scratch and redevelop a tool.

We think it's particularly desirable to have tool reuse at the novice level: somebody has gone to one library and learned how to use perhaps the visualization or mapping tool and they can reuse the tool at another library. However, we don't think it's necessarily the right way to go for experts. An expert is presumably trying to get access to data as quickly and efficiently as possible, but performance tends to decrease with generality and portability. We felt there might be a loss of specialized functionality if a library relies entirely on generalized and portable tools.

Can the user run his or her own program on library data?

The answer comes back to the idea of providing the data in common formats, like HDF, gif, pdf, IEEE numeric formats. Then the user can run their own analysis package or tool using data from a wide variety of different libraries. Coupled to this is another problem: often the data is not meaningful unless you have access to the metadata as well, raising the question of how the user is able to get the metadata as well as the data.

What is implied by collaborative use of the library?

We considered two types of collaboration. First, users who want to know if the data in the library changes, for example, the quality control techniques have changed, or if the data has been updated. Right now there is no nice mechanism for keeping interest groups or people who should be informed when certain things happen. This means the need for an ability to register your interest, to be on a mailing list that concerns certain corners of the library.

For teaching or discussion of research results, we need two or more people to be able to see the same thing on their screen. They can be talking on the telephone or video conferencing, but if they can't look at the same thing you can 't really collaborate effectively. There is also a need to capture a script or history of a session -- if I tell you I found a great piece of data and this is how I did it, I don't want to have to write everything down by hand and dictate it.

What's needed to support collaborative use of a library?

There are several collaboration technologies available, Tango from Syracuse University, Netmeeting from Microsoft, and many others. The question is how to interface the collaboration tool with the tool that is generating the content and images.

A library should be a matchmaker for identifying possible collaborators, providing mechanisms for seeing who's using the library to do what. Obviously there are going to be some problems with privacy here. This is a library function that is not often recognized.

Users should benefit more from the experience of other users. You don't find that in digital libraries today, you don't find things that leave a cookie trail, this is where great stuff is lost. Somebody might have a wonderful way to mine the information that is lost to the larger community. We compared it to the early days of news groups when people related their experiences on the news group and it would be extremely helpful to other people. Moderation is needed so there isn't too large a volume of information, but this is a way of creating new collaborations which is very important.

Tango collaboration software: http://trurl.npac.syr.edu/tango/

What are the chief obstacles to library usability?

We came up with three answers. The first is the reproducibility problem. Why is it so hard to go back and repeat a search or navigation strategy? We need an easy way to be able to log those actions and examine them to see what it was that you did and store the sequences for future use by yourself or by some other users.

Second, the serendipity problem. In a physical library you can see the things on the shelf above and below what you're looking for, and thereby find new things. We need to provide that facility for associative searches, for browsing, in digital libraries.

Finally the virtual librarian problem. There is nothing that can beat a human librarian for prompting user for information that will aid the search. But in digital libraries the burden is all on the user to ask the right question.

7.2 What Users Do at the Library

Group Leader: Celeste Matarazzo <celestem@llnl.gov>

The participants of this breakout group found the premise of this session difficult to work with. The premise presented a very library-centric view. We are not interested in "going" to the library, even when that library is a virtual one. Instead we take a view of the world as an immanent (everywhere) information environment. The environment would be seamless and transparent, where we acquire information as we need it in the context of problem solving, and do not know when we have "entered" the library. Given this perspective we answered some of the questions posed to us.

What are the services that can be used by many scientific archives?

We discussed services that are common to many archive and in a perfect world would be useful to have available. These services include security authentication, access control, encryption, annotation, navigation assistance quality of service measures, collaborative environments, authoring/submission service, reference history, customization capabilities and an integrated data analysis service. Annotation services was a lengthly topic of discussion -- by this we mean the creation of metadata, both library metadata and user metadata and the ability for users to create and edit this metadata. Regarding navigation services, our discussion was about how to get to places, and how to get there again.

Given that we didnot distinguish between classes of users, we believed a customization service would be a great benefit. Customization services would decide how to make an appropriate presentation. The same data may have a K-12 representation, a researcher representation, a manager representation. If the person is visually impaired for example, you may want to use an audio response instead of a visual response. There's a whole lot of customization that could be useful. Each user may want their own kind of navigation of the archive, with translations and filtering to achieve this. It is in not appropriate to presume one particular interface will be useful to all; maybe there is a way to provide a user interface with levels of hierarchy using a customization service. To make the customization services useful you would want the ability to assess the users first.

It should be possible to measure quality of service from the library: some metrics might include the relevance and quality of the data returned as well as its cost. It is also important to talk about quality of service in terms of availability of network and how the system is capable of performing on a particular day.

We explored the implications of having collaboration environments; we thought collaboration services would be very helpful. It would be helpful for the user of the library to be able to collaborate with a domain expert, to guide the navigation of the library. Working in a collaborative environment would help the user get relevant information. The collaborative environment could be synchronous or asynchronous and would be analogous to the user's interaction with a reference librarian in the traditional library sense.

We discussed several other services such as data comparison capabilities, various kinds of authoring tools and data analysis. Intergrating the data analysis environments with the data source is believed to be important.

Lastly we discussed the importance of a reference history: the history or the parentage, of this information. Metadata could be included that describes what information was derived from this raw data. You could find a list of things that were derived from the same raw data to see what kind of conclusions were made. We mentioned a kind of citation service, where who refers to this work gives one some kind of measure of the quality of the information.

How can we be assured that the library is usable?

This discussion reduces to providing methods of determining value. For example, by looking at the metadata I can say I know this author, and therefore I want anything by that author. We could also try to quantify that everybody in the world seems to think this information is hot. If everybody else thinks it's good maybe it would be good for me. Also, correlation information may be useful in determining relevance: people who access dataset A usually also access dataset B.

It is important to have an assessment of the quality of the information including citations referencing this work. The user may want only those records that have a complete metadata description. If the relevant metadata is not available, I don't want the data. People should have that kind of choice.

We discussed using metrics to help understand the impact of a request. It may be possible that a user can form a query that makes a request to every data server in the country, and thereby bring down the entire system! I need to know the impact of my request on the system and whether it is reasonable. Give me options: I may be willing to wait three days, and in another case I want the results in the next two minutes or I don't want it at all.

How can we maintain data archives once the library funding is terminated?

If we make good decisions and provide effective services during the funded effort, we can increase the probability that the library will survive. The architecture should separate data storage from data presentation to allow us to embrace technology evolution. Even though the library funding may end, we believe it is possible that the library raw data would continue to grow. There may be a submission service as we discussed earlier that easily allows contributions to the library. We can encourage national recognition of the value of the data research to perhaps larger institutions like National Labs or universities or internationally funded bodies that could be supported in some way to maintain the collections of libraries and the query servers that may have been developed in separate projects.

What we agreed on for scientific digital libraries

Summarizing our discussion these are the kinds of things we agreed on. We believe that the organization and the structure of the data archive provides value; this is in contrast to the current situation with the web. We believe that interoperability between libraries is critical. Common services could be defined that add value to the data providers, data maintainers, and data consumers, with a programming interface to these services.

Library archives are long lasting and persistent, and we hope that the infrastructure being suggested today might last longer than previous software and hardware approaches. Quality assurance of the information really does impact the usefulness of the information.

Finally, the breadth of users of the system is a big challenge. We had no clear consensus on who or what are the users and what methods and operations they want. We don't want to say you have to be a certain kind of person with certain kinds of capabilities in order to use the library.

7.3 Implementation and Architecture

Group Leader: Julian Bunn <julian.bunn@cern.ch>

How is the library is implemented in the case studies?

Some people were using object databases, some people were using relational data bases, other people were simply using large binary files with indexes. In the latter case, the index may be implemented with a database. There did not seem to be any formal reasons for these different approaches.

We tried to contrast the differences between using simple files and using relational or object databases. We thought that the solution was domain specific, very often a trade off between efficiency and functionality. Sometimes a decision may be forced by inertia: you want to use one form but you don't want to convert all the data. I think we all agree that databases have a big advantage over flat files because of the very good querying functionality, the ease of sorting, of generating custom documents, of making global changes, and of connecting to a web server.

Object database practitioners are largely convinced of their superiority over relational databases for scientific data. There was quite an animated discussion: this is now a religious war! Some of us, myself included, are absolutely convinced that object data bases work well for scientific data. But relational data bases can also work for scientific data. It's really a matter of whether you're interested in using some of the features of objects such as data abstraction and hiding, inheritance, and reusability. One might also argue that if applications are in an object-oriented language, then it is natural and proper that the data dealt with be stored in an object database.

How can we achieve scalability of scientific databases at the PetaByte level?

Federations of databases have been distributed across networks in commercial systems, and this is also the correct approach for scientific archives. If you have all the data in one room, you're going to have some severe access problems. There are obvious strategic goals to achieve scalability: like (now) moving to 64 bit filesystems. Multithreading and SMP are becoming essential, and it must be possible to replicate the database geographically so you can have multiple access. It's important to consider the whole system: the network, the hardware, the user's workstations, the types of query the system must support, the metadata scheme. A lot of people in the group felt that even with a lot of money, an infinite amount of money, you probably couldn't make a workable PetaByte database today.

What are the scalability plans for the case-studies?

There was a general need for middleware, or brokers: something that goes between the user and the database which allows connection of new (heterogeneous) repositories and integrates the data by offering services. To allow these brokers to be effective and modular, we must, as always, carefully specify interfaces: how a user interface talks to a broker, and how the broker talks to a database or to another broker.

What are the special services required to support scientific data?

We identified many such services, which should all be modular, distributable, and well-documented: visualization, comparison, pattern matching, metadata processing, browsing, mining, agent-based learning, quality evaluation, and others.

What is the perceived and real impact of component technologies in scientific data archives?

It's very important that the user or the implementer knows how each component functions and how to insert it into the system. We drew the analogy between components in software and plug-and-play devices. A card you can plug it into a PC slot that the system recognizes is how we want software components to behave. In the working group, despite not yet having a good understanding how components work in scientific data archives (with the notable exception of JavaBeans), we were confident we needed them!

What is the impact of distributed object technologies such as CORBA, Java RMI, and distributed object databases?

We were unable to agree on a good answer to this question, although we see expertise building very rapidly. One of us noted that CORBA had destroyed more projects than it had saved. We were unanimous in being optimistic about distributed Java. Most people felt that it was useful and some people had had success with it. We weren't really sure about CORBA, but again we were optimistic about Java; everybody seems to think it's working. The optimism about Java is because its simpler and works. CORBA is 10 years old and we've yet to see real successes with it.

How can we implement authentication, security, accounting, encryption and billing?

Our initial reaction to this was "no thank you, we don't want any of that, it's just a nuisance"! But when we thought about it more responsibly we decided that if you did need to have it for various reasons, then you could use existing technologies such as SSL or PGP. When you use these you need exisiting components, you don't want to start writing your own. You want to be able to plug in a component to your data archive which then gives the authentication methods and so on to manage user access.

What sort of private storage will be available at the library? How and when do we cache results?

We combined the question on private storage with another question about caching; for us it was one and the same thing. We used a ``real'' library as an analogy: public, departmental, or domain specific. There is a need to have the equivalent of a library desk with a pile of books on it, and the equivalent of a notepad on which you are making jottings, that you take out of the library.

Putting new material into the library is a special topic that really addresses how to make private caches available to others, thus making them public. Having generated some of your own data which is synthesized from library holdings, you may feel that you need to add your findings back into the library so that other people can benefit from your research. However, this information is personal, without any refereeing or peer review. The question then becomes how to manage the transition from private to public.

What are the most significant maintenance tasks in a scientific digital library?

We thought there should be little difference in the maintenance tasks for a scientifc digital library, and a conventional book library. However, I don't think any of us were librarians! We had a concept of what a real librarian did which was ingesting new material, cataloguing, maintaining, and archiving it, and expiring the old. The digital library needs software and hardware upgrades, reaction to feedback from library users, preparing analyses of the holdings so users can understand what kind of material is in the library. Another important task was determining the quality and provenance of library material.

How do we estimate and minimize the wait time for a user query?

At a conventional library, when one user makes a request for 3 books and another requests 300 books, the librarian will respond differently to the two users. This is the kind of service that we need at the digital library. It is very important that the user is given continuous feedback when using the library. The users needs to understand resource demands, whether there is existing data available that might to speed up their queries. In addition, the users need resource estimation before submitting queries, continuously updated estimates while the queries execute. Trivial queries should take trivial amounts of time to process, although defining what is a "trivial" query rigorously is notoriously hard to accomplish!

We discussed how to schedule and prioritize queries that take a lot of resources. For example, if a user submits a query that requires a very large amount of data (a TeraByte or more) to pass through a computing process, there may be have to be a different scheduling mechanism invoked. It may be most efficient to collect several such queries from different users and run them simultaneously later on. In that way the robot loads the tapes sequentially, and each user process is given the data it has requested in the order in which it comes on the tape. We are seeing a reversal of the conventional relationship between computer and storage: the storage system is becoming the manager and the compute process the slave.

How can user-written applications use scientific data collections?

This really wasn't very contentious. We felt there should be components again, with well-documented API's. Sometimes you have to impose access restrictions: perhaps not all the data in the library should be exposed to these `user-processes'.

What is the role of agents in scientific data archives?

There should be an API to allow creation of autonomous agents that can search offline on behalf of the user, to work while he's asleep. Agents evaluating the quality of each others results would be nice, but implies a level of autonomous agent sophistication that is not yet available.

We felt that unless there was tight control on use of agents we could rapidly become swamped; we could have a system with a lot of these agents buzzing around trying to find things out and all getting in one anothers way. Who is responsible for killing rogue agents? How will we deal with malicious agents? How do we tell the difference between agents and humans when they are operating in a software system?

What is ``Information based computing''?

The Scientific Information Boom will occur as new large scientific data repositories come online in the next few years. The major challenge will be to integrate the repositories successfully, lest we become swamped in a morass of data. It is the creativity of the human being that must be allowed full sway when we design our systems. If we don't allow human beings to be as creative as they can be then we will have absolutely failed.

7.4 Metadata and Federation

Group Leader: Eric Van de Velde <evdv@library.caltech.edu>

How does metadata differ from data?

Metadata is structured data about data. Metadata is data.The two different terms (data and metadata) only describe a particular relationship between two data items: if the relationship exists between two data elements, one element (the metadata) describes some aspect of the other element (the data). We should focus on this relationship rather than on the elements connected by this relationship.

Metadata is distinguished from data only by the context of its used. One person's data is another's metadata. For this reason, the term "meta-metadata" is not really defined, and it should be banished from our vocabulary.

How is metadata created?

Metadata defines the structure of the library and is, therefore, intrinsically connected to the function of the library. It is absolutely crucial to manage and to subject to rigorous quality control the addition and creation of metadata. We cannot allow leaving the creation of metadata up to chance and accept just whatever the previous owner of the data left behind.

Library users may contribute to metadata, but only in a very controlled fashion. Some metadata may start out as one person's view of the library. If that structure is identified as worthwile to others (by a process to be identified) this metadata could be added to the formal definition of the library.

Similarly, some metadata may be created and maintained automatically. Computers may summarize items by computational reduction (thumbnail images, header extraction); computers may provide some descriptive elements like data size, various access dates (last modified, last read, e.g.), hardware descriptions, locations, etc. However, these automatically generated items must fit into a well-crafted metadata structure.

What is a metadata standard?

A metadata standard is a controlled vocabulary for the description of related information resources. A metadata standard is developed by forming an authoritative consensus about the semantics, structure, and syntax of the metadata. An authoritative consensus can be obtained only by political means: it is important that we develop ways to grow consensus.

Metadata and Federation.

We define federation as "autonomous decentralized data repositories linked by common metadata standards". In an analogy: the US constitution is the common standard by which US states are federated. The autonomous entities (the states) surrender some of their authority to the federation by agreeing to the common standard (the constitution).

Metadata standards and federation work together in synergy: each supports and encourages the other. We do not believe it is possible to form some council of luminaries to create and enforce metadata standards. (Unless the council is formed by Microsoft.) Instead, we believe that we should create an environment in which standards grow and evolve.

In our discussions, federation emerged as the driving force for the evolution of metadata standards. Federating digital libraries is attractive because it is almost always the case that the whole is greater than the sum of the parts. Creators and users of digital libraries will continue to search for ways to increase the functionality of their data. Federating with other, but related, libraries is the obvious way to achieve this (if the hardware can support it!). When creators of two libraries meet to discuss federation, they must address the issue of metadata standards. However, they will focus on their particular issues for their particular federation. Although they will not focus on "the big picture", they will be making a global contribution in the process: eventually, the federation will be looking to expand at which renewed negotiations start.

We expect that this evolutionary process will produce many strains of standards, and the most successful ones will survive...

However, there are legal barriers to federation that may stunt this evolution. Intellectual-property rights must be addressed. Hopefully, these problems can be resolved in the negotations that precede each federation. Unfortunately, we know that many federations will not even be contemplated because of the legal issues involved.

What is the role of metadata in assessing data quality?

None. Organizational authority is for more important than is some abstract descriptive element. We know how to assess the quality of a book author, institution, publisher, etc. Similarly, we can assess the quality of data. Some questions that need to be asked: Where does the data come from? Is it published by by an organization with a history of quality? In federated libraries, we may lose some of this information, because the source of component data may not always be apparent or might be difficult to obtain.

7.5 Other People's Software

Group Leader: Paul Messina <messina@cacr.caltech.edu>

What software technologies have resulted in effective user interfaces?

HTML is a real success, as are a number of object-oriented technologies and scripting languages. Widgets are a particularly a good example of reusable code. All of these are very broad based, used by the entire computing industry, not just by scientific nerds. They're used by millions of people for lots of things. They also have some features in common: there is a specific structure that was created by the initial developers, and it happened to work. Each of these technologies is used for different purposes, each successfully encodes a single good idea, an idea whose time was ripe, an idea which is easy to grasp.

The use of standard tools with open source is prevalent -- the freely available source is an important feature. Commercial software, on the other hand, has its own advantages, such as vertical integration, competitive sharpening of the product, and a help desk. Unfortunately, licensing agreements are often a major problem, concentrating on the old model of the standalone workstation. We might think of a web-enabled server with thin clients; they use the lightweight, applet version of the package for a few days, then licence the full-strength version for a month, before making an outright purchase or lease of the package.

What software technologies have resulted in effective user interfaces?

One of the people in the group pointed us to an article called ``The Cathedral and the Bazaar'' which seems useful. The bazaar represents the open-source community: the free software that you fix up here and there. Successful examples are the X collection, Linux and the GNU project. Those were high quality, extremely widely used, and free. The Cathedral analogy expresses more the vertically integrated, Microsoft approach, with its obvious advantages, for example, all the presentations of this morning's session are in Powerpoint! We're going to be firmly straddling the fence on a lot of these things because they are clear examples of success on both sides as well as failures.

What software standards have been used for visualization?

The X graphics library has been used in XtalView, a crystallography package. It is very portable across Unix machines because it uses the lowest common denominator functionality, though it was obnoxious to maintain and develop. On the other had the users liked the interface; now the X backend is being phased out and being replaced by OpenGL. The user interface lasts for much longer than the computing engine: that is where our design skills should go.

Skycat has used the freely available Tcl/Tk, and the LIGO project, that we heard about earlier, will probably use that as a standard as well. It's an interpretive language, very general, excellent for rapid prototyping. It's freely available, as source code for Unix systems, also available for a number of other platforms. It is used in, for example, commercial chemistry package called Quanta, and it can be interfaced with the OpenGL API.

On the commercial side, AVS has been around for a while and is available on a number of platforms. LIGO considered it but licensing difficulties, rapid changes in the foundation API, and difficulty of use led to it being abandoned.

Another product that has been used for building GUI's is XVT. It's been around for awhile, it's portable across many platforms. Perhaps it will be replaced by Java and JavaBeans. On the other hand the XVT people are happy and are building up a web-enabled version.

What platforms do users want for visualization?

We digress briefly into what are people are using for visualization these days. Windows NT and Linux -- they are so numerous that they are also being used by scientists and university people. Traditional Unix workstations are still holding their ground, especially for high-end visualization. Perhaps the web browser can be considered as a first level interface, which certainly reduces some of the support issues, but of course the client code will still be platform dependent.

What data-exchange formats have been successful?

Successful exchange of text data has been achieved with ASCII, Postscript and its child PDF. The requirement that the text be, in some sense, machine-readable, generated SGML, which we might think of as a very large superset of HTML. People are expecting a lot from XML, a compromise between HTML and SGML that removes much of the arbitrariness of the former, without adding the complexity of the latter.

For exchange of numerical data, XDR is the only choice; it is used by many vendors and a lot of medical training packages. We know of HDF and many similar types of `self-defining' file formats, but we really didn't have enough people in the group who could speak authoritatively about it. In graphics, GIF and Jpeg are very widely used for non-lossy and lossy compressed images, respectively. The PNG standard is beginning to replace the GIF format in web applications, partly because it is a more flexible and well-focused, partly because GIF is proprietary.

What software has been used for databases?

The Caltech seismology lab has made a decision to use Oracle as a replacement for a home-grown solution. The Caltech astronomy department is moving from Sybase to Objectivity. Microsoft Access is easy for small-scale projects. The patent database project at SDSC used DB2 parallel edition. Our small group simply picked projects they knew about, we have not done a survey.

Observations on database management systems

While most scientific data archives have embraced the idea that a relational or object database is essential for storing the metadata, the catalog information, there is real controversy about how to store the large binary objects that represent the data objects themselves. One point of view maintains that investing time, effort, and money into a DBMS implies that we should use it for managing all the data in a unified way, with no awkward questions about whether some object should be considered data or should it be metadata. This is appropriate when there are many, relatively static, not-too-large data objects. The other point of view is that cataloguing functions are very different from the specialized operations on the large data objects, and when we write the code that does complex processing and mining, we want to work directly with the data, not through the DBMS API. Splitting data from metadata in this way reduces dependence on a particular DBMS product, making it much easier to port the archive to a different software platform, since only the metadata needs to be moved.

Scientists are understandably nervous about relying on a software product remaining viable for the 10-20 years over which their data archive will be useful. One strategy, as employed by the European Southern Observatory and others, is to provide a broker interface between the database and a web browser. This way that user interface can be designed independently of the particular database used to implement it, and it means that the backend database can be `swapped-out' for another, if necessary, without changing the user's trust in the stability of the system. Another benefit is that the broker can access heterogeneous databases to satisfy a query.

Another portability question arises between relational and object databases. Relational schemas are simpler than the richness that is possible with a full object model, so porting between relational products entails writing the tables in some reasonable form, perhaps ASCII, and reading them into the other product. With an object database, however, much of the implementation work is writing code that interfaces to the proprietary API, and porting to another ODBMS will only be easy if the code is designed for portability right from the start.

JAVA or not?

The Java language seems very promising for many things, but of course new technologies always look very promising. The much-touted portability is not perfect, especially with the AWT graphics implementations, especially on Mac platforms, but we note that a Java program is generally much easier to port than other languages, the strong typing being a plus. Java programs depend on how each vendor has implemented the virtual machine, which can cause incompatibilities with other programs. We note in passing that the reported slowness of Java applications is because of its virtual machine model, and that there is no reason why Java cannot run as fast as C++ if there is a platform-specific compiler, rather than a byte-code compiler. We note that Sun's grip on the language standard is being loosened by Microsoft, and now HP has introduced its own version, so there is a potential tower of Babel on the horizon.

Many of us at this workshop are interested in Java RMI (Remote Method Invocation) to implement distributed objects. If different machines are running Java code with the same kinds of objects, it is relatively simple to bind the machines together into a distributed metacomputer using the RMI technology. The other way to do this uses CORBA, but this approach has extra complexity in exchange for the ability to work with heterogeneous languages as well as heterogeneous machines.

Should management recommend or require that clients use a particular package?

Yes if the package is free or inexpensive, if it is easy to install, if it runs on most operating systems and hardware, and if long-term maintenance is feasible and planned. Specification of standard client software has the advantage that it avoids the re-implementation of many things, for example graphics from X-windows, or using a file-chooser or graphing component written with the JavaBeans interface. Commercial products, such as Matlab or Iris Explorer provide a great deal of functionality, but at the expense of paying for each client and the consequent complex negotiations with the vendor to establish the licensing agreement. In many cases the required software can provide a high level of quality, for example scripting languages like Perl or Tcl, or numerical algorithms as with the Blas libraries.

When should software be acquired and when should it be created?

When building your own software, take care to use the right tools and toolkits. Gluing together existing packages is much easier than writing new code, but only if the main functions of the packages are the same as what we want it to do -- for example, we would not use Excel for image-display, even though it can do this, because its focus is elsewhere (spread-sheets).

Open-source software allows you to control your own destiny, for example Perl, GNU.

Acquire, learn and use packages that will be around for a long time, and use them in a way that provides an escape hatch. Develop a migration plan from the beginning, insulating the bulk of your code from the idiosyncrasies of the API.

How can the archive be protected against changes or disappearance of software packages?

Pragmatically, we must assume that any software external to the project may become unstable or unavailable, so we must take this into account from the beginning, designing the overall system so that swapping subsystems is feasible. The broker paradigm is a way to provide this kind of insulation, as we have discussed.

How can we provide a gradual learning path for using advanced software?

The system should be designed so that the effort to learn is proportional to the complexity of the task -- simple things should be simple to achieve. If it is not critical that the system work at peak performance, it becomes much easier to specify how a task is to be done. An analogy is that when using computers with complex architecture, we can ignore the architecture if performance is not an issue.

8. Survey of Participants

In this Appendix we summarize the results from a survey, conducted before the ISDA workshop. The workshop registrants were asked to fill in a form about themselves, with three questions. The 37 responses contained text and checkboxes; first we summarize the question about what they hope to gain from the workshop. The other two questions, about their professional interests and familiarity with various technologies, are summarized with histograms.

8.1 Question: ``What are you hoping to get from this workshop?''

The following is a compilation from the responses

``Collaboration with other groups. Meeting those who know how to work with digital libraries. Other similar efforts that I may be able to leverage directly for my project. Solutions for managing a Scientific Digital Library that can be transferred to my Library.''

``What are the hard problems in this area, and what are the proposed solutions. I'd like to hear candid reviews of some of the technologies people are using/have used and when they have succeeded, and particularly when they have failed.''

``How to make scientific data readily accessible to our customers, user-oriented access to large-scale data repositories and digital libraries, how to interface visualization and analysis tools with data retrieval systems.''

``Interaction between databases, networks and users, distributed approaches to large digital archives. The file systems 'under' these archives will need to be adjusted to support fast access, and the types of data in the archives, as well as the typical accesses will need to be taken into account as the file systems are designed. How to work with very large data bases. How large and complex databases are managed and interfaced.''

``Information on interdisciplinary distributed databases. How the information found in one database can be completed with queries to other databases. Problems and solutions related to federating distributed databases, how to combine data from different database sources. Learning more about federated archives and digital library technology and its impact on scientific data visualization. Research directions for the merging of scientific archives and digital libraries. Applications of Web-linked Databases.''

``Standard or common interfaces, a sense of emerging standards in Scientific Digital Libraries.''

``Security, ownership and privacy issues in digital libraries.''

``Future evolution of digital libraries. Insight into new technology for data access.''

``Understanding of Data Mining. Fostering ideas on data mining strategies.''

``How to support collaborative access and use of data for knowledge discovery and education.''

``Initiate a working group for the specification of functionality, user interfaces, execution and storage environments.''

``My particular vision is for a student or researcher to "go" to the library, and ask a scientific question. The resulting search would return both the literature related to the topic and the scientific data and analysis. Both the full text of the literature and the digital data would be immediately available on-line. Achieving this grand vision requires cross database searching, and mechanisms for accessing extremely complex data objects.''

``The SkyView virtual telescope provides uniform access to astronomy data spanning radio though gamma-ray wavelengths and our high-energy astronomy archive provides astronomers access to data from more than a dozen present and past astronomy missions. Most recently our Astrobrowse initiative provides finder services which allow astronomers to rapidly discover information spread over the WWW. The unifying theme of all of these projects is an attempt to provide simple and uniform interfaces to heterogeneous data. My hopes for this conference are to be able to see what others are doing in this area and to share some of the techniques and systems that we have found to be popular. Ideally this could lead to agreements on standard interface protocols for digital archive libraries, though more realistically I expect to find specific groups with whom we can share insights and software.''

``As a specialist in the development and deployment of metadata standards and systems, I hope to communicate to this workshop some of the process and results of a metadata standard that may play an important role in improving the description and discovery of electronic resources of all types. The Scientific Data communities have special problems and requirements for resource description that are unlike those of many other communities, but there is still a common thread for description semantics that can be identified and used to good effect. Conventions for extending these common semantics in domain specific ways are under development. Finally, wide adoption of common syntactic and structural conventions for exchanging metadata is of great importance for promoting interoperability among applications and disciplines.''

8.2 Question: ``How would you describe your professional interests?''

Check all those that apply.

8.3 Question: ``What is your familiarity with the following technologies?''

The participant was asked to provide a number from 1 to 5, with the meanings of the numbers indicated by the key at the right. The questions are sorted by the average response, so that most respondents are most familiar with Object-Oriented Languages, least familiar with Clustering Algorithms.

9. Survey of Archives

9.1 Textual responses

The workshop registrants were asked to fill in a form about one or more scientific data archives with which they were familiar. The 19 responses contained text and checkboxes; first we summarize each project through the textual responses, and in the next section, the multiple-choice component is summarized through histograms. The name associated with each archive is simply a point of contact, and is not meant to be a listing of collaborators.

Aladin

Francois Bonnarel <bonnarel@astro.u-strasbg.fr>

The archive contains astronomical plate scans used as reference images, in FITS, HCOMPRESS and JPEG formats. The interface is through X, HTTP, or Java. Images can be dynamically recentered, compressed, and uncompressed.

COMPASS

Gretchen Greene <greene@stsci.edu>

The name is an acronym for ``Catalogs of Objects and Measured Parameters from All-Sky Surveys'', containing astronomical data from the application of image processing and object recognition of digitized photographic plates taken from the Palomar and UK Schmidt telescopes. A front-end software pipeline produces a series of flat files, binary and text that are used to populate the archive. The archive utilizes the commercial database Objectivity/DB. The data is structured into C++ classes and stored hierarchically within the Objectivity file system. The front-end pipeline is DEC Alpha VMS. The archive is on a WNT networked system, with DEC Raid, + HSM DLT-jukebox storage. A condensed version of the archive will be exported to a separate database and interfaced to the ESO SkyCat. The archive is currently 6 TByte raw data (1TByte processed), with 1-2 TByte per year of processed data in the future. We expect hundreds of users 2 years in the future.

DAO Data Archive

Ken Ekers <ekers@dao.gsfc.nasa.gov>

The DAO Data Archive contains assimilated and analyzed global earth science data. Currently the archive contains data in "phoenix" format which was developed by the DAO long ago for ease in using our products. By the AM-1 launch in June, the data will be archived in HDF-EOS. EOS has levied this requirement on all instrument teams associated with the AM-1 satellite. Currently the archive is 300GByte, expanding by several Tbyte per year.

Digital Puglia Radar Archive

Giovanni Aloisio <aloisio@sara.unile.it>

This library will contain Synthetic Aperture Radar data covering the Puglia region of SE Italy, as both unprocessed, raw instrument data, and as processed, multichannel images. Metadata includes thumbnail images and georeferencing metadata. The archive will allow authenticated users to run SAR processing, interferometric SAR, classification, and machine intelligence algorithms, using geographically distributed data and computing resources.

Digital Sky

Roy Williams <roy@cacr.caltech.edu>

Digital Sky will provide access to catalogs and images of stellar sources from multiple, independent sky surveys at different wavelengths, together with the tools and data for multi-wavelength astronomical studies. Initial efforts will focus on federating the DPOSS (Digital Palomar Observatory Sky Survey), 2MASS (2 Micron All-Sky Survey), and two radio surveys.

GENESIS Simulator-based Neural Database

Jenny Forss <jenny@bbb.caltech.edu>

The library contains models of neurons, along with associated background data such as notes, references, experimental data etc. The data will be stored in an object-oriented database (most likely ObjectStore). We are currently developing our own data models for storing neuroscience data, and are also working with several collaborators to access data in a distributed fashion. The types of data that might be stored in this library is enormously varied: anything from simple text (a researcher's private notes) to GIF images, movies, 3D representations of neuronal morphologies, and simulation data. It is hard for me to say anything specific about the formats for these data at this point, but in general, we are making efforts to come up with data models that are not too GENESIS-specific. When it comes to storing images, movies etc., we hope to take advantage of support provided by the DBMS that we choose.

The GENESIS simulator would receive data from the user and the library, and create simulation results that would be returned to the user and also stored in the library. Some data will be completely public, while some will be restricted to a small group of collaborators.

Geolib

Mark Simons <simons@gps.caltech.edu>

This archive and processing facility will contain geodetic information to measure surface deformation associated with earthquakes, volcanos, and fluid processes in the crust. These data streams include Interferometric Synthetic Aperture Radar (InSAR), real-time Global-Positioning System measurements, and real-time seismological data. There is both a scientific and a applied use for this data -- the latter is in related to earthquake/volcanic/subsidence issues that can affect urban planning and disaster relief agencies. There are a lot of potential derivative data types including processed deformation maps, decorrelation images, topographic maps.

GIOD prototype LHC Object Database

Julian Bunn <julian.bunn@cern.ch>

The ``Globally Interconnected Object Databases'' project contains simulated event data from the Large Hadron Collider, which is under construction at CERN, Switzerland. Object data allows complex relationships between the entities used in High Energy Physics research, for example, tracks, clusters, particles. The data are made persistent in an ODBMS implemented using Objectivity.

LIGO data archive

Roy Williams <roy@cacr.caltech.edu>

The archive will contain the output of three gravitational wave observatories, mainly in the IGWD-Frame format, with some data in other TBD formats. Some users are interested in detecting transient or continuous gravity-wave signals in deep noise, others are interested in using the archive for instrument diagnosis. Besides the raw and calibrated data stream from the instruments, we expect to store many derivatives and diagnostic data, such as heterodyned time signals, images representing instrument performance, average power spectra, and candidate events mined from the time series.

MDTD

Lennart Johnsson <johnsson@cs.uh.edu>

The archive consists of simulated molecular dynamics trajectories and analysis data for DNA molecules. From the trajectories, thermodynamical, mechanical, and geometric properties are computed, such as helical parameters, backbone dihedral angles, and deviations from canonical structure.

National High-performance Software Exchange

Shirley Browne <browne@cs.utk.edu>

The NHSE contains software, documents, and performance data of interest to the high performance computing community. All data are cataloged using the HTML binding of the Basic Interoperability Data Model (BIDM), an IEEE standard for software catalog records. An extension of the BIDM has been created to catalog performance data. The data themselves are in various formats, depending on the source.

NCSA Astronomy Digital Image Library

Ray Plante <rplante@ncsa.uiuc.edu>

The purpose of the Astronomy Digital Image Library (ADIL) is to collect research-ready images and make them available to the astronomical community and the general public. The library, at http://imagelib.ncsa.uiuc.edu/imagelib, contains fully-processed astronomical images in FITS format, currently 19 Gbyte total. Authors can contribute to the Library's collection with images of the basic (calibrated) measurements and as well as images derived from the basic measurements. For example, a contribution might include the basic data in the form of an image cube of spectral line emission; an example of a derived image might be a velocity image--an image of the doppler velocity of the emission as a function of position.

Oregon Coalition of Interdisciplinary Databases

Cherri Pancake <pancake@cs.orst.edu>

This is a collection of interrelated databases containing information related environmental inventories, systematics, and related digital and printed information across a wide variety of disciplines, including Earth Science, Biology, Ecology, Forestry and Agriculture. OCID's role is to provide software gateways and interfaces that allow each database owner to retain data in the formats of his/her choice, while still enabling interoperability and access via common Web interfaces.

Planetary Data System

J. Steven Hughes <J.Steven.Hughes@jpl.nasa.gov>

The Planetary Data System (PDS) archives and distributes digital data from past and present NASA planetary missions, astronomical observations, and laboratory measurements and is sponsored by NASA's Office of Space Science to ensure the long-term usability of data, to stimulate research, to facilitate data access, and to support correlative analysis.

PDS has developed a standards architecture for the creation of science data archives. This architecture includes an archive format and the concept of product labeling that describes how a user is to access the data as well as how to use the data for meaningful scientific research. The PDS accepts and archives data in a variety of representations but strongly recommends those that are well understood, well supported, and in common use.

PDS supports a community that contains experts in a variety of disciplines including Geophysics, Geoscience, Astronomy, Atmospheres, Radio Science and Imaging. The Planetary Science community is the primary customer of the PDS. The educational community and the general public are welcome to access and use the data.

The current size is 5 TByte, with expected increase of 1 TByte per year. Much of the data is used to produce derived products which are often ingested back into the archive.

TOPBASE

Claudio Mendoza <claudio@pion.ivic.ve>
Francois Ochsenbein <francois@simbad.u-strasbg.fr>

Claude Zeippen <Claude.Zeippen@obspm.fr>

The archive contains massive sets of accurate radiative atomic data, in the context of the Opacity Project. Opacities are derived from quantum mechanics, and are used for star models. Most of the astronomically interesting parameters are computed and interpolated within the Topbase system rather than simply retrieved from file. A few hundred people from a few specialized disciplines are using it, for stellar evolution models.

VIRGO Data Archive

B. Mours <mours@lapp.in2p3.fr>

The contents are data produced by the online processing system of the VIRGO Gravitational Wave interferometer, expected to be 50 Tbyte per year at full production. The IGWD-Frame format is used for efficiency and portability. The user interface is through TCP/IP based messages: the user's post requests and the data are sent in binary. Derivative data include calibrated data, results of search algorithms, and data quality information.

VizieR

Francois Ochsenbein <francois@simbad.u-strasbg.fr>

This is the largest collection of catalogues of astronomical interest, organized in a database, from http://vizier.u-strasbg.fr (and clones in the near future). All ~1600 astronomical catalogues (~4000 tables) can been queried through a unified interface: originally ascii or FITS tables, with standardized documentation, they are pipelined to a relational database. Current data volume is ~10Gbyte, and we expect ~1 Gbyte per year.

VLT Science Archive Facility

Miguel Albrecht <malbrech@eso.org>

The facility contains raw, engineering and operations data stream from the ESO Very Large Telescope, including FITS images and text documents in many formats. The archive is currently 3 Tbyte, with an expected growth of 40 Tbyte per year.

WebWisdom

Geoffrey Fox <gcf@npac.syr.edu>

The archive is curricula material: the main contents are "Educational Objects" which are used to generate Web Pages on the fly. It contains text and images and meta-information to define formatting, and it contains metadata to define bibliographic context of information. It is now 6 Gbytes, expanding at 2 Gbytes per year.

9.2 Histogrammed responses

This section uses histograms to summarize the responses to several questions about the archives.



10. Workshop Participants

Albrecht, Miguel European Southern Observatory

Aloisio, Giovanni Universita di Lecce, Italy

Aydt, Ruth University of Illinois

Blackburn, Kent Caltech/LIGO

Browne, Shirley University of Tennessee

Brunner, Robert Caltech/Astronomy

Bunn, Julian CERN

Cherniavsky, John National Science Foundation

Ekers, Ken NASA/GSFC/DAO

Elson, Lee Jet Propulsion Laboratory

Feig, Michael University of Houston

Finn, Lee Samuel Northwestern University

Forss, Jenny Caltech/Biology

Fox, Geoffrey Syracuse University

Frew, James UC Santa Barbara

Greene, Gretchen Space Telescope Science Institute

Helly, John San Diego Supercomputer Center

Hertzberger, Bob University of Amsterdam, Netherlands

Hughes, Steven Jet Propulsion Laboratory

Johnson, Anngienetta NASA

Johnsson, Lennart University of Houston

Karin, Sid UC San Diego

Koonin, Steven Caltech/Physics

Marciano, Richard San Diego Supercomputer Center

Matarazzo, Celeste Lawrence Livermore National Laboratory

McGlynn, Thomas NASA/Goddard

McGrath, Robert National Center for Supercomputing Applications

McMahon, Sue Jet Propulsion Laboratory

Messina, Paul Caltech/CACR

Mours, Benoit LAPP Annecy, France

Nichols, David Jet Propulsion Laboratory

Ochsenbein, Francois Observatoire Astronomique de Strasbourg, France

Pancake, Cherri Oregon State University

Plante, Raymond National Center for Supercomputing Applications

Pool, Jim Caltech/CACR

Prince, Tom Caltech/Astronomy

Quinn, Peter European Southern Observatory

Reynales, Tad Oregon State University

Salmon, John Caltech/CACR

Siegel, Herb Jet Propulsion Laboratory

Simons, Mark Caltech/Geology

Spaulding, Omar NASA Office of Earth Science

Stein, Chris Caltech CACR

Suresh, Ramachandran NASA/GSFC/RSTX

Ten Eyck, Lynn San Diego Supercomputer Center

Tierney, Brian LBNL/NERSC

Van de Velde, Eric Caltech/Libraries

Verkindt, Didier LAPP Annecy, France

Weibel, Stuart OCLC Online Computer Library Center

Weiss, Rainer MIT

Williams, Roy Caltech/CACR

Yeager, Nancy National Center for Supercomputing Applications

Zemankova, Maria National Science Foundation