What: A Structured Approach to Semi-Structured Information Who: Professor Gary Perlman The Ohio State University Columbus, Ohio, USA When: 11:00 am, Tuesday, August 13, 1991 Where: Institute of Systems Science, National University of Singapore Semi-structured information is not as well defined as database records but more structured than full text. Semi-structured records are similar to forms, attribute lists, and frames. Semi-structured records contain a series of fields with arbitrary names and (possibly typed) values, in which the order of fields is arbitrary and often irregular, except that multiple instances of the same field are ordered. Examples of semi-structured information include bibliographic records, electronic mail messages and news articles, survey questionnaires, schedules of events, and personal databases such as address books, household inventories, grade rosters, etc. Semi-structured records can be extended by allowing pointers in field values to other records; textual identifiers of other records can effectively implement hierarchical and hypertext network structures. For example, bibliographic records can point to their references, news articles can point to the articles they discuss, and hierarchical structures can be created by having superordinates point to their subordinates. Given the diversity of semi-structured information and its apparent suitability for many information sharing tasks, it is not surprising that there is much software for managing it. What is surprising, is that the software used for one domain (e.g., mail) is usually not used for another domain (e.g., bibliographies), even if the domains are very similar (e.g., many people use different programs for mail and news), although there are some notable exceptions (e.g., emacs mail/news readers, the MH mail handler with uses UNIX file manipulation programs to manipulate mail messages). One reason for the use of different software tools is that operations in one domain may not make sense in another domain, at least not at first glance. Another reason for different tools is the difference of storage formats; each domain has its own peculiar markup for the same basic structure, making software reuse difficult. The SST (Semi-Structured Toolkit) attempts to provide an integrated environment for semi-structured information. At the heart of SST is a table-driven parser/generator that reads and writes a wide variety of semi-structured record (SSR) formats, representing them internally in a generic structure for manipulation. Because SSRs have a simple structure, dynamic input parsing and output generation can be done efficiently. Common manipulations on the semi-structured records include matching, sorting, editing, selecting, formatting, and viewing. By creating a cross-tabulation of Operations by application Domains, we have found many holes where an operation supplied in one application program is not supplied in another, even though the missing operations would be useful. For example, the Berkeley Mail program does not allow search operations (although many other mail programs do), and it does not let users reorder messages according to key fields such as date, sender, subject, and so on. It may be the exception rather than the rule that functionality is domain specific; the number of functions in a domain-specific application can often be doubled and still include only intuitive functions. For example, the generation of tables of contents (based on a hierarchy of fields that are displayed only when the values change), common in bibliographic applications, is readily applied to sorted mail and news. The SST is implemented in ANSI C and is running in various forms on UNIX (command line and X), Macintosh, and on DOS. Current work is being done to build an interface, PipeFitter, that allows novices to link the tools in UNIX-like pipelines to allow the creation of applications for exploring bibliographies, mail and news, and a variety of other application domains (this work is being done by Lynn Snider, and will also be applicable to many UNIX filter programs). Other work is in the development of widgets for viewing SSRs in a variety of dynamically manipulable formats (this work is being done by J. Edward Swan, II). There is also an effort to "productize" the SST for use at three levels: C function libraries, UNIX filter programs, and via a graphical connector, so they can be used by others. The first release of the SST and PipeFitter will be in conjunction with the HCI Bibliography, a free access extended bibliography on Human-Computer Interaction, being compiled at The Ohio State University. The SST is being used for manipulating bibliographic records, mail requests for information about the project, and for managing information (including pictures) about people in HCI.