What:  A Structured Approach to Semi-Structured Information
Who:   Professor Gary Perlman
       The Ohio State University
       Columbus, Ohio, USA
When:  11:00 am, Tuesday, August 13, 1991
Where: Institute of Systems Science, National University of Singapore

Semi-structured information is not as well defined as database records
but more structured than full text.
Semi-structured records are similar to forms, attribute lists, and frames.
Semi-structured records contain a series of fields with arbitrary names
and (possibly typed) values, in which the order of fields is arbitrary and
often irregular, except that multiple instances of the same field are ordered.  
Examples of semi-structured information include bibliographic records,
electronic mail messages and news articles, survey questionnaires,
schedules of events, and personal databases such as address books,
household inventories, grade rosters, etc.
Semi-structured records can be extended by allowing pointers
in field values to other records; textual identifiers of other records
can effectively implement hierarchical and hypertext network structures.
For example, bibliographic records can point to their references,
news articles can point to the articles they discuss,
and hierarchical structures can be created by having superordinates
point to their subordinates.

Given the diversity of semi-structured information and its apparent
suitability for many information sharing tasks,
it is not surprising that there is much software for managing it.
What is surprising, is that the software used for one domain (e.g., mail)
is usually not used for another domain (e.g., bibliographies), even
if the domains are very similar (e.g., many people use different programs
for mail and news), although there are some notable exceptions
(e.g., emacs mail/news readers, the MH mail handler with uses UNIX
file manipulation programs to manipulate mail messages).
One reason for the use of different software tools is that operations
in one domain may not make sense in another domain,
at least not at first glance.
Another reason for different tools is the difference of storage formats;
each domain has its own peculiar markup for the same basic structure,
making software reuse difficult.

The SST (Semi-Structured Toolkit) attempts to provide an integrated
environment for semi-structured information.
At the heart of SST is a table-driven parser/generator that reads
and writes a wide variety of semi-structured record (SSR) formats,
representing them internally in a generic structure for manipulation.
Because SSRs have a simple structure, dynamic input parsing and output
generation can be done efficiently.
Common manipulations on the semi-structured records include matching, sorting,
editing, selecting, formatting, and viewing.
By creating a cross-tabulation of Operations by application Domains,
we have found many holes where an operation supplied in one application program
is not supplied in another, even though the missing operations would be useful.
For example, the Berkeley Mail program does not allow search operations
(although many other mail programs do), and it does not let users reorder
messages according to key fields such as date, sender, subject, and so on.
It may be the exception rather than the rule that functionality is domain
specific; the number of functions in a domain-specific application can
often be doubled and still include only intuitive functions.
For example, the generation of tables of contents (based on a hierarchy
of fields that are displayed only when the values change), common in
bibliographic applications, is readily applied to sorted mail and news.

The SST is implemented in ANSI C and is running in various forms on UNIX
(command line and X), Macintosh, and on DOS.
Current work is being done to build an interface, PipeFitter, that allows
novices to link the tools in UNIX-like pipelines to allow the creation
of applications for exploring bibliographies, mail and news, and a variety
of other application domains (this work is being done by Lynn Snider,
and will also be applicable to many UNIX filter programs).
Other work is in the development of widgets for viewing SSRs in a variety of
dynamically manipulable formats (this work is being done by J. Edward Swan, II).
There is also an effort to "productize" the SST for use at three levels:
C function libraries, UNIX filter programs, and via a graphical connector,
so they can be used by others.  The first release of the SST and PipeFitter
will be in conjunction with the HCI Bibliography,
a free access extended bibliography on Human-Computer Interaction,
being compiled at The Ohio State University.
The SST is being used for manipulating bibliographic records,
mail requests for information about the project, and for managing
information (including pictures) about people in HCI.