|STAT Statistical Data Analysis : Historical Notes

|STAT Statistical Data Analysis

Free Data Analysis Programs for UNIX and DOS

by Gary Perlman

Home

History

Last updated:

History of |STAT

Why would anyone write their own statistics package?
How did you decide which programs to write?
Is there a detailed timeline?
How did the package make it to DOS?
What are the plans to port |STAT to new platforms?
What's the idea behind the conditions of use?
What the deal with that name?
Where can I read more?

This page contains my recollections of the history of |STAT, particularly the early history. It's probably not of much interest to anyone but me, and it's mainly a place for me to gather information that might otherwise be lost. The order of events may not be completely accurate, and some facts may have been omitted, but those problems can be corrected over time. This page also serves as a FAQ (Frequently-asked questions) list; all the questions below have been asked and answered more than once.

Why would anyone write their own statistics package?

|STAT was an invention of necessity -- my necessity -- although I have tried to allow others to benefit from the effort. And it is highly skewed to the procedures used in experimental psychology. If I needed a statistical procedure or a feature, it would not mean much effort for me to add it, and then others could use it. I have never had the ambition of writing a complete package -- all that might mean is to have what other packages have -- although I have thought about more general models for design specification and analysis.

I was at UCSD in the late 1970s. Our lab ran UNIX (version 6) on a PDP 11/45 with 256K of memory (128K for UNIX). The UCSD Burroughs mainframe had the BMDO and BMDP packages (among others). We used BMD-O8V and BMD-P2V for most of our ANOVA needs, but it was inconvenient to print out data, punch it onto cards in another building, and wait for a job to finish and print out. We were, after all, accustomed to an interactive time-sharing system!

BMD-P2V offered many new capabilities, but we found the program difficult to use. Data had to be placed in exact column positions and read with the equivalent to Fortran format statements. Errors were common. I tried to humanize BMD-P2V by rewriting their manual. It was my first real introduction to unusable software where errors could have career-ending effects.

Our lab obtained BMD-P for our PDP 11/45. The initial excitement was met with dismay as the programs were far too large to fit in 128K of memory. The program worked by overlaying -- loading parts, unloading, loading other parts -- which was time-consuming. To obtain the mean of two numbers, it took several minutes of CPU time, and on a loaded time-sharing system, that could span 10-15 minutes and downgrade the performance of all other users.

People wrote their own programs in C to do the most mundane analyses. We shared code, and some programs became used by more than their original authors.

How did you decide which programs to write?

In the fall of 1979, I got an idea about how to automatically choose bin sizes to make a good looking histogram, and desc was born (actually, it was first called hist). Over time, many stats were added, and desc was clearly a cut above the rest of the little programs.

pair followed soon after. Don Gentner had written a paired data analysis program, but the output was inconsistent, especially in alignment. I rewrote it to align similar items, add more stats, and eventually added error handling (which was missing from most of these home-grown programs), and more options. I wrote a character-based bivariate plotting program, biplot, which eventually merged into pair as the -p option.

Next was dm, originally called the data massager but later the data manipulator for public distribution (interactive mode could have greeted analysts with "Welcome to the massage parlor," but I don't think it ever did). The major effort was in learning about building a parser with yacc (Yet Another Compiler-Compiler). Jay McClelland helped motivate me to add string handling to the numeric functions (He argued that it would be much easier for me to change the code than for him to, and I made most of the changes in our lab while he looked over my shoulder.)

By the summer of 1980, I took on the task of writing anova, adapting as much of the methods from Keppel's Design and Analysis book as I could. Jay McClelland had written dt, the Data Tabulator, which was highly influential in the design of anova and how it simplified the specification of the relationship of the data to the experimental design.

regress (initially called corr) followed later in 1980, as did many of the data validation facilities. Only a few programs, mostly for data manipulation, were added during my remaining time at UCSD:

abut: joins data files (based on Jay McLelland's program of the same name)
calc: algebraic calculator, using the same parser as dm
perm: randomly permute lines (used extensively in experimental research)
dprime: signal detection analysis
transpose: matrix operation

While at UCSD, I started distributing the package (via magnetic tape). One request came from the Hospital for Sick Children in Toronto. "Yow!" I thought. "If there are bugs in the programs and these doctors are basing decisions on the results, they might say 'Well, these results from Gary Perlman's programs are clear. Off with Timmy's leg'" I would wake up at night and think about doing more testing. It made me much more serious about testing, especially after making changes. It also made me reluctant to make changes.

Having been a professor who taught software engineering for 12 years, I've told that story in every class in which I discussed testing and the potential impact of poorly-tested software.

At the Wang Institute, I began adding to the package in 1985, particularly non-parametric/rank-based statistics:

contab: contingency tables / chi-square
ranksort: convert data to ranks
oneway: single factor between-subjects ANOVA
stats: simple descriptive stats format
dsort: data sorting
rankrel: rank-order stats for related groups
rankind: rank-order stats for independent groups

Some non-statistical utilities were added around this time:

ff: fast formatter
fpack: plain text archiver
features: feature comparison

I also put more effort into regression testing for the package (that's testing to make sure that changes did not break things that were not supposed to change).

The programs have remained remarkably stable over their first 20 years. Few features were added and most changes were when there was a package-wide change (e.g., common help options across all programs). Fortunately, none of the programs were changed for computational errors.

Is there a detailed timeline?

No. Well, yes. It's not in a good format to include here. Maybe I'll add it?

How did the package make it to DOS?

While I was teaching at the Wang Institute, my wife-to-be was at Cornell in Psychology graduate school. Fred Horan, a department programmer, worked on porting the package to the PC using the Lattice C compiler. After helping with some of the stickier problems, the package was running with no problems.

What are the plans to port |STAT to new platforms??

There are no plans, so if you can't get |STAT to work on a platform, you might be out of luck because the only development is restricted to Linux. DOS/Windows users find that |STAT programs (like many others) won't run on their 64-bit machines; there is no plan to work on this unfortunate turn of events. Mac users may find that some programs will not compile on this some versions of the Mac OS; there are no plans to compile and test |STAT on the Mac. As time passes, new versions of C compilers point out new compatibility issues with |STAT; changes to the |STAT sourcecode will be considered on a case-by-case basis.

What's the idea behind the conditions of use?

|STAT is distributed to people who agree to not distribute modified versions of the programs. This seems the opposite of the policies of other free software (e.g., Free Software Foundation's policies), which promote (or require) the distribution of derivative works. But perhaps unlike other software, it can be difficult to tell if a computation is correct. Statistical software is used to make decisions, often by people who lack detailed knowledge of the theory behind the statistics or the practical limitations of the software. This creates a potentially dangerous situation, and one that is best avoided.

The policy for |STAT arose from experiences with well-meaning people sending in enhancements to various programs. Each enhancement was accompanied by an enthusiastic message that indicated a real sense of cooperation and pride. Unfortunately, they also included bugs -- computational errors -- that could result in an incorrect decision. After about ten contributions, none of which had ever been included because none had ever been correct, the non-redistribution policy was born.

I have often considered distributing the validation suite with the software, so that the software could be checked after compilation. Once someone suggested using a new version of a compiler for a huge performance improvement; the results came back in one tenth the time -- wrong results, but fast! Usually, however, portability problems are clear, and it would be a large additional effort to prepare the scripts for distribution, so that feature may need to wait for the need.

What the deal with that name?

Early on, being on UNIX, using pipes (which are indicated with the pipe symbol |), I called the package UNIX|STAT. When I went to work at Bell Labs, their lawyers said not to use UNIX on anything (actually, they just crossed out the name UNIX), so I was left without a name.

I ran a contest with some students, and although they had some good ideas (okay, I'm lying here, but you want to encourage students), they all would require some effort on my part to protect the name from infringers. Continuing the package's tradition of never doing more work than is necessary for my needs (or my wife's, depending on how forcefully she expresses her needs), I decided that a name like "STAT" would be ideal because it was generic and unprotectable so I would not need to worry about protection, infringement, etc. I kept the pipe symbol, although that resulted in some people interpreting the pipe symbol as an uppercase "I", and citing |STAT as ISTAT.

Some people ask me how the pronounce "|STAT". The pipe symbol is silent. Now you know. Some people like to call it "pipe stat", which is probably better than a silent symbol, but leans toward something that I might need to protect, so I have not considered it seriously.

Where can I read more?

Perlman, G. (1980) Data analysis programs for the UNIX operating system. Behavior Research Methods and Instrumentation, 12:5, 554- 558.
Perlman, G. (1982) Experimental Control Programs for the UNIX Operating System, Behavior Research Methods and Instrumentation, 14:4, 417-421.
Perlman, G. (1982) Data Analysis in the UNIX Environment: Techniques for Automated Experimental Design Specification. In K. W. Heiner, R. S. Sacher, & J. W. Wilkinson (Eds.), Computer Science and Statistics: Proceedings of the 14th Symposium on the Interface.
Perlman, G., & Horan, F. L. (1986) Report on |STAT Release 5.1 Data Analysis Programs for UNIX and MSDOS. Behavior Research Methods, Instruments, & Computers, 18.2, 168-176.
Perlman, G., & Horan, F. L. (1986) |STAT: Compact Data Manipulation and Analysis Programs for MSDOS and UNIX - A Tutorial Overview. Tyngsboro, MA: Wang Institute of Graduate Studies.
Perlman, G. (1987) The |STAT Handbook
Conlon, M. (1989) Review of |STAT 5.3, The American Statistician, 43:3, 171-174.