Haupz Blog

... still a totally disordered mix

Array Programming on the Command Line

2024-06-15 — Michael Haupt

Having spent most of my adult computing life in Unix-derived operating systems, I’m quite fond of having a good command line interface (vulgo shell) available at all times. The Unix command line tools are small but powerful commands that are very well tailored to do one kind of thing. It is a definitive upside that they all work on one common shared representation of data that is passed (or “piped”) between them: text.

Text typically comes in lines, and processing line-based content with tools like grep, less, head, sort, uniq, and so forth is great. Editing data streams on the fly is possible using tools like sed, with just a bit of regular expression knowledge.

When data becomes slightly more two-dimensional in nature - lines are broken down in fields, e.g., in CSV files -, awk quickly steps in. Its programming model is a bit awkward (dad-joke level pun intended), but it provides great support for handling those columns.

Two-dimensional data like that can sometimes come in the form of matrices that may have to be pushed around and transformed a bit. Omit a column here, transpose the entire matrix there, flip the two columns yonder, oh, and sum the values in this column please. While this is possible using awk, its programming model is a bit low level for those jobs.

Enter rs and datamash.

The rs tool, according to its manual page, exists to “reshape a data array”. As mentioned, those Unix command line tools do one thing, and do it very well - so here it is. It’s a wonderful little subset of APL for two-dimensional arrays.

Sometimes reshaping isn’t enough, and computations are needed. Like rs, datamash is a nice little subset of APL, only this time not focused on reshaping, but on computing. To be fair, the capabilities of datamash also cover those of rs, but while the latter is often part of a standard installation, the former requires an installation step. (This may change in the future.) With datamash, numerous kinds of column- and line-oriented operations are possible.

These tools are two less reasons to fire up R or Excel and import that CSV file ...

Tags: hacking