Typing group by, Revisited

2020-04-29

In an aside to a previous post, I mentioned that I’d rather live in the alternate universe where data science tools were built on rich static type systems rather than evolved piecemeal from a variety of “dynamic” crud. Yesterday, we examined the semantics of SQL’s group by operation, and pieced together a minimalist version of a “data frame” in Python. I lamented that since Python didn’t provide a type system that let us check any invariants of interest, we’d have to do all these checks at runtime. Today, I’m going to put my money where my mouth is. Let’s peer into the alternate universe where data frames are well-typed, and answer the question left unanswered yesterday: what is the type of group by?

This will take us “through the looking glass” to a world of purely functional code, dependent types, type-driven programming, and “correct-by-construction” data. Let me show you how deep the rabbit hole goes.

Eats, Shoots, and Leaves

Data frames are an oddity: a data structure ubiquitous in data science but relatively unknown to ordinary programmers. I am not sure where they originated, but they go back at least to the early 90s via the S statistical programming language. I became familiar with them through S’s descendent R, which makes heavy use of them. Python programmers in data science will likely be familiar with data frames through the pandas library.

Either way, the concept is the same: a data frame is a helpful representation for a tabular data set. It can be represented in one of in one of two ways: as a heterogeneous collection of homogeneous “columns”, or as a homogeneous collection of heterogeneous “rows”. The representations are isomorphic: we can losslessly go from one to the other and back.

Data frames provide operations for accessing elements in various ways: by row, by column (either by ordinal number or by name), and by slicing along either of these dimensions.

Consider a typical table, perhaps from a relational database or a spreadsheet.

patientID age condition outcome
1 43 placebo no change
2 29 trial improved
3 35 trial no change
4 50 placebo worsened

We have four columns and four rows. The values within each column are all of the same type (that’s what we mean by homogeneous) but the types differ between columns (that’s what we mean by heterogeneous).

Most data frame implementations allow for access by column (and sometimes even row) names rather than indices, but since that is easy enough to add to an ordinal-based implementation by using a lookup structure from names to indices, we’ll work only with ordinal indices for our proof of concept.

In pandas, we could create this data frame with the following:

import pandas as pd

df = pd.DataFrame({
    'patientID': [1, 2, 3, 4],
    'age': [43, 29, 35, 50],
    'condition': ['placebo', 'trial', 'trial', 'placebo'],
    'outcome': ['no change', 'improved', 'no change', 'worsened'],
})

We can see from the way the data frame is constructed that pandas uses the collection-of-columns representation. It’s a good choice, especially for performance reasons. In the Python context, it lets us use numpy arrays as column representations and take advantage of relatively fast vectorized operations on them.

The tradeoff is that traversing a data frame by rows becomes slower and more difficult. In pandas we can convert a data frame into sequence-of-rows format using df.itertuples().

We can slice-and-dice a data frame in various ways:

# one row
print(df.iloc[1])

patientID 2
age 29
condition trial
outcome improved
Name: 1, dtype: object

# one column
print(df.iloc[:,1])

0 43
1 29
2 35
3 50
Name: age, dtype: int64

# a row slice
print(df.iloc[1:3])

patientID age condition outcome
1 2 29 trial improved
2 3 35 trial no change

# a column slice
print(df.iloc[:,1:3])

age condition
0 43 placebo
1 29 trial
2 35 trial
3 50 placebo

Both R and pandas also provide a suite of “SQL-like” operations on data frames. For instance, we might split a data frame into sub-frames based on unique values of a key column—or perform a group by style aggregation. pandas provides this with the groupby method:

import numpy as np
df.groupby('condition').agg([len, np.mean])

patientID age
len mean len mean
condition
placebo 2 2.5 2 46.5
trial 2 2.5 2 32.0

That’s… something, alright. The documentation for groupby is literally bewildering. I’m sure there’s a way to get something more sensible out, but it’s opaque enough to me that I’m not tempted to try.

This brings me to the reasons I am a notorious pandas hater. I think pandas exposes way too large an API surface. It tries to do too many things; I’ve seen programmers use it just to read a small CSV file, or fetch a single column from a SQLite database. I’ve also seen horrific “SQL in Python” ETL scripts which take hours to do, in pandas, on the client side, what could be done in minutes on the database server with SQL.

I don’t want to complain about performance today, though. Instead I want to complain about dynamic typing and its abuses. The problem I find when trying to make sense of pandas documentation is that none of the functions are “typed” in a meaningful way. They return objects of various runtime types depending upon the arguments passed, the shape of the dataframe, and the phase of the moon.

I’m a huge fan of “type-driven development”: contrary to popular belief, most static typing aficionados don’t think they’re smarter than other programmers. To quote Dijkstra, we’re “fully aware of the limited size of [our] own skull[s]” and want the compiler to take care of as much tedious bookkeeping, and as much error checking, as possible! With a well-typed API, we can piece together software like a Lego set—a simple type signature can sometimes be more useful than reams of “operational” documentation like pandas provides.

Data frames are tricky beasts for most statically typed languages. Your common-or-garden type systems don’t excel at typing heterogeneous collections, and less so at computing (say) the type of a resulting data frame from a slicing operation. We can do better, though, by taking a step outside the garden. In a few hundred lines of code, we’ll have a “prototype-quality” data frame implementation which statically checks:

Into the Mirror Universe

Most readers of this blog will have learned to program in the “procedural” or “imperative” fashion: programs are essentially laundry lists of instructions which manipulate the state of various mutable cells containing values. There are infinite variations on this theme (object-oriented programming, structured programming, On Error GoTo HELL) but they share a common, mechanistic philosophy of programming. Programming, in the imperative mode, is about a machine: we understand that the machine has a memory, and that it operates sequentially, loading instructions and executing them. A Turing machine, or something not unlike one, in short.

There are other ways to model computation, equivalent in “power” (this is the crux of the Church-Turing thesis. One such model is drawn from Alonzo Church’s work on the lambda calculus: a formal system for computing only with functions, in the mathematical sense.

In imperative languages, we write “functions” but they often break the assumptions we can make of well-behaved mathematical functions: they may have side-effects like writing to a file, or altering mutable global state. As a consequence, we can’t indulge in equational reasoning about our programs. In an imperative language, x == x does not imply f(x) == f(x): perhaps f uses and increments a global counter, say.

The discipline of functional programming elaborates on the lambda calculus. We construct programs from pure “mathematical” functions. By pure, we mean that our functions do not have side effects (yes, this raises questions about how we ever “do” anything; we’ll return to this thought), but produce results which depend only and explicitly on their inputs.

As a consequence, we can easily apply equational reasoning to our programs: if x == x then surely f(x) == f(x). This has some benefits for compiler code optimization, but the primary benefit is to the programmer. We reduce the number of things we have to keep in our heads.

Functional programming tends to operate on immutable structures (like prepend-only linked lists) rather than mutable structures (like contiguous mutable arrays). This poses some challenges for performance, but we generally have ways of working around them. When absolutely needed, most functional languages provide an “escape hatch” for executing imperative code—or better yet, for wrapping imperative code behind a safe functional interface.

This is related to the answer to our previous question about interacting with the real world. One way or another, functional languages provide some loophole to “make things happen”: I/O and other side effects. Haskell’s (in)famous monads are an example of a principled solution to this problem.

“Well-Typed Programs Don’t Go Wrong”

Functional languages have been hotbeds of type theory research and type system development since Robin Milner’s work on ML in the 1970s. ML was, as the old saying goes, “an improvement on many of its successors”: it pioneered type inference for polymorphic functions, algebraic data types, and the use of pattern matching as a concise notation for working with inductively-defined data.

Milner famously coined the slogan “well-typed programs can’t go wrong”, which is a bit misleading. Of course programs of any sort can “go wrong”—but Milner means “go wrong” in a specific technical sense. In a sound type system, programs which typecheck can’t exhibit the sort of “undefined behavior” we find in languages like C: they are nasal-demon free even when they contain logic errors of the “do what I mean, not what I say” variety.

On that note, how can we enhance our ability to “say what we mean” with types? ML kicked off a miniature Cambrian explosion of functional languages and type systems. O’Caml (a somewhat “conservative” evolution of ML with some object-oriented features) and Haskell (a more avant-garde “research language” which has begun to see more serious industrial use) are the most commonly known modern relatives.

However, a lesser-known branch of the family tree permits the use of dependent types. Dependent on what? Henk Barendregt’s “lambda cube” is my favorite way to answer this, but I’ll provide a short answer.

Every language with a type system to speak of (even the ones that don’t realize it) has types and terms. Terms are the pieces of code we construct programs from, and types classify them. We say that a term “has” a type (although that type may not be unique!): the literal 3 is an integer.

It’s common for terms to depend on other terms: we do this every time we write a function!

Languages with polymorphic types (the imperative world tends to call these “generics”) make it possible for terms to depend on types as well: for instance, we might be able to write a polymorphic “pair-with-self” function dup x = (x, x) which is typed as something like forall t . t -> (t, t), where t is a “type variable” which stands in for any argument type the function is called with.

These same languages also often provide polymorphic (or “generic”) type constructors, which let types depend on types. For a possibly more familiar example, we might have

class List<T> {
    ...
    private T[] buffer;
}

Type constructors can be thought of as functions from types to types. Some languages, like Haskell, permit the use of higher-kinded types which allow polymorphism over type constructors themselves.

However, there’s a combination we don’t usually see: what if types could depend on terms? This would let us write types like “the type of arrays of exactly 5 integers”, where 5 is just an ordinary term and not some wacky type-level emulation. This feature is exactly what we mean by dependent types. In a sense, dependent type systems erase the type vs. term distinction. We can freely intermix the two in our programs, even apply term-level functions inside type definitions.

We’re going to be using a dependently-typed functional programming language called Idris. If you’d like to follow along, make sure you download and use the latest release of Idris 1 (not Idris 2, which is still an exciting work in progress).

Idris has a very similar syntax to Haskell, with a few syntactic refinements and the addition of true dependent types. I highly recommend it as a relatively accessible first step into the dependently-typed world, especially if you have previous Haskell or even ML or Lisp experience. Edwin Brady’s Type-Driven Development with Idris is a great resource: not as a language reference (that’s online), but rather to see an expert use dependent types “in anger” to solve surprising problems.

A quick primer on Idris syntax, especially for non-Haskellers:

-- this is a comment line

-- "x has type Integer" (a type signature)
x : Integer
-- define x
x = 3

-- function types are denoted by arrows
add3 : Integer -> Integer
-- this is a function definition; definitions are by pattern matching
--   and thus mirror use
add3 x = x + 3

-- function application is denoted by adjacency (not parentheses-they're just
--   for grouping), and is left-associative: f g x means (f g) x, not f (g x)
y : Integer
y = add3 x

-- multiple-argument functions are "curried": they really take one argument at
--   a time and returned the "partially applied" function
fusedMultAdd : Integer -> Integer -> Integer -> Integer
fusedMultAdd a b c = a + (b * c)

multAdd5 : Integer -> Integer -> Integer
multAdd5 = fusedMultAdd 5

-- polymorphic terms and types use type variables, which we can create
--   "implicitly" by mentioning them
-- tuple types are written (ty1, ty2, ...)
dup : a -> (a, a)
dup x = (x, x)

-- terms which denote types have type Type (this is what I meant
--   about erasing distinctions)
ty : Type
-- List is a type constructor of type Type -> Type
ty = List Integer

-- we can write algebraic data types with potentially dependent
--   constructor type signatures
data Either : Type -> Type -> Type where
  Left : a -> Either a b
  Right : b -> Either a b

Apple Pie from Scratch

If you wish to make an apple pie from scratch, you must first create the universe.
–Carl Sagan

We’re going to do this the hard way. We’ll take full advantage of Idris’ type system, but make only minimal use of the standard library: everything with an “interesting” type, we’ll (re-)implement ourselves. We will make use of ordinary, non-dependent linked lists and a couple of functions on them in places: these are defined the “usual” way and re-implementation wouldn’t be germane or interesting.

Many of the types we are about to implement appear in the Idris standard library: Nat, Fin, Vect, and HVect are all available, and for the most part should be implemented nearly identically. This isn’t because we’re going to copy-and-paste code! In fact, we’re going to let the compiler write most of the code for us, following the type signatures. Given the types involved, there’s often “one right way” to do it, which the compiler can efficiently guide us toward. This is a fun feature of Idris which really has to be experienced to be believed; you can watch it in action to get a taste in Edwin Brady’s conference talk on type-driven development.

Building these types from scratch will give us the opportunity to ease into the shallow end of dependent typing before encountering the more complex types and definitions which will be involved in defining data frames.

Peano Recital

The few of you who have seen dependently-typed programs before probably can guess what’s coming next. We’re going to need to talk about structures of known length, and reason about them inductively. In order to do this, we’re going to need to construct the natural numbers (that is, the non-negative integers) first!

Following the Peano style, we’ll define naturals inductively: a natural number is either zero (Z), or the successor (S n) of (“one more than”) a natural number.

Our program proper begins here—all subsequent code snippets, except REPL interactions (you’ll see a *tafra> prompt), are part of the same Idris module.

%hide Nat
-- we'll roll our own, this is Carl Sagan's apple pie

-- natural numbers a la Peano
data Nat : Type where
  Z : Nat
  S : Nat -> Nat

-- addition, by recursion on the left
plus : Nat -> Nat -> Nat
plus Z m = m
plus (S n) m = S (plus n m)

This is a verbose representation: for instance, 3 is represented as S (S (S Z)). That’s annoying from a human perspective when dealing with larger numbers, but there are workarounds. We’ll use raw Peano representations, though, for clarity and consistency. It’s also problematic from a performance standpoint: 3 is represented by a linked list of 3 elements! One can see that this won’t necessarily scale. Performance is a theme we’ll return to; for now we’ll remark that in many cases we’ll be able to “erase” these numbers so that they don’t need to exist at runtime, but this is an imperfect art and performance is a valid concern.

We can now talk about lengths or counts (cardinal numbers), but what about indices (ordinal numbers)? Let’s introduce the notion of “finite sets” of natural numbers less than a given number. For a given n : Nat, values of type Fin n will exactly correspond to valid indices into a collection of length n (assuming zero-based indexing, as all decent folk should—and you thought that was arbitrary convention).

-- "finite sets" - naturals less than a given number
data Fin : Nat -> Type where
  FZ : Fin (S n) -- zero is less than any successor
  FS : Fin n -> Fin (S n) -- by induction

We will build these indices the same way as our naturals, and the same caveats apply: to request the third element of a collection we’ll need FS (FS FZ).

Now we need something to index into! We’ll start by building homogeneous vectors: these are exactly the representation we’ll use for each column of our data frames. In dependent-typing parlance, vector types are parameterized by their length (a Nat) and their element type (a Type).

We can use pattern matching to provide checked inductive definitions for access, concatenation, and mapping—tracking the result types with “type-level arithmetic”. Of course, in Idris, our type-level arithmetic is just arithmetic!

-- "vectors" - known-length homogeneous sequences
namespace Vect
  data Vect : Nat -> Type -> Type where
    Nil : Vect Z a
    (::) : a -> Vect n a -> Vect (S n) a

  -- bounds-checked element access
  index : Fin n -> Vect n a -> a
  index FZ (x :: _) = x
  index (FS i) (_ :: xs) = index i xs

  -- length-tracking concatenation
  (++) : Vect n a -> Vect m a -> Vect (plus n m) a
  [] ++ ys = ys
  (x :: xs) ++ ys = x :: xs ++ ys

  -- we can map over vectors, preserving type
  Functor (Vect n) where
    map f [] = []
    map f (x :: xs) = f x :: map f xs

  -- we'll want this later to implement "slicing"
  vslice : Vect k (Fin n) -> Vect n a -> Vect k a
  vslice [] _ = []
  vslice (i :: is) xs = index i xs :: vslice is xs

  -- we'll want this to walk two parallel vectors in "lock-step"
  zip : Vect n a -> Vect n b -> Vect n (a, b)
  zip [] [] = []
  zip (x::xs) (y::ys) = (x, y)::zip xs ys

We’re now equipped to model known-length homogeneous collections. Idris has a nice syntax sugar: it will translate [x, y, z] into x :: y :: z :: [], for any type with constructors :: (“cons”) and Nil.

This lets us write linked lists, vectors, and even the more interesting types we’ll be constructing shortly using the same convenient notation.

Let’s define an example Vect of 3 strings, and another of 4 strings.

exVect1 : Vect (S (S (S Z))) String
exVect1 = ["first", "second", "third"]

exVect2 : Vect (S (S (S (S Z)))) String
exVect2 = ["uno", "dos", "tres", "catorce"]

We can try out our functions at the Idris REPL. For example, we can index a Vect with an appropriate Fin, but not an invalid one!

*tafra> index (FS (FS (FS FZ))) exVect2
"catorce" : String

*tafra> index (FS (FS (FS FZ))) exVect1
(input):1:1-36:When checking an application of function Main.Vect.index:
        Type mismatch between
                Vect (S (S (S Z))) String (Type of exVect1)
        and
                Vect (S (S (S (S n)))) iType (Expected type)

        Specifically:
                Type mismatch between
                        Z
                and
                        S n

That error is saying we “ran out of valid indices” (in other words, hit Z) before reaching our requested index. Note that it’s not happening “at runtime”—this is the compiler carrying out static type checking.

vslice is a “vector slice” operation which we’ll want later. It fetches multiple elements from a Vect, using a collection of indices which is itself a Vect of Fin n, where n is the length of the Vect we’re indexing in to. This gives us safe, checked slicing:

*tafra> vslice [FZ, FZ, (FS (FS (FS FZ)))] exVect2
["uno", "uno", "catorce"] : Vect (S (S (S Z))) String

*tafra> vslice [FZ, FZ, (FS (FS (FS FZ)))] exVect1
(input):1:1-42:When checking an application of function Main.Vect.vslice:
        Type mismatch between
                Vect (S (S (S Z))) String (Type of exVect1)
        and
                Vect (S (S (S (S n)))) a (Expected type)

        Specifically:
                Type mismatch between
                        Z
                and
                        S n

map, ++ (concatenation), and zip work just like their equivalents for lists, but with length checking.

*tafra> zip exVect1 exVect1
[("first", "first"), ("second", "second"), ("third", "third")]
  : Vect (S (S (S Z))) (String, String)

*tafra> exVect1 ++ exVect2
["first", "second", "third", "uno", "dos", "tres", "catorce"]
  : Vect (S (S (S (S (S (S (S Z))))))) String

-- OK, a little cheat here: the result of length is a built-in Nat, not our Nat
*tafra> map length exVect1
[5, 6, 5] : Vect (S (S (S Z))) Nat

We won’t go into as much detail regarding all the functions which operate on our next types, but hopefully that provides a taste of interactive development with Idris and some intuition for how our types are working.

Modeling Heterogeneity

Here’s where things get interesting, and where we start to move beyond what can be done straightforwardly with anything short of full dependent types. Everything previous to this point can be written, perhaps tediously, in not-quite-fully-dependent systems: we can construct things like Nat at the type level; or, better, we can take advantage of “lightweight dependent types” features like GADTs.

In large part that’s because the types resulting from, say, indexing a Vect are easily determined. Since a Vect is homogeneous, the only interesting check is for index validity; the type of any element is fixed.

To represent “rows” of our data frames, we’re going to need a heterogeneous structure: something like a tuple. We’ll parameterize these HVect (heterogeneous vector) types by a Vect n Type—in other words, by the vector of their element types.

-- heterogeneous vectors - we'll need these to represent "rows"
namespace HVect
  data HVect : Vect n Type -> Type where
    Nil : HVect []
    (::) : a -> HVect as -> HVect (a::as)

  -- bounds-checked element access, with appropriate return type
  index : {n : Nat} ->
          {as : Vect n Type} ->
          (i : Fin n) ->
          HVect as ->
          index i as
  index FZ [] impossible
  index (FS _) [] impossible
  index FZ (x :: _) = x
  index (FS i) (_ :: xs) = index i xs

  -- HVects can be compared for equality, when their components
  --   are comparable - we'll need this for "group by"

  Eq (HVect []) where
    [] == [] = True

  (Eq a, Eq (HVect as)) => Eq (HVect (a::as)) where
    (x :: xs) == (y :: ys) = x == y && xs == ys

  -- length- and type-tracking concatenation
  (++) : HVect as -> HVect bs -> HVect (as ++ bs)
  [] ++ ys = ys
  (x :: xs) ++ ys = x :: xs ++ ys

  -- first element only
  head : HVect (a::as) -> a
  head (x :: _) = x

  -- all but first element
  tail : HVect (a::as) -> HVect as
  tail (_ :: xs) = xs

Let’s create an example HVect and interact with it at the REPL. Remember, because our data constructors are called :: and Nil, we can use [x, y, z] notation.

exHVect : HVect [Integer, String, Nat, String]
exHVect = [2, "fast", (S (S Z)), "furious"]

Now we’ll see the resulting values and types from manipulating it:

*tafra> index (FS FZ) exHVect
"fast" : String

*tafra> exHVect ++ exHVect
[2, "fast", S (S Z), "furious", 2, "fast", S (S Z), "furious"]
  : HVect [Integer,
           String,
           Nat,
           String,
           Integer,
           String,
           Nat,
           String]

*tafra> head exHVect
2 : Integer

*tafra> tail exHVect
["fast", S (S Z), "furious"] : HVect [String, Nat, String]

We now have everything we need to build our minimalist data frames (as before, let’s call this type a “tafra”, plural “tafrae”), because it’s the “innards” of a proper data frame).

namespace Tafra
  -- a "tafra" is a list of same-length vectors of various types
  -- the type is indexed by the row count and a vector of column types
  data Tafra : (rows : Nat) -> Vect cols Type -> Type where
    Nil : Tafra rows []
    (::) : Vect rows a -> Tafra rows as -> Tafra rows (a::as)

  -- we can concatenate tafras with the same column types and different
  --   row counts "vertically"
  (++) : Tafra n as -> Tafra m as -> Tafra (plus n m) as
  [] ++ [] = []
  (topFirst :: topRest) ++ (botFirst :: botRest) =
    Vect.(++) topFirst botFirst :: topRest ++ botRest

  -- we can concatenate tafras with different column types but the same row
  --   counts "horizontally"
  hcat : Tafra n as -> Tafra n bs -> Tafra n (as ++ bs)
  hcat [] y = y
  hcat (x :: xs) y = x :: hcat xs y

  -- indexing by columns is simple, and returns a column vector
  index : {cols: Nat} ->
          {as : Vect cols Type} ->
          (i : Fin cols) ->
          Tafra rows as ->
          Vect rows (index i as)
  index FZ [] impossible
  index (FS _) [] impossible
  index FZ (x :: _) = x
  index (FS i) (_ :: xs) = index i xs

  -- indexing by rows is trickier - we have to return a heterogeneous "tuple"
  rindex : Fin rows -> Tafra rows as -> HVect as
  rindex _ [] = []
  rindex i (x :: xs) = index i x :: rindex i xs

  -- we can also "slice" by multiple row or column indices
  -- we'll do this a little inefficiently

  -- "column" slices
  slice : {cols : Nat} ->
          {as : Vect cols Type} ->
          (ixs : Vect k (Fin cols)) ->
          Tafra rows as ->
          Tafra rows (vslice ixs as)
  slice [] _ = []
  slice (i :: is) xs = index i xs :: slice is xs

  -- "row" slices
  rslice : {rows : Nat} ->
           (ixs : Vect k (Fin rows)) ->
           Tafra rows as ->
           Tafra k as
  rslice _ [] = []
  rslice [] (_ :: xs) = [] :: rslice [] xs
  rslice is (x :: xs) = vslice is x :: rslice is xs

  -- just the first row
  rhead : Tafra (S n) as -> HVect as
  rhead [] = []
  rhead ((x :: _) :: ys) = x :: rhead ys

  -- all rows but the first
  rtail : Tafra (S n) as -> Tafra n as
  rtail [] = []
  rtail ((_ :: xs) :: ys) = xs :: rtail ys

  -- combine the "head" and the "tail" (we'll need this to
  --   correctly type "group by"
  rcons : HVect as -> Tafra n as -> Tafra (S n) as
  rcons [] [] = []
  rcons (x :: xs) (r :: rs) = (x :: r) :: rcons xs rs

  -- transpose to a row-oriented representation
  transpose : {rows : Nat} -> Tafra rows as -> Vect rows (HVect as)
  transpose {rows = Z} _ = []
  transpose {rows = (S m)} xs = rhead xs :: transpose (rtail xs)

  -- rebuild a tafra from a transposed representation
  untranspose : {rows : Nat} ->
                {cols : Nat} ->
                {as : Vect cols Type} ->
                Vect rows (HVect as) ->
                Tafra rows as
  untranspose {rows = Z} {as = []} [] = []
  untranspose {rows = Z} {as = (t :: ts)} [] = [] :: untranspose []
  untranspose {rows = (S m)} {as = []} _ = []
  untranspose {rows = (S m)} {as = (t :: ts)} (xs :: yss) =
    (head xs :: map head yss) :: untranspose (tail xs :: map tail yss)

  -- a zero-row tafra of any type
  empty : Tafra Z as
  empty {as = []} = []
  empty {as = (_ :: _)} = [] :: empty

There’s some new syntax here. Recall that previously we mentioned that type variables can be created implicitly. In general, Idris will create implicit arguments whenever a previously unmentioned lowercase name appears in a type signature. The {n : Nat} notation makes these “implicts” explicit, so that we can name and type them ourselves, as well as pattern-match on them. We need this in a few spots to correctly type Tafra functions.

Let’s create a couple of example tafrae, and interact with them:

someData : Tafra (S (S (S Z))) [Double, Bool, List String]
someData =
  [ [23.4, 45.6, 78.9]
  , [True, False, True]
  , [ ["yo", "dawg"]
    , ["i heard you", "like structures"]
    , ["so i put structures in your structures"]
    ]
  ]

otherData : Tafra (S (S Z)) [Int, String]
otherData =
  [ [99, 45]
  , ["some", "text"]
  ]

Indexing is possible, and checked, by column or row index. A column is a Vect and a row is an HVect.

*tafra> index FZ someData
[23.4, 45.6, 78.9] : Vect (S (S (S Z))) Double

*tafra> rindex FZ someData
[23.4, True, ["yo", "dawg"]] : HVect [Double, Bool, List String]

*tafra> rindex (FS (FS FZ)) otherData
(input):1:1-29:When checking an application of function Main.Tafra.rindex:
        Type mismatch between
                Tafra (S (S Z)) [Int, String] (Type of otherData)
        and
                Tafra (S (S (S n))) as (Expected type)

        Specifically:
                Type mismatch between
                        Z
                and
                        S n

Slicing can be done by rows or by columns, producing sub-tafrae:

*tafra> slice [FZ, (FS FZ)] someData
[[23.4, 45.6, 78.9], [True, False, True]] : Tafra (S (S (S Z))) [Double, Bool]

*tafra> rslice [FZ, (FS (FS FZ))] someData
[[23.4, 78.9],
 [True, True],
 [["yo", "dawg"], ["so i put structures in your structures"]]]
   : Tafra (S (S Z)) [Double, Bool, List String]

We have rhead (first row) and rtail (all-but-first-row) functions (this is the “functional programming” meaning for these names, not the “pandas” meaning), as well as rcons which inverts the rhead/rtail transformation.

*tafra> rhead otherData
[99, "some"] : HVect [Int, String]

*tafra> rtail otherData
[[45], ["text"]] : Tafra (S Z) [Int, String]

*tafra> rcons (rhead otherData) (rtail otherData)
[[99, 45], ["some", "text"]] : Tafra (S (S Z)) [Int, String]

We can convert to and from row-major (Vect of HVect) representation with transpose and untranspose, which are probably not ideally named (it is a handy REPL-only variable referring to the value of the last evaluated expression, like _ in the Python REPL):

*tafra> transpose otherData
[[99, "some"], [45, "text"]] : Vect (S (S Z)) (HVect [Int, String])

*tafra> untranspose it
[[99, 45], ["some", "text"]] : Tafra (S (S Z)) [Int, String]

Finally, just like pandas, we can concatenate tafrae either “vertically” (appending rows) or “horizontally” (appending columns), but ours are checked! For vertical concatenation we need identical column types; for horizontal concatenation we need identical row counts.

*tafra> someData ++ someData
[[23.4, 45.6, 78.9, 23.4, 45.6, 78.9],
 [True, False, True, True, False, True],
 [["yo", "dawg"],
  ["i heard you", "like structures"],
  ["so i put structures in your structures"],
  ["yo", "dawg"],
  ["i heard you", "like structures"],
  ["so i put structures in your structures"]]] : Tafra (S (S (S (S (S (S Z))))))
                                                       [Double, Bool, List String]

*tafra> hcat otherData otherData
[[99, 45], ["some", "text"], [99, 45], ["some", "text"]]
  : Tafra (S (S Z)) [Int, String, Int, String]

*tafra> someData ++ otherData
(input):1:1-29:When checking an application of function Main.Tafra.++:
        Type mismatch between
                Tafra (S (S Z)) [Int, String] (Type of otherData)
        and
                Tafra m [Double, Bool, List String] (Expected type)

        Specifically:
                Type mismatch between
                        Z
                and
                        S Z

*tafra> hcat someData otherData
(input):1:1-23:When checking an application of function Main.Tafra.hcat:
        Type mismatch between
                Tafra (S (S Z)) [Int, String] (Type of otherData)
        and
                Tafra (S (S (S Z))) bs (Expected type)

        Specifically:
                Type mismatch between
                        Z
                and
                        S Z

We’ve already achieved much of what we set out to do. We now have a statically checked data frame implementation which correctly models a heterogeneous collection of same-length homogeneous columns.

Typing Group-by

As a final goal, let’s answer the question posed previously: what is the correct (or at least a correct) type for a SQL-inspired group by operation?

Just like we did in Python, we’ll execute our “group by” by accumulating rows sharing a unique tuples of “group-by values” into an associative collection. We don’t have easy access to a hash table implementation, so we’ll use a list of (key, value) tuples as a purely functional implementation of associative arrays.

On the way to “group by”, we’ll define helper functions to collect tuples by unique value; rowsByUnique could be useful on its own but also forms a solid building block for groupBy.

-- instead of maps, we'll use "association lists" for associative lookup
-- the prelude helpfully provides
-- lookup : Eq a => a -> List (a, b) -> Maybe b
-- we'll define a helpful type synonym

Map : Type -> Type -> Type
Map k v = List (k, v)

-- and some additional helpers
insert : Eq k => k -> v -> Map k v -> Map k v
insert k v [] = [(k, v)]
insert k v (x@(k', _) :: xs) =
  if k == k'
     then (k, v) :: xs
     else x :: insert k v xs

-- we're almost ready to write "group by"
-- first, let's prove we can collect rows by unique combination of group-by
--   column values

-- an "accumulating" function we'll use to collect rows by unique combination
private
accumRows : Eq k =>
            Map k (List v) ->
            Vect n (k, v) ->
            Map k (List v)
accumRows map [] = map
accumRows map ((b, g) :: rest) = case lookup b map of
  Nothing => accumRows (insert b [g] map) rest
  -- for performance, we're going to build these "backward"
  (Just xs) => accumRows (insert b (g::xs) map) rest

-- this is also the only example I have ever seen of explicitly passing
--   a "locally named" interface implementation
rowsByUnique : (is : Vect groups (Fin cols)) -> -- indices of group-by columns
               (js : Vect aggs (Fin cols)) -> -- indices of aggregated-over columns
               {auto prf: Eq (HVect (vslice is as))} ->
               Tafra rows as ->
               Map (HVect (vslice is as)) (List (HVect (vslice js as)))
rowsByUnique {prf} is js xs =
  accumRows @{prf} [] (zip (transpose (slice is xs)) (transpose (slice js xs)))

-- ok, let's type group by
groupBy : {cols : Nat} ->
          {rows : Nat} ->
          {as : Vect cols Type} ->
          {groups : Nat} ->
          {aggs : Nat} ->
          -- indices of group-by columns
          (is : Vect groups (Fin cols)) ->
          -- indices of aggregated-over columns
          (js : Vect aggs (Fin cols)) ->
          -- aggregation function
          (aggFn : List (HVect (vslice js as)) -> HVect bs) ->
          -- a proof that group-by columns can be compared for equality
          --   (automatically found by compiler)
          {auto prf: Eq (HVect (vslice is as))} ->
          -- data frame to group-by
          Tafra rows as ->
          -- result type: an existentially quantified tafra of some
          --   row count, with columns from the group-by columns and
          --   the result of the aggregation function
          (k : Nat ** Tafra k (vslice is as ++ bs))
groupBy is js fn xs = applyAggregation fn (rowsByUnique is js xs) where
  applyAggregation : (List (HVect gs) -> HVect rs) ->
                     Map (HVect as) (List (HVect gs)) ->
                     (k : Nat ** Tafra k (as ++ rs))
  applyAggregation f [] = (Z ** empty)
  applyAggregation f ((k, rs) :: rest) with (applyAggregation f rest)
    applyAggregation f ((k, rs) :: _) | (n ** restAgg) =
      (S n ** rcons (k ++ f rs) restAgg)

To be fair, that was nearly as exhausting to write as it probably was to read. I take this as a sign that “group by” is a somewhat complicated operation, semantically.

There’s plenty of bookkeeping of indices and types (and of the constraint that we must be able to do equality comparison on our group-by column values), but there is one interesting new complication which requires us to use another “unusual” feature of our type system. Unlike the operations we implemented previously, there’s not a fixed relationship between the number of rows in a tafra and the number of rows resulting from groupBy.

The result’s row count is only known at runtime, after we consolidate unique combinations of the grouping columns. To handle this in our type system, we use an existential type in the form of a dependent pair. If polymorphic types correspond to “for all” (universal quantification), existential types correspond to “there exists” (existential quantification). Object-oriented programmers work with existential types every day in the form of interfaces: they promise that “there exists” a value of some concrete type fulfilling the interface without statically committing to a given implementation.

We use an existential to promise that “there exists” some count of rows in the result, and that the result has that many rows, and that we can produce both bundled together in a dependent pair (the (k : Nat ** Tafra k (vslice is as ++ bs)) type of our result is a dependent pair type).

It’s also noteworthy that types can contain function application! We use this to indicate the correspondence between the columns selected for grouping and aggregation, the aggregation function result type, and the column types of the resulting tafra.

Let’s see groupBy in action. We’ll write the rough equivalent of this SQL:

select
  col2,
  join('\n', str(col1) & join(' ', col3))
from someData
group by col2

In our system:

*tafra> groupBy
  [FS FZ]
  [FZ, FS (FS FZ)]
  ((::[]) . unlines . map (\[num, strs] => unwords (show num :: strs)))
  someData

(S (S Z) **
 [[True, False],
  ["78.9 so i put structures in your structures\n23.4 yo dawg\n",
   "45.6 i heard you like structures\n"]]) : (k : Nat ** Tafra k [Bool, String])

There we have it! We’ve typed groupBy, and it works.

Parting Thoughts

That was a lot of work, especially to build the more complex types involved in dynamic operations like “group by”. However, we’ve gained incredible power to statically check the validity of our data manipulation.

Why is this from “another universe”? Why don’t we see dependent types used for these kinds of systems in practice? I think there are a few reasons: some good, some not.

Functional programming is still unpopular, although less so than it used to be. The same goes for expressive static type systems. We’re seeing a growing trend of “mainstream” imperative languages adopting features (typeclasses or “traits”, algebraic types and pattern matching, etc.) from this world, but languages like Haskell are still pretty far from mainstream.

Data science had the poor luck to come of age during the long slow decline of “dynamic typing”, but before the “static typing renaissance” we’re currently living through. As a consequence, key tools like R or pandas are very much products of their times. At the time, given the mainstream or near-mainstream alternatives, the decision wasn’t a bad one. A REPL is very helpful for this sort of work, and they were hard to come by for statically-typed languages.

A deeper issue is performance. If you were playing close attention, you realized that every data structure we built, from Nat to Vect to Tafra, was an immutable linked list “under the hood”. That’s far from ideal for performance under the best of circumstances, and the kind of code we wrote to slice and manipulate frames was far from the “best-case” scenario for these data structures.

A few years back I worked on a dependent-typing related project for a client. I spent a lot of time researching prior work on optimizing compilers for these languages, to improve the performance of dependently-typed programs. Even with aggressive “erasure” (e.g. so all those Nat indices don’t have to exist at runtime), performance is at best comparable to a “traditional” functional language—which is to say that performance is poor. (One side effect of this research was MinPS, which successfully compiles a dependently-typed core language to fairly crappy Javascript.)

There are some reasons for optimism, though. For example, Sixten is an experimental functional language with both “unboxed” (read: fewer indirections—the difference between a C array and a Python list, or for that matter a Haskell list) data and dependent function types. There doesn’t appear to be support for the kind of dependent pair types we used in typing groupBy, but it’s a promising start!

Comparing Idris, say, to Python, Idris probably comes out ahead in performance when considering just the “core language”. However, Python’s data science ecosystem is mostly built on Python wrappers around fast compiled code (C, C++, or Fortran). There’s no reason we couldn’t build the same sort of “layered” system to provide a dependently-typed interface to a fast native data frame library. True, we’d lose some of the “correct by construction” benefits we got while writing our implementation, but with adequate testing we’d still provide the benefits of static checking to our users: that is, assuming we could get them to try and adopt a dependently-typed language.

Alas, that day is probably quite far off. Or perhaps dependent types are just a stepping stone to, or a detour from, some future approach which manages to combine good performance with static checking and the je ne sais quoi which attracts “average” programmers.

For now, we’ll just have to live with occasional glimpses through the looking glass into a world that might have been.


PS: you can find the entire Idris program in one file (Tafra.idr) here.

PPS: I threw in a very quick-and-dirty parsing module (TafraParser.idr), which will attempt to parse a tafra of a given type (and unknown row count) from (not-quite-fully-to-spec) CSV formatted text. The point here was not to write a production-grade CSV parser, but rather to answer a question we didn't get to address in the code above: how do we handle the fact that when we parse data from the outside world, we may know what column types to expect, but we have no guarantees about the data being "correct" nor about the number of rows. The answer, of course, is the combination of existential types (to quantify over the statically unknown row count) and option (Maybe) types to indicate the possibility of failure during parsing.

PPPS: I've updated a few variable names for clarity, based on some very helpful feedback.