All posts

Last time, we used a minimalist parser combinator library to build a parser for an oddly familiar language called OBAN. The problem with our previous parser is that it produces extremely unhelpful error messages. This is probably fine for a parser which runs as part of an automated toolchain and processes almost-always-valid input, but is completely unacceptable for a user-facing tool.

We’ll address this, while making only minimal changes to the parser’s structure, by tweaking the “base monad” on which it's built. In other words, we’ll change what it means to chain parsers together.

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
–Jamie Zawinski

The fundamental problem with regular expressions is that they only recognize regular languages. This sounds, and is, tautological, but it has huge implications.

A Dialogue

2020-06-03

SOCRATES: What is “machine learning”?

It’s a nightmare scenario: trapped behind enemy lines with no hope for rescue. For unfathomable reasons of bureaucracy, your access to Turing-complete tools of the trade has been denied. Will you fight? Or will you perish like a dog (in a mire of spreadsheets)?

Ok, that’s a little dramatic. But it’s a common-enough dilemma: due to ill-considered IT policies, you don’t have access to the tools you need to efficiently automate tedious tasks: compilers, interpreters, debuggers. You can accept defeat and turn to memorizing keyboard shortcuts… or you can join the League of Shadows.

It’s important, as a rule of thumb, when operating or investing in a firm which produces a physical commodity, to have the ability to reliably quantify the expected future production of the commodity given the firm’s assets. In the oil and gas industry, we have many different ways to forecast future production from an oil or gas well. Some rely on detailed measurements and explicitly incorporate detailed mathematical models of flow physics. Others use whatever historical data we can scrape together and a bit of curve-fitting.

Today, we’re talking about the second kind.

A couple weekends ago, I found myself with the desire to fetch oil and gas production data for a specific county in New Mexico from the New Mexico Oil Conservation Division (OCD).

Fortunately, the OCD provides access to historical well production via an FTP server. The OCD doesn’t seem to provide a way to query a limited time- or area-based subset of production history data, so we’re stuck with a single ZIP file for “all of New Mexico since the dawn of time”. The result is a whopping 712MB ZIP file.

Here’s where I knew I was in trouble: the only thing inside was a single 38 GB file called wcproduction.xml.

If you wish to make an apple pie from scratch, you must first create the universe.
–Carl Sagan

A few weeks ago, chatting with some friends who also occupy the tiny intersection between engineers and programmers, the topic of “group by” came up in the context of in-memory data management with Python.

I’m known as a pandas hater and have managed to sway a few others to my view, so we were talking about how to translate the logic of SQL’s (or pandas’) group by into the Python “type system”.

Many clients, friends, acquaintances, and (in the Before Times) strangers in bars have asked me over the years: what do I think about “business intelligence” (BI) tools? These are applications which make it easy, without any custom programming required, to connect to data sources, visualize data, and create interactive analytics and “dashboards”.

What is object-oriented programming really about? What’s so special about “late binding”? And why do I have to pass self around everywhere in Python? We’ll take a meandering path in today’s post which will try to answer each of these questions, and build our own miniature object system along the way.

Oh, R. I can’t tell you why “data scientists” have switched, for the same reason I can’t tell you how Santa’s reindeer achieve lift: I simply don’t believe in them. But, I can tell you why I no longer use R for new projects.

Today’s Paper of the Day continues this week’s theme of “education”; this time, with a research mathematician turned educator’s thoughts on why K-12 math education ends up leaving so many graduates with negative feelings toward math.

I like this paper a lot, because it resonates so strongly with my own experience of middle- and high-school math. I had several great teachers, but somehow it all seemed like an exercise in memorization, at best, or obscurantist puzzle-solving, at worst: a sequence of parlor tricks; when you see this pattern, do this transformation (it’s called “integration by parts”, but you will not be required or encouraged to know where it came from or why it got the name).

We continue our celebration (or perhaps examination) of Computer Science Education Week with today’s POTD.

Many computer science programs include a course on “programming languages”; these take a lot of different forms, from a broad survey of various popular languages to in-depth studies of semantics and interpreter implementation. In today’s POTD, Shivers argues for the importance of studying programming languages not just to computer science students, but to anyone who wants to understand the modern world.

Today’s POTD continues our theme of Computer Science Education Week. Yesterday we saw a paper (and retraction), and meditated on the topic of why it’s so hard to teach programming.

What if part of the reason is that the way computer science topics are typically presented just isn’t that interesting to students?

This week is Computer Science Education Week! It’s an initiative I watch each year with a mix of admiration and dread; I’m dead convinced that our society needs to expose more students to computer science, not fewer; that programming is getting more accessible and relevant, not less; that the only way to build diversity of experience and perspective downstream in industry and academia is to construct education programs that appeal to diverse populations upstream.

That said, I’m not sure that Barack Obama or a movie star spending an hour writing a Javascript “hello world” program and getting an “I learned code” sticker does much to move that needle. There’s a fundamental tension: we want computer science to be accessible, but it is fundamentally hard: hard like algebra is hard. Everyone takes algebra, and as a society we see the value in teaching every student the power of abstracting from concrete arithmetic to symbolic manipulation. But we don’t, generally, construct flashy multimedia initiatives to get kids to write “y = x + 2” on a sticky note and pretend that’s all there is to it.

Today’s POTD is a well-known classic in the genre of “social sciences research that confirms what we’ve always suspected: people are terrible”.

The Paper of the Day for today comes to us from the distant past: a time before venture capital firms would seemingly hurl trebuchet-loads of money at anyone with a “.ai” domain name and the ability to spell “deep learning”. (I’ve been working on this long enough that my folder of papers on machine learning and data science is labeled “AI”.)

Today’s POTD presents a translation of a great tool from the functional programming community into the more mundane world of 90s-style object-oriented programming languages.

A recent discussion on the SPE discussion board inspired me to jot down some thoughts on how a young (or experienced!) petroleum engineer—or for that matter any other technically skilled non-programmer—might best engage with “learning to code”.