Data comprehension syntaxes

The order of clauses in SQL is as follows:

SELECT expression
FROM data_source
WHERE condition

Python features a similar syntax in its list comprehensions. (In this article I don’t discuss syntax like Haskell’s list comprehensions or Scala’s for-comprehensions.)

expression
for item in data_source
if condition

You can see that these two are roughly parallel. The order of clauses is: the expression, the data source, and then the filtering condition.

But the LINQ syntax in C# orders these clauses differently:

from data_source
where condition
select expression

Why is it different from the established standard? It turns out it’s a deliberate design decision by the C# team.

It’s a better mental model

One reason I’ve heard given by C# team members is that the SQL syntax doesn’t make very much sense. It might be intuitive from a natural-language standpoint, but it fails to accurately model how we can think about the data processing.

When doing the database scan, of course we have to first identify where we are getting data from. Then we have to scan each row and determine if it meets the condition, and then we add some expression in terms of the row to the set of results that we are going to return.

Not exactly

In practice this isn’t quite accurate for SQL. We can use the result of our expression in our filtering condition:

SELECT foo + 1 AS bar
FROM my_table
WHERE bar > 10 -- This uses the expression `bar` we computed.

But we can’t do this in Python:

foo + 1  # Can't name this expression for use in the filter...
for foo in my_table
if foo + 1 > 10  # ...so we have to recompute it.

It’s arguably intuitive to go from more general (the entire aggregate data source) to more specific (the exact transformations on the specific data we want from the data source). This is reflected in how one would write out these loops by hand:

for foo in data_source:
    if foo + 1 > 10:
        yield foo

It’s better for tooling

One problem with the SQL syntax is that when you’ve written this:

SELECT f

you have a hard time of autocompleting the word starting with “f”. It could be a column name from any table in the database!

C# doesn’t have this problem because you only write the filter and expression after you’ve written the data source:

from data_source
where f // easily autocompleted

from data_source
where foo + 1 > 10
select f // likewise

The usefulness of autocomplete in exploratory programming shouldn’t be understated; this alone is a compelling reason to use a different order for these clauses.

The following are hand-curated posts which you might find interesting.

Want to see more of my posts? Follow me on Twitter or subscribe via RSS.

Comments