SQL doesn't compose all that well. For example, imagine that you have a complex ...

MattConfluence · on March 2, 2021

> SQL doesn't compose all that well.

On that topic, I really enjoy working in Elixir because Ecto [1] lets you write "SQL" with Elixir's composable functional syntax. It sits somewhere between "the language is compiled to SQL" and ORM. The Ruby-esque syntax took some getting used to, but once I was past that hurdle my productivity skyrocketed. It's not 100% feature complete compatibility with all the different SQL dialects, but most of what you'll need is there.

[1] https://github.com/elixir-ecto/ecto

phillc73 · on March 2, 2021

In a similar vein, I really like the LINQ style queries in the Julia package, Query.jl[1].

[1] https://www.queryverse.org/Query.jl/stable/linqquerycommands...

dominotw · on March 2, 2021

> your options are to copy paste the SQL query with the change

There are things like dbt, lookml that bring templating , macros on top of sql. Its much better than copy/pasting strings but is worse than a full fledged programming language.

phillc73 · on March 2, 2021

I agree that dbplyr is a nice way to query databases, if already familiar with dplyr (actually I think dtplyr is more interesting for operating on data.tables). However, I'm not sure I really understand your point about the "if" statement.

If the data is already in a dataframe, why not still use the "if" statement, but one of the packages I mentioned earlier to further modify the data?

E.g.

if (x = 10) {

duckdf("SELECT * WHERE y >= 20")

}

I guess I was thinking about one query language to rule them all. If SQL was used everywhere, it's just about learning one paradigm and largely being able to use that across platform and language.

jerf · on March 2, 2021

I read down the thread as it stands now, and the basic problem is this. Write the following function:

    Compose("SELECT a, b.revenue, b.name 
        FROM table 
        INNER JOIN  b ON whatever 
        WHERE name LIKE 'b%'",
    "revenue > 10000")

to yield a new valid SQL statement that composes these two fragments together in the way that it is obvious that I want. You basically can't. If you can in this small example, I can easily complexify the SQL until you can't anymore. Note how "revenue" in my supplemental WHERE clause may or may not be that "b.revenue", both in the sense that I may be asking for some other revenue entirely, and in the sense that that particular fragment may have come from somewhere that has no way to know about the "b" part of the name; one of the major issues is namespacing like this, though it is far from the only one!

It is almost certainly impossible to sanely implement this function literally in terms of strings. (If such a thing was possible, it would almost certainly have trash performance.) You need a richer set of data types and operations to permit this.

It has been done. You can hold on to a symbolic representation of the SQL in some internal library representation. However, to "bind" to SQL in this manner requires a binding to every single feature of the SQL you want to use at a pretty deep level (more than just strings, you need to understand the full syntax tree), and speaking from experience, no matter how cleverly you try to start writing it at first it gets more complicated than you think. It is certainly a thing that has been done, but it is a huuuuuuge project. Seriously. It would be a lot easier if we were using an underlying representation designed to do this sort of thing from the beginning.

I like what you can do with SQL, but I kinda hate the way it was juuust good enough to get ensconced and apparently permanently ensure that no improvement on it can ever get off the ground because no conceivable improvement can overcome the entrenched advantages SQL has.

kristjansson · on March 3, 2021

SQL is a good language, but a terrible API.

kristjansson · on March 2, 2021

I think the point from GP (which aligns with my thinking) is that more programmatic interfaces for querying data allow for control flow in constructing the query, not just in processing it. So you can conditionally add a predicate or whatever, and then run the whole query at once.

Whereas with the SQL packages you mentioned, you either have to conditionally construct your query string (ew) or pull results back and express your conditional as operations on the resulting data frame.

For pure dplyr-on-in-memory-data-frame there isn't much difference between the two, to be sure. For dplyr on dbplyr, or sparklyr, or for SparkR, or PySpark, or Dask, or pick-your-favorite-framework, the difference between operating on (effectively) query AST vs. result sets is huge.

phillc73 · on March 2, 2021

Good points. That makes the larger than memory use case quite clear, and interesting.

jfim · on March 2, 2021

It's for more complex queries, eg. active users per country:

  WITH active_users AS
    (SELECT DISTINCT user_id, user_country FROM ...
     WHERE last_login >= NOW() - 1 month)
  SELECT user_country, COUNT(user_id) AS user_count
    FROM active_users GROUP BY user_country
    ORDER BY user_count DESC

Now imagine someone says "what about users that have at least 5 friends?" If you're using dplyr and want to reuse most of your logic, it's just a matter of doing something like

  active_users_with_friends = active_users %>% filter(friend_count >= 5)

The SQL version is much hairier, since it's just code that's within a string.

phillc73 · on March 2, 2021

I think we're talking at cross purposes, and I don't want to belabour the point too much, but there's no need to modify the original SQL query (I'm imagining this pulls data from a database). If the data is already in a dataframe, use one of the libraries which can apply SQL to in-memory dataframes.

e.g.

active_users_with_friends <- duckdf("SELECT * FROM active_users WHERE friend_count >= 5")

jfim · on March 2, 2021

There's an aggregation part that comes after the CTE, at which point the "friend_count" or "last_login" fields are not available anymore. "active_users" isn't a table, it's the result of the CTE in the query that returns distinct users that have a last_login in the last 30 days.

Also, keep in mind that this is a pretty simple example, a more realistic one would probably have a half dozen to a dozen CTEs, some nested, at which point correctly mutating the SQL statement is not trivial.

phillc73 · on March 2, 2021

> "active_users" isn't a table, it's the result of the CTE in the query

Maybe we really are at cross purposes. In the duckdf example above "active_users" is a dataframe. The duckdf library applies SQL queries on dataframes. It's not (necessarily) querying an on-disk database.

If you're querying data using dplyr then it's highly likely that data is already in a dataframe. By the same principle write the original SQL query to use SELECT with wildcard, to return everything. Then use a relevant library to apply SQL on dataframes and it's pretty much the same result as dplyr.

That's not to say complex SQL queries shouldn't be used to only return the minimally required information from the database. Clearly, sometimes there's too much data to fit in memory. My point is that if you're using dplyr on dataframes (or tibbles), then you could just as easily use an SQL based query on the same dataframe.

jfim · on March 2, 2021

> Then your dplyr example above doesn't work, because active_users doesn't exist.

You can assume that it would contain the same data that would be in the CTE in the SQL query above. There's no such table in the database, it's a subquery.

If your argument is that chaining multiple SQL queries on data frames is needed, then we're both saying the same thing. SQL doesn't compose well, hence the need for multiple queries. At that point it's not just SQL though, it's querying of dataframes using SQL and composing the dataframes together using non-SQL code.

Luckily, dplyr will actually compose properly, and you can get the actual SQL statement that is equivalent to the dplyr expression, which would be rather annoying to obtain programmatically.