> in particular, its historically limited support for structured data within fie...

jasonhansel · on Dec 22, 2022

The issue is that, to put things in 1NF, you need to fully normalize everything, which has a big performance penalty since every query now has to JOIN a large number of tables together.

Of course, an RDBMS could be designed to do that without a performance penalty, by storing data in a denormalized form and automatically translating queries for the normalized data accordingly.

But SQL doesn't have the features you'd need to control and manage that sort of transparent denormalization. So you'd end up having to extend SQL to support it properly so that the performance penalty in question could be mitigated in all cases.

edit: Rather than "you need to fully normalize everything," I should have said "you need to split all your data across multiple tables to eliminate the need for structured data within records." The performance penalty happens when you need to do this everywhere for sufficiently complex datasets.

goto11 · on Dec 23, 2022

I doubt querying JSON or XML or some other embedded structured format would be faster than querying normalized data. It might be true in some special cases, but certainly not in the general case of ad-hoc queries across nested data structures.

But I totally agree SQL could be improved to make normalization feel like less of a burden. It really highlights a problem when it feels like its more convenient to just dump a JSON array into a field rather than extract to a separate table.

dragonwriter · on Dec 22, 2022

> The issue is that, to put things in 1NF, you need to fully normalize everything

Not unless you redefine “fully normalize” you don’t.

int_19h · on Dec 22, 2022

The boundary between "simple" and "complex" data structures is largely arbitrary, though. It's not unreasonable to consider an integer as a complex data structure, consisting of bits, and in some contexts we do that - but in most, it's obviously more convenient to treat it as a single value. In the same vein, it should be possible to treat a tuple of several numbers (say, point coordinates) as a single value as well, in contexts where this makes sense.

goto11 · on Dec 23, 2022

In the context of the relational model, simple and composite refer to how they are treated by by the relational operators. You can't select individual bits of an integer (without special-purpose operators) which means an integer is a single value.

Presumable an integer is physically stored as a set of bits, but this is not exposed to the logical layer (for good reasons - e.g. whether the machine uses big endian or little endian should not affect the logical layer). If you actually wanted to operate on individual bits using relational operators, you would use a column for each individual bit.

Having XML or JSON fields is also totally fine according to the relational model as long as they are "black box"-values for the logical layer. But Codd observed that if you wanted to treat individual values as composite you end up with a much more complex query language. And indeed this have happened with XPath and JSON-queries and whatnot embedded in SQL. Presumably it should then be possible to have XML inside a JSON struct, and a table inside the XML. If this is even possible, it would be hideously complex. But normalized relations already allows this without any fuss.

int_19h · on Dec 23, 2022

This is an arbitrary distinction in the first place. You can absolutely have a relational database that only allows 0 and 1 as valid values and requires you to build integers up yourself using relational techniques. The relational model itself doesn't care what "values" are.

In practice, simple value tuples make things much more convenient, and the edge cases are minimal. You don't have to allow full-fledged composition like nested tables etc.

goto11 · on Dec 26, 2022

> The relational model itself doesn't care what "values" are.

Maybe I misunderstand what you are arguing, but the relational model is defined in terms of values, sets, and domains (data types), but of course the domains chosen for a particular database schema depends on the business requirements.

ttfkam · on Dec 23, 2022

Codd argued this a long time ago. Unfortunately Codd died twenty years ago, so we can't ask him his current thoughts on the matter. On the other hand, Chris Date dropped this rigid view of the relational model way back in the 1990s.

https://www.youtube.com/watch?v=swR33jIhW8Q

goto11 · on Dec 23, 2022

The question isn't who said what when, the question is what reasoning holds today. Codds argument was simply that if we allow tables embedded inside fields and we want to query across multiple "layers" of embedded tables, we get a much more complex query language and implementation for no additional benefit, since the same relationship can be represented with foreign keys. I haven't seen any reasonable counterargument against this.

If I understand Date correctly, he is just saying that individual values can be arbitrary complex as long as they are treated as "atomic" by the relational operators. I don't disagree, but reality is that very soon after you start storing stuff like XML or JSON in database fields, someone wants to query sub-structures, e.g. filter on individual properties in the JSON. And then you have a mess.