Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think this misses the point of DRY a little bit. DRY isn't about not copy pasting code, it's about ensuring that knowledge isn't repeated. If two parts of the system need to know the same thing (for example, who the currently logged in user is, or what elasticsearch instance to send queries to, etc.), then there should be a single way to "know" that fact. Put that way, DRY violations are repetitions of knowledge and make the system more complex because different parts know the same fact but in different ways and you need to maintain all of them, understand all of them, etc. etc.

Code blocks that look to be syntactically the same are the lowest expression of "this might be the same piece of knowledge" insofar as they express knowledge about "how to do X", but the key is identifying the knowledge that is duplicated and working from there. Sometimes it comes out that the "duplication" is something like "this is a for loop iterating over the elements of this list in this field in this object" and that is the kind of code block that contains very little knowledge in terms of our system. But supposing that that list had a special structure (ie, maybe we've parsed text into tokens and have information about whitespace, punctuation, etc in that list) and we start to notice we're repeating code to iterate over elements of the list and ignore the whitespace, punctuation elements in it, then we've got a piece of knowledge worth DRYing out given that all the clients now need to know what whitespace & punctuation look like even when they'd like to filter them out.

It's worth pointing out that DRYing out something isn't necessarily "abstracting", it is more like consolidating knowledge into one place.



> ensuring that knowledge isn't repeated

The most fun bug I've encountered as a web developer is of this category. Two pages, both check for a logged-in user and redirects to the other if found or not found, respectively. The bug was a subtle difference in how these were calculated, the details of which are unfortunately lost to the sands of time. The end result was that if you sat on one of the pages and waited for your user session to time out, you'd get stuck in a redirect loop between the "logged in" and "please log in" versions of the page.

Anyhow, the point of this is that when you calculate the same fact two different ways, you will occasionally build something that makes an unwarranted assumption that because it's the "same fact" you wind up with the same answer. This is an entire category of easily missed and often subtle bugs.


And in both cases, it was a sign of mismanaged design. I have encountered THAT EXACT bug, and the reason we supported both was because both were released and users began to expect both pages for different reasons. What we needed was a designer to sit down and say, hey, this design seems replicated, how do we mitigate this? This version of DRY becomes a business and resource problem, above the developer, and unfortunately, this means that you or do not have the resources to adequately deal with it.


I haven't seen this "don't repeat knowledge" take before, it's pretty interesting. I see why you don't want mutated various versions of the same information all over the place, but you still have dangers.

Especially if you "overly reduce" your knowledge. If your common recipe is "do A, B, C, D, E" and you reduce that to just "do X," for instance.

I've seen this often turn into "now, instead of the knowledge being repeated in several places, it's hidden in one place and only one person knows it." Everybody else just relies on the library doing its magic, and when someone needs to do something differently, they have this huge mountain to climb to figure out how to modify the code to also do "J" for certain cases without breaking everyone else.


As someone who deals with 15 million lines of code (and many readers of this have bigger systems) i need to trust that do X does X without me having to know how. When I have to learn it slows me down from the part of the code I need to know well. If do J is needed, that needs to be someone else's problem who knows the rest of do X. Unless do X is my responsibility of course. But nobody has responsibility for more than a small fraction of the code.


This is a great point often forgotten in this kind of discussion.

Size matters, and depending of the system size we’re dealing with it will have a significant impact on what approach we take. Or how we handle documentation for instance.


There is definitely a spectrum of "knowledge" at play when it comes to these considerations. The most obvious DRY violations are those kinds of things that you go "oh I need to test for this case" because that is usually an indication of some knowledge you need to know when interacting with a piece of code. EG, if you ever use -1 as a sentinel value then the knowledge of "what -1" means should be consolidated together, otherwise all clients will have to know that -1 is a sentinel, what it means and at best you'll have duplicate code, at worst those interpretations won't align and you might have a subtle bug where that -1 is doing something somewhere (ie it is supposed to mean "No information provided" but somewhere something is keeping an arithmetic mean of this field and those -1s are now screwing up your metrics and you don't really notice).

When we think about the knowledge of "how to do something" that's where things can get confusing. 9/10 times I'd say that right move is to look for common assumptions or facts. IE it isn't just "doing something" that is important, but the assumptions made in the process of doing it:

As an example, consider finding the average word length in some piece of text. We might start writing that feature like:

  def count_words(text: str) -> int:
      return len(text.split(' '))

  def average_word_length(text: str) -> int:
      num_words = count_words(text)
      word_lengths = []
      for word in text.split(' '):
          word_lengths.append(len(word))
      return sum(word_lengths) / num_words


then the piece of knowledge they share is "what a word is" and the DRY refactoring would pull out that piece of knowledge into its own function

  def words(text: str) -> List[str]:
      return text.split(' ')

that might be code you write when starting to write a feature and that's the kind of "ding ding ding there's common knowledge here" that should guide refactoring. The system has a concept of a "word" that we've introduced and its important that knowledge about "what a word is" in one place. For DRY things it frequently doesn't make any sense for there to be multiple statements of "what a word is" where the system wants to use the same concept.

Kind of orthogonal to this is abstraction where the focus is on "usefulness" and that is where 100% you can abstract incorrectly, prematurely, get screwed over by requirement changes, write a library that hides everything and makes people angry. The example you provide seems more like an error in abstraction where things that should be close together are too far apart in the system (ie, some "fact" is hidden away and another part of the system wants to know it), but the consolidation and DRYing of those facts, I'd argue, is a lot easier once we've figured out how to identify them


Yeah, I like this approach, because the "what is a word" knowledge is a nice piece of common functionality that doesn't make sense to repeat. It's unlikely to change for just one of those two functions.

In my example, it's less a "core piece of knowledge" that people are trying to DRY, and more just a "common sequence." Someone sees a bunch of different places where we have a sequence of calls like A, B, C, D.. and says "oh this is a shared method I can extract" even if there's plenty of ways that in the future you might want to do A, B, C, E without D. And so then you pass in a bool, than another one, and you have a centralized mess...


I think the distinction is that if those two pieces of code had a different idea of what a word is then that would constitute a bug, then you definitely need to replicate the 'how to find words' logic. But if it doesn't really matter if two different pieces of code are using the same exact way to do something, then that's likely 'coincidental' replication. If you need to do word splitting, and someone else has written a word splitter, by all means copy paste their code to get you started, but definitely don't assume the best plan is to pull their code in as a dependency.


Yeah, I saw this approach in a book called "The Pragmatic Programmer", which I highly recommend. Agreed 100%


These things need to be balanced. I live in an ecosystem of DRY gone amok and it's not pleasant.

There's a standard library to connect to databases. There's a huge hierarchy setup just to start an app running.

All of these super dry infrastructure changes have, unfortunately, come with a huge cost. We are still stuck on ubuntu 14.04 because our super dry puppet framework we invented can't be ported to puppet 6.

We are stuck talking to MS-SQL, because our super dry database connection management library can't handle establishing other database interactions.

We are still stuck on Tomcat 7 because our super dry Jersey libraries don't work with newer versions of Jersey (which has locked us into older versions of tomcat!).

Consolidation is a decent goal, but it really needs to be measured. For me, it is FAR more important to consolidate on the how to do things and not the what does things. In otherwords, rather than making an "elasticsearch connection library" specify "This environment variable is the elasticsearch host/credentials" and let the apps move from there.

That's because, when it comes right down to it, configuration code is super easy to write and it really doesn't matter if it's duplicated. You want your libraries consolidating knowledge to be for things that are easy to get wrong (such as checking who is currently logged in or how to authenticate).


> Consolidation is a decent goal, but it really needs to be measured. For me, it is FAR more important to consolidate on the how to do things and not the what does things. In otherwords, rather than making an "elasticsearch connection library" specify "This environment variable is the elasticsearch host/credentials" and let the apps move from there.

I think we're in agreement here. Config is the most basic kind of knowledge because when something wants to know about the elastic credentials,it almost never makes sense to have it in two places if those two places are supposed to be the same thing.

How to actually connect to elastic -- that's the part that is more iffy. If there is some knowledge we've added there, then it makes sense to DRY it up, but the knowledge of "this is how you pass credentials to this elastic search client" isn't the kind of system knowledge we care about. If, for example, there were some kind of parameters that we had to set on each connection and we claimed it as a piece of knowledge that all of our connections to this service are of this specific TYPE and have these specific parameters, then we've started to add some additional systemic knowledge that might need to get consolidated.If someone were to start working on a piece of code and I feel the need to tell them "Don't forget about X" then that is the kind of situation where DRY comes into play. If it's just a vanilla connection to a database and we don't care about the connections made, then I do given't think we have a violation of DRY given that there isn't an important piece of knowledge that's repeated.

At some point, especially when we pay too much attention to copy-pasted code, we end up abstracting. Abstracting is hard, more general, very difficult to do right, almost always done to early. DRYing out knowledge is easier and almost always improves things.


This is a good interpretation. Similar to a "Single Source of Truth" [1].

[1] https://en.m.wikipedia.org/wiki/Single_source_of_truth


IMHO it is not the author who misses the point of DRY, but countless developers who make code less readable only to reduce visible repetition or to avoid copy-n-paste. May be DRY is just a bad name.


Yeah, I'd agree. When the principle was introduced it was stated as:

> The DRY principle is stated as "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system"

(from wikipedia)

It feels like the name really took over the intention and it became about code repetition instead of knowledge repetition.


I agree that the name took over. The intention sounds synonymous with bounded contexts of DDD.

I find the vocabulary of DDD to have more explanatory power. Especially with people who don’t grok the difference between removing repetition and consolidating models.

I think repetition is a symptom that a code base may be afflicted with interwoven domains, but the existence of repetition is not sufficient for the diagnosis, IMO.


The only problem is that's hard to speak. Also, what's DDD?


It’s “Domain Driven Design”: https://en.m.wikipedia.org/wiki/Domain-driven_design.

Bounded Contexts is an idea that helps you draw the boundaries between domains. It asks you to be disciplined in your abstractions, and in return it allows you to feel comfortable changing implementations within a domain without fear of cascading second order effects to other domains.

For example, your service/library for managing customers shouldn’t return data about the books they’ve purchased. That comes from the order context, which composes the customer and book contexts.

If your boundaries are well defined, you can change the order process without fear of the book and customer models, and vice versa.

It marries well with service oriented architecture, because you can use the network to help enforce a boundary. You still need some skill to enforce the correct boundary, of course.


One Source of Truth or OSoT doesn't sound as nice as DRY though.


+1000 on that point.

Yes, I've dealt with systems that had bad abstractions. And I've also dealt with systems where knowledge of highly nameable things - like how to authenticate a user, or how to connect to a database, how to obtain a token to the same API server - wasn't centralized.

Systems of the first kind are certainly bad. It takes a lot of time to understand before you can get ahead and start refactoring. If your organization had low code review discipline at any point, abstractions often become hard to refactor with time, since some developers don't understand the abstractions, and instead of fixing them, just work around them with thread locals or lots of branches.

But systems of the second kind are much worse. Here what happens is that duplicated knowledge invariably diverges with time. It can be developers fix a bug in one place and forget the other, or adding a certain feature in one place and another one the other place. Over time, each implementation of the knowledge has it own unique behavior and bugs, and some parts of the sprawling code base grow to depend on a certain behavior. Or perhaps your code doesn't, but you have other services in other part of the company consuming your API that do and you just have no idea if they rely upon the implementation difference or not.


If you write it once, you eliminate the chance of a small fix not propagating properly. This is particularly common when handling files and network connections, as those tend to develop edge cases over time.

DRY reduces the number of potential lose ends when you update your code.


Yeah, one important drying question is "if this fact/assumption/value changes, how many places in the code will have to change?".

If it's > 1, you have a moisture problem!


Not all code is knowledge, in this sense. And sometimes repeating knowledge is better, on balance, than unifying it somewhere, when you consider the added costs of coupling, of reification, and of abstraction liability.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: