Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Of course, it is not generalizable! In my experience though, most minifiers do only the following:

- Whitespace removal, which is trivially invertible.

- Comment removal, which we never expect to recover via unminification.

- Renaming to shorter names, which is tedious to track but still mechanical. And most minifiers have little understanding of underlying types anyway, so they are usually very conservative and rarely reuse the same mangled identifier for multiple uses. (Google Closure Compiler is a significant counterexample here, but it is also known to be much slower.)

- Constant folding and inlining, which is annoying but can be still tracked. Again, most minifiers are limited in their reasoning to do extensive constant folding and inlining.

- Language-specific transformations, like turning `a; b; c;` into `a, b, c;` and `if (a) b;` into `a && b;` whenever possible. They will be hard to understand if you don't know in advance, but there aren't too many of them anyway.

As a result, minified code still remains comparably human-readable with some note taking and perseverance. And since these transformations are mostly local, I would expect LLMs can pick them up by their own as well.

(But why? Because I do inspect such programs fairly regularly, for example for comments like https://news.ycombinator.com/item?id=39066262)



I feel you’re downplaying the obfuscatory power of name-mangling. Reversing that (giving everything meaningful names) is surely a difficult problem?


JSNice[1] is an academic project that did a pretty good job of this in the 2010s and they give some pointers on how it is accomplished[2].

[1]: http://jsnice.org/

[2]: https://www.sri.inf.ethz.ch/jsnice


I would say the actual difficulty greatly varies. It is generally easy if you have a good guess about what the code would actually do. It would be much harder if you have nothing to guess, but usually you should have something to start with. Much like debugging, you need a detective mindset to be good at reverse engineering, and name mangling is a relatively easy obstacle to handle in this scale.

Let me give some concrete example from my old comment [1]. The full code in question was as follows, with only whitespaces added:

    function smb(){
      var a,b,c,d,e,h,l;
      return t(function(m){
        a=new aj;
        b=document.createElement("ytd-player");
        try{
          document.body.prepend(b)
        }catch(p){
          return m.return(4)
        }
        c=function(){
          b.parentElement&&b.parentElement.removeChild(b)
        };
        0<b.getElementsByTagName("div").length?
          d=b.getElementsByTagName("div")[0]:
          (d=document.createElement("div"),b.appendChild(d));
        e=document.createElement("div");
        d.appendChild(e);
        h=document.createElement("video");
        l=new Blob([new Uint8Array([/* snip */])],{type:"video/webm"});
        h.src=lc(Mia(l));
        h.ontimeupdate=function(){
          c();
          a.resolve(0)
        };
        e.appendChild(h);
        h.classList.add("html5-main-video");
        setTimeout(function(){
          e.classList.add("ad-interrupting")
        },200);
        setTimeout(function(){
          c();
          a.resolve(1)
        },5E3);
        return m.return(a.promise)
      })
    }
Many local variables should be easy to reconstruct: b -> player, c -> removePlayer, d -> playerDiv1, e -> playerDiv2, h -> playerVideo, l -> blob (we don't know which blob it is yet though). We still don't know about non-local names including t, aj, lc, Mia and m, but we are reasonably sure that it builds some DOM tree that looks like `<ytd-player><div></div><div class="ad-interrupting"><video class="html5-main-video"></div></ytd-player>`. We can also infer that `removePlayer` would be some sort of a cleanup function, as it gets eventually called in any possible control flow visible here.

Given that `a.resolve` is the final function to be executed, even later than `removePlayer`, it will be some sort of "returning" function. You will need some information about how async functions are desugared to fully understand that (and also `m.return`), but such information is not strictly necessary here. In fact, you can safely ignore `lc` and `Mia` because it eventually sets `playerVideo.src` and we are not that interested in the exact contents here. (Actually, you will fall into a rabbit hole if you are going to dissect `Mia`. Better to assume first and verify later.)

And from there you can conclude that this function constructs a certain DOM tree, sets some class after 200 ms, and then "returns" 0 if the video "ticks" or 1 on timeout, giving my initial hypothesis. I then hardened my hypothesis by looking at the blob itself, which turned out to be a 3-second-long placeholder video and fits with the supposed timeout of 5 seconds. If it were something else, then I would look further to see what I might have missed.

[1] https://news.ycombinator.com/item?id=38346602


I believe the person you're responding to is saying that it's hard to do automated / programmatically. Yes a human can decode this trivial example without too much effort, but doing it via API in a fraction of the time and effort with a customizable amount of commentary/explanation is preferable in my opinion.


Indeed that aspect was something I failed to get initially, but I still stand by my opinion because most of my reconstruction had been local. Local "reasoning" can be often done without the actual reasoning, so while it's great that we can automate the local reasoning, it falls short of the full reasoning necessary to do the general unobfuscation.


This is, IMO, the better way to approach this problem. Minification applies rules to transform code, if we know the rules, we can reverse the process (but can't recover any lost information directly).

A nice, constrained, way to use a LLM here to enhance this solution is to ask it some variation of "what should this function be named?" and feed the output to a rename refactoring function.

You could do the same for variables, or be more holistic and ask it to rename variables and add comments (but risk the LLM changing what the code does).


How do we end up with you pasting large blocks of code and detailed step-by-step explanations of what it does, in response to someone noting that just because process A is simple, it doesn't mean inverting A is simple?

This thread is incredibly distracting, at least 4 screenfuls to get through.

I'm really tired of the motte/bailey comments on HN on AI, where the motte is "meh the AI is useless, amateurish answer thats easy to beat" and bailey is "but it didn't name a couple global variables '''correctly'''." It verges on trolling at this point, and is at best self-absorbed and making the rest of us deal with it.


Because the original reply missed three explicit adverbs to hint that this is not a general rule (EDIT: and also had mistaken my comment to be dismissive). And I believe it was not in a bad faith, so I went to give more contexts to justify my reasoning. If you are not interested in that, please just hide it because otherwise I can do nothing to improve the status quo and I personally enjoyed the entire conversation.


> As a result, minified code still remains comparably human-readable with some note taking and perseverance.

At least some of the time, simply taking it and reformatting to be unfolded and on multiple lines is useful enough to be readable/debuggable. FIXING that bug is likely more complex, because you have to find where it is in the original code, which, to my eyes, isn't always easy to spot.


> - Comment removal, which we never expect to recover via unminification.

ChatGPT is quite good at adding meaningful comments back to uncommented code, actually.

Paste some code and add "comment the shit out of this" as a prompt.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: