This is awesome. One question though: how does it handle the same resource (e.g. image) appearing multiple times? Does it store multiple copies, potentially blowing up the file size? If not, how does it link to them in a single HTML file? If or if so, is there any way to get around it without using MHTML (or have you considered using MHTML in that case)?
Also, side-question about Rust: how do I get rid of absolute file paths in the executable to avoid information leakage? I feel like I partially figured this out at some point, but I forget.
Thank you! It's pretty straight-forward: this program just retrieves assets and converts them into data-URLs (data:...), then replaces the original href/src attribute value, so in case with the same image being linked multiple times, monolith will for sure bloat the output with the same base64 data, correct. I haven't looked into MHTMTL, ashamed to admit it's the first time I'm hearing about that format. I need to do some research, maybe I could improve monolith to overcome issues related to file size, thank you for the tip!
And about Rust: I think you're way ahead of me here as well, this is my first Rust program. If you're talking about it embedding some debug info into the binary which may include things like /home/dataflow then perhaps there's a compiler option for cargo or a way to strip the binary after it's compiled. ¯\_(ツ)_/¯ Sorry, that's the best I can tell at the moment.
Okay thanks! That was a pretty quick reply :) Regarding MHTML, it's basically the MIME format emails are in (which are basically inherently self-contained HTML documents). Various browsers have had varying degrees of support for it over the years. Chrome recently made it harder to save in MHTML format; I don't know how long they will be able to read the format, so I can't guarantee that if you go in that direction it'll still be useful for a long time, but at the moment there is still some support for it.
One way to dedupe inline image resources while still using HTML rather than MHTML, could be to encode them in css once, and transform the image element to something with that class.
Good point. I was thinking in the direction of something I'm tinkering with in a similar area. There getting a static snapshot of the current DOM or fragment is key (meaning scripts being stripped out is an intentional feature). Tweaking the document contents for efficiency could significantly impact a lot of script work that may be present.
Also, side-question about Rust: how do I get rid of absolute file paths in the executable to avoid information leakage? I feel like I partially figured this out at some point, but I forget.