This is awesome. One question though: how does it handle the same resource (e.g....

flatroze · on Aug 23, 2019

Thank you! It's pretty straight-forward: this program just retrieves assets and converts them into data-URLs (data:...), then replaces the original href/src attribute value, so in case with the same image being linked multiple times, monolith will for sure bloat the output with the same base64 data, correct. I haven't looked into MHTMTL, ashamed to admit it's the first time I'm hearing about that format. I need to do some research, maybe I could improve monolith to overcome issues related to file size, thank you for the tip!

And about Rust: I think you're way ahead of me here as well, this is my first Rust program. If you're talking about it embedding some debug info into the binary which may include things like /home/dataflow then perhaps there's a compiler option for cargo or a way to strip the binary after it's compiled. ¯\_(ツ)_/¯ Sorry, that's the best I can tell at the moment.

dataflow · on Aug 23, 2019

Okay thanks! That was a pretty quick reply :) Regarding MHTML, it's basically the MIME format emails are in (which are basically inherently self-contained HTML documents). Various browsers have had varying degrees of support for it over the years. Chrome recently made it harder to save in MHTML format; I don't know how long they will be able to read the format, so I can't guarantee that if you go in that direction it'll still be useful for a long time, but at the moment there is still some support for it.

dspillett · on Aug 23, 2019

One way to dedupe inline image resources while still using HTML rather than MHTML, could be to encode them in css once, and transform the image element to something with that class.

dataflow · on Aug 23, 2019

That'd easily break Javascript though.

dspillett · on Aug 27, 2019

Good point. I was thinking in the direction of something I'm tinkering with in a similar area. There getting a static snapshot of the current DOM or fragment is key (meaning scripts being stripped out is an intentional feature). Tweaking the document contents for efficiency could significantly impact a lot of script work that may be present.

chmod775 · on Aug 23, 2019

https://github.com/Y2Z/monolith/blob/master/src/html.rs#L94

Hope this answers your question (it gets converted to a data URI, and there's apparently no de-duplication).

gildas · on Aug 23, 2019

SingleFile which can run on CLI too handles this by using CSS variables.