Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Cool dip into Ruby internals.

If you roll your own ruby, instead of redoing all your strings, you could just change RSTRING_EMBED_LEN_MAX. This would cause more wasted memory if you have a lot of short strings (0 < len << RSTRING_EMBED_LEN_MAX), and probably isn't worth it since there isn't much performance improvement.

The most confusing part of this article was the actual RString struct implementation. Are the anonymous unions and structs used to control structure padding and alignment?



Values in interpreters written in C are frequently implemented as (manually) discriminated unions - i.e. unions that share a field at the start to indicate the type and contents of the remainder - because that's a handy way of implementing the polymorphism required for a straightforward interpreter. It's pretty much necessary to use structs inside unions in order to have a more than one field per layout; the struct is just grouping, so it doesn't need a type name.

So without looking at any of MRI source, I'd be willing to guess that most, if not all, of its structures representing Ruby values start with a field of type RBasic, and that type contains information necessary to distinguish and interpret the remainder of the value.


Yeah. I just looked at the source because I was confused about how it knew how long the embedded string was (since the length field is in the other half of the union and ruby strings can have embedded \0 bytes), and RBasic is a struct containing a VALUE referring to a class and a VALUE "flags" that tends to have a lot of bit fiddling done to it.

Apparently out of the non-reserved bits, one is used to tell whether the string is embedded or not and five more are combined to give the string's length. Makes sense!


The "aux" union contains either the reference count or the string capacity. The idea must be that shared strings are immutable so the capacity is not meaningful. Conversely, if the string is not shared, the reference count is not meaningful.

The outer "as" union either contains the 24 byte char array or the 24 byte (len + ptr + aux) heap string information.


Thanks! Cool idea about recompiling with a new value for RSTRING_EMBED_LEN_MAX - never thought of that!

Yes, the inner unions/structs are used to tell the C compiler exactly where the different data values should go. They are in fact not anonymous: e.g. "heap" and "aux" - the names appear right after the closing brace.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: