More

dfranke · on Dec 6, 2024

Allowing purely numeric usernames seems like a terrible idea to me, because it creates ambiguity between what's a username and what's a UID. It's common for tools like ls or ps to display a username when one is found and fall back to displaying a UID if it isn't, and similarly tools like chown will accept either a UID or a username and disambiguate based on whether it's numeric or not. Now suppose there's a numeric username that doesn't match its own UID, but does match some other user's UID. It doesn't take a lot of imagination to see how this would lead to vulnerabilities.

throw0101a · on Dec 6, 2024

Talk to POSIX:

> A string that is used to identify a user; see also User Database. To be portable across systems conforming to POSIX.1-2017, the value is composed of characters from the portable filename character set. The <hyphen-minus> character should not be used as the first character of a portable user name.

* https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...

The "portable filename character set" is defined as:

    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
    a b c d e f g h i j k l m n o p q r s t u v w x y z
    0 1 2 3 4 5 6 7 8 9 . _ -

* https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...

So only a hyphen as the first character is forbidden.

Given that you can't necessarilly control where usernames come from (e.g., LDAP lookups), properly speaking your system has to handle everything anyway, even if you don't allow local creation.

dfranke · on Dec 6, 2024

Yes, I'm aware, and POSIX has many such bugs that make command input or output unavoidably ambiguous if certain unexpected characters are present that they didn't think to prohibit. A lot of the revisions that went into POSIX 2024 were aimed at fixing some of these, such as standardizing find -print0 and xargs -0. The fact that this one got overlooked doesn't mean it's a good idea to make the situation worse and harder for future POSIX revisions to address.

bluGill · on Dec 6, 2024

It is time for POSIX to get with the times. Computers are used in more than the US and Canada (for the most generous interpretation of American in ASCII I'm including Canada, their French speakers will not be happy with that, not to mention first nations of which I know nothing but imagine their written language needs more than ASCII). UTF8 has been standard for decades now, just state that as of POSIX 2025 all of UTF8 is allowed in all string contexts unless there is a specific list of exception characters for that context (that is they never do a list of allowed characters). They probably need to standardize on utf8 normalization functions and when they must be used in string comparisons. Probably also need some requirement that and alternate utf8 character entry scheme exist on all keyboards.

The above is a lot of work and will probably take more than a year to put into the standard, much less implement, but anything less is just user hostile. Sometimes commettiees need to lead from the front not just write down existing practice.

PhilipRoman · on Dec 6, 2024

Some practical concerns I have with UTF-8 are similar (or even the same, depending on font) characters which can be used in malicious ways (think package names, URLs, etc), not to even mention RTL text and other control characters. Every time I add logging code, I make sure that any "interesting" characters are unambiguously escaped or otherwise signaled out-of-band. Having English as an international writing standard is perfectly fine and I say that as a non-native speaker with a non-ascii name.

abdullahkhalids · on Dec 6, 2024

A good chunk of the world does not speak english or latin character based languages. They should be able to interact with computers completely in their own languages and alphabet sets, even if those are written right-to-left or top-to-bottom.

Of course, someone has to do the work to make this possible. And no one is obliged to do it. But to suggest that, such work should not be done at all, does not sit right.

notpushkin · on Dec 7, 2024

This isn't quite black and white.

Right now, I can set up and use Linux in my language, have my display name in my script, but my username and password are ASCII-only and are available on the standard English keyboard anywhere. If I run into trouble, I can SSH in from any device in the world without any issue. I can just borrow a laptop from anyone, switch to English if needed, and jump right in.

Having a common denominator set of characters for such things is just really, really useful. I’d rather focus on all the other things that need to be localised.

folmar · on Dec 7, 2024

Without any issue is a stretch, using a French keyboard is bad enough experience for passwords, not everyone uses standard English keyboards.

wongarsu · on Dec 7, 2024

The French keyboard is the most notable example of anyone using something other than query or quertz. Even Japan and China use an extended querty. But even with the French keyboard the only issue is that everything is in the wrong place, not that the standard 26 "English" letters don't exist or are hard to reach.

Meanwhile using ä, é or ş in a username or password will make your life much harder once you are in a foreign country. Never mind any letter that isn't derived from the Latin alphabet.

oarsinsync · on Dec 7, 2024

> something other than query or quertz. Even Japan and China use an extended querty

qwerty

hnthrowaway6543 · on Dec 6, 2024

> A good chunk of the world does not speak english or latin character based languages.

nearly everyone in a first world country knows the English alphabet though. a vast majority of the developing world as well. just look at street view on Google maps in any country, there's going to be a ton of street signs using English characters, even in non-touristy areas.

> They should be able to interact with computers completely in their own languages and alphabet sets, even if those are written right-to-left or top-to-bottom.

if you're a typical android/ios end user you're interacting with a computer in your native language anyway. this discussion only applies to low level power users.

in that case: why? these aren't user-facing features. this is like saying that people should be able to use symbols native to their language rather than greek letters when writing math papers.

it might not be "fair" that English is overrepresented in computing but it also hasn't demonstrably been a barrier to entry. Japan, Korea and China have dominated, particularly in hardware.

if you think it should be fixed why stop at usernames? why represent uids with 1234 instead of 一二三四?

citrin_ru · on Dec 7, 2024

> nearly everyone in a first world country knows the English alphabet though

And not only 1st world. Actually the bigger country the more everything is localized - from dubbed films to food packaging labels. In a small country one would see more English/Spanish/French e. t. c. because they don't have resources to localize everything.

abdullahkhalids · on Dec 6, 2024

> if you're a typical android/ios end user you're interacting with a computer in your native language anyway. this discussion only applies to low level power users.

I don't think you realize how poor this experience is. Partly the reason being that the underlying system is so english focused, that app developers have to do so much work to get things working.

> if you think it should be fixed why stop at usernames? why represent uids with 1234 instead of 一二三四?

I mean, if the computers had first been built in south east asia, they would have been.

hnthrowaway6543 · on Dec 6, 2024

it's certainly hard to localize everything but billions of people use ios/android in India, China, SEA, MENA, etc... i think it's fair to say that at the end user level, computers are in fact usable by non-English speakers.

individual apps may not be as usable, but that's on the developers. good counter-example, a lot of japanese games, even made within the past 5 years, require setting the Windows system locale to Japanese to function properly. and as someone who played a fair number of japanese doujin games in the 00s/10s, it used to be every game with this problem.

> I mean, if the computers had first been built in south east asia, they would have been.

debatable as CJK heavily use Arabic numerals everywhere, but even if they did, so what? you'd learn those symbols and get used to it. the same way that if you're a unix sysadmin you get used to only being able to use a small subset of ASCII characters for usernames.

abdullahkhalids · on Dec 6, 2024

> it's certainly hard to localize everything but billions of people use ios/android in India, China, SEA, MENA, etc... i think it's fair to say that at the end user level, computers are in fact usable by non-English speakers.

Its important to contextualize these discussions in socioeconomics. Computers are not just fun play things. They are serious tools used for economic activities. Their usage, through their design, has significant impact on the social systems of society. Non-latin-language speakers are able to use poorly localized computers, but they are only able to use them less well than the latin-language speakers. At least in South Asia, there is a huge economic divide between those who can speak English and those who can't, where causality runs both ways, and in more recent times exacerbated by the inability of some to use technology. And that economic divide then causes huge sociopolitical problems in societies.

If computers are means for economic progress, we shouldn't put the condition that one has to somehow learn English to use them well. But isn't localization sufficient? No it isn't. Ignore even that localization requires some members of your language to be dual speakers. The current era of economic progress is characterized by software development. But if the only way you can develop software is to learn a foreign language, then surely we are denying economic progress to some communities.

P.S. I will repeat. Nobody has to do any work to help other communities. But to assert that such work should not happen is plain wrong.

hnthrowaway6543 · on Dec 7, 2024

you're confusing "speaking English" with "knowing the English alphabet." these things are orthogonal. 95%+ of people in those countries know the english alphabet. i just threw down google maps street view at a random spot in Phnom Penh and instantly found english letters visible from the street, on advertisements[0]. then i threw it down in a much smaller Thai city that i had never heard of, Nakhon Sawan, and instantly found English on the street.[1] i've been in China, Japan and Korea enough to know english characters are all over the place. the English alphabet is omnipresent everywhere, i think you fail to realize this. nobody who is using a computer in these places is getting confused by the english alphabet.

> But to assert that such work should not happen is plain wrong.

i assert it should not happen because it's not solving an actual problem, the same way that changing "x" and "y" to "ㅋ" and "ㅌ" in algebra doesn't solve a problem, and trying to "solve" it will yield to a monstrous amount of incompatibilities and confusion. here's a really good comparison: ipv6. IPv6 is solving a problem, maybe in a way people disagree with, but definitely a real problem... and yet we still can't make ipv6 fucking work after God knows how many years, and trying to get IPv6 networking at any sort of scale is a massive fucking headache. now we want to go through the same headaches to support... umlauts in usernames? yeah, no thanks.

there's enough real work left to be done in the world that we shouldn't waste time with stupid makework like this.

or maybe in 30 years i'll be able to call up IT support and say "hey i forgot my password, can you reset it? my username is 神王 سعود. ... need me to spell that for you?"

edit: somewhat ironically, HN swallowed a few of the unicode characters in my theoretical future username...

[0] https://i.imgur.com/0WkG0ze.png

[1] https://i.imgur.com/VhDR5Xh.png

abdullahkhalids · on Dec 7, 2024

I am from Pakistan. At least in South Asia, there are english characters everywhere because the infrastructure is primarily designed for the rich english-speaking classes, while the poor are left behind. A serious political problem.

I have seen many non-english speaking people interact with computers in English, both poor people and old folks in rich families who don't know English. They kinda recognize the shape of words, or they go by icons. They don't actually know the meaning of anything. They can only do a limited set of pre-memorized actions. Scamming them is easy. If they get stuck, they need to beg someone to help them.

Again, I will say this. There are two problems here. One for users and one for developers. Users must be able to read in their own language. Developers must be able to develop in their own language.

wongarsu · on Dec 7, 2024

> They kinda recognize the shape of words, or they go by icons. They don't actually know the meaning of anything.

That's kind of true of a lot of English computer users too.

But more to the point, what you are advocating for is translating the interface. Which I think nobody is against, and which is a common thing to do (at least for countries people care about, which sadly excludes a lot of the poorer parts of the world). The username prompt should read "username" in Pakistani. That doesn't automatically mean it has to accept non-ascii input too, as long as you accept unicode in the display name.

> Developers must be able to develop in their own language.

I learned coding in Pascal before I learned that "if" is an English word. English helps, but in the end keywords in programming languages and shell commands are only mnemonics. Knowing the translation helps but isn't necessary. What's important are documentation, tutorials and other resources in a language the developer understands.

citrin_ru · on Dec 7, 2024

I have an impression that people confuse learning English (which is hard unless you native language is a Germanic/Romance one) with learning to recognize and type Latin characters which is easy and people around the world already use Latin alphabet without knowing any English. You may escape Latin alphabet if you have spend a whole life in a remote village but for people living in cities around the world it should be familiar and not a barrier at all. It's hard to escape Latin characters in the modern world and this ship has already sailed like it or not (I mostly do).

Muromec · on Dec 6, 2024

Oh no please, I don’t want to have my linux username in Cyrillic. Thanks but no, thanks!

I know enough linux to see 10 ways in which it will make things worse at some point.

smitelli · on Dec 6, 2024

> similar (or even the same, depending on font) characters which can be used in malicious ways

These are called "confusables" and boy does that well run deep: https://www.unicode.org/Public/security/16.0.0/confusables.t...

throw0101a · on Dec 6, 2024

> It is time for POSIX to get with the times.

"Be the change that you wish to see in the world." — Mahatma Gandhi

It's free to join:

* https://www.opengroup.org/austin/lists.html

* https://www.opengroup.org/austin/

7bit · on Dec 6, 2024

[flagged]

throw0101a · on Dec 6, 2024

> Most useless post so far

Is GP willing to help out? To go through data structures and file formats (like pax[0][1] (née ustar (née tar))) to find places where things will need to be changed?

It's easy to say "Someone else should fix things."

[0] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/p...

[1] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/p...

numpad0 · on Dec 7, 2024

NO. PLEASE DON'T. This wreaks havoc especially on East Asian users because Unicode is poorly supported in console on top of being binary non-canonical in both entry and display.

Meaning,

  - :potato: OR :potatoh: may display as :eggplant: OR :potato:    
  - isEqual(`:eggplant:`, `:eggplant:`) may fail OR succeed   
  - trying to type :sequence: breaks console until reboot  
  - typing :potato: may work but not :eggplant:  
  - users don't know how to spell :eggplant:  
  - etc.

If you must, please fix Unicode first so that user entry and display would have 1:1 relationship. I do have Han Unification in mind, but I believe the problem isn't unique to the unification or East Asia.

rurban · on Dec 6, 2024

Almost nobody supports string search and comparison API functions for unicode. The unicode security tables for unicode identifiers are hopelessly broken.

Not even the simplest tools, like grep do support unicode yet. This didnt happen in the last 15 years, even if there are patches and libs.

ygra · on Dec 7, 2024

Wasn't one way to make grep faster setting LANG=C to avoid using language-aware string comparison? If so, shouldn't Unicode be supported by default or what would, say, de_DE.UTF-8 actually compare to make it slower?

rurban · on Dec 9, 2024

yes it should. but the libunistring variant was too slow. And since LANG is run-time evaluated you cannot really provide pre-compiled, better search patterns.

sometime I'll come up with pre-computed optimized tables, but no time.

JetSpiegel · on Dec 12, 2024

It's just a grep bug, ripgrep is fast and supports proper regex.

atoav · on Dec 6, 2024

Sure, go ahead. Write the PR and make sure to test against all other things used in production.

Let's talk again in 30 years when you're done.

jerf · on Dec 6, 2024

Oh, it's been closer to 20 years for the rest of the world to catch up to Unicode than 30. We aren't at "perfect" now but we're certainly down to the trickier corner cases that are difficult to even see how you solve the problems at all, let alone code the solutions, and that's just reality's ugly nose sticking in to our pristine world of numbers.

But there really isn't any other solution. Yes, there will be an uncomfortable transition. Yes, it blows. But there isn't any other solution that is going to work other than deal with it and take the hits as they come. The software needs to be updated. The presumption that usernames are from some 7-bit ASCII subset is simply unreasonable. We'll be chasing bugs with these features for years. But that's not some sort of optional aspect that we can somehow work around. It's just what is coming down the pike. Better to grasp the nettle firmly [1] than shy away from it.

At least this transition can learn a lot from previous transitions, e.g., I would mandate something like NFKC normalization applied at the operating system level on the way in for API calls: https://en.wikipedia.org/wiki/Unicode_equivalence Unicode case folding decisions can also be made at that point. The point here not being these specific suggestions per se, but that previous efforts have already created a world where I can reference these problems and solutions with specific existing terminology and standards, rather than being the bleeding-edge code that is figuring this all out for the first time.

[1]: https://www.phrases.org.uk/meanings/grasp-the-nettle.html

atoav · on Dec 6, 2024

Don't get me wrong, I think using UTF-8 everywhere is how things should be.

But this is not a "let's just" or "why don't we" type of endeavor. This is a major undertaking, and as such people are needed who (A) think it is worth the effort and (B) are willing to follow through with all the consequences.

Open Source software lives from contributions and if you're not willing to do it, why should others spend years of their lives for it?

In the end this is a question of: are the benefits worth the effort? What do we win? Where do things get simpler? Where more complicated? How do you pull it off if half the distributions use UTF8 and the other half uses the legach way? How would tooling deal with this split? etc.

atoav · on Dec 7, 2024

To add a little bit of context:

You know what I think would be way worse than todays reduced characterset usernames with some special rules or "just" using utf-8 for them?

Both. Imagine a world where some usernames are UTF-8 some are not and it is hard to figure out which is which. That would be worse than just leaving things as they are.

Avoiding that situation makes pulling the whole thing off even harder, since there needs to be a high amount of coordination between many projects, distros etc.

gray_-_wolf · on Dec 7, 2024

> Unicode case folding decisions can also be made at that point

Ok I will bite. How do you indent to do case folding without knowing the language the string is in? Will every filename or whatever also have its language as part of the string? I am not sure what the plan is there.

citrin_ru · on Dec 7, 2024

Unicode opens a whole can of worms. World is already full of software which in theory supports non-ASCII texts but in practice breaks for some use cases. It's easy to allow UTF8, it's hard to test all possible use cases and to foresee them to know what to test. Nowadays I use mostly English so don't see localization bugs but when I used my native language with software/internet (~10y ago) I've encountered too many bugs and avoided using non-ASCII in things like usernames/password, file names and other places when utf-8 may be allowed but causes problems later. Just allowing UTF-8 is rarely enough. Localization is hard so better to start with places where it is important. Usernames IMHO not one of them.

chikere232 · on Dec 6, 2024

Sounds like lots of work and a lot of new bugs for no real value.

miki123211 · on Dec 6, 2024

> Computers are used in more than the US and Canada

Even if you speak US (or Canadian) English exclusively, there are still some words that are just impossible to spell correctly in pure ASCII, e.g. résumé, café etc.

drdeca · on Dec 6, 2024

“correctly”. I don’t consider it “incorrect” English when someone writes “cafe” or “resume”. It seems to me a little bit pædantic to insist that those words must have the accent marks in order to be correct (when using them in English).

sneak · on Dec 6, 2024

Yeah, loanwords are different words than the original word.

The correct plural of "baby" in German is "babys".

somat · on Dec 6, 2024

I would say it is not the place of posix to prescribe how it should be, the job of posix is describe what it is, a common operating environment. this is why posix is such a mess and why I feel it is not a big deal to deviate from posix, however posix fills an important role in getting everyone on the same page for interoperability.

In my opinion the way to improve this, is bottom up, not top down. Start with linux(theese days posix is largely "what does linux do?"), get a patch in that changes the defination of the user name from a subset of ascii to a subset of utf-8. what subset? that is a much harder problem with utf-8 than ascii, good luck. get a similer patch in for a few of the bsd. then you tell posix what the os's are doing. and fight to get it included.

On the subject of what unicode subset. perhaps the most enlightened thing to do is the same as the unix filesystem and punt. one neat thing about the unix filesystem is that names are not defined in an encoding but as a set of bytes. This has problems and has made many people very mad. but it does mean your file system can be in whatever encoding you want, transitioning to utf-8 was easy(mainly doe to the clever backwards compatible nature of utf-8) and we were not locked into a problematic encoding like on windows. perhaps just define that the name is a array of bytes and call it a day. that sounds like the unix way to me.

tssva · on Dec 6, 2024

"however posix fills an important role in getting everyone on the same page for interoperability."

Isn't that exactly what the posix username rules are doing? Specifying a set of characters which are portable across systems to allow for interoperability between current and legacy unix systems along with most non-unix systems.

"Start with linux"

Which linux? Debian/Ubuntu, Redhat/Fedora, shadow-utils, and systemd all differ.

"get a patch in that changes the defination of the user name from a subset of ascii to a subset of utf-8"

ASCII is a subset of UTF-8 so the POSIX definition already specifies a subset of UTF-8.

patrick451 · on Dec 7, 2024

Honestly, I just don't care. UTF8 is excessively complicated. ASCII is simple.

NoMoreNicksLeft · on Dec 7, 2024

> properly speaking your system has to handle everything anyway, even if you don't allow local creation.

Honestly, I try not to be a pessimist, but this sounds like the opening narration to some dystopian doomsday movie. Titled something like You're Not Wrong, I suppose.

macintux · on Dec 6, 2024

At the meatspace level, purely numeric usernames are problematic.

I was working as a contractor at a Fortune 500 firm several years ago when they introduced a new ERP system which apparently encouraged the company to switch to numeric system IDs. Fortunately the technical teams, especially Linux support, objected and it was overruled, but I was just as worried about the communications problems that would result.

When everyone has a system ID that matches a consistent pattern, like “YZ12345”, IDs are easy to recognize in documentation and data. An ID like “1234567” could be practically anything.

PhilipRoman · on Dec 6, 2024

I really like the concept of adding some redundancy to ids, like a prefix. It helps to disambiguate things (kind of like static typing). A good example is also bank numbers, which must be a multiple of 97 +1, enabling fast client-side validation against typos.

cupantae · on Dec 7, 2024

Could you give a reference on this 97 rule? I’m intrigued.

az09mugen · on Dec 7, 2024

I was also intrigued, so I searched and on wikipedia ( https://en.wikipedia.org/wiki/International_Bank_Account_Num... ), in the section "Validating the IBAN" it is written :

    Interpret the string as a decimal integer and compute the remainder of that number on division by 97
    If the remainder is 1, the check digit test is passed and the IBAN might be valid

Spooky23 · on Dec 7, 2024

It’s pretty common in places that handle Tax data.

At the end of the day, pushing opinionated bullshit doesn’t belong in utilities. If there’s a security vulnerability, sell that and push for incorporation into NIST standards.

thephyber · on Dec 6, 2024

I am also worried about more subtle bugs caused by usernames that are not strictly only-numeric, such as “10e2” or “0xDEADBEEF”.

Ferret7446 · on Dec 6, 2024

It shouldn't be a problem as long as the system disallows a numeric username to be the same as an existing UID (excepting the case where the matching UID is assigned to said username).

nikisweeting · on Dec 10, 2024

still makes historic data garbage, both users and pids can be created/destroyed over time.

hulitu · on Dec 6, 2024

> Allowing purely numeric usernames seems like a terrible idea to me

"I'm not a number, i am a free man. Ha ha ha ha ha"

kps · on Dec 6, 2024

“Who is UID 0?”

“You are UID 6.”

wombatpm · on Dec 7, 2024

You have an off by one error. But I honestly don’t know which you should change to with the spirit of the show.

Spooky23 · on Dec 7, 2024

There’s lots of dumb things that you can do. Where do the safety bumpers stop?

pas · on Dec 7, 2024

wherever each community puts them?

dfranke · on Sept 18, 2022

I retrained myself on Barchowski last year. Barchowski and its close cousin Getty-Dubay are italic rather than looped, and a lot easier to read for someone who only ever learned print.

dfranke · on Aug 13, 2022

Author of the essay here. I did a double take at seeing it posted here because I thought it was completely forgotten, nearly including by myself. I think the actual date of this essay is 2007 or maybe 2006, because I remember writing it from my university computer lab and I was class of '07. Anyway, there's certainly a lot of water under the bridge since then and the political composition of hackerdom today looks nothing like it did 15 years ago. With the growth of the FAANGs there are far more hackers today than there were then, and the younger ones are a lot more likely to be leftists than libertarians. Still, though, when I travel in libertarian circles it's pretty clear to me that hackers are overrepresented there, so I think the reverse remains true as well, even though it's not as dramatic or obvious as it was in the '00s.

dfranke · on April 24, 2022

It probably wasn't the only thing affected. It's just flipping bits in encryption keys has much more dramatic and obvious effect than flipping other random bits in memory. Flip a bit in a raster image and you get one funny-looking pixel. Flip a bit in an AES key and you completely corrupt all the data handled by that key.

dfranke · on Jan 6, 2022

Steno devices are well-known to the speed typing community. Some competitions allow them, some don't. You can't bring this to one that doesn't for the same reason you can't bring a F1 to a stock car race.

dfranke · on Oct 1, 2021

That's what mmap is for.

haydnv · on Oct 1, 2021

It might be possible to replace freqfs with mmap on a POSIX OS, but a) you would still have to implement your own read-write lock, and b) you would (I think probably?) lose some consistency in behavior across different host operating systems.

vlovich123 · on Oct 1, 2021

Which OSes does this run on that doesn’t have some kind of mmap operation?

haydnv · on Oct 1, 2021

It should work on Windows (because tokio::fs works on Windows) although I have not personally tested this

julian37 · on Oct 1, 2021

You can do mmap on Windows, eg. https://github.com/danburkert/memmap-rs

gpderetta · on Oct 1, 2021

mmaps for read, explicit API for writing, a-la LMDB. Buggy readers can read inconsistent data but cannot corrupt the os.

otterley · on Oct 1, 2021

Corrupt the OS? How might that happen?

gpderetta · on Oct 1, 2021

Sorry, I meant the DB!

dfranke · on Aug 28, 2021

> [Lisp macros] by their nature would be hard to implement properly in a language without turning it into a dialect of Lisp.

Camlp4, Template Haskell, and Rust procedural macros all serve as counterexamples to this claim.

throwaway17_17 · on Aug 28, 2021

To the best of my knowledge, and I could be wrong, none of those languages/extensions allow for run time generation of code. I think there is a great deal of semantic difference between compile time code execution that enables syntax manipulation and the full scope of Lisp macros. I’m not making a judgment about the value of such capabilities, but I really think they make for distinct expressability classes.

dfranke · on Aug 29, 2021

Common Lisp macros are precisely "compile time code execution that enables syntax manipulation". If by "runtime generation of code" you mean to include executing that code after it's been generated, that's not macros, that's `eval`.

Just as use of `eval` tends to be discouraged in Lisp land, it's not something that a Haskell or Rust programmer would often reach for. But in Haskell, if you really want it, GHC has an API and you can have the whole power of the compiler available to you at runtime. This isn't really a language feature per se, it's literally just linking in the whole compiler and calling it like an ordinary library. I'm not aware of anything similar to that in Rust but I haven't really looked. However, if you're only trying to generate code and not JIT-compile or execute it, the same Rust crates that support compile-time AST manipulation (like `syn` and `quote`) can equally well be used at runtime.

kazinator · on Aug 28, 2021

Are you saying that they were easy to implement, and that it was done properly?

dfranke · on Aug 28, 2021

Implementing a modern, production-quality compiler is not easy as a baseline, but nothing in the design of OCaml, Haskell, or Rust adds any significant obstacles, relative to Common Lisp, to supporting this feature. Slinging an AST around and dropping it into a quasiquoted template is a well-understood problem. The simplicity of Lisp's syntax is not a prerequisite and hasn't been since the parsing techniques that were developed in the 1970s.

Done properly? I can't speak to camlp4, but at least in the case of Haskell and Rust, certainly. Incidentally I just had my first occasion to write a Rust procedural macro last weekend. I had a significantly complex transformation written and working in half a day, learning curve included, and I found it all pretty frictionless.

kazinator · on Aug 29, 2021

The litmus test is being able to remove any phrase structure in the language and replace it by a macro.

dfranke · on Aug 29, 2021

Haskell and Rust both pass this. The input doesn't even have to be a production in the source language; you can parse it any way you'd like.

kazinator · on Aug 29, 2021

Paul Graham is not a complete idiot, and you are not disproving his statement "hard to implement properly in a language without turning it into a dialect of Lisp".

If you have to parse anything, it's not a Lisp-like macro, which is always an object arising from a production in the language.

Parsing can be hard, so he is right there too. If we take out some subset of Rust that has a significant syntax, and implement it in macros, that sounds like being up to the elbows in parsing.

dfranke · on Aug 29, 2021

You're not up to your elbows if somebody else has already done the work for you. If the input to your macro is a valid production in Rust, then all the parsing work that's incumbent on the macro author is to write

    let ast = parse_macro_input!(input as Foo);

Where foo is some type defined by the `syn` crate and there's one for every production in Rust's grammar. But neither are you limited to those. You can also extend or replace that grammar as you see fit, but in that case the added parsing is on you.

kaba0 · on Aug 28, 2021

Also, Scala.

yawaramin · on Aug 28, 2021

And Elixir.

dfranke · on July 22, 2021

You should list your side projects if they've been influential in some significant way, e.g., if they've developed a large user base, become a dependency of some other noteworthy project, or changed how other people approach similar problems. If you're merely proud of the code, then don't list them directly; instead, pin those repos on your GitHub profile and link to it from your résumé.

dfranke · on April 21, 2021

How in the world is conducting behavioral research on kernel maintainers to see how they respond to subtly-malicious patches not "human subject research"?

alxlaz · on April 21, 2021

In the restricted sense of Title 45, Part 46, it's probably not quite human subject research (see https://www.hhs.gov/ohrp/regulations-and-policy/regulations/... ).

Of course, there are other ethical and legal requirements that you're bound to, not just this one. I'm not sure which requirements IRBs in the US look into though, it's a pretty murky situation.

dlgeek · on April 21, 2021

How so?

It seems to qualify per §46.102(e)(1)(i) ("Human subject means a living individual about whom an investigator [..] conducting research: (i) Obtains information [...] through [...] interaction with the individual, and uses, studies, or analyzes the information [...]")

I don't think it'd qualify for any of the exemptions in 46.104(d): 1 requires an educational setting, 2 requires standard tests, 3 requires pre-consent and interactions must be "benign", 4 is only about the use of PII with no interactions, 5 is only about public programs, 6 is only about food, 7 is about storing PII and not applicable and 8 requires "broad" pre-consent and documentation of a waiver.

dekhn · on April 21, 2021

rather than arguing about the technical details of the law, let me just clarify: IRBs would actively reject a request to review this. It's not in their (perceived) purview.

It's not worth arguing about this; if you care, you can try to change the law. In the meantime, IRBs will do what IRBs do.

shkkmo · on April 21, 2021

If the law, as written, does actually classify this as human research, it seems like the correct response is to sue the University for damages under that law.

Since IRBs exist to minimize liability, it seems like that would be that fastest route towards change (assuming you have legal standing )

dekhn · on April 22, 2021

Woah woah woah, no need to whip out the litigation here. You could try that, but I am fairly certain you would be unsuccessful. You would be thrown out with "this does not qualify under the law" before it made it to court and it wouldn't have much bearing except to bolster the university.

thu2111 · on April 22, 2021

It obviously qualifies and the guy just quoted the law at you to prove it.

Frankly universities and academics need to be taken to court far more often. Our society routinely turns a blind eye to all sorts of fraudulent and unethical practices inside academia and it has to stop.

Tobu · on April 21, 2021

That's still 10 thousand words you're linking to…

I had a look at section §46.104 https://www.hhs.gov/ohrp/regulations-and-policy/regulations/... since it mentioned exemptions, and at (d) (3) inside that. It still doesn't apply: there's no agreement to participate, it's not benign, it's not anonymous.

dfranke · on April 21, 2021

If there's some deeply legalistic answer explaining how the IRB correctly interpreted their rules to arrive at the exemption decision, I believe it. It'll just go to show the rules are broken.

IRBs are like the TSA. Imposing annoyance and red tape on the honest vast-majority while failing to actually filter the 0.0001% of things they ostensibly exist to filter.

dekhn · on April 21, 2021

are you expecting that science and institutions are rational? If I was on the IRB, I wouldn't have considered this since it's not a sociological experiment on kernel maintainers, it's an experiment to inject vulnerabilities in a source code. That's not what IRBs are qualified to evaluate.

ebcode · on April 22, 2021

> it's an experiment to inject vulnerabilities in a source code

I'm guessing it passed for similar reasoning, along with the reviewers being unfamiliar with how "vulnerabilities are injected." To get the bad code in, the researcher needed to have the code reviewed by a human.

So if you rephrase "inject vulnerability" as "sneak my way past a human checkpoint", you might have a better idea of what they were actually doing, and might be better equipped to judge its ethical merit -- and if it qualifies as research on human subjects.

To my thinking, it is quite clearly human experimentation, even if the subject is the process rather than a human individual. Ultimately, the process must be performed by a human, and it doesn't make sense to me that you would distinguish between the two.

And the maintainers themselves express feeling that they were the subject of the research, so there's that.

ClumsyPilot · on April 22, 2021

Testing airport security by putting dangerous goods in your luggage is not human experimentation. Testing a Bank's security is not human experimentation. Testing border securiry is not.

What makes people revieing linux kernel more 'human' than any of the above?

salawat · on April 23, 2021

Tell that to the person on the hook if or when they get caught.

CRConrad · on April 22, 2021

It's not an experiment in computer science; these guys aren't typing code into an editor and testing what the code does after they've compiled it. They're contributing their vulnerabilities to a community of developers and testing whether these people accept it. It is absolutely nothing else than a sociological experiment.

dfranke · on March 10, 2021

Last time I applied for job through a headhunter (2010), they ran my LaTeX resume through an automated .doc converter that destroyed all the formatting and then didn't even attempt to fix any of it. Somehow I still managed to get some interviews, and when I saw the printout on the interviewer's desk I shrieked in horror and handed him one the original paper copies that I'd luckily had the foresight to bring with me.