Hacker Newsnew | past | comments | ask | show | jobs | submit | paragraft's commentslogin

Yes here's the patent. The independent claims are frustratingly broad if you're trying to think through practical NPC world sim systems.

https://patents.google.com/patent/US20160279522A1/en


I've recently come off a team that was racking up a huge Splunk bill with ~70 log events for each request on a high traffic service, and this is all very resonant (except the bit about sampling, I never gave that much thought - reducing our Splunk bill 70x was ambitious enough for me!).

Hadn't heard the "wide event" name, but I had settled on the same idea myself in that time (called them "top-level events" - i.e. we would gather information from the duration of the request and only log it at the "top" of the stack at the end), and evangelised them internally mostly on the basis it gave you fantastic correlation ability.

In theory if you've got a trace id in Splunk you can do correlated queries anyway, but we were working in Spring and forever having issues with losing our MDC after doing cross-thread dispatch and forgetting to copy the MDC thread global across. This wasn't obvious from the top-level, and usually only during an incident would you realise you weren't seeing all the loglines you expected for a given trace. So absent a better solution there, tracking debug info more explicitly was appealing.

Also used these top-level events to store sub-durations (e.g. for calling downstream services, invoking a model etc), and with Splunk if you record not just the length of a sub-process but its absolute start, you can reconstruct a hacky waterfall chart of where time was spent in your query.


What's the right way?


UTF-8

When D was first implemented, circa 2000, it wasn't clear whether UTF-8, UTF-16, or UTF-32 was going to be the winner. So D supported all three.


utf8, for essentially the reasons mentioned in this manifesto: https://utf8everywhere.org/


Yep. Notably supported by go, python3, rust and swift. And probably all new programming languages created from here on.


I would say anyone mentioning a specific encoding / size just wants to see the world burn. Unicode is variable length on various levels, how many people want to deal with the fact that the unicode of their text could be non normalized or want the ability to cut out individual "char" elements only to get a nonsensical result because the following elements were logically connected to that char? Give developers a decent high level abstraction and don't force them to deal with the raw bits unless they ask for it.


I think this is what Rust does, if I remember correctly, it provides APIs in string to enumerate the characters accurately. That meaning, not necessarily byte by byte.


https://pastebin.com/raw/D7p7mRLK

My comment in a pastebin. HN doesn't like unicode.

You need this crate to deal with it in Rust, it's not part of the base libraries:

https://crates.io/crates/unicode-segmentation

The languages that have this kind of feature built-in in the standard library, to my knowledge, are Swift, JavaScript, C# and Java. Swift is the only one, of those four, that treat operating on graphemes as the default. JavaScript requires Intl.Segmenter, C# requires StringInfo, Java requires BreakIterator.

By the way, Python, the language caused so much hurt with their 2.x->3.x transition promising better unicode support in return for this pain couldn't even do this right. There is no concept of graphemes in the standard library. So much for the batteries included bit.

>>> test = " "

>>> [char for char in test]

['', '\u200d', '', '\u200d', '', '\u200d', '']

>>> len(test)

7

In JavaScript REPL (nodejs):

> let test = " "

undefined

> [...new Intl.Segmenter().segment(test)][0].segment;

' '

> [...new Intl.Segmenter().segment(test)].length;

1

Works as it should.

In python you would need a third party library.

Swift is truly the nicest of programming languages as far as strings are concerned. It just works as it always should have been.

let test = " "

for char in test {

    print(char)
}

print(test.count)

output :

1

[Execution complete with exit code 0]

I, as a non-Apple user, feel quite the Apple envy whenever I think about swift. It's such a nice language, but there's little ecosystem outside of Apple UIs.

But man, no using third party libraries, or working with a wrapper segmenter class or iterator. Just use the base string literals as is. It. Just. Works.


For context, it looks like you’re talking about iterating by grapheme clusters.

I understand how iterating through a string by grapheme clusters is convenient for some applications. But it’s far from obvious to me that doing so should be the language’s default. Dealing with grapheme clusters requires a Unicode database, which needs to live somewhere and needs to be updated continuously as Unicode grows. (Should rust statically link that library into every app that uses it?)

Generally there are 3 ways to iterate a string: by UTF8 bytes (or ucs2 code points like Java/js/c#), by Unicode codepoint or by grapheme clusters. UTF8 encoding comes up all the time when encoding / decoding strings - like, to json or when sending content over http. Codepoints are, in my opinion, the correct approach when doing collaborative editing or patching strings. And grapheme clusters are useful in frontend user interfaces - like when building a terminal.

Of those 3 iteration methods, I’ve personally used UTF8 encoding the most and grapheme clusters the least. Tell me - why should grapheme clusters be the default way to iterate over a string? I can see the argument in Swift, which is a language built for frontend UI. But in a systems language like rust? That seems like a terrible default to me. UTF8 bytes are by far the most useful representation for strings in systems code, since from the pov of systems code, strings are usually just data.


This was not meant as criticism for rust in particular (though, while it shouldn't be the default behavior of strings in a systems language, surely at least the official implementation of a wrapper should exist?), but high level languages with ton of baggage like python should definitely provide the correct way to handle strings, the amount of software I've seen that are unable to properly handle strings because the language didn't provide the required grapheme handling and the developer was also not aware of the reality of graphemes and unicode..

You mention terminals, yes, it's one of the area where graphemes are an absolute must, but pretty much any time you are going to do something to text like deciding "I am going to put a linebreak here so that the text doesn't overflow beyond the box, beyond this A4 page I want to print, beyond the browser's window" grapheme handling is involved.

Any time a user is asked to input something too. I've seen most software take the "iterate over characters" approach to real time user input and they break down things like those emojis into individual components whenever you paste something in.

For that matter, backspace doesn't work properly on software you would expect to do better than that. Put the emoji from my pastebin in Microsoft Edge's search/url bar, then hit backspace, see what happens. While the browser displays the emoji correctly, the input field treats it the way Python segments it in my example: you need to press backspace 7 times to delete it. 7 times! Windows Terminal on the other hand has the quirk of showing a lot of extra spaces after the emoji (despite displaying the emoji correctly too) and will also require 11 backspace to delete it.

Notepad handles it correctly: press backspace once, it's deleted, like any normal character.

> Of those 3 iteration methods, I’ve personally used UTF8 encoding the most and grapheme clusters the least.

This doesn't say anything about grapheme clusters being useless. I've cited examples of popular software doing the wrong thing precisely because, like you, they didn't iterate over grapheme clusters. That you never use grapheme iteration might say more about you than it says about grapheme iteration being unneeded.

The dismissiveness over more sane string handling as a standard is not unlike C++ developers pretending that developers are doing the right thing with memory management so we don't need a GC (or rust's ownership paradigm). Nonsense.


Those are good examples! Notably, all of them are in reasonably low level, user-facing code.

Your examples are implementing custom text input boxes (Excel, Edge), line breaks while printing, and implementing a terminal application. I agree that in all of those cases, grapheme cluster segmentation is appropriate. But that doesn't make grapheme cluster based iteration "the correct way to handle strings". There's no "correct"! There are at least 3 different ways to iterate through a string, and different applications have different needs.

Good languages should make all of these options easy for programmers to use when they need them. Writing a custom input box? Use grapheme clusters. Writing a text based CRDT? Treat a string as a list of unicode codepoints. Writing an HTTP library? Treat the headers and HTML body as ASCII / opaque bytes. Etc.

I take the criticism that rust makes grapheme iteration harder than the others. But eh, rust has truly excellent crates for that within arms reach. I don't see any advantage in moving grapheme based segmentation into std. Well, maybe it would make it easier to educate idiot developers about this stuff. But there's no real technical reason. Its situationally useful - but less useful than lots of other 3rd party crates like rand, tokio and serde.

> like you, they didn't iterate over grapheme clusters. That you never use grapheme iteration might say more about you than it says about grapheme iteration being unneeded.

It says that in 30+ years of programming, I've never programmed a text input field from scratch. Why would I? That's the job of the operating system. Making my own sounds like a huge waste of time.


I've used it in anger on a project with an unreasonable start time in its test suite. Put an alert sound on VS when breakpoints were reached let me start a test with debugger attached, and then go do something else for the time it would take to start, but not forget what I had been originally doing.


I still haven't settled the nature of free will and determinism, aka who's actually controlling the trains, them or their drivers?


I know the article was about movies that persuaded viewers to _go_ to a certain country, but I am reminded of a former colleague who explained to me that it was actually a TV show ("The Mechanism", a drama about political corruption in Brazil) that finally convinced him he needed to take his young family and _leave_ Brazil for a better future.

Of course I had to give it a watch after a review like that. Good show.


This was the best I could find https://www.invasive.org/gist/products/handbook/20.triclopyr... (see Mode of Action)


> The Prime Minister of Canada just did a state visit to India, where he was not warmly received, and made to look foolish on the world stage.

Have you considered the possibility that the reason he got a cold reception was he brought the accusation he just made publicly? Would put the issue with the plane in a whole new light too...


I have fond memories of the early DOS version (1.something). Its tutorial was a masterclass (included full screen ascii graphics illustrating the scenarios) and as a kid playing on my father's laptop I went through it several times. I particularly enjoyed the database tutorial where you worked on WHODUNIT.WDB to solve the mystery of a murder at a ski slope by narrowing down the customers against clues...


Microsoft Works 2.0 also had the tutorial! I don't remember the ascii graphics - maybe my IBM PS/2 didn't show them, or maybe I have brain damage. (edit: was there a mountain picture?)

It ruined my expectations for help documents forevermore: nothing was as cool.


The graphics weren't elaborate, I just remember things like the illustration of a storefront before you worked on an inventory spreadsheet, or the ascii picture of a telephone and modem before the "Communications" tool (i.e. telnet) section.


Somewhat (by a few decades) behind the times here, but I finally found a set of Robert Caro's Years of Lyndon Johnson books at a 2nd-hand shop earlier this year. What a ride. I know it's a famous series, but I'm outside the US and so only recently became aware of its existence, and also therefore went in mostly blind as to the subject (I knew very little of LBJ prior). Am partway through the 2nd book now, with the conclusion of the 1948 Texan Senate Democratic primary, and my mouth just sort of hung open for pages at a time during that.

Caro's a talented writer, but what really shows through is just the sheer years of hard work he clearly put into the books. I don't know how one can focus for so many years on just one writing project.


Just started Master of the Senate. Caro's books have reignited my love of reading. Cannot recommend his books enough for anyone interested in how power in organizations or government is culled and cultivated. His books read like thrillers, but are exhaustively researched and sourced.

Sony Pictures recently released a movie covering Caro and his recently passed editor, Robert Gottlieb, relationship and process: https://www.youtube.com/watch?v=gv3CRojrbeE


Thanks, was unaware of the movie!


In that genre, his book _Power Broker_ is masterful as well, and sometimes considered the superior work, although about a much lesser-known person.


I've been on the look-out for that too. Have lined up Mike Royko's "Boss" on Daley in the interim, though the bookshop owner wasn't as fond of Royko's writing style...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: