Hacker Newsnew | past | comments | ask | show | jobs | submit | tenaf0's commentslogin

Just a tiny notice, the github page, nor the website seem to currently contain an “installation” link. The one found by google returns a ‘page not found’ for the current version.


Well, code size is another interesting aspect here. A JIT compiler can effectively create any number of versions for a hot method, based on even very aggressive assumptions (an easy one would be that a given object is non-null, or that the interface only has a single instance loaded). The checks for these are cheap (e.g. it could be encoded as trapping an invalid page address in case of NPEs), and invalidation’s cost is amortized.

Contrast this with the problem of specialization in AOT languages, which can easily result in bloated binaries (PGO does help here quite a lot, that much is true). For example, generics might output a completely new function for every type it gets instantiated with - if the function is not that hot, it actually makes sense to rather try to handle more cases with the same code.


I do think, that in the general case, a JIT compiler is required: you can’t make every program fast, without having the ability to synthesize new code based on only-runtime available information. There are many where AOT is more than enough, but not all are such. Note, this doesn’t preclude AOT/hybrid models as pjmlp correctly says.

One stereotypical (but not the best) example would be regexes: you basically want to compile some AST into a mini-program. This can also be done with a tiny interpreter without JIT, which will be quite competitive in speed (I believe that’s what rust has, and it’s indeed one of the fastest - the advantage of the problem/domain here is that you really can have tiny interpreters that efficiently use the caches, having very little overhead on today’s CPUs), but I am quite sure that a “JITted rust” with all the other optimizations/memory layouts could potentially fair better, but of course it’s not a trivial additional complexity.


Last time I checked, it couldn’t handle expressions that are not just tokens one after the other. For example, German separable verbs. I tried fixing it here: https://news.ycombinator.com/item?id=38915786


(Misunderstood the question, please ignore my above comment)


I have been working on a similar project on-and-off in my spare time, the only remotely interesting feature that other similar software may not have is that it actually tries to parse/analyze sentences (with an NLP lib). It's made specifically for German, and the reason why I wanted to make it is that no existing software managed to handle separable verbs properly - for example learning "Wir fangen jetzt an." is just wrong if you learn it as 'fangen' and 'an' separately, you actually care about 'anfangen', dictionary-wise.

It unfortunately does have false-positives (a complete solution would require LLMs, I believe over the much less complicated NLP algorithms - I just don't want to send whole books to ChatGPT, as that would quickly become expensive), but I found it usable, so I made it public now: https://github.com/tenaf0/lwt

I don't want to "advertise" it even more, as the NLP lib is run by academia as a free service, and I don't want to overburden it (I have been planning on hosting it myself, but didn't yet get there).


You have my full support for your project, as I think natural language processing is a very exciting and underutilised technology for language learning. But if you want a low-tech solution, I've found Wiktionary to be ideal. Wiktionary has all the declensions and prefixes for German verbs; to use your example:

https://en.wiktionary.org/wiki/f%C3%A4ngt_an

tells you what the word is, and gives a link back to:

https://en.wiktionary.org/wiki/anfangen#German

I chose to add Wiktionary to Kiwix Android (8GB download) for offline use. In addition, I can search by right-clicking or tap+holding on a word. All that information is available because of the (mostly manual) work done by Wiktionary contributors, but it reaches a very high standard. There is usually more digression and explanation for the usage notes in Wiktionary than, say, Collins German-English dictionary, which is a rather good thing for language learners.


FWIW, English Wikitionary (appears to!) have fewer words than German Wiktionary. I've run into this trying to extract words from eBooks (then converting to the "base" form, to essentially de-duplicate). I think it's mostly compound or more niche words, but I imagine you'd still run into them at least occasionally with most written works.

There's a nice project for converting and extracting the data from English Wiktionary into JSON but it doesn't support any other languages, AFAIK, which is a bit of a shame but also not very surprising - Wiktionary is a lot more complex, technically, than I expected!


Interesting to hear that - I'm still at the level of German where I wouldn't know what I'm missing. For clarification: are you saying that:

- the English Wiktionary has fewer English words than the German Wiktionary has German words, or

- the English Wiktionary has fewer German words than the German Wiktionary does?


The latter. I'm very definitely not at that level either, but looking at German words from books that couldn't be found on English Wiktionary, I was able to find them on German Wiktionary. One example would be "Weihnachtsfest" - not sure it's "officially" a compound word, though if you know "Weihnacht" and "Fest", then the meaning should be clear. In any case, it shows up as a single word and trying to "split" words made up of other words is an exercise in insanity.

Another example is "krächzender", which might also serve to give some idea of the particular pains in processing German text. It's not in English Wiktionary, but krächzen is, and is a verb. So "krächzender" is the adjectival form of the verb, and if you know "krächzen" and the general rules around adjective formation it would probably be obvious. But would you rely on a computer to parse those rules, or would you want a table with all the declensions laid out? And if you're building a vocab list for a book, is it a separate entry in the list, or does it fall under the verb?

Obviously, German Wiktionary only has definitions & explanations in German so it's not great for beginners, but any tool that's trying to automatically do stuff with German text would likely benefit from using German Wiktionary.

I have no idea if it's true for other languages, but I wouldn't be surprised if it's also true for other major languages spoken by Wikipedia users (e.g., French, Spanish, but maybe not Chinese).


Interesting! I have a partially-built, related, tool, to extract "words" from e-books, so I could build flashcard lists and make sure I knew the majority of words that were used - most of them would be common words but every book has a decently-sized selection of specialised vocabulary. I did think about trying to get something fancy done with an LLM or an NLP for figuring out the separable verbs, but in the end, I took a very... brute-force approach, basically grabbing the final word in the "phrase", then prepending that to every word in the phrase one by one and asking "is this a known separable verb?" - I'm not sure how well it worked, but that's a different story.


You could potentially use an NLP library like SpaCy, or even bundle with a free fine-tuned LLM like Mistral 7b.

The fine-tuned mistral models are known to out-perform GPT-4 on their specific tasks.


Shameless plug of my similar project: https://github.com/tenaf0/rust-jvm3


I would like to receive one as well if you have, under my username@pm.me


There is also Google’s closure compiler that can compile java to js.


Thank you very much for checking it out! I was afraid no one will take a look because it‘s not that particularly interesting on the surface.

Regarding verbosity, the foreign function&memory interface is deliberately very low-level, they only want to expose a safe and performant interface on which further, more handy abstractions can be built, which is imo the correct decision here.

I sort of started to build such a “lib” in an ad hoc way (see the CWrapper class) for error handling in particular. Type safety would be another point, but that is also not yet a concern on this level for this JDK project.

Thanks for the feedback, I will document it better and get back at you! Unfortunately the linux user space is criminally under-documented so I would also have trouble clearly explaining some of the concepts/libs I actually make use of, I had to read lots of code from wlroots/weston to make it work. So, do not quote me on the following description!

Libseat in particular is responsible for seat- and session-management, where a seat is either a physical access to a computer, or a remote one (e.g. an active ssh connection would be another seat). Usually there is a single physical seat, and it can have multiple sessions attached to it, which happens when you log into a vtty, or start up a display manager. You can query these through the `loginctl` utility (I think it is part of systemd).

Maybe the most relevant part of this interface is that connected devices’ events (mouse, keyboard, etc) under /dev/input are only queryable by root, and one shouldn’t really be running their DE with sudo and such. In the old times this was solved by the xserver actually having elevated permissions, and WMs only communicating with that through an unprivileged protocol, but non-root xserver has been possible for years now, plus wayland compositors (as is this project) also run as a simple, unprivileged programs. This is possible by opening these “event files” through libseat basically, which will give access to these files when the given executable is the active session, and revoke access when one changes to another session. E.g. when you press ctrl+alt+FN your DEs input devices’ accesses will get revoked and be given to the virtual terminal, and vice versa. (Actually, I failed to find any way to query for the currently active session, apparently it is only possible through dbus which I didn’t want to depend on for this project, so I make use of loginctl in a quite hacky way..).

Regarding the DRM part it currently only supports “primitive buffers” which are supported by any video card (and are the replacement for the previous fbdev interfaces if I’m not mistaken, so on a modern kernel this is the way your boot logs are already being displayed), but digging a little bit more into the DRM lib could give one an OpenGL context and Skia can make use of them, so it would be a relatively small undertaking, but I wanted to get an interactive demo before that.

Actually, I want(ed) to make a whole toy wayland compositor out of this, even reimplementing libwayland’s server side so that I could test a new virtual thread per application model with Loom (not sure if it would be any better though), but one step at a time (plus I have plenty other side project ideas :D ).


Very cool, thanks for the info and submission!


I never liked these arguments.. We can’t live without some trust, it is simply impossible the same way we can’t avoid all risks in life. Putting our head in the sand doesn’t solve anything.

As for the concrete languages mentioned, Java is probably the safest bet out of managed languages, not only does it have a proper specification (both the language and the JVM), it can be carried forward by multiple companies single-handedly, it is that critical piece of infrastructure. Also, even from an incentives point it doesn’t make sense to put backdoors or whatever, as they themselves use it very heavily, so each big company in effect “checking” the others.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: