Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was super excited at this, but digging through the release [0] one can see the following [1]. While using Bible translations is indeed better than nothing, I don't think the stylistic choices in the Bible are representative of how people actually speak the language, in any of the languages I can speak (i.e. that I am able to evaluate personally).

Religious recordings tend to be liturgical, so even the pronunciation might be different to the everyday language. They do address something related, although more from a vocabulary perspective to my understanding [2].

So one of their stated goals, to enable people to talk to AI in their preferred language [3], might be closer but certainly a stretch to achieve with their chosen dataset.

[0]: https://about.fb.com/news/2023/05/ai-massively-multilingual-...

[1]: > These translations have publicly available audio recordings of people reading these texts in different languages. As part of the MMS project, we created a dataset of readings of the New Testament in more than 1,100 languages, which provided on average 32 hours of data per language. By considering unlabeled recordings of various other Christian religious readings, we increased the number of languages available to more than 4,000. While this data is from a specific domain and is often read by male speakers, our analysis shows that our models perform equally well for male and female voices. And while the content of the audio recordings is religious, our analysis shows that this doesn’t bias the model to produce more religious language.

[2]: > And while the content of the audio recordings is religious, our analysis shows that this doesn’t bias the model to produce more religious language.

[3]: > This kind of technology could be used for VR and AR applications in a person’s preferred language and that can understand everyone’s voice.



This paragraph justifies the decision, IMHO.

> Collecting audio data for thousands of languages was our first challenge because the largest existing speech datasets cover 100 languages at most. To overcome this, we turned to religious texts, such as the Bible, that have been translated in many different languages and whose translations have been widely studied for text-based language translation research.

I think the choice was just between having any data at all, or not being able to support that language.


While AFAIU the UN: UDHR United Nations Universal Declaration of Human Rights is the most-translated document in the world, relative to the given training texts there likely hasn't been as much subjective translation analysis of UDHR.

Awesome-legal-nlp links to benchmarks like LexGLUE and FairLex but not yet LegalBench; in re: AI alignment and ethics / regional law https://github.com/maastrichtlawtech/awesome-legal-nlp#bench...

A "who hath done it" exercise:

[For each of these things, tell me whether God, Others, or You did it:] https://twitter.com/westurner/status/1641842843973976082?

"Did God do this?"


The UN's UDHR likely yields <5mins of audio data in any language which is kinda useless as far as training data goes.


Compared to e.g. religious text translation, I don't know how much subjective analysis there is for UDHR translations. It's pretty cut and dry: e.g. "Equal Protection of Equal Rights" is pretty clear.

"About the Universal Declaration of Human Rights Translation Project" https://www.ohchr.org/en/human-rights/universal-declaration/... :

> At present, there are 555 different translations available in HTML and/or PDF format.

E.g. Buddhist scriptures are also multiply translated; probably with more coverage in East Asian languages.

Thomas Jefferson, who wrote the US Declaration of Independence, had read into Transcendental Buddhism and FWIU is thus significantly responsible for the religious (and nonreligious) freedom We appreciate in the United States today.


Of course, my critique is less "This project shouldn't exist", but rather "It seems to me there are several biases that affect the performance of this project in the context it was presented in".

This is a great project and an important stepping stone in a multilingual AI future.


> I don't think the stylistic choices in the Bible are representative of how people actually speak the language, in any of the languages I can speak

The cadence and intonation sounded a little weird, but I suspect fine-tuning can improve that by a lot. I am really excited to see some low-resource language finally get some mainstream TTS support at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: