Can you trust Tatoeba sentences?

To quickly assess the trustworthiness of a sentence, Kanjiverse is introducing a 5-star rating system based on its author's fluency and the sentence's difficulty. No more risk to learn sentences that do not sound natural!

Can you trust Tatoeba sentences?

If you are using any of the free dictionary apps or websites out there, chances are, the sample sentences all come from the same source: Tatoeba Project. Unfortunately they usually forget to show an important disclaimer about how those sentences were compiled.

Here is the catch, anyone can freely contribute to Tatoeba, therefore the quality of the sentences and translations can vary widely according to their author's fluency and the source of those sentences.

To remedy this issue, Kanjiverse is introducing a new 5-star rating system to help you find out how trustworthy a sentence is! Check it out now in the app or read on to learn more about what to be aware of when learning Japanese with Tatoeba sentences.

Who is using Tatoeba sentences?

I am not going to single out any one app, I have tried a dozen of the most popular and they are all using Tatoeba as their source. If you are not paying $20~40 for a brand dictionary, it is almost certain that it is using the trifecta: KANJIDIC for the kanji, JMdict for the definitions, and Tatoeba for the sentences. Kanjiverse is no exception... for now ;)

How are Tatoeba sentences compiled?

Tatoeba is a collection of sentences and translations contributed by volunteers under free to reuse licenses (CC BY 2.0 FR or CC0 1.0). The corpus was built and is constantly updated as follow:

  • Anyone is free to submit new sentences or provide translations to existing sentences.
  • Each contributor has a profile page where they can list the languages they know and how fluent they are on a scale from 0 to 5.
  • Members can review, tag, and comment on any sentence.

What are the caveats of the Tatoeba corpus?

While Tatoeba can be a great source of free learning material, there are many pitfalls one should be aware of:

  • Anyone can submit a sentence even if they are not a native speaker of that language, they might have made up the sentence themselves or copied it from an unreliable source without knowing if the sentence sounds natural to a true native.
  • Anyone can translate a sentence even if their skills are limited in either or both the source and target languages.
  • Translations can be indirect – they are translations of other translations – increasing the odds of drifting further away from its original sense.
  • Contributors self-assess their language skills and might overestimate their abilities or not even specify their level.
  • Most of the Japanese sentences came from the Tanaka Corpus and owing to the way it was compiled – by students translating textbook sentences – it contains a large number of unnatural sentences and unreliable translations, see this article for more details about the issue.

Can we assess the trustworthiness of a sentence?

Although most apps do not include any of those informations, the Tatoeba website does provide us with some ways to alleviate the shortcomings listed above:

  • Sentences have a link to their author's profile where we can see if they are native (5 stars) or fluent (4 stars) in that language.
  • Indirect translations are marked as such.
  • Anonymously published sentences can be adopted by a contributor who can proofread them and confirm if they are natural-sounding.
  • Sentences can be marked with an "OK" tag to indicate that they have been reviewed.

What is the current state of the corpus?

Here are some statistics I have collected from a recent snapshot (August 6, 2022):

  • Number of Japanese sentences: 227,532
  • Tanaka Corpus sentences: 148,983 (65% of all sentences!)
  • Sentences with non anonymous author: 122,152
  • Sentences with self-assessed user level: 114,456
  • Sentences whose author is fluent or native: 108,816
  • Sentences whose author is native: 105,396
  • Sentences marked as OK: 1,666
  • All English translations: 250,252
  • Direct English translations: 163,902
  • Translations with non anonymous author: 62,166
  • Translator is fluent in English: 52,176
  • Translator is at least intermediate in English and Japanese: 10,246
  • Translator is at least advanced in English and Japanese: 8,349
  • Translator is fluent in English and Japanese: 4,177
Out of 250,252 pairs of Japanese/English sentences, only 3,705 (1.6%) have both author and translator fluent in Japanese and English.

How is Kanjiverse's trustworthiness score calculated?

Each sentence and translation in Kanjiverse is attributed a 5-star rating based on the following criteria:

  • The self-reported fluency of the author.
  • The self-reported fluency of the translator in both the source and target languages.
  • The difficulty of the sentence – inferred from the rarity of its vocabulary – so that translating an easy sentence would not require the translator to be fluent.
  • Other factors such as whether the sentence was reviewed or not, is original or a translation of another sentence, the translation is direct or indirect, etc.

Want to see it in action?

All sentences and translations in Kanjiverse are now tagged with this 5-star rating so you can instantly assess the quality of a sentence like this one:

Further details, such as the author's level, can be enabled in Tatoeba Sentence Page Settings:

The selection of sample sentences displayed under the definition of a Japanese word like this one can be customized in Tatoeba Sentences Card Settings where you can filter them out by difficulty, rating, and author's fluency:

It is also possible to choose another language than English for the translations:

One last disclaimer...

Please be reminded that Kanjiverse is still in beta and so is this rating system, it might overestimate or underestimate the quality of a sentence. Ratings were not reviewed individually by a human and only attributed by the algorithm based on the informations available.

I hope I did not offend any of the Tatoeba contributors with unwarranted bad ratings ^^; If you think your sentence has been unfairly attributed a low rating, please contact me so I can rectify it. I also encourage contributors that have not specify their language levels on their Tatoeba profile to do so ;)

I hope you find this rating system useful in your Japanese studies, if so please share a link to the beta to your fellow Japanese learners and join us on Discord!

Changelog v0.8.1

  • fixed level selection not working
  • fixed back button on Android popping the screen behind modals
  • added 5 stars ratings to all Tatoeba cards
  • sorts sentences by rating on Tatoeba Corpus Sentences card
  • added extra languages to Tatoeba Translations: French, Russian, Italian, German, Portuguese,
  • added Tatoeba Sentences Card settings screen to control the search criteria: max difficulty, min
    rating, min author level, sort by rating
  • added Tatoeba Sentence Page settings screen to enable/disable color, rating, difficulty, author
    level, license, translation language
  • added Random Sentence card on the Dashboard
  • added license tag to Tatoeba card
  • added Feedback button on Android
  • replaced radio selections in Settings with bottom sheet menus
  • removed indices in front of sentences in My List
  • removed X button on bottom sheet with handle
  • increased Tatoeba sentences limit to 100
  • added hidden Dev Mode