A search engine for language learning podcasts

Hi, I had this idea after looking through the Japanese Podcasts thread (Japanese Podcasts: The New, Not-At-All Outdated, Totally Definitive Guide) that it would be cool to have a proper search engine and ranking algorithm just for language learning podcasts.

E.g. I want to find stories/dialogues/monologues/lessons at the beginner/intermediate/advanced level.

This is what I’ve got so far:

I’d really be interested in some feedback on this idea, and especially, what sorts of language-learning-appropriate metrics would be useful to everyone to rank and categorise search results? Such as:

  • language level (beginner/intermediate/advanced)
  • language type (stories/dialogues/monologues/lessons)
  • recording quality (easy/difficult to hear speech e.g. due to background noise/microphone)
  • (maybe others?)

It should also be possible for anyone to submit and rate podcasts based on these metrics, and maybe also submit links to transcripts for episodes if they are aware of them (and can legally share them.)

Does anyone else have any other good ideas?

My dream is basically to be able to search for n+1 content in podcast form, but as you can see from the screenshot, there’s a dropdown so maybe it will be nice to select from YouTube channels and other RSS feeds for foreign language content.

6 Likes

Today I set up the domain name and hosting and it’s now live:

https://lingopods.com/

I know it’s really bare bones at the moment :slight_smile: You can’t yet rate podcasts after they’ve been submitted but I’m working on that, although if you do submit a new podcast, you can rate it while submitting.


Just for a bit of fun, I’ve commissioned my niece to design me a logo (she’s 3 years old). I’ve said crayon is fine. I’m excited to see what she comes up with :smile:

3 Likes

This is a really cool idea! I shared it out on Twitter to try and get some more folks from the language learning community involved.

1 Like

This sounds like it will be great if you can get people to participate. Things I would want to sort by are:

  1. Ratio of languages. E.g. Is it mostly English explanations with a few Japanese examples or is it all Japanese?
  2. Types of Lessons. Are the lessons about the target language? The culture? Unrelated to language learning but aimed at native beginners/children and thus viable for learning terminology.

For the second a tag system might be best. A user could narrow the selection to lessons and then search for podcasts tagged with “Geography” without you having to think of all the possible categories people might want.

1 Like

@Splatted Thanks for the really good ideas:

  1. My intuition at first was that all mixed-language podcasts would probably be lessons anyway and so I thought it would be redundant to consider this as a separate thing. But now that I think about it, there’s not a 1-to-1 correspondence between mixed-language podcasts and lessons. The prime example being Bilingual News which is mixed language but is not a lesson. And then also there are some lessons that are completely monolingual. So the ratio of languages is definitely orthogonal and should be a separate thing to rate.

  2. I agree a tag system sounds like the way to go for things like subject matter.

One other thing I’ve been contemplating is whether I can come up with some sort of “automatic” language level detection. If a user is able to submit a transcript for an episode, maybe there is a way for me to analyse the vocabulary used and compare it with word frequency lists, and based on that figure out whether this is suitable for each level (Beginner 1,2, Intermediate 1,2, Advanced 1,2).

But I have no idea what is the typical vocabulary for each level (is there data on this already?). One thing I can do is to first just accept manual user ratings for a while and then I can calibrate my algorithm according to that data.

@unseenjapan I appreciate the share, thanks!

You could use the old word lists for JLPT as a base, but I personally don’t think they reflect the difficulty of the piece that well. If you have a transcript, maybe count the unique words per unit length of text and compare them to a list, instead of just looking for keywords themselves? It’ll still need a lot of calibration, but it would prevent anything talking about academic subjects in even the most basic sense (like for children) being labelled as the highest difficulty.

For other computable things, you could compare the length of text to the length of audio (even better if the text were expanded into kana), to get an idea for the speed, which is another thing that creates a challenge with audio.

2 Likes

Ah, of course. I never really got into the JLPT levels so this rather obvious idea completely didn’t occur to me :wink: Even though JLPT isn’t necessarily based on word frequencies, I can analyse the lists and maybe infer that sort of data from it in combination with other data.

Counting unique words per unit length is also an interesting idea since fewer unique words implies more internal repetition of words. Related to that, if a word isn’t necessarily high frequency in global usage, but it is repeated a lot in a particular story, then it’s a lot easier than its global frequency would suggest it to be. So difficulty level may be a balance of its global frequency and its local frequency. Like, if you have an interest in cooking and listen to cooking podcasts, the high repetition of rare cooking utensil names or ingredient names would actually not be difficult at all.

For speech speed, I wonder how much of a problem this is with current technology like the ability in some audio players to slow down audio, but barring that, yes that would make it more difficult. But even if the speaker is not speaking fast, but a lot of hard words are concentrated in a single sentence, that makes it harder than if the hard words were a bit more spread out.

So language difficulty could be based on a combination of:

  • global frequency of each word
  • local frequency of each word
  • concentration of difficult words
1 Like

@wareya made a tool called analyzer that could be useful for this application.

1 Like

The latest incarnation of my frequency list/stats tools is called jpstats: https://github.com/wareya/jpstats

It’s nowhere near as user-friendly though, it’s primarily meant for updating the stats tables on my wiki rather than making individualized frequency lists so there’s no GUI and the way configuration etc. work is non-obvious (because I’m the only user).

1 Like

@wareya That’s an astonishing list of projects you have on your GitHub page. Impressive!

I’ll definitely have to study your jpstats project in more detail. Its “complexity estimation” sounds like exactly the sort of metric that could be used to sort podcasts into different language levels.

Your mecab clone also looks interesting - does it have a comparable memory footprint to mecab or better/worse? I’ve basically got to fit my full stack within 1GB. Actually now that I think about it, I may be able to go with an algorithm that drastically sacrifices accuracy for space efficiency. As long as I use the SAME inaccurate segmenter to produce my frequency lists as I do to segment text in a submitted podcast transcript, it should still in principle do the job. E.g. TinySegmenter.

Memory consumption depends on the analysis dictionary being used and the OS. Mecab and notmecab both use memory mapped files for the most memory intensive parts of dictionary access, at least now, aside from notmecab loading the surface form lookup data into memory directly.

Complexity estimation is similar to the thing cb’s old tools used, but it’s less likely to get fooled by texts using significantly different proportions of kanji and kana, and it’s trained based on whatever’s in the analysis corpus.

A tiny fast segmenter would be interesting, though you’d probably get away with just using a very small mecab dictionary. Would be better than trying to find word boundaries through automated jmdict lookups at least!

1 Like

Thanks for the helpful tips, @wareya. I’ll put this on the back burner for the moment. My focus now is to add user ratings so that the community can help each other know which podcasts would be suitable for their level. It may take me a while to get that working (mainly I have to learn about Europe’s GDPR).

Thanks to everyone’s feedback, I have settled on the following rating fields:

  • Language type (stories, dialogues, monologues, lessons)
  • What percentage is in the native language
  • Talking speed: Slow, Average, Fast
  • Language level: Beginner 1, 2, Intermediate 1, 2, Advanced 1, 2
  • Clarity / intelligibility: From “Terrible” to “Very clear”

And I can add refine this later.

I know that this index of podcasts really needs to grow before it becomes useful, so I’ve added a few more (in various languages), and it looks like a couple of others submitted, one got an error (which I’ve fixed).

My favourite new podcast is this one:

I hadn’t heard of it before, but the audio quality is very crisp. I’ve often had the issue of trying to understand slurred speech or speech over background noise that makes it difficult to make out some words, but this one’s quite clear and easy to hear every word, and they’re not really talking over each other too much.

Sorry to suddenly revive this thread, but is this project still running? It sounds extremely useful

1 Like

I didn’t realize lingopods is already operating and is populated with data.

Great job!

1 Like

Thanks for asking and reviving this thread because I do sort of have an update on this project. Note that the search engine is down most days and it has a bug that requires me to manually reboot it. I haven’t bothered to fix that and since I have a new/better plan on how to achieve the original goal, I probably won’t fix that bug and will just try to build the new system.

The flaw with the original concept is that it relies on users submitting and sharing good podcasts for Japanese study. The assumption is that Japanese learners already know some good podcasts that they can share. In reality, Japanese learners would come to a search engine because they’re LOOKING for podcasts rather than already know which podcasts are good which they can share.

So what I’ve been trying to do is write a script to automatically analyse podcasts and estimate their language difficulty automatically. With this script, I hope to automatically crawl and index a large number of podcasts out their on the Internet and make them searchable. No user submissions are required, so the idea is just to make it easier for people to find good podcasts at their current level.

Just to give a taste of some of the analysis I’ve done, I did an experiment analysing around 400 episodes from 10 different podcasts and produced the following graph of estimated language difficulty:

I really wish I could find time again to continue on this project, but I’ve always had various other things get in the way.

Thanks for reminding me I should get back to this.

1 Like

Some podcasts I didn’t know about, this is super helpful thanks Ryan. Is the nihongo con teppei you analyzed his beginner or intermediate series. He has two different ones now…

Thanks!

2 Likes

@laddr The version of Nihongo con Teppei in that graph is the original one (with the longer episodes). I should mention that this analysis is very approximate although it’s something I can improve over time. A couple of things I’m not currently considering is the speed of speech, and the sentence length. The beginner version of the podcast is much slower and maybe has shorter sentences, and therefore easier in those respects. I think what I’d eventually like to do is calculate these metrics independently and then combine them into one overall score, but allow the search engine to still search for these metrics independently.

My analysis also really only works on podcasts that are 100% in a single language, so there are quite a few podcasts in the current database that I might have to eject just to make the new system easier to build, such as bilingual news and the various podcasts in English that “teach” Japanese but are mostly in English. I’m sure some people find those podcasts useful, so on the new forum once another “recommended podcasts” thread is set up I can copy those podcasts into that thread before removing them from the search engine.

Amazing answer, thank you!