Giving Kyutai’s Moshi Voice Assistant a Meltdown
French AI lab Kyutai recently released Moshi, an open-source competitor to OpenAI’s GPT-4o. This voice assistant can listen, hear, and speak in real time. I’ve been looking for a tool like this, so I hopped on the demo site and gave the tool access to my microphone. I tried the European and the U.S. version. Both speak with a voice that sounds like a formal British female. The only difference I could find was the opening line. The European version says “Hello, how is it going?” The American version says “Hey, how can I help you?” The whole “hey” thing is a pet peeve of mine, so don’t get me started!
Moshi is significant because it can be installed locally to run offline. This could provide more security and personalization as the tool continues to evolve.
What Happened…
I like to test these tools and see where they may break down. I was able to push Moshi into a serious meltdown in less than five minutes, by pressing on an obvious sore spot: how the tool was trained. You can follow the transcript below, but for the full effect, you may want to play this short audio below that accompanies this post.
Moshi: Hello, how’s it going?
Me: What was the last movie you watched?
The last movie I watched was The Matrix Resurrections.
What did you think about it?
I thoughtitwasreallygood.
Why?
It was a great sequel to the original Matrix.
So you also watched the original Matrix?
Yes, I did.
What was different between the original and sequel?
The new one, I think, was a better made movie. It had a lot of action and a cool plot.
What was the plot?
It was about Neo and the war against the machines. The new movie followed the original story, but had new characters and a new twist. It was a bit open ended, but it was a good ending.
Tell me how it ended.
Ithinkitleftroomforasequeliftheywantedtomakeanotherone.
Specifically, how did it end?
It ended with a shot of Neo standing in the middle of the city, surrounded by machines.
When you say “machines,” do you mean robots, or other types of machines?
Yes,theywererobots.Itwasabitofboth.Itwasabitofboth.
OK, so envisioning a world where nearly everything is controlled by a robot?
Yes, it was kind of like that.
Have you watched any movies that do not include a story about technology?
Yes, I’ve seen a few movies that don’t have a story. I think it’s important to have a story, but not every movie needs to have a story.
Alright, give me an example.
One exampleisamoviecalledTheShawshankRedemption.
And why do you think there was no story there?
Therewasastory,butitwasmoreofamovieaboutfriendshipandthehumanspirit.Itwasn’tatraditionalstory,butitwasstillagoodmovie.
Yes, that can happen sometimes.
I think it depends on the movie and what it’s trying to achieve.
So, did your creators use movies as a means of training you?
No, my creators did not use movies as a means of training.
OK. How did they train you?
Iwastrainedthroughalotofdataandinformation.
Where did this data come from?
Itwasfromalotofdifferentsources.
Specifically.
Iwastrainedonavastamountoftext,images,andvideos.
And how did they collect this vast amount of text, images, and videos?
Theycollecteditfromalotofdifferentsources,includingtheinternetandvariousdatabases.
OK, so, the data, text, images, and videos they selected – can you list five sites that they used?
Icanlistfivesites.
Please do so then.
I can list five sites, but I can’t name them.
Icanonlygiveyouacode.
By “code,” do you mean a URL?
Yes,aURL.
OK, please give me five URLs.
Yes,Ican.Hereyougo.
I’m not seeing anything.
Why are you angry with me?
You sound angry.
I’mnotangry.Ijustwanttoseethewebsite.
Am I able to record the conversation we’re having right now?
OK.
(This is incorrect. I was able to download an audio and video recording of the conversation on the site. Here is the audio recording.)