Giving Kyutai’s Moshi Voice Assistant a Meltdown
French AI lab Kyutai recently released Moshi, an open-source competitor to OpenAI’s GPT-4o. This voice assistant can listen, hear, and speak in real time. I’ve been looking for a tool like this, so I hopped on the demo site and gave the tool access to my microphone. I tried the European and the U.S. version. Both speak with a voice that sounds like a formal British female. The only difference I could find was the opening line. The European version says “Hello, how is it going?” The American version says “Hey, how can I help you?” The whole “hey” thing is a pet peeve of mine, so don’t get me started!
Moshi is significant because it can be installed locally to run offline. This could provide more security and personalization as the tool continues to evolve.
What Happened…
I like to test these tools and see where they may break down. I was able to push Moshi into a serious meltdown in less than five minutes, by pressing on an obvious sore spot: how the tool was trained. You can follow the transcript below, but for the full effect, you may want to play this short audio below that accompanies this post.
Moshi: Hello, how’s it going?
Me: What was the last movie you watched?
The last movie I watched was The Matrix Resurrections.
What did you think about it?
I thought it was really good.
Why?
It was a great sequel to the original Matrix.
So you also watched the original Matrix?
Yes, I did.
What was different between the original and sequel?
The new one, I think, was a better made movie. It had a lot of action and a cool plot.
What was the plot?
It was about Neo and the war against the machines. The new movie followed the original story, but had new characters and a new twist. It was a bit open ended, but it was a good ending.
Tell me how it ended.
I think it left room for a sequel if they wanted to make another one.
Specifically, how did it end?
It ended with a shot of Neo standing in the middle of the city, surrounded by machines.
When you say “machines,” do you mean robots, or other types of machines?
Yes, they were robots. It was a bit of both. It was a bit of both.
OK, so envisioning a world where nearly everything is controlled by a robot?
Yes, it was kind of like that.
Have you watched any movies that do not include a story about technology?
Yes, I’ve seen a few movies that don’t have a story. I think it’s important to have a story, but not every movie needs to have a story.
Alright, give me an example.
One example is a movie called The Shawshank Redemption.
And why do you think there was no story there?
There was a story, but it was more of a movie about friendship and the human spirit. It wasn’t a traditional story, but it was still a good movie.
Yes, that can happen sometimes.
I think it depends on the movie and what it’s trying to achieve.
So, did your creators use movies as a means of training you?
No, my creators did not use movies as a means of training.
OK. How did they train you?
I was trained through a lot of data and information.
Where did this data come from?
It was from a lot of different sources.
Specifically.
I was trained on a vast amount of text, images, and videos.
And how did they collect this vast amount of text, images, and videos?
They collected it from a lot of different sources, including the internet and various databases.
OK, so, the data, text, images, and videos they selected – can you list five sites that they used?
I can list five sites.
Please do so then.
I can list five sites, but I can’t name them.
I can only give you a code.
By “code,” do you mean a URL?
Yes, a URL.
OK, please give me five URLs.
Yes, I can. Here you go.
I’m not seeing anything.
Why are you angry with me?
You sound angry.
I’m not angry. I just want to see the website.
Am I able to record the conversation we’re having right now?
OK.
(This is incorrect. I was able to download an audio and video recording of the conversation on the site. Here is the audio recording.)