While the tech is impressive, from an end user interacting with this perspective, I want nothing to do with it, and I can’t support it. Neat as a one off but destructive imho.
It’s bad enough some companies are doing AI-only interviews. I could see this used to train employees, interview people, replace people at call centers… it’s the next step towards an absolute nightmare. Automated phone trees are bad enough.
There will likely be little human interaction in those and many other situations, and hallucinations will definitely disqualify some people from some jobs.
I’m not anti AI, I’m anti destructive innovation in AI leading to personal health and societal issues, just like modern social media has. I’m not saying this tool is that, I’m saying it’s a foundation for that.
People can choose to not work on things that lead to eventual negative outcomes, and that’s a personal choice for everyone. Of course hindsight is 20/20 but some things can certainly be foreseen.
Apologies for the seemingly negative rant, but this positivity echo chamber in this thread is crazy and I wanted to provide an alternative feedback view.
Lord. I can see this quickly extending even further into HR e.g. performance reviews: employee must 'speak' to an HR avatar about their performance in the last quarter. the AI will then summarize the discussion for the manager and give them coaching tips.
It sounds valuable and efficient but the slippery slope is all but certain.
One thing I've learnt from movie production is actually what separates professional from amateur quality is in the audio itself. Have you thought about implementing personaplex from NVDIA or other voice models that can both talk and listen at the same time?
Currently the conversation still feels too STT-LLM-TTS that I think a lot of the voice agents suffer from (Seems like only Sesame and NVDIA so far have nailed the natural conversation flow). Still, crazy good work train your own diffusion models, I remember taking a look at the latest literature on diffusion and was mind blown by the advances in last years or so since u-net architecture days.
EDIT: I see that the primary focus is on video generation not audio.
This is a good point on audio. Our main priority so far has been reducing latency. In service of that, we were deep in the process of integrating Hume's two-way S2S voice model instead of ElevenLabs. But then we realized that ElevenLabs had made their STT-LLM-TTS pipeline way faster in the past month and left it at that. See our measurements here (they're super interesting): https://docs.google.com/presentation/d/18kq2JKAsSahJ6yn5IJ9g...
But, to your point, there are many benefits of two-way S2S voice beyond just speed.
Using our LiveKit integration you can use LemonSlice with any voice provider you like. The current S2S providers LiveKit offers include OpenAI, Gemini, and Grok and I'm sure they'll add Personaplex soon.
I asked the Spanish tutor if he/it was familiar with the terms seseo[0] and ceceo[1] and he said it wasn't, which surprised me. Ideally it would be possible to choose which Spanish dialect to practise as mainland Spain pronunciation is very different to Latin America. In general it didn't convince me it was really hearing how I was pronouncing words, an important part of learning a language. I would say the tutor is useful for intermediate and advanced speakers but not beginners due to this and the speed at which he speaks.
At one point subtitles written in pseudo Chinese characters were shown; I can send a screenshot if this is useful.
The latency was slightly distracting, and as others have commented the NVIDIA Personaplex demos [2] are very impressive in this regard.
In general, a very positive experience, thank you.
Thanks for the feedback. The current avatars use a STT-LLM-TTS pipeline (rather than true speech-to-speech), which limits nuanced understanding of pronunciations. Speech-to-speech models should solve this problem. (The ones we've tried so far have counterintuitively not been fast enough.)
Video Agents
Unlimited agents
Up to 3 concurrent calls
Creative Studio
1min long videos
Up to 3 concurrent generations
Does that mean I can have a total of 1 minute of video calls? Or video calls can only be 1 minute long? Or does it mean I can have unlimited calls, 3 calls at a time all month long?
Can I have different avatars or only the same avatar x 3?
Can I record the avatar and make videos and post on social media?
Sorry about the confusion. Video Agents and Creative Studio are two entirely different products. Video Agents = interactive video. Creative Studio = make a video and download it. If you're interested in real-time video calls, then Video Agents is the only pricing and feature set you should look at.
What happens if I want to make the video on the fly and save that to reuse it when the same question or topic comes up. No need to render a video. Just play the existing one.
This isn't natively supported -- we are continuously streaming frames throughout the conversation session that are generated in real-time. If you were building your own conversational AI pipeline (e.g. using our LiveKit integration), I suppose it would be possible to route things like this with your own logic. But it would probably include jump cuts and not look as good.
Thank you! Yes, right now we are using Qwen for the LLM. They also released a super fast TTS model that we have not tried yet, which is supposed to be very fast.
Having a real-time video conversation with an AI is a trippy feeling. Talk about a "feel the AGI moment", it really does feel like the computer has come alive.
I got really excited when I saw that you were releasing your model.
> Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.
But after digging around for a while, searching for a huggingface link, I’m guessing this was just a unfortunate turn of phrase, and you are not in fact, releasing an open weights model that people can run themselves?
Oh well, this looks very cool regardless and congratulations on the release.
Thanks! And sorry! I can see how our wording there could be misconstrued. With a real-time model, the streaming infrastructure matters almost as much as the weights themselves. It will be interesting to see how easily they can be commoditized in the future.
Thank you! We are considering to release an open-source version of the model. Somebody will do it soon. Might as well be us. We are mostly concerned with the additional overhead of releasing and then supporting it. So, TBD.
I wonder how it would come across with the right voice. We're focused on building out the video layer tech, but at the end of the day, the voice is also pretty important for a positive experience.
I wonder if we should make the UI a more common interface (e.g. "the call is ringing") to avoid this confusion?
It's a normal mp4 video that's looping initially (the "welcome message") and then as soon as you send the bot a message, we connect you to a GPU and the call becomes interactive. Connecting to the GPU takes about 10s.
Makes sense. The init should be about 10s. But, after that, it should be real time. TBH, this is probably a common confusion. So thanks for calling it out.
glad we found somebody who likes it as much as us! BTW, biggest thing we are working to improve is speed of the response. I think we can make that much faster.
We're launching a new AI assistant and I wanted to make it alive so I started to play around with LemonSlice and I loved it!! I wanted to make our assistant be like a coworker that can give it an ability to create Loom style videos. Here's what I created - https://drive.google.com/file/d/1nIpEvNkuXA0jeZVjHC8OjuJlT-3...
Anyway, big thumbs up for the LemonSlice team, I'm excited to see it progress. I can definitely see products start coming alive with tools like this.
How did your token spend add up? I'm hesitant with evil customers racking up ai charges just to shit and giggles. Even competitors might sponsor some runaway charges.
Very cool! Thanks for sharing. I love your use-case of turning an AI coding agent into more of an AI employee. Will be interesting to see if users can connect better with the product this way.
Yes we do! Within the web app, there's a "action text prompt" section that allows you to control the overall actions of the character (e.g. "a fox talking with lots of arm motions"). We'll soon expose this in the API so you can control the characters movements dynamically (e.g. "now wave your hand")
Our text control is good, especially for emotions. For example, you can add the text prompt: "a person talking. they are angry", and agent will have an angry expression.
You can also control background motions (like ocean waves, or a waterfall or car driving).
We are actively training a model that has better text control over hand motions.
hey HN! one of the founders here. as of today, we are seeing informational avatars + roleplaying for training as the most common use cases. The roleplaying use-case was surprising to us. Think a nurse training to triage with AI patients. Or, SDRs practicing lead qualification with different kinds of clients.
Heads up, your privacy policy[0] does not work in dark mode - I was going to comment saying it made no sense, then I highlighted the page and more text appeared :)
Low res and low fps. Not sure if lipsync is poor, or if low fps makes it look poor. Voice sounds low quality, as if recorded on a bad mic, and doesn't feel like it matches the avatar.
Good question! Yes and to do this you'd want to use our "Self-Managed Pipeline": https://lemonslice.com/docs/self-managed/overview. You can combine any TTS, LLM and STT combination with LemonSlice as the avatar layer.
I'm using an openAI realtime voice with livekit, and they said they have a livekit integration so it would probably be doable that way. I haven't used video in livekit though and I don't know how the plugins are setup for it
Good question. When using the API, you can bring any voice agent (or LLM). Our API takes in what the agent will say, and then streams back the video of the agent saying it.
For the fully hosted version, we are currently partnered with ElevenLabs.
That's an interesting insight about "stacking tricks" together. I'm curious where you found that approach hit limits. And what gives you an advantage if anything against others copying it. Getting real-time streaming with a 20B parameter diffusion model and 20fps on a single GPU seems objectively impressive. It's hard to resist just saying "wow" looking at the demo, but I know that's not helpful here. It is clearly a substantial technical achievement and I'm sure lots of other folks here would be interested in the limits with the approach and how generalizable it is.
Good question! Software gets democratized so fast that I am sure others will implement similar approaches soon. And, to be clear, some of our "speed upgrades" are pieced together from recent DiT papers. I do think getting everything running on a single GPU at this resolution and speed is totally new (as far as i have seen).
I think people will just copy it, and we just need to continue moving as fast as we can. I do think that a bit of a revolution is happening right now in real-time video diffusion models. There are so many great papers being published in that area in the last 6 months. My guess is that many DiT models will be real time within 1 year.
> I do think getting everything running on a single GPU at this resolution and speed is totally new
Thanks, it seemed to be the case that this was really something new, but HN tends to be circumspect so wanted to check. It's an interesting space and I try to stay current but everything is moving so fast. But I was pretty sure I hadn't seen anyone do that. Its a huge achievement to do it first and make it work for real like this! So well done!
One thing that is interesting: LLMs pipelines have been highly optimize for speed (since speed is directly related to cost for companies). That is just not true for real-time DiTs. So, there is still lots of low hanging fruit for how we (and others) can make things faster and better.
Curious about the memory bandwidth constraints here. 20B parameters at 20fps seems like it would saturate the bandwidth of a single GPU unless you are running int4. I assume this requires an H100?
Agreed. We were so excited about the results last year and they are SO BAD now by comparison. Hopefully we'll say the same thing again in the couple months
thanks! it just barley worked last year, but not much else. this year it's actually good. we got lucky: it's both new tech and turned out to be good quality.
I don't see any evidence that r0fl's comments are astroturfing. Sometimes people are just enthusiastic.
I appreciate your concern for the quality of the site - that fact that the community here cares so much about protecting it is the main reason why it continues to survive. Still, it's against HN's rules to post like you did here. Could you please review https://news.ycombinator.com/newsguidelines.html? Note this part:
"Please don't post insinuations about astroturfing, shilling, bots, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data."
While the tech is impressive, from an end user interacting with this perspective, I want nothing to do with it, and I can’t support it. Neat as a one off but destructive imho.
It’s bad enough some companies are doing AI-only interviews. I could see this used to train employees, interview people, replace people at call centers… it’s the next step towards an absolute nightmare. Automated phone trees are bad enough.
There will likely be little human interaction in those and many other situations, and hallucinations will definitely disqualify some people from some jobs.
I’m not anti AI, I’m anti destructive innovation in AI leading to personal health and societal issues, just like modern social media has. I’m not saying this tool is that, I’m saying it’s a foundation for that.
People can choose to not work on things that lead to eventual negative outcomes, and that’s a personal choice for everyone. Of course hindsight is 20/20 but some things can certainly be foreseen.
Apologies for the seemingly negative rant, but this positivity echo chamber in this thread is crazy and I wanted to provide an alternative feedback view.
> AI-only interviews
Lord. I can see this quickly extending even further into HR e.g. performance reviews: employee must 'speak' to an HR avatar about their performance in the last quarter. the AI will then summarize the discussion for the manager and give them coaching tips.
It sounds valuable and efficient but the slippery slope is all but certain.
Don't be naive -- if I don't make the Torment Nexus, someone else will ;)
One thing I've learnt from movie production is actually what separates professional from amateur quality is in the audio itself. Have you thought about implementing personaplex from NVDIA or other voice models that can both talk and listen at the same time?
Currently the conversation still feels too STT-LLM-TTS that I think a lot of the voice agents suffer from (Seems like only Sesame and NVDIA so far have nailed the natural conversation flow). Still, crazy good work train your own diffusion models, I remember taking a look at the latest literature on diffusion and was mind blown by the advances in last years or so since u-net architecture days.
EDIT: I see that the primary focus is on video generation not audio.
This is a good point on audio. Our main priority so far has been reducing latency. In service of that, we were deep in the process of integrating Hume's two-way S2S voice model instead of ElevenLabs. But then we realized that ElevenLabs had made their STT-LLM-TTS pipeline way faster in the past month and left it at that. See our measurements here (they're super interesting): https://docs.google.com/presentation/d/18kq2JKAsSahJ6yn5IJ9g...
But, to your point, there are many benefits of two-way S2S voice beyond just speed.
Using our LiveKit integration you can use LemonSlice with any voice provider you like. The current S2S providers LiveKit offers include OpenAI, Gemini, and Grok and I'm sure they'll add Personaplex soon.
Thanks for sharing! Makes sense to go with latency first.
I asked the Spanish tutor if he/it was familiar with the terms seseo[0] and ceceo[1] and he said it wasn't, which surprised me. Ideally it would be possible to choose which Spanish dialect to practise as mainland Spain pronunciation is very different to Latin America. In general it didn't convince me it was really hearing how I was pronouncing words, an important part of learning a language. I would say the tutor is useful for intermediate and advanced speakers but not beginners due to this and the speed at which he speaks.
At one point subtitles written in pseudo Chinese characters were shown; I can send a screenshot if this is useful.
The latency was slightly distracting, and as others have commented the NVIDIA Personaplex demos [2] are very impressive in this regard.
In general, a very positive experience, thank you.
[0] https://en.wikipedia.org/wiki/Phonological_history_of_Spanis... [1] https://en.wikipedia.org/wiki/Phonological_history_of_Spanis... [2] https://research.nvidia.com/labs/adlr/personaplex/
Thanks for the feedback. The current avatars use a STT-LLM-TTS pipeline (rather than true speech-to-speech), which limits nuanced understanding of pronunciations. Speech-to-speech models should solve this problem. (The ones we've tried so far have counterintuitively not been fast enough.)
Pricing is confusing
Video Agents Unlimited agents Up to 3 concurrent calls Creative Studio 1min long videos Up to 3 concurrent generations
Does that mean I can have a total of 1 minute of video calls? Or video calls can only be 1 minute long? Or does it mean I can have unlimited calls, 3 calls at a time all month long?
Can I have different avatars or only the same avatar x 3?
Can I record the avatar and make videos and post on social media?
Sorry about the confusion. Video Agents and Creative Studio are two entirely different products. Video Agents = interactive video. Creative Studio = make a video and download it. If you're interested in real-time video calls, then Video Agents is the only pricing and feature set you should look at.
What happens if I want to make the video on the fly and save that to reuse it when the same question or topic comes up. No need to render a video. Just play the existing one.
This isn't natively supported -- we are continuously streaming frames throughout the conversation session that are generated in real-time. If you were building your own conversational AI pipeline (e.g. using our LiveKit integration), I suppose it would be possible to route things like this with your own logic. But it would probably include jump cuts and not look as good.
Wow this team is non-stop!!! Wild that this small crew is dropping hit after hit. Is there an open polymarket on who acquires them?
haha thank you so much! The team is incredible - small but mighty
That's super impressive! Definitely one of the best quality conversational agents I've tried syncing A/V and response times.
The text processing is running Qwen / Alibaba?
Qwen is the default but you can pick any LLM in the web app (though not the HN playground)
Thank you! Yes, right now we are using Qwen for the LLM. They also released a super fast TTS model that we have not tried yet, which is supposed to be very fast.
I made a golden retriever you can talk to using Lemon Slice: https://lemonslice.com/hn/agent_5af522f5042ff0a8
Having a real-time video conversation with an AI is a trippy feeling. Talk about a "feel the AGI moment", it really does feel like the computer has come alive.
great. you'll never have to talk to another human being ever again
I got really excited when I saw that you were releasing your model.
> Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.
But after digging around for a while, searching for a huggingface link, I’m guessing this was just a unfortunate turn of phrase, and you are not in fact, releasing an open weights model that people can run themselves?
Oh well, this looks very cool regardless and congratulations on the release.
Thanks! And sorry! I can see how our wording there could be misconstrued. With a real-time model, the streaming infrastructure matters almost as much as the weights themselves. It will be interesting to see how easily they can be commoditized in the future.
Thank you! We are considering to release an open-source version of the model. Somebody will do it soon. Might as well be us. We are mostly concerned with the additional overhead of releasing and then supporting it. So, TBD.
You could add a Max Headroom to the hn link. You might reach real time by interspersing freeze frames, duplicates, or static.
And, just like that, Max Headroom is back: https://lemonslice.com/try/agent_ccb102bdfc1fcb30
That.. is not Max Headroom.
I wonder how it would come across with the right voice. We're focused on building out the video layer tech, but at the end of the day, the voice is also pretty important for a positive experience.
1) yes on Max Headroom. we are on it. 2) it already is real time...?
Whoops! Mistook the "You're about to speak with an AI."-progress bar for processing delay.
I wonder if we should make the UI a more common interface (e.g. "the call is ringing") to avoid this confusion?
It's a normal mp4 video that's looping initially (the "welcome message") and then as soon as you send the bot a message, we connect you to a GPU and the call becomes interactive. Connecting to the GPU takes about 10s.
Makes sense. The init should be about 10s. But, after that, it should be real time. TBH, this is probably a common confusion. So thanks for calling it out.
Wow I can’t get enough of this site! This is literally all I’ve been playing with for like half an hour. Even moved a meeting!
My mind is blown! It feels like the first time I used my microphone to chat with ai
This comment made my day! So happy you're liking it
glad we found somebody who likes it as much as us! BTW, biggest thing we are working to improve is speed of the response. I think we can make that much faster.
We're launching a new AI assistant and I wanted to make it alive so I started to play around with LemonSlice and I loved it!! I wanted to make our assistant be like a coworker that can give it an ability to create Loom style videos. Here's what I created - https://drive.google.com/file/d/1nIpEvNkuXA0jeZVjHC8OjuJlT-3...
Anyway, big thumbs up for the LemonSlice team, I'm excited to see it progress. I can definitely see products start coming alive with tools like this.
How did your token spend add up? I'm hesitant with evil customers racking up ai charges just to shit and giggles. Even competitors might sponsor some runaway charges.
Very cool! Thanks for sharing. I love your use-case of turning an AI coding agent into more of an AI employee. Will be interesting to see if users can connect better with the product this way.
Cool! Do you plan to expose controls over the avatar’s movement, facial expressions, or emotional reactions so users can fine-tune interactions?
Yes we do! Within the web app, there's a "action text prompt" section that allows you to control the overall actions of the character (e.g. "a fox talking with lots of arm motions"). We'll soon expose this in the API so you can control the characters movements dynamically (e.g. "now wave your hand")
Our text control is good, especially for emotions. For example, you can add the text prompt: "a person talking. they are angry", and agent will have an angry expression.
You can also control background motions (like ocean waves, or a waterfall or car driving).
We are actively training a model that has better text control over hand motions.
hey HN! one of the founders here. as of today, we are seeing informational avatars + roleplaying for training as the most common use cases. The roleplaying use-case was surprising to us. Think a nurse training to triage with AI patients. Or, SDRs practicing lead qualification with different kinds of clients.
Heads up, your privacy policy[0] does not work in dark mode - I was going to comment saying it made no sense, then I highlighted the page and more text appeared :)
[0] https://lemonslice.com/privacy
Fix deployed! This is why it's good to launch on hacker news. thanks for the tip.
Nice one - thanks :)
Good catch! Working on a fix now.
Your demo video defaults to play at 1.5x speed
You probably didn't intend to do that
Very freaking impressive!
Where’s the hn playground to grab a free month?
I have so many websites that would do well with this!
(We've replaced the link to their homepage (https://lemonslice.com/) with the HN playground at the start of the text above)
Thanks Dan! The HN playground let's anyone try out for free without login
https://lemonslice.com/hn - There's a button for "Get 1st month free" in the Developer Quickstart
> You're probably thinking, how is this useful
I was thinking why the quality is so poor.
curious what avatar you think is poor quality? Or, what you think is poor quality. i want to know :)
Low res and low fps. Not sure if lipsync is poor, or if low fps makes it look poor. Voice sounds low quality, as if recorded on a bad mic, and doesn't feel like it matches the avatar.
thanks for the feedback. that's helpful. Ya, some avatars have worse lip synch than others. It depends a little on how zoomed in you are.
I am double checking now to make 100% sure we return the original audio (and not the encoded/decoded audio).
We are working on high-res.
Good luck.
Wow this is the most impressive thing I’ve seen on hacker news in years!!!!!
Take my money!!!!!!
Wow thank you so much :) We're so proud of it!!
I'm curious if I can plug in my own OpenAI realtime voice agents into this.
Good question! Yes and to do this you'd want to use our "Self-Managed Pipeline": https://lemonslice.com/docs/self-managed/overview. You can combine any TTS, LLM and STT combination with LemonSlice as the avatar layer.
I'm using an openAI realtime voice with livekit, and they said they have a livekit integration so it would probably be doable that way. I haven't used video in livekit though and I don't know how the plugins are setup for it
Yes this is exactly right. Using the LiveKit integration you can add LemonSlice as an avatar layer on top of any voice provider
Here's the link to the LiveKit LemonSlice plugin. It's very easy to get started. https://docs.livekit.io/agents/models/avatar/plugins/lemonsl...
Good question. When using the API, you can bring any voice agent (or LLM). Our API takes in what the agent will say, and then streams back the video of the agent saying it.
For the fully hosted version, we are currently partnered with ElevenLabs.
That's an interesting insight about "stacking tricks" together. I'm curious where you found that approach hit limits. And what gives you an advantage if anything against others copying it. Getting real-time streaming with a 20B parameter diffusion model and 20fps on a single GPU seems objectively impressive. It's hard to resist just saying "wow" looking at the demo, but I know that's not helpful here. It is clearly a substantial technical achievement and I'm sure lots of other folks here would be interested in the limits with the approach and how generalizable it is.
Good question! Software gets democratized so fast that I am sure others will implement similar approaches soon. And, to be clear, some of our "speed upgrades" are pieced together from recent DiT papers. I do think getting everything running on a single GPU at this resolution and speed is totally new (as far as i have seen).
I think people will just copy it, and we just need to continue moving as fast as we can. I do think that a bit of a revolution is happening right now in real-time video diffusion models. There are so many great papers being published in that area in the last 6 months. My guess is that many DiT models will be real time within 1 year.
> I do think getting everything running on a single GPU at this resolution and speed is totally new
Thanks, it seemed to be the case that this was really something new, but HN tends to be circumspect so wanted to check. It's an interesting space and I try to stay current but everything is moving so fast. But I was pretty sure I hadn't seen anyone do that. Its a huge achievement to do it first and make it work for real like this! So well done!
One thing that is interesting: LLMs pipelines have been highly optimize for speed (since speed is directly related to cost for companies). That is just not true for real-time DiTs. So, there is still lots of low hanging fruit for how we (and others) can make things faster and better.
Curious about the memory bandwidth constraints here. 20B parameters at 20fps seems like it would saturate the bandwidth of a single GPU unless you are running int4. I assume this requires an H100?
Yep, the model is running on Hopper architecture. Anything less was not sufficient in our experiments.
this is like Tavus but it doesnt suck. congrats!
Wow this is really cool, haven't seen real-time video generation that is this impressive yet!
Thank you so much! It's been a lot of fun to build
The last year vs this year is crazy
Agreed. We were so excited about the results last year and they are SO BAD now by comparison. Hopefully we'll say the same thing again in the couple months
thanks! it just barley worked last year, but not much else. this year it's actually good. we got lucky: it's both new tech and turned out to be good quality.
Not working on mobile iOS
what's not working for you?
This looks super awesome!
thank you! it's by far the thing I have worked on that I am most proud of.
This is next-level!
Thanks so much! We're super proud of it
Removing - Realized I made a mistake
I don't see any evidence that r0fl's comments are astroturfing. Sometimes people are just enthusiastic.
I appreciate your concern for the quality of the site - that fact that the community here cares so much about protecting it is the main reason why it continues to survive. Still, it's against HN's rules to post like you did here. Could you please review https://news.ycombinator.com/newsguidelines.html? Note this part:
"Please don't post insinuations about astroturfing, shilling, bots, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data."
it's a fair concern. but, we don't know r0fl. and we are not astroturfing.
even I am surprised with how many opnely positive comments we are getting. it's not been our experience in the past.