In part 1, Ryan chats with the co-founder and CEO of Inception, Stefano Ermon, about diffusion language models and how their multiple token generation compares to traditional LLMs (spoiler: they’re faster and more accurate). In the second half of the episode, Ryan and the chairman of Roomie, Aldo Luevano, dive into Roomie’s purpose built models for both physical and software AI, and how their ROI-first approach helps companies track the impact of their robotics and AI implementation.
Inception researches and builds diffusion language models for faster and more efficient AI.
Roomie is a robotics and enterprise AI company with an ROI-first platform that tracks how well their AI solutions are actually working.
Connect with Stefano on LinkedIn.
Connect with Aldo on LinkedIn.
[Intro Music]
Ryan Donovan: Hello, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I’m your host, Ryan Donovan, and today we have two interviews that I recorded on the floor at AWS re:Invent back in December. The first is with Stefano Ermon of Inception, and the second is with Aldo Luévano at Roomie. Enjoy, and we’ll talk to you next time.
Ryan Donovan: I’m here talking to Stefano Ermon, CEO, and Co-founder of Inception, and today we’re talking about Diffusion LLMs. Now, I’ve heard of ‘diffusion models’ for image generation. How do diffusion LLMs work?
Stefano Ermon: Yeah, that’s a great question. So, diffusion models, as you said, are basically the best way right now to generate images, video, audio. So far, they’ve not really been able to work well on text and code generation, and what we’re doing at Inception is we’re pioneering the first large-scale, commercial-grade diffusion language. Diffusion language models, they work very differently compared to traditional large language models that everybody else is building. So, everybody else is building an ultra-aggressive model, and basically, the way they work is when you ask a question to ChatGPT or Gemini, they will provide an answer, one token at a time, left to right. And that’s pretty slow; it’s like a structural bottleneck that is very hard to accelerate. It’s a very sequential kind of computation. A diffusion language model instead generates multiple tokens, essentially in parallel. So, you start with a rough guess of what the answer should be, and then you iteratively refine it. The key difference is that we’re still use big neural network under the hood, but each neural network evaluation is not just producing a single token, but it’s basically able to modify multiple tokens at the same time, and that’s why diffusion language models are significantly faster than ultra-aggressive models. We’re looking at 5 to 10 X faster compared to ultra-aggressive models of similar quality.
Ryan Donovan: For diffusion models I’ve seen for images, they start with a seed and noise. Do you also start with seeds and noise, and by reusing the same seed, would that make it a more deterministic process?
Stefano Ermon: Yeah, so the process is very similar. So, just like in a typical– you probably see in some videos showing how diffusion model works. You start with pure noise and then you refine it until you get a crisp, nice image or video at the end. We do something very similar. We start with basically random tokens, and then as we go through this diffusion process, we gradually remove noise, which means that we are able to correctly guess what the token value should be. And so, you can see the process where we go through this refinement chain until at the end we get a high-quality answer that we can output and we can give to the user.
Ryan Donovan: So, in refining that noise, is there a sort of built-in eval process there? Is that part of the neural network? I know with image generation there’s all sorts of convolutional, deconvolution, all this sort of stuff. Is it anything like that?
Stefano Ermon: So, even for image diffusion models these days, people use transformers mostly. I was one of the pioneers of diffusion models for image generation back in 2019. My lab at Stanford kinda invented the whole idea of using a diffusion model. Back then, yes, we were using basically ConvNets because that made sense. We were doing dense image prediction, and that was the best thing that existed back then. Since then, now people have switched mostly to diffusion transformers. So, it’s still a transformer-based neural network that is trained on the same kind of objective. Basically, you take an image, you add noise, and then you train the transformer to predict the noise or effectively remove the noise from the image. And we do something similar: under the hood, we’re also using the transformers, so our neural network is still like a large transformer. We start with clean text or clean code. We intentionally add some mistakes, so we essentially destroy some of the structure in the data, and then we train the neural network to reconstruct the regional clean signal. It’s basically trained to correct mistakes. Instead of being trained to predict the next token, the neural network is actually trained to correct mistakes. And at the inference time, the process is not, ‘predict one token at a time;’ the process is, ‘let’s try to fix as many mistakes as you can as you go through this denoising process until the output is sufficiently clean,’ and then we just output it and give it to the user.
Ryan Donovan: That’s interesting. So, it’s almost like trying to reproduce the training data. How does that work for stuff that is peripheral or not exactly the training data?
Stefano Ermon: Yeah, so it’s still a statistical model, so just like an ultra-aggressive model. We’re trying to learn basically a probability distribution, which is like the data distribution, which is like the data generating process, which is whatever– it depends on what you’re training the model on. It could be code, it could be text, typically large data sets that we scrape from the internet. And we’re learning a generative model in the sense that it’s not just trying to remember all the training data; it needs to generalize. So, there’s still a machine learning component where you need to be able to generate new content, otherwise it wouldn’t be particularly useful. But under the hood, yes, it’s a generative model. It’s a statistical model. When you give it a new prompt, the model has not perhaps seen during training, it will try to generalize. It’ll try to come up with a reasonable answer, which is based on this probability distribution that the model has learned during training.
Ryan Donovan: Is it still prone to hallucinations?
Stefano Ermon: That’s a great question. I get this a lot, and unfortunately, yes, it’s not perfect. We like this idea of denoising because basically the model is trained to fix mistakes, and so, we hope that it’s gonna help us to reduce the number of errors, and basicall,y there is built-in error correction. If you think about an ultra-aggressive model, once it outputs a token, it can never take it back. There is no mechanism to fix a mistake. Our models have trained to mistakes, and so we’re hoping that eventually we’ll get to the point where they’re 100% reliable, but we’re not there yet. Right now, they still do make mistakes, but in terms of if you measure it in terms of the quality, the accuracy that you get on question answering or programming, they do really well. They’re comparable to some of the best speed-optimized models, Frontier Labs, the Mini models, Flash, Haiku models from Anthropic, were comparable to those models in terms of accuracy.
Ryan Donovan: With diffusion model, you could iterate forever. Is there a sort of optimal sweet spot you found?
Stefano Ermon: Diffusion language models enable different ways of using compute at the inference time, at test time. And we have our own methods that we’ve devised at the company to figure out what is the optimal trade-off between compute, but it’s an area that we’re actively researching. We’re building up reasoning capabilities, where the model can automatically decide how multistep, how much it needs to think. And this approach to reasoning is completely different from how it’s approached in the traditional ultra-aggressive models, like the old models or the reasoning models from DeepSeek and other players. And so, we’re very excited about the kind of capabilities that we’re seeing because it’s something completely different, completely new.
Ryan Donovan: So, what is the sort of change in thinking people have to do when they’re thinking about inference, or any of the other evals?
Stefano Ermon: Yeah, so it’s basically a very different stack. Under the hood, we still use transformers, so the neural maps doesn’t change too much, but the training objective is different, and the inference is completely different. So, it’s not just predicting the next token over and over. You need to figure out how to efficiently process many tokens at the same time, denoise them as efficiently as possible, figure out how many denoising steps you wanna use, and we’ve built all of that technology internally at Inception. We’ve built our own serving engine that we can use to basically serve these models at scale, handle continuous patching, handle different kinds of caching that you might wanna do to make the model as efficient as possible. So, we’ve internally built pretty complex serving engine that we’re able to use, and that we’re using right now to serve a lot of different customer.
Ryan Donovan: So, you’re still using tokens. Does the way that you tokenize and chunk and approach a piece of text change with the diffusion model?
Stefano Ermon: Yeah, that’s another great question, and it’s pointing towards the fact that a lot of the design choices that people make are optimal for ultra-aggressive models, but because we’re using something very different, what we’re seeing is that a lot of those design choices were not optimal. And so, there is still a lot of room for improvement because we can just play around with a lot of those design choices, and I think we’re still very far from a local optimum. So, I’m very optimistic about the future of diffusion language models because in a very short time we were able to catch up with some of the best ultra-aggressive models out there, and there’s still a lot of room for improvement, exactly because what you said: there’s so many design choices that are suboptimal.
Ryan Donovan: So, with the diffusion image models, there was like the six-finger problem. What’s the six-finger problem for the text version?
Stefano Ermon: That’s a good question, and one thing that happens, and I don’t understand why, but it happens, is the degeneracies, they can go on and on repeating similar things over and over.
Ryan Donovan: Kind of like a recursive loop.
Stefano Ermon: Google Gemini released the Gemini diffusion model a while back. It’s not available in production, but there was a little demo that you could play with, and I saw that it had the same kind of problem. So, I wonder whether it’s something that is somehow special and related to the diffusion approach. We fixed it mostly, but I think it’s an interesting kind of finding.
Ryan Donovan: Almost sounds like with the image you have a set boundary. With a text, you don’t have a set boundary. Is that related to the problem?
Stefano Ermon: Yeah, that’s another great question. One of the challenges of building diffusion language models is how to handle variable-length content. It’s something that we’ve not disclosed publicly how we do it, but it’s definitely one of the big technical challenges. One is figuring out how to generalize diffusion math, which is inherently very continuous. If you think about a diffusion process, it’s all revolves around partial differential equations and Fokker-Planck equations. It’s a very continuous type of math, and we had to figure out how to convert that mathematics to something that is very discrete. Like [how] you’re talking about tokens, there is only a finite number of tokens. There is no way to interpolate between the meaning of two different words, right? There is nothing in between. And so, we had to develop the right math to figure out how to do it, but still there are additional challenges like handling variable-length generation, figuring out how to actually serve these models efficiently in practice. Also, build the actual systems-level optimizations to make these models as efficient as they can be.
Ryan Donovan: With traditional LLMs, running them in a sort of cloud space, they take a lot of in-memory time – they’re like three, four, or five terabytes in memory. Do diffusion models have the same issue?
Stefano Ermon: Yeah. So, we still, unfortunately, need to use pretty large neural networks, and so the amount of memory that we need, it’s still pretty large. The key benefit is that they’re much less bottlenecked by memory bandwidth. So, if you think about ultra-aggressive generation, the key bottleneck is that the memory is not only finite, but it’s also pretty slow to move around the data across the different memory hierarchy. If you wanna move the weights all the way from the slowest, like large memory, all the way down to where you can actually do the computation, that’s very slow. That’s the bottleneck: memory bandwidth, more than compute. Diffusion language models are basically designed to be much more efficient in terms of memory bandwidth, because you can essentially process multiple things in parallel, so you can move the weights once and then use them across many tokens at the same time. And so, the added medic efficiency is significantly higher compared to traditional ultra-aggressive models. And that’s why our models are so much faster in practice.
Ryan Donovan: Are there any plans to move away from the transformer backend?
Stefano Ermon: Yeah, that’s another good question, and in fact, our production models are still based on transformer, but we do have some prototype models that are based on different backbones. In my lab at Stanford, I work quite a bit on the state-based models, and these are alternative architectures that do not scale quadratically with respect to the length of the context. And what we’ve shown is that indeed we can actually use those architectures and train them based like neural network or layers in a diffusion way. So, instead of training them to predict the next token, you can train them to be a denoiser, and everything works. And so, there are very orthogonal ways of improving the efficiency. One is an architectural improvement. The other one is more like an algorithmic improvement, where we’re training the network to do something different, and we’re using a different, more parallel algorithm to do inference. The benefits of both can actually be combined.
Ryan Donovan: And one of the hot new things is world models. Is it possible to have a diffusion-style world model? It seems like, because diffusion is a more holistic sort of approach, so is that possible?
Stefano Ermon: Yeah, as far as I know, a lot of the best world models that exist out there are actually based on diffusion. And so, it’s a key technology that people use whenever they think about models for several reasons. One is that they tend to provide higher accuracy, but then they’re also significantly faster. And if you think about a world model, you typically wanna use it to predict what’s gonna happen next, use that information to make better decisions, and figure out how you wanna drive your car. And so, that’s why decision models are so popular in that space, ’cause they can be made to be so efficient. And that’s the same reason we are using that technology for language generation, because of course, inference is a key bottleneck. Inference is gonna be there, and that’s why people are building all these data centers. There’s gonna be a huge need for more and more inference. And yeah, the hardware will get you some improvement in the number of tokens that you can serve, but software will have to come in, and diffusion language models are way get 10 x improvement, which is independent from the improvements that we get from the heart. I’d love for your audience to try out our models. If you wanna try the latest and greatest language models, our models are available through our API, so you can use them in an open AI-compatible way. So, it’s all backwards compatible. Whatever application you were building on top of traditional, ultra aggressive models, you can now use a diffusion language model, and you can get much better speeds, reduce costs, especially if you’re working on an application that is a little bit latency sensitive; maybe it’s a coding IDE, or an age voice agent, or a customer support agent, we’re seeing our customers are getting significant benefits by switching over to diffusion language models. And so yeah, I encourage you all to give it a try. So my name is Stefano Ermon, one of the Co-founders and the CEO of Inception Labs. The website is inceptionlabs.ai. I’m also on LinkedIn. If you look for my name, it’s pretty unique, so you’re gonna find me.
Ryan Donovan: I’m here with Aldo Luévano, Chairman of Roomie, and we’re gonna be talking about purpose-built models for both physical AI and software AI. So, tell me, what is the breadth of the models that you have?
Aldo Luévano: Actually, we have an enterprise AI platform, so we’re using different LLMs, and basically we’re targeting top accounts, mid-market accounts. Our solution can automate different back-office processes, right? So, we have a huge range of solutions in our enterprise AI. Actually, I think that one of the most important differentiators that we have is our ‘ROI first’ approach. This moment, we have a lot of first conversations. I think that basically they are buzzwords on LinkedIn, and in the industry, ‘AI first’, ‘data first’. But the relatives at organizations are looking for an ROI first approach, right? We need to deliver real value for our customers. So, that’s why we decided to create core module in our platform that can track the return of investment of each dollar investing in artificial intelligence. So, the relationship with physical AI is because one of our modules is a physical AI module, so basically, we can integrate the agentic world with physical use cases, with human or robots, bipedal robots, or also AI on edge smart devices that can interact with different kinds of processes, like in factories, for example, distribution centers and other CPE organizations, for example.
Ryan Donovan: It’s interesting. Is that an LLM or is that more of a traditional machine learning model?
Aldo Luévano: Actually, it’s an LLM base. It’s interesting because basically the model can, first of all, calculate, which is gonna be the current TCO of a manual process or a semi-automated process, maybe with a legacy technology, for example. Based on this calculation, then we can forecast, which is gonna be the future state of that TCO after our enterprise AI implementation. And basing these TCOs, then we can calculate the final ROI for that organization. And basically, it’s a GPT. It’s a conversation that our consultants have with the clients, with our customers, understanding their business needs, their strategy, the creation of their companies, and based on this understanding, we can estimate which is gonna be the ROI associated with implementation, right? So yeah, every module that we have in the platform, it’s gonna start with the implementation of our ROI first code module.
Ryan Donovan: How do you train the thing? Do you have successful ROI use cases, like documents, and then say, ‘ it should look like this,’ or, given any given business implementation, it’s gonna look a little different, right?
Aldo Luévano: We’re a company with 11 years on the market, right? So, we used to develop a lot of project-based offerings, like projects, tailor-made solutions for our clients, and the reality is that we learn a lot. In all these implementations, we have clients in financial services, in banking, in CP, in retail, in public sector, so after all of these years of experience, we could understand very well the business of our customers, and based on this understanding, we could train our model with all this data, right? So, in this moment, we can just calculate the ROI associated with the use cases that we know very well, but eventually, the idea is to integrate more use cases to our platform, and also calculate which is the ROI associated to those use cases. So, again, basically the idea to train the model is understanding the business needs. We have historic information of all the implementations that we have with our clients, and nowadays we have a powerful solution to calculate.
Ryan Donovan: Could you just do like an automation of a business? Just attach some agents to that, ‘here’s the ROI,’ and say, ‘ go make a business, AI’?
Aldo Luévano: Yeah, it’s a good question, too. Actually, as I was mentioning, we have eight use cases, eight modules in the platform. So, for example, we have a module that is associated with legacy systems. You know, in this moment in the organizations, we have a lot of vibe-coding tools like Lovable, for example, Replit, or even Cursor, but the reality is that all these tools are for the creation of new stuff, right? New applications, new mobile apps, enterprise apps with the latest architecture, right? Like JavaScript, for example. But the reality is that if you see the complete market of software development, the majority of the market is on legacy systems. The big dinosaurs are in organizations, mainframes, for example, couple base systems that basically they are supporting millions of transactions every single day. So, it’s crazy, right? But that’s why we’re identifying this opportunity with one of our models as we can take cover the legacy system and create new functionalities based on natural language as a vibe-coding tool. So, it’s a verticalized, in this case, LLM or SLM, because I think that one of the issues to adopt this use case is because you know that in general the LLMs are trained with public data, right? With Wikipedia, for example, in the case of ChatGPT. But the reality is that if you wanna create, in a natural language fashion, code, you need to have- so this is the main challenge, and in our case, because we have access to our clients, to their source code in the last 11 years, that’s why we could create this enterprise AI for legacy systems.
Ryan Donovan: We’re here at re:Invent and saw the Matt Garman keynote. He talked about a similar thing – moving from legacy systems. And you talk about how you have the legacy code, the mainstream mainframe stuff, COBOL. Do you also have the sort of end result of the conversion? Like, what the new stuff should look like?
Aldo Luévano: Yeah, it’s a very good point, and maybe I have a different approach because when you have a monolith, in this case, you have very fast applications. The speed of the application is amazing, right? Because if you compare the monolith with service-oriented applications, or cloud-based applications you have different layers, right? So, the application could be slower. So, for this reason, I believe that in this moment, it’s possible that big banks, financial service organizations, are gonna decide to continue their operation in the legacy system since basically the decision that a lot of CTOs are taking right now. That’s why I think that our main functionality in the enterprise AI solution is that we are delivering maintenance and support for these legacy systems. It’s the decision of our clients to migrate the solution to a new architecture, to new code, but they can keep the solution in the legacy, and just create new functionalities, and maintenance, and supporting of the current modules based in natural language.
Ryan Donovan: For some migrations, I’ve seen companies break the code down to an abstract syntax tree and then build that up in another language. Have you looked at those kind of approaches?
Aldo Luévano: Yes. I think that it’s important to clarify that we have this option in our enterprise, so for sure. I think that is a powerful use case. Again, it depends on the business needs, right? If they decide to move all the stuff to a new architecture, for sure we can do it, but I think that the main advantage that we have in our platform is to do the maintenance of the legacy system. Also, a big problem that, in general, the industry is facing right now, is that COBOL developers, they are old right now. So, for this reason-
Ryan Donovan: Nobody’s learning COBOL.
Aldo Luévano: Nobody’s Learning COBOL right now. So, for this reason, we need to have tools, like our ROI First enterprise platform to accelerate the development of new legacy code in this mainframe infrastructure.
Ryan Donovan: You mentioned some of your models are for physical AI. Can you talk about those?
Aldo Luévano: Yeah, for sure. We have a module that is 100% based on physical AI, and also it’s important to explain a little bit more about the story of the company. The company started as a robotics startup for B2B. At some point, we decided to pivot a little bit our business model, and we started creating artificial intelligence solutions, because in general, I think that robotics is an early stage. Right now, even if we have humanoids and they can walk, the reality is there’s not a lot of adoption in terms of robotics. So, for this reason, the company is growing by our line of business in AI development, let’s say. But we are still investing in robotics because we see that in the future [there] are gonna be a huge market opportunities in robotics. And one of our modules in the physical AI space is basically an integrator. Okay? So, we can connect our ROI first platform, all the use cases that we have in the platform, with physical use cases, and in order to do that, you need to operate, you need to control the physical devices, in this case, humanoids. For example, we have a module that basically is a computer vision model in order to do the self-checkout, the peaking inside factories, or in CPE organizations. So basically, this capability of the humanoid robot can increase, can extend the scope of our agentic space in computer vision, because obviously, the robot can walk. Then we can have a huge range of opportunities in terms of a normal pattern detection, or outta the stocks, so checkout solutions.
Ryan Donovan: Obviously, computer vision has been a research topic for a long time. What sort of approach are you all taking? I know Convolutional [and] Deconvolution Neural Networks were sort of in vogue for a while. What’s the approach that you’re taking?
Aldo Luévano: Yeah, I think that in this moment we have a lot of models already in our industry, and just in GitHub you can download it. So, the main approach is CNN, for sure. I think that the main advantage that we have is that we’re not just developing computer vision to alert personal organizations, the workforce. The idea of the agentic approach is to have an action after the identification of a specific pattern. So, for this reason, we are connecting our computer vision, the inference of the CNN, with a space to enable different activities, tasks that can solve a specific use case on the industry.
Ryan Donovan: These and others are separate models, but they’re on a platform. Do they connect? Can you make a robot that does your taxes?
Aldo Luévano: Actually, in this moment, no. It’s funny because in this moment, I think that the robotics industry is more related to the general-purpose approach. So, there are a lot of VCs investing in humanoid, but there is not any specific use case. We know which are gonna be the main use cases, for example, picking on factories, or maybe house calls, to have a companion in your house. But the reality is that the potential of the technology is so big because the robot can replicate our human capabilities, right? To pick an object, to walk, to have a conversation, you know, exactly as a human. So, for this reason, I believe that we don’t have specific use cases right now. We are in a moment that is complicated to create an ROI, or a business case associated with a unit economics for a deployment, or something like that. And the play right now is to raise venture capital for the humanoids, and also to present a technology that could be deployed, let’s say in 5-10 years, but that guarantee this capability of the general purpose.
Ryan Donovan: On the plane over, I watched the movie companion. Speaking of robotic capabilities, do you worry about the sort of ethical implications of having robots doing human-like things? Obviously, Terminator is a touchstone for that sort of worry.
Aldo Luévano: I think that it’s a very good question, because in the ethical side, for sure, there is a threat associated with the implementation of humanoid robots. Because at the end of the day, companies, and the industry in general, are looking for a hyper automation, you know. Hyper automation means cost reduction, payroll reduction, and workforce reduction. But the reality is that it’s a transformation that the human kind is facing right now. So, for that reason, I think that we need to understand that there are a lot of different kinds of technologies that are hyper automating the humankind of the workforce of organizations. The agentic space is an excellent example, right? Now, you can automate a role in the organization, or a job in the organization using agents, and it doesn’t mean that you need to have a physical barrier operating the process, but I believe that eventually we need to manage this. And in our case, it’s important to clarify our point of view, that one of the dimensions of our ROI first approach is to reduce the workforce. I think that is a complicated topic because a lot of people say, ‘no, we’re not gonna- with jobs, or we are gonna decrease jobs in the organization.’ That is not true. We’re gonna improve the capabilities of the workforce using artificial intelligence. Our approach in Roomie is a little bit different. We’re gonna go shout to reduce part of the payroll of the organization, and this is true, and we need to manage this.
Ryan Donovan: Automation is a key concern of computer science in general, but I’m sure there’s a lot of nerves out there, a lot of anxiety about the future.
Aldo Luévano: And also, I believe that it’s because when you think in artificial intelligence, normally you think in a robot. I think that is a physical representation of artificial intelligence, of AI. So, that’s why a lot of people are worried about the situation, just for this ethical thing on the sci-fi. The sci-fi is a threat on this, because at the end of the day, we need to improve our human capabilities and learn more skills, and if we’re gonna use robots, for example, right now, if you wanna have a real use case, you need to teleoperate the robot. So, you need to understand how, ‘can [I] manage the user interface of the robot,’ ‘how can [I] train the robot?’ And there are gonna be new jobs associated with robotics deployment, right? So, right now, obviously we are gonna lose a lot of jobs. I think that is part of the evolution of our industry, the evolution of the society, but at the same time, we’re gonna create more jobs associated with this technology. I think that at this moment, it’s complicated to have just one use case as your main product. The no code capabilities, the democratization of AI is so important right now, because you as a company or a startup, you can create different use cases in less time. So, that’s why our thesis in Roomie is to support an enterprise layer, right? We are not verticalizing just one use case as our growth driver. In fact, we have different use cases, and this is the idea, right? We wanna democratize the access of the artificial intelligence, mainly for Latin American companies, for eventually, the idea is to have exactly a same approach for United States companies, because when we’re talking with some opinion leaders, they say, ‘hey. How can you have all the muscles to develop physically the AI, and at the same time enterprise AI?’ Because normally this is strange, right? The other side is common to just pay attention in the physical AI, or just develop enterprise AI. We have a different thesis because right now, you can create technology very fast with this new approach of software development, and that’s why we wanna have a huge range of use cases for our clients. Yeah, my name is Aldo Luévano. I’m the chairman of Roomie and Co-founder of Roomie. We are a robotics and enterprise AI company based in Boston, and our delivery center is in Mexico City. So, we are growing so fast. We have more than 11 years in the market. We are obviously partners of AWS. We’re ISB. You can find more information about us in roomie-it.com.