Even your voice is a data problem

They cover how Deepgram is improving speech-to-text and text-to-speech capabilities using deep learning to take on challenges posed by dialects and noisy environments and the moral and ethical considerations voice AI companies have to make when it comes to voice cloning and synthetic data training.

Deepgram builds accurate, scalable, and affordable large scale voice AI for speech recognition, generation, and AI Agents.

Connect with Scott on LinkedIn, Twitter, or email him at Scott@Deepgram.com

[Intro Music]

Ryan Donovan: Hello, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I’m your host, Ryan Donovan, and today we’re talking about voice AI. And my guest [is] Scott Stephenson, founder and CEO of Deepgram. So, welcome to the show, Scott.

Scott Stephenson: Thanks for having me.

Ryan Donovan: Top of the show, we like to get to know our guests. How did you get involved in software and technology?

Scott Stephenson: So, I was a particle physicist. I built deep underground dark matter detectors, and in my physics training is when I first came across serious coding and doing development work. And from a physicist perspective, though, everything is just a tool, so it’s , ‘okay, I’m learning this thing in order to accomplish tasks.’ So, that’s when I first got the Deep experience. But it was TI 83 calculators and programming in basic, while you’re bored in the back of your math class, and that’s what got me going in that mindset. But then the real exposure to how the sausage is made was in physics training and physicists, because of this tool mentality, they think, ‘ yeah, just hurry through, and whatever.’ And so, then they get this reputation as writing bad code, which is mostly true, but this is something that me and my Co-founder, who is also a physicist, rebel against a little. We’re like, I think it’s better to understand the error handling and, all this stuff, because you do all this work going down this one path, and then, now you can’t reuse that somewhere else. And so, that was the first Deep experience that I had with it when I was in my PhD program working with my Co-founder now, who’s CTO of Deepgram. And now, of course, we build AI models for speech-to-text, text-to-speech, voice agents, and all of that development mindset, [and] engineer mindset really comes in handy for that.

Ryan Donovan: So, you found a Deepgram 10 years ago or so. Where was the line between that and then speech-to-text/text-to-speech. Why that problem?

Scott Stephenson: So, I was, not kidding, in a James Bond layer deep underground in a government-controlled region of China, this is just what it was. And I was the sole American graduate student there because the US was about to partner—this is like 2010-2011. Relations were good. Capitalism was flourishing in China, et cetera. It was on the upswing, but then Xi Jinping took over as ruler of China, and then things soured. But in that brief window where things were looking good, I was a graduate student, and we were cooking up this idea of a particle physics experiment in China, because we had heard about the world’s tallest dam being built. It’s called the Jinping Dam, still is the largest. They have the three gorgeous dam, but then they have the Jinping Dam. It’s a lesser-known one in Western China, but it’s the tallest dam in the world. It’s also a unique dam where it’s a standard dam, but then it has a secondary dam where it diverts a river through a mountain, and then the mountain is the dam. Okay?

Ryan Donovan: Oh, okay.

Scott Stephenson: This is important because when you’re diverting a river through a mountain, that means that there’s a tunnel going through, and now you have all this rock above you. And so, in particle physics, you’re always trying to run away from cosmic radiation. We’re constantly bombarded by radiation. If you were to build a detector on the surface of the Earth, it would light up like a Christmas tree. And so, you try to find a shield, and for most cosmic radiation, you have to go deep underground to have a sufficient shield. But we hear about this marble mountain in China, where it’s two miles underground, and we’re like, ‘oh my God, that’s such a great place to do it;’ and we somehow convinced the Chinese government and US government that this is a good idea. And it was, but the relationships soured, but I still stayed as a graduate student there. So, I was the sole American graduate student working with this extremely fast-moving, basically, startup—I know now, [but] I didn’t know then, but a startup detector. We were starting from nothing, and then it was like, ’25 million, 25 people, four years, go.’ That’s a startup, basically.

Ryan Donovan: Yeah.

Scott Stephenson: We are building this detector that has waveform digitization in it, and they’re extremely sensitive detectors called photomultiplier tubes. These are in PET scans and that type of thing, but they can sense individual photons, but they’re an analog device. And what you do is you digitize the waveform, and that waveform is like at one nanosecond, or 10 nanosecond time steps. So, it’s like extremely fast. But, so there’s a ton of data coming through, and it’s super noisy. But if you do it the right way, with the right models, with everything the right way, then you can determine in the detector, where did a particle come in, bash off something, and scatter, deposit energy, deposit light charge, that type of thing. And then, you could figure out what type of interaction it was. Was it a background radiation thing? Was it dark matter? This is what we were looking for. And it turns out that setup works extremely– that way of thinking, having real-time models that are looking at waveforms at massive scale with extremely low latency work really well for audio. But this isn’t what we were thinking at the time. It’s like [this] pessimistic scientist– physicist mindset. We were like, ‘hey, let’s just keep going down the path that we’re going down.’ But I just thought, man, it was so cool what we gotta do. We were deep under ground, we’re in this James Bond layer, we’re whatever– why isn’t there a documentary crew here? Why isn’t somebody recording this? We’re gonna look back years from now and say, we wish we had some recollection of this. So, we built these devices to make backup copies of our lives, just record audio all day, every day. And so, we ended up with over a thousand hours doing that, just recording all the time. And after we did the data-taking run for the experiment, we were also taking data for that, basically. And we came back to it. So, we were uploading all that data to an S3 bucket. It was just accruing. But if you’ve ever tried to listen to a long-form recording of your life, it’s extremely boring. Not a lot going on. And so, you actually wanna find the hits, you want the highlights, you want the highlight reel.

Ryan Donovan: Yeah.

Scott Stephenson: And so, this is where we were looking for a tool to say, ‘hey, is there some tool out there in the world that can do a highlight reel of audio?’ And since we had so much success with the physics experiment, we assumed that yes, there probably would be. We’re using N10 deep learning in order to do this, why wouldn’t there be something to do it for audio? Seems easy. And so, we went out and looked. Nothing existed. But then, we also went to the Frontier Labs, like Google, Microsoft. They had speech teams in 2015 that everybody thought were the best of the best. And we got meetings with them, probably just with some clickbait title on the email that’s like, ‘Dark Matter Physicist,’ whatever, and they’re like, ‘fine, we’ll meet these people.’ But we asked them, ‘hey, we’ll give you this weirdo data, but can you give us access to your next-gen, end-to-end Deep learning-based speech recognition system?’ And they’re like, ‘end-to-end. Deep learning is never going to work for voice. It’s never gonna work for conversation. What are you talking about? We’ve tried this for years and years. All these people have PhDs in this. It doesn’t work. You can’t do it. You guys are gonna fail. That’s cute that you wanna do that, but–’ anyway. So, we couldn’t get off the phone fast enough with these people, and just say, ‘ these people don’t get it. They don’t get it. We should start a company.’ And so, that’s what got us into that mode where we’re like, ‘hey, we’re gonna take this type of thinking, we’re gonna apply it to audio, we’re gonna build a demo.’ We put it on Hacker News. It went to the top of Hacker News where we were allowing people to search through YouTube videos. This was like 10 years ago. And now, you can just do it on Google. But anyway, that actually pushed us into doing B2B too, because we just assumed all of that, hey, we built a consumer use case, but Google’s just gonna do that; but what Google’s not gonna do is massive scale B2B. And so, this is where we focused as a company. But anyway, that’s how we got started is, we had this pile of data that we’re interested in, and there wasn’t a solution out there. So, we built a solution, and then now we sell that to people in the world.

Ryan Donovan: Yeah, there are a number of companies that are like, ‘oh, we need this, let’s build it. Other people need this. Let’s sell this.’ And the sort of approach of one waveform is like another, right? But of course, with speech, there is a whole raft of complications, right? There’s dialects, there’s slang, there’s speech impediments. How much work had to go in after the first pass to get it to be universally reliable?

Scott Stephenson: Yeah. There’s a double-edged sword here, which is along the lines of some models are useful, all of them are wrong. That kind of way of thinking. So, what you do is, initially, you bite off a use case that isn’t being well served, but it’s the lowest hanging fruit. It’s still pretty high up there, but it’s the lowest-hanging fruit. And so, for us it was customer service calls, doing analytics on them in English. And then, there are certain regulated industries in the US, like banking or insurance or others, where they just have to record everything, and they have to use a reasonable system to be able to search it, make sure that people aren’t doing fraudulent activity and baking, or that type of thing. And there was already a business out there that Nuance and IBM had built around that. And we were like, ‘okay, this is a good place for us to attack first.’ Just batch mode, English, these areas, and go after Nuance, go after IBM. These are partners now, right? All these are partners of Deepgram now. One of the best ways to make somebody a great partner is to become a great competitor to them. And then they say, ‘actually, you know what? You should probably do that, and then we’ll do what we do best.’ And then that works out really well. But yeah, at the time we were like, hey let’s go serve the need that these other customers are looking for, but they’re not getting. And we did that by building cutting-edge models that were faster. We’re the ones who brought down the price for speech-to-text. This is typically thought of differently. The hyperscalers are the ones doing that. Not for AI. It’s typically the startups that are saying, ‘actually, it costs too much to do it this way.’ And that’s actually a reason that people aren’t using it at massive scale. So, you need to increase the efficacy of the model, but you need to drop the price. So, for us, speech-to-text was around $3 an hour at the time, like 2015-2016, and we were like, ‘no. This has to drop at least 10 x.’ The reason is when you try to build a voice agent, it’ll be speech-to-text plus some type of language model. At the time, you wouldn’t call them large language models; you just say it was a language model and TTS. And all of them together need to be less than $2 an hour. All of them. Everything that you’re doing. Because the reason is you’re competing with somebody that you hire in India or the Philippines, you’re competing with the training of them, and all of that too. And they’re typically being paid like $2-5 and hour, something like that. So, if you wanna have a product that’s even close, you have to be in that range. And so, if it’s $3 for speech-to-text, $5 for the language model, $5 for the text-to-speech, and yeah. So, anyway, this is just our mindset when we were coming into it, and the type of differentiation that we were bringing.

Ryan Donovan: And so, the model itself – you say it’s deep learning. Is it traditional neural net, is it some other architecture? How does it work?

Scott Stephenson: Yeah. It isn’t very different than what we were thinking about at the time; myself for sure, but also the other team members at Deepgram. In the early days, if we had lowered our ambitiousness, then we wouldn’t do it at all.

Ryan Donovan: Yeah.

Scott Stephenson: And so, it wasn’t an iteration on the previous generation. It was a full rewrite. It was like, ‘we’re gonna do that, full end-to-end deep learning, or bust.’ And we had great investors early on that supported that, as well. So, the idea was build with a full end-to-end deep learning system. Even at the time, there were labs that were claiming that they had end-to-end deep learning for speech, but they actually didn’t. They would have a portion of it that was essentially a really good, strong acoustic model—is what people would call it at the time—but then they would tack on a traditional, more statistical heuristic language model onto the end of that, rescore it, do a beam search, et cetera. This is not an N10 deeplearning approach, but they would call it that, and we were saying we’re gonna get rid of all of that. And then what it’s gonna bring is extreme low latency, much higher throughput, then you can actually drop the price and still make great margin. But the promise also is that you’ll be able to train bigger, better models, but you’ll also be able to adapt the models. So, right now, or 10 years ago with the current thinking, if you want it to adapt a model, not kidding, it would be $500,000 to $2 million with IBM or Nuance, didn’t matter. And their accuracy rate might be 65% on phone call audio, or something. So, really bad. And then, after two years, they would get to 68%. The reason for this is every single one of those pieces, you’d have something that tried to de-noise the audio, something that tried to guess the phone names, something that tried to guess candidate words, something that tried to rank the words, something that did the beam search on those, et cetera—they’re just lossy, lossy, lossy. But when you say, we’re gonna get rid of literally everything that’s gonna be full end-to-end deeplearning, and the data is gonna write the model, then you actually have a hope now of saying, ‘I can go adopt this for you. All we have to do is label a little bit of data from you.’ So, this is now a key part of the Deepgram system. We have the best general models in the world, but we also allow folks to adapt models so that they get better over time, and that type of thing. So, we use full N10, deep learning. We use what people might call ‘dense layers’ or ‘ fully connected layers.’ Yep. We use those. That’s not the whole thing. If you tried to build a system like that, it wouldn’t work. We use a convolutional neural network, but if you built the whole system that way, it wouldn’t be world-class. We use recurrent type systems, but if you built the whole system that way, it wouldn’t work. And we use attention-based systems, and self-attention, et cetera, but if you built a system the whole way, that way, it wouldn’t work. So, what you have to do is figure out where does the fully connected make the most sense? Where does the convolutional make the most sense? Where does the recurrent part make the most sense? And then where does attention make the most sense? And I think about this from the physicist mindset. The fully connected layer is God mode, but it’s a really dumb God mode, because if it had all connections to everything in the world, then it actually would be like God mode, but it’s so narrowed down that actually it doesn’t work out that way. So, you have that, but typically you use that as some kind of adapter to move from CNNs to RNNs, or RNNs to attention, or whatever it is. Or to go from pixel space to some embedding space, or something, you would use it maybe as a portion of that. But then, the convolutional neural networks though, I think are more like space. So, I think of it like space, time, omniscience, and then the ability to focus. And so, space is the convolutional neural way of thinking about it, and then the recurrent type models are figuring out the time relation. But if you do that, then you’re overloading– if you just say, ‘ okay, I’m gonna have fully connected layers with convolutional neural networks, and with recurrent neural networks, and then I’m just gonna feed that into the system.’ It’s actually like information overload, and this is where the attention comes in and says, ‘ no, just like with humans, I’m gonna look over there. I’m gonna do that. I’m gonna move my attention around.’ And then that was the thing that kind of pulled everything together and got through another hurdle. But I think of this a little bit like you are finding the elements of intelligence, like the periodic table for chemistry – we are finding it for intelligence now, and you’re going down that path. So, it’s just another fundamental science, but you’re finding the natural laws of intelligence.

Ryan Donovan: Yeah. I would go back to some of the other ones [that] do the phoning detection–

Scott Stephenson: Yeah.

Ryan Donovan: Word, sort of statistic guessing, and it sounds like you’re just processing the raw waveform. You figure it out. With all those neural nets running over it, is it more reliable? Is it less? Because every recording system I have does transcription, and they don’t get it exactly right, especially when it’s something that is unusual jargon. Are there advantages, disadvantages to the raw waveform approach?

Scott Stephenson: It’s mostly a data problem, and not that sort of input transduction. So, we’ve done a lot of studies on this. Do you use a raw waveform? Do you use two DFFTs? Do you use log mail spectrogram components? Do you do psen? Is another common one. Do you do them in combination? And it doesn’t seem to matter that much. Basically, if you take each one of them and optimize them, it doesn’t matter all that much. Even if you combine them, it doesn’t matter all that much, which is very interesting. The thing that matters more is the temporal attention combining, and that type of thing. So, you need to have an input transduction that maintains the information, but as long as it is doing it in a reasonable way via some transformation, or you’re allowing the model to create its own transformation giving it enough degrees of freedom to do that, then it does a really good job with it. The way I would look at that is more a failing of the data manifold coverage. And when you have enough of that, and you probably have to tweak the size of the models a little bit, larger, but I don’t think it needs to be 100 times larger or whatever; but tweak the size of the models, three times larger, five times larger, or something like that, but with a 100 x data manifold coverage, then those little things that they’re talking about would be way better covered. There’s another way to do it too, which is through a more active learning system design, which is not common right now. So, pretty much what everybody uses now, it’s just what it is, it is what you get, and it’s not gonna get better over time. And, maybe every year or something like that, when a new model comes out, it might, but that’s also a unique product that we offer to customers where we say, ‘if you want, you can turn on model improvement.’ And then over time, the model is going to identify areas where it’s doing poorly, and then that will go through a whole data training process, and then it’ll get better at those type of things. But I’ll also be honest about the state of that right now. That happens on the week scale, or months scale, or something like that, it’s not instantly, but a human would be able to do it instantly, right? You could say, ‘ whoa. Hey, actually it’s pronounced this way, or it’s this other word, or let me type it out to you.’ And then you would say, ‘oh, okay.’ And then it would instantly get it. The systems aren’t in that state yet.

Ryan Donovan: You say it’s largely a data problem. Is this something you can use synthetic data to help with simulations?

Scott Stephenson: Absolutely. And it’s– man, we subscribe to the same way of thinking that Ilia thinks, where the world is just compression, basically. And compression leads the way. I’m going somewhere with this, which is that I think the intuitive idea that synthetic data generation is a path to making these models much better is absolutely true. I think the way you generate the synthetic data matters so much, though. So, we could generate 10 million hours of people saying hello, greeting each other, whatever it is, but you have to think about the practicalities of how you’re actually doing it. I don’t even mean the cost, I just mean, what is the prompt that you’re sending to an LLM? Which LLM? Why an LLM? Is it you’re trying to generate the text? Okay, so you generate the text of it, and then now you send that to a text-to-speech model. What kind of context are you giving that text-to-speech model? Does that text-to-speech model even have an ability to absorb the context? Because when you’re trying to simulate these environments, you need to simulate a noisy room, you need to simulate being in the car, you need to simulate slurring the words. You need to simulate all sorts of things. The current state of that is way too clean. If you just take the standard models that are there now, and you say to an LLM, ‘generate something that people would say,’ you now take those and then feed them to a TTS, it’s probably actually not gonna make the model much better right now. It might for certain terminology; if it was a clean situation with certain terminology, it’ll make it better. I already confirmed that. That does work. But when you’re talking about the way longer, long-tail-edge-case-type scenario, then the synthetic data generation systems have to get so much better.

Ryan Donovan: Yeah.

Scott Stephenson: And it won’t be just a standard TTS model that’s doing it, it’ll be models that are built just for synthetic data, and the way to make them work extremely well is gonna make them extremely good compressors, and what that means is make them extremely good world models. So, then you could describe what you’re looking for to that model, yes, but you could also just show it. You could go record audio in a drive-through or something, and record 10 different people using it. And then now you say, ‘okay, in order to make the model better, I need 10,000 of those interactions.’ You’re like, ‘oh, great. I don’t wanna leave all of those,’ et cetera. But if you show those 10 to a world model, that doesn’t exist yet today announced—but if you show that to a world model and then you say, ‘think what this, get into that mode of thinking and that embedding space, and now generate audio.’ That’s gonna work extremely well.

Ryan Donovan: So, with any audio, a lot of it has been a sort of throughput capacity problem. I know, announced at the keynote, you are being integrated into the Bedrock agent core system. Can you talk about the sort of capacity assist that’s happened and how that happened?

Scott Stephenson: Yeah, so there’s a grandiose vision thing, and it’s definitely there, where partners like AWS say they want to offer customer choice, and they want to have coverage, and compete extremely well in the cloud market. And so, they’re thinking that way. But then there’s also a very practical thought process here, which is, okay, how do we do it? And the general idea there is you find a common customer that has this need, and then you look at the offerings that are there, and then say, ‘why is it not meeting them?’ And then you say, ‘okay, what needs to be built in order to meet it?’ And then will this also help at least 10 other customers? And then, can it be built in a way that it helps thousands, right? So, this is the process that we go through working with the AWS partnership team, and identify different customers like Salesforce, or Cigna, or others that are out there that are big AWS shops that identify this gap or this need that they have, and then we say, okay, how can we help there? And so, yeah, that process is actually is really fun. It’s also– you have to know how to navigate it, which we’ve now had two years of partnership with AWS. So, we have figured out how to do that. And so, when the time was right, when voice AI was going mainstream, et cetera, then we were able to just jump in and do it. But we had this joint need for Deepgram to be available on SageMaker, in Connect, in their different agent capacities. We were working on several of them with the AWS team. And so, it just all came together because basically, voice AI went mainstream in the last year, and then now all this demand is happening. And then the idea there is, okay, who has actually built scalable systems that have low latency, that are reliable, that are the right kind of topology for how this is gonna work? Deepgram’s a good partner for that, for AWS. And so, then we help ’em do it, and then we, yeah. We just did the release and announcement of those in the last couple weeks, and we already have some of the first users extremely happy and successful. And I think that this just in general, from a developer perspective, there was a missing primitive in SageMaker, or just in the way of thinking in the ecosystem, how AWS was thinking about things in the AI sense, which is there was no streaming in but there was streaming out, and this is because everything was built for LLMs. LLMs don’t stream in. They take a huge chunk of context, and then they operate from that, and then they generate token by token to it just stream out. And so, it’s built that way. But if you’re trying to build any type of real-time AI, but voice included, and voice is the first real-time AI use case, but there’s gonna be many more. But if you’re trying to build any type of real-time AI, you need bidirectional streaming. You need streaming in, you need streaming out. You need it to have high throughput, you need it to have low jitter, et cetera. And so, we brought all those requirements to the team, and then worked with an amazing AWS team, who really cooked it up in a short period of time. When they identify something that is, this is real, this is serious, this is a good need, or this is a good product to address that need, then they can act really quickly, and then get released. And so, yeah, that’s how all that materialized. I don’t know if that helps on the sausage making, or helps understand how the sausage is made, but it’s just developers, and teams, and product people, and different companies just saying, ‘wait a minute, this doesn’t exist. It should exist. Why don’t we make it exist?’

Ryan Donovan: Yeah. I expect a lot of the folks listening are sausage makers themselves.

Scott Stephenson: Yeah.

Ryan Donovan: You mentioned AI voice has gone mainstream. With that, we’ve seen some sort of news bumps on fraud people doing a fake phone call and then cloning the voice.

Scott Stephenson: Yeah.

Ryan Donovan: Are there ways to catch that, prevent that from the model level?

Scott Stephenson: Yes, and one thing that we do– so, we offer speech-to-text and text-to-speech, and a full voice agent, but our text-to-speech, we don’t allow voice cloning, and this is one of the reasons why. I don’t want my grandma scammed by my voice being cloned to call her and say I’m in desperate need of something, or whatever. So, I think just unfettered access to voice cloning is not, as it currently sits, a net productivity game to the world. So, this is why Deepgram doesn’t release it. There are other companies that do, and it does increase the imagination of what happens. But also, every tool, if held appropriately, is a weapon basically, and I think next year we can release a voice cloning in a way that is responsible enough, where it’s watermarked, where we also sell a companion product that tells you if somebody is using the system in order to generate this, but that then you need to have this sort of more widespread, and you’re like an arms dealer selling to both sides. But it’s actually the thing that needs to happen. I feel bad doing it, but it’s also, yeah– if you wanna get the productivity gain of this type of technology, then you need to do both, essentially. Right? So this is the type of product I think needs to be released. And we think about things from this B2B perspective, this massive scale. In a few years we’re gonna have a billion simultaneous connections of us talking to machines. That’s 8 billion people in the world. You’re gonna be peaking at a billion simultaneous voice agents conversations, or ambient listening, or whatever it’s right. It’s gonna be an extreme level of that. And then, okay, but if that’s the type of technology that is released into the world, is it actually the type of technology that we want released into the world? And we’re the trusted, big scaler partner for many of these companies out there, so we wanna release the tech that works in that way, and then set the standard for how that’s gonna go, and say, ‘hey, here’s how to do it responsibly.’

Ryan Donovan: Yeah. The billion simultaneous connections with passive listening sounds like surveillance forever.

Scott Stephenson: Yeah.

Ryan Donovan: What is the unsolved problem that you’re looking at now with voice AI?

Scott Stephenson: So, just a little bit of history of the problems: Do we have good enough perception at all? And this is like 2015-2018 era where that was roughly solved to, ‘ okay, we have good enough perception to provide value in many circumstances, not all, but we have good enough perception to get going.’ And so, that was essentially 2018 to 2020. But then okay, fine, when you have good perception, you can do things like QA, you can do things like– I don’t even wanna say note taking, but transcribing an entire conversation, but it doesn’t do the summarization and that type of thing. You need to have more things developed in order to do that. But the perception was getting good enough to do that. But then, you needed other pieces to the system, what we would now call LLMs and the humanlike text-to-speech, right?

Ryan Donovan: Text understanding.

Scott Stephenson: Yeah. You needed text understanding. You need all these other pieces to come together. And so, that’s where, at that point in time, us as our research team and everything, that’s when we start thinking, ‘okay, great.’ We know that the OpenAI and others are working on the LLMs. Those are gonna be good enough. We can already see it. GPT 1, 2, 3, et cetera. These are gonna be good enough. It’s coming. I still can’t tell you exactly, which day, but in the next few years, this is absolutely gonna happen. Therefore, we should start our text-to-speech, low latency, high reliability, text-to-speech research at that time, et cetera, and then get it lined up for the launch of LLMs going massive scale. And so, eventually, this is something that we’re working on in our research team now, and we have a white paper out there about an architecture we call Neuro Plex, which is a way to think about combining all these systems. So, we start out talking about, man, these systems are so stupid. Look at all these modular systems, and anytime you wanna change anything, you have to do this, or whatever. It sounds like you’re describing that now with voice agents, right? It’s like, yes, in some ways I am.

Ryan Donovan: Yeah.

Scott Stephenson: But also, I’ll just be honest about the state of the world right now, and I like to use this example: in the phone world, people still have PBX systems installed in their basements.

Ryan Donovan: Sure.

Scott Stephenson: These exchange systems that they could be using VoIP, which was 25 years ago, but they still don’t have that.

Ryan Donovan: People still have mainframes, right?

Scott Stephenson: Yeah, exactly. Mainframes, et cetera. And why do they have that? Because they had a need at the time, and they figured out a way to make that system work for it. Right now, there are many use cases out there that have a need at the time right now, and they’re gonna figure out a way to make the current system to work for it. So, speech-to-text, plus an LLM, plus maybe RAG, tool use, et cetera, and TTS. And then, to reschedule your dentist appointment or something, that’s gonna be totally fine for that. There are many other use cases, though, where it’s not gonna satisfy all the different pieces for that use case. This is where I see the next generation going where you’re passing full context through the system. I believe we can still do this in a modular way, so that one thing that you lose with a speech-to-speech system is the ability to inspect it, the ability to put guardrails on. Again, this is extremely important for B2B, and you might ask the question, in this speech-to-speech system, what did it think I said? So, okay, that’s basically saying you need a transcription, and I think of this a little bit like a circuit, a PCB that you’ve designed, and you have test points on. And you’re like, oh, I can see the logic that’s happening here. I can see it there, et cetera, too. With a full speech-to-speech system, those test points aren’t there. But in this modular way of thinking about it, which is modeled after the human brain, where we have regions in our brain that understand what’s going on, or that do the transduction from the raw signals that are coming in, into some other ontology that’s in there. Then, that gets reworked and transformed in an understanding piece, like an LLM, basically. Our motor cortex changes our body, makes us do stuff or whatever, to go deliver that. And so, that generation system is also like a separate system, and then they’re all connected with white matter in the brain. Gray is what does the compute, and white matter is what does the connection. And the neuroplex system is why it has its name, is modeled after that, where you have this modularity, but you have full connection between it. And then, you have context being passed through the entire system, and you can use it as a full end-to-end system, or you can use the different modules as their own and output context; or you can use the full system in a way where you have the test points enabled, and you can inject into it guardrails, et cetera. And I think this is a framework for thinking about these problems that will last a decade or 50, like mainframes, et cetera, but I think it will be the next evolution. We’re just at the very beginning of the voice revolution, but just the intelligence revolution is really how I think about it. We had an agricultural revolution for 1500 years. We had an industrial revolution. And the agricultural revolution is more about getting calories in humans, and then that increases the productivity.

Ryan Donovan: Right?

Scott Stephenson: And then the industrial revolution is removing these big, menial, large workloads. ‘Dig this hole with a shovel,’ or ‘use machinery to do it.’ So, that increases the productivity of humanity. And then we had an information revolution where it was storage of data, it was communicating that data at light speed, filtering that data, et cetera. But now we have a new revolution. Let’s just call it what it is. It’s topologically different than the other ones before. The new piece used to be storage of information for information and transmitting it, et cetera, that type of thing. It used to be just raw work, or calories, et cetera. The new thing that we’re automating here is intelligence, and that intelligence is–

Ryan Donovan: Creating information, right?

Scott Stephenson: Yeah, exactly. And so, we’re unlocking something that previously was not able to be done in any kind of concerted way. You had to do this bespoke way through humans and whatnot., And so, it’s gonna be a revolution just like anything else. I think if you look at the timescales there, 1500 years to 250 years, to 75 years, to, I would bet this next revolution, the intelligence revolution, is probably gonna be like 25 years. And we would probably look at it like we’re three or five years or something into it right now. And so, anybody who’s thinking, ‘man tech companies move fast,’ intelligence companies have to move three times faster. And so, I think my message to the world there is get used to it. I don’t know what to say. This is a new mode of operation. Every company needs to be an intelligence company, or be outcompeted, basically. And so, how can you adapt and do that? And there’s a reason for it. It’s a new revolution. That’s what’s happening. And then, a lot of people ask me, ‘ okay, what comes after that?’ It’s probably a biological revolution. We have another 10 or 15 years to worry about the intelligence revolution right now until we get to that.

Ryan Donovan: It is that time of the show again. We’re not gonna shout out a badge. We are recording from re;Invent today, so we’re just shouting out re;Invent in general. I have been Ryan Donovan. I edit the blog, host the podcast here at Stack Overflow. If you have comments, questions, concerns, wanna suggest a topic to us, email me at podcast@stackoverflow.com; and if you wanna reach out to me directly, you can find me on LinkedIn.

Scott Stephenson: And I’m Scott Stephenson. I’m CEO and Co-founder of Deepgram. And you can just find me on LinkedIn. I’ll add anybody. Or send me a message on Twitter. Or just find the Deepgram, @DeepgramAI on X. Shout out to my great EA. She manages my email inboxes. Hit me at Scott@Deepgram.com. It’s a pretty easy email. I would love to talk to anybody.

Ryan Donovan: Another live email address in the wild. Thanks for listening, everybody, and we’ll talk to you next time.

Related Posts

Leave a Reply Cancel reply