Prompt Fidelity: Measuring How Much of Your Intent an AI Agent Actually Executes | Towards Data Science

Spotify just shipped “Prompted Playlists” in beta. I built a few playlists and discovered that the LLM behind the agent tries to fulfill your request, but fails because it doesn’t know enough but won’t admit it. Here’s what I mean: one of my first playlist prompts was “songs in a minor key within rock”. The playlist was swiftly created. I then added the caveat “and no song should have more than 10 million plays”. The AI agent bubbled up an error explaining that it didn’t have access to total play counts. It also surprisingly explained that it didn’t have access to a few other things like musical keys, even though it had claimed to use that in the playlist’s construction. The agent was using its LLM’s knowledge of what key a certain song was in and adding songs accordingly to its memory. A close inspection of the playlist showed a few songs that were not in a minor key at all. The LLM had, of course, hallucinated this information and proudly displayed it as a valid match to a playlist’s prompt.

All images, unless otherwise noted, are by the author.

Obviously, a playlist creator is a fairly low-stakes AI agent capability. The playlist it made was great! The trouble is it only really used about 25% of my constraints as validated input. The remaining 75% of my constraints were just guessed by the LLM and the system never told me until I dug in deeper. This is not a Spotify problem; it’s an every-agent problem.

Three Propositions

To demonstrate this concept of prompt fidelity more broadly, I must make these three propositions:

Any AI agent’s verified data layer has a limited or finite capacity. An agent can only query the tools it’s been given, and those tools expose a fixed set of fields with finite resolution. You can enumerate every field in the schema and measure how much each one narrows the search. A popularity score eliminates some fraction of candidates. A release date eliminates another. A genre tag eliminates more. Add up how much narrowing all the fields can do together and you get a hard number: the maximum amount of filtering the agent can prove it did. I’ll call that number I_max.
User intent expressed in natural language is effectively unbounded. A person can write a prompt of arbitrary specificity. “Create a playlist with songs that are bass-led in minor key, post-punk from Manchester, recorded in studios with analog equipment between 1979 and 1983 that influenced the gothic rock movement but never charted.” Every clause narrows the search. Every adjective adds precision. There is no ceiling on how specific a user’s request can be, because natural language wasn’t designed around database schemas.
Following directly from the first two: for any AI agent, there exists a point where the user’s prompt asks for more than the data layer can verify. Once a prompt demands more narrowing than the verified fields can provide, the remaining work has to come from somewhere. That somewhere is the LLM’s general knowledge, pattern matching, and inference. The agent will still deliver a confident result. It just can’t prove all of it. Not because the model is poorly built, but because the math doesn’t allow anything else.

This isn’t a quality problem, but a structural one. A better model doesn’t raise the ceiling. Better models do get better at inferring and filling in the rest of the user’s needs. However, only adding more verified data fields raises this ceiling, and even then, each new field offers diminishing returns because fields are correlated (genre and energy aren’t independent, release date and tempo trends aren’t independent). The gap between what language can express and what data can verify is permanent.

The Problem: Agents Don’t Report Their Compression Ratio

Every AI agent with access to tools and skills does the same thing: it takes your request, decomposes that request into a set of actions, executes those actions, infers about the output of those actions, and then presents a unified response.

The Minor Bass Melodies Prompted Playlist

This decomposition from request to action actually erodes the meaning between what it is you’re asking for and what the AI agent responds with. The narration layer of the AI agent flattens what it is you requested and what was inferred into a single response.

The problem is that as a user of an AI agent, you have no way to know what fraction of your input was used to trigger an action, what fraction of the response was grounded in real data, and what fraction was inferred from the actions that the agent took. This is a problem for playlists because there were songs that were in a major key, when I had explicitly asked it to only contain songs in a minor key. This is even more of a problem when your AI agent is classifying financial receipts and transactions.

We need a metric for measuring this. I’m calling it Prompt Fidelity.

The Metric: Prompt Fidelity

Prompt Fidelity for AI agents is defined by the constraints you give to the agent when asking it to perform some action. Each constraint within a prompt narrows the possible paths that the agent can take by some measurable amount. A naïve approach to calculating fidelity would be to count each constraint, add up the ones that are verifiable, and the ones that are inferred. The problem with that approach is that each constraint is weighted the same. However, data is often skewed heavily within real life datasets. A constraint that eliminates 95% of the catalog is doing vastly more work than one that eliminates 20%. Counting each constraint the same is wrong.

Therefore, we need to properly weight each constraint according to the work it does filtering the dataset. Logarithms achieve that weighting. The bits of information in a prompt can be defined as “-log2(p)” bits where p is the surviving fraction of information from the constraints or fillers you’ve applied.

In each agent action, each constraint can only be a) verified by tool calls or b) inferred by the LLM. Prompt fidelity measures the ratio of constraints between those two options.

Prompt Fidelity has a range of 0 to 1. A perfect 1.0 means that every part of your request was backed by real data. A fidelity of 0.0 means that the entire output of the AI agent was driven by its internal reasoning or vibes.

While updating a Prompted Playlist, the agent shows its thoughts. Here its “Defining mood and key”

Spotify’s system above always reports a perfect 1.0 in this situation. In reality, the prompt fidelity of the playlist creation was around 25% – two constraints (under 4 minutes and recorded before 2005) were fulfilled by the agent, the rest were inferred from the agent’s existing (and potentially faulty) knowledge and recall. At scale and applied to more impactful problems, falsely reporting a high prompt fidelity becomes a big problem.

What Fidelity Actually Means (and Doesn’t Mean)

In audio systems, “fidelity” is a measure of how faithfully the system reproduces the original signal. High fidelity does not guarantee that the music itself is good. High fidelity only guarantees that the music sounds how it did when it was recorded. Prompt fidelity is the same idea: how much of your original intent (signal) was faithfully fulfilled by the agentic system.

High prompt fidelity means that the system did what you asked and you can PROVE it. A low prompt fidelity means the system probably did something close to what you wanted, but you’ll have to review it (listening to the whole playlist) to ensure that it’s true.

Prompt Fidelity is NOT an accuracy score. It cannot tell you that “75% of the songs in a playlist match your prompt”. A playlist with a 0.25 fidelity could be 100% perfect. The LLM might have nailed every single inference about each song it added. Or, half the songs could be wrong. You don’t know. You can’t know until you listen to all the songs. That’s the point of a measurable prompt fidelity.

Instead prompt fidelity measures how much of the result you can TRUST WITHOUT CHECKING. In a financial audit, if 25% of the line items have receipts and 75% of the line items are estimates, the total bill might still be 100% accurate, but your CONFIDENCE in that total is fundamentally different than an audit with every single line item supported by a receipt. The distinction matters because there are domains where ‘just trust the vibes’ is fine (music) and domains where it isn’t (medical advice, financial guidance, legal compliance).

Prompt fidelity is more like a measurement of the documentation rate given a number of constraints, not the error rate of the response itself.

Practically in our Spotify example: as you add more constraints to your playlist prompt, the prompt fidelity drops, the playlist becomes less of a precise report and more of a recommendation. That’s totally fine, but the user should be informed about which they’re getting. Is this playlist exactly what I asked for? Or did you make something work to fulfill the goal that I gave you? Surfacing that metric to the user is essential for building trust in these agentic systems.

The Case Study: Reverse-Engineering Spotify’s AI Playlist Agent

Spotify’s Prompted Playlists feature is what started this exploration into prompt fidelity. Let’s dive deeper into how these work and what I did to explore this capability just from the standard prompt input field.

Prompted Playlists let you describe what you want in natural language. For example, in this playlist, the prompt is simply “rock songs in minor keys, under 4 minutes, recorded before 2005, featuring bass lines as a lead melodic element”.

Normally, to make a playlist, you’d need to comb through hours of music to land on exactly what you wanted to make. This playlist is 52 minutes long and took only a minute to generate. The appeal here is obvious and I really enjoy this feature. Without having to know all the key rock artists, I can be introduced to the music and explore it more quickly and more easily.

Unfortunately, the official documentation from Spotify is very light. There are almost no details about what the system can or can’t do, what metadata it keys off of, nor is there any data mapping available.

Using a simple technique, however, I was able to map what I believe is the full data contract available to the agent over the course of one evening (all from my couch watching the Sopranos, naturally).

The Technique: Impossible Constraints as a Forcing Function

As a result of how Spotify architected this playlist-building agent, when the agent cannot satisfy a request, the error messages can be influenced to reveal architectural details that are otherwise not available. When you find a constraint that the agent can’t build off of, it will error and you can leverage that to understand what it CAN do. I’ll use this as the constant to probe the system.

In our example playlist above, Minor Keys & Bass Lines, adding the unlock phrase “with less than 10 million streams” acts as a circuit breaker for the agent, signalling that it cannot fulfill the users’ request. With this phrase, you can explore the possibilities by changing other aspects of the prompt over and over again until you can see what the agent has access to. Collecting the responses, asking overlapping questions, and reviewing the responses allows you to build a foundational understanding of what is available for the agent.

A prompt with 10 million Spotify streams triggers an error from the agent

What I Found: The Three-Tier Architecture

Spotify Prompted Playlist agent has a wealth of data available to it. I’ve separated it into three tiers: musical metadata, user-based data, and LLM inference. Beyond that, it appears that Spotify has excluded various data sources from its agent either as a product choice or as a “get this out the door” choice.

Tier 1
- Verified track metadata: duration, release date, popularity, tempo, energy, explicit, genre, language
Tier 2
- Verified user behavioral data: play counts, skip counts, timestamps, recency flags, ms played, source, period analytics (40+ fields total)
Tier 3
- LLM inference: key/mode, danceability, valence, acousticness, mood, instrumentation — all inferred from general knowledge, narrated as if verified
Deliberate exclusion:
- Spotify’s public API has audio features (danceability, valence, etc.) but the agent doesn’t have access. Perhaps a product choice, not technical limitation.

A full list of available fields is included at the bottom of this post.

Another error, this time with more details about what is available to use

The Behavioral Findings

The agent demonstrated surprisingly resilient behavior to ambiguous requests and conflicting instructions. It commonly reported that it was doublechecking various constraints and fulfilling the users’ request. However, whether those constraints were actually checked against a validated dataset or not was not exposed.

Making interesting playlists that would otherwise be difficult to make

When the playlist agent can get a close, but not exact, match to the constraints listed in the prompt, it runs a “related” query and silently substitutes the results from that query as valid results for the original request. This dilutes the trust in the system since a prompt requesting ONLY bass-driven rock music in a playlist might gather non-bass-driven rock music in a playlist, likely dissatisfying the user.

There does appear to be a “certainty threshold” that the agent is not comfortable crossing. For example, this entire exploration was based on the “less than 10 million plays” unlock phrase. When this happens, the agent would divulge just a handful of fields it had access to every time. This list of fields would change from prompt to prompt, even if the prompt was the same between runs of the prompt. This is classic LLM non-determinism. In order to boost trust in the system, exposing what the agent DOES have access to in a straightforward way tells the human exactly what they can and cannot ask about.

Finally, when these two types of data are mixed, the agent is not clear about which songs it has used verified data for and which it has used inferred data for. Both verified and inferred decisions are mixed and presented with identical authority in the music notes. For example, if you craft a prompted playlist about your own user information (“songs I’ve skipped more than 30 times with a punchy bass-driven melody”), the agent will add real data (“you skipped this song 83 times last year!”) right next to inferred knowledge (“John Deacon’s bass line commands attention throughout this song”). To be clear, I’ve not skipped any Queen songs 83 times to my knowledge. But the AI agent doesn’t have a “bass_player” field anywhere in its available data to query against. The AI knows that Queen commonly has a strong bass line in their songs and the knowledge of John Deacon as Queen’s bass guitarist allows its LLM to infer that it’s his bass line that caused the song to be added to the playlist.

Applying the Math: Two Playlists, Two Fidelity Scores

Let’s apply this prompt fidelity concept to example playlists. I don’t have full access to the Spotify music catalog so I’ll be using example survivorship numbers from our criteria filters in our fidelity bit computations. The formula is the same at every step: bits = −log₂(p) where p is the estimated fraction of the catalog that survives the filter being applied.

“Minor Bass Melodies” — The Confident Illusion

This playlist is the one with Queen. “A playlist of rock music, all in minor key, under 4 minutes of playtime, released pre-2005, and bass-led”. I’ll apply our formula and use the bits of information I have from each step to help compute the prompt fidelity.

Duration < 4 minutes

Estimate: ~80% of tracks are under 4 minutes → p = 0.80

This barely narrows anything, which is why it contributes so little

Release date before 2005

Estimate: ~30% of Spotify’s catalog is pre-2005 (the catalog skews heavily toward recent releases) → p = 0.30

More selective — eliminates 70% of the catalog

Minor key

Estimate: ~40% of popular music is in a minor key → p = 0.40

Moderate selectivity, but this is entirely inferred — the agent confirmed key/mode is not a verified field

Bass-led melodic element

Estimate: ~5% of tracks feature bass as the lead melodic element → p = 0.05

By far the most selective constraint. This single filter does more work than the other three combined. And it’s 100% inferred.

Totals:

These survival fractions are estimates. However, the structural point holds regardless of exact numbers: the most selective constraint is the least verifiable, and that’s not a coincidence. The things that make a prompt interesting are almost always the things an agent has to guess at.

The agent thinks it has access to song download status, but only some songs are downloaded (the green arrow icon pointing down indicates offline availability)

“Skipped Songs” — The Honest Playlist

This prompt is very straight forward: “A playlist of songs I’ve skipped more than 5 times”. This is very easy to verify and the agent will lean into the data it has access to.

Skip count > 5

Estimate: ~10% of tracks in your library have been skipped more than 5 times → p = 0.10

This is the only constraint, and it’s a verified field (user_skip_count)

Totals:

The Structural Insight

The interesting part about prompt fidelity is apparent in each playlist: the “most interesting” prompt is the least verifiable. A playlist with all my skipped songs is trivially easy to implement but Spotify doesn’t want to show it. After all, these are all songs I generally don’t prefer to listen to, hence the skips. Similarly, publish date being before 2005 is very easy to verify, but the resultant playlist is unlikely to be interesting to the average user.

The bass-line constraint though is very interesting for a user. Constraints like these are where the Prompted Playlist concept will shine. Already today I’ve created and listened to two such playlists generated from just a concept of a song that I wanted to hear more of.

However, the concept of a “bass-driven” song is hard to quantify, especially at Spotify’s scale. Even if they did quantify it, I’d ask for “clarinet jazz” the next day and they’d all have to get back to work finding and labeling those songs. And this is of course the magic of the Prompted Playlist feature.

Validation: A Controlled Agent

The Spotify examples are compelling, but I don’t have direct access to the schema, the tools, and the agentic harness itself. So I built a movie recommendation agent in order to test this theory within a more controlled environment.

https://github.com/Barneyjm/prompt-fidelity

The movie recommendation agent is built with the TMDB API that provides the verified layer. Fields in the schema are genre, year, rating, runtime, language, cast, and director. All the other constraints like mood, tone, and pacing are not verified data and are instead sourced from the LLM’s own knowledge of movies. As the agent fulfills a user’s request, the agent records its data sources as either verified or inferred and scores its own response.

The author used the TMDB API in this example but this example is not endorsed or certified by TMDB.

The Boring Prompt (F = 1.0)

We’ll start with a “boring” prompt: “Action movies from the 1980s rated above 7.0”. This offers the agent three constraints to work with: genre, date range, and rating. All these constraints correspond to verified data values within the database.

If I run this through the test agent, I see the high fidelity pops out naturally because each constraint is tied to verified data.

Prompting the movie agent with a high fidelity prompt

Every result here is verifiably correct. The LLM made zero judgement calls because it had data it could base its response on for each constraint.

The Vibes Prompt (F = 0.0)

In this case, I’ll look for “movies that feel like a rainy Sunday afternoon”. No constraints in this prompt align to any verified data in our dataset. The work required of the agent falls entirely on its LLM reasoning off its existing knowledge of movies.

Prompting the agent with a low fidelity prompt

The recommendations are defensible and are certainly good movies but they are not verifiable according to the data we have access to. With no verified constraints to anchor the search, the candidate pool was the entire TMDb catalog, and the LLM had to do all the work. Some picks are great; others are the model reaching for obscure films it isn’t confident about.

The Takeaway

This test movie recommendation agent verifies the prompt fidelity framework as a powerful way to expose how an agent’s interpretation of a users’ intent pushes its response into a precision tool or a recommendation engine. Where the response lands between those two options is critical for informing users and building trust in agentic systems.

The Fidelity Frontier

To make this concrete: Spotify’s catalog contains roughly 100 million tracks. How much total information your prompt needs to carry to narrow the catalog down to your playlist I’ll call I_required.

To select a 20-song playlist from that catalog, you need approximately 22 bits of selectivity (log₂ of 100 million divided by 20).

The verified fields (duration, release date, popularity, tempo, energy, genre, explicit flag, language, and the full suite of user behavioral data) have a combined capacity that tops out at roughly 10 to 12 bits, depending on how you estimate the selectivity of each field. After that, the verified layer is exhausted. Every additional bit of specificity your prompt demands has to come from LLM inference. I’ll call this maximum, I_max

That gives you a fidelity ceiling for any prompt:

And the fidelity ceiling for any playlist:

For the Spotify agent, a maximally specific prompt that fully defines a playlist cannot exceed roughly 55% fidelity. The other 45% is structurally guaranteed to be inference. For simpler prompts that don’t push past the verified layer’s capacity, fidelity can reach 1.0. But as prompts get more specific, fidelity drops, not gradually but by necessity.

An screenshot of an interactive chart to explore the fidelity frontier

This defines what I’m calling the fidelity frontier: the curve of maximum achievable fidelity as a function of prompt specificity. Every agent has one. It’s computable in advance from the tool schema. Simple prompts sit on the left of the curve where fidelity is high. Creative, specific, interesting prompts sit on the right where fidelity is structurally bounded below 1.0.

The uncomfortable implication is that the prompts users care about most (the ones that feel personal, specific, and tailored) are exactly the ones that push past the verified layer’s capacity. The most interesting outputs come from the least faithful execution. And the most boring prompts are the most trustworthy. That tradeoff is baked into the math. It doesn’t go away with scale, better models, or bigger databases. It only shifts.

For anyone building agents, the practical takeaway is this: you can compute your own I_max by auditing your tool schema. You can estimate the typical specificity of your users’ prompts. The ratio tells you how much of your agent’s output is structurally guaranteed to be inference. That’s a number you can put in front of a product team or a risk committee. And for agents handling policy questions, medical information, or financial advice, it means there is a provable lower bound on how much of any response cannot be grounded in retrieved data. You can shrink it. You cannot eliminate it.

The Broader Application: Every Agent Has This Problem

This is not a Spotify problem. This is a problem for any system where an LLM orchestrates tool calls to answer a user’s question.

Consider Retrieval Augmented Generation (RAG) systems, which power most enterprise AI knowledge-base deployments today. When an employee asks an internal assistant a policy question, part of the answer comes from retrieved documents and part comes from the LLM synthesizing across them, filling gaps, and smoothing the language into something readable. The retrieval is verified. The synthesis is inferred. And the response reads as one seamless paragraph with no indication of where the seams are. A compliance officer reading that answer has no way to know which sentence came from the enterprise policy document and which sentence the model invented to connect two paragraphs that didn’t quite fit together. The fidelity question is identical to the playlist question, just with higher stakes.

Coding agents face the same decomposition. When an AI generates a function, some of it may reference established patterns from its training data or documentation lookups, and some of it is novel generation. As more production code is written by AI, surfacing that ratio becomes a real engineering concern. A function that’s 90% grounded in well-tested patterns carries different risks than one that’s 90% novel generation, even if both pass the same test suite today.

Customer service bots may be the highest-stakes example. When a bot tells a customer what their refund policy is, that answer should be drawn directly from policy documents, full stop. Any inferred or synthesized content in that response is a liability. The silent substitution behavior observed in Spotify (where the agent ran a nearby query and narrated it as if it fulfilled the original request) would be genuinely dangerous in a customer service context. Imagine a bot confidently stating a return window or coverage term that it inferred rather than retrieved.

The general form of prompt fidelity applies to all of these:

Fidelity = bits of response grounded in tool calls / total bits of response

The hard part, and increasingly the core challenge of AI engineering work, is defining what “bits” means in each context. For a playlist with discrete constraints, it’s clean. For free-text generation, you’d need to decompose a response into individual claims and assess each one, which is closer to what factuality benchmarks already try to do, just reframed as an information-theoretic measure. That’s a hard measurement problem, and I don’t claim to have solved it here.

But I think the framework has value even when exact measurement is impractical. If the people building these systems are thinking about fidelity as a design constraint (what fraction of this response can I ground in tool calls, and how do I communicate that to the user?) the outputs will be more trustworthy whether or not anyone computes a precise score. The goal isn’t a number on a dashboard. The goal is a mental model that shapes how we build.

The Complexity Ceiling

Every agent has a complexity ceiling. Simple lookups (what’s the play count for this track?) are essentially free. Filtering the catalog against a set of field-level predicates (show me everything under 4 minutes, pre-2005, popularity below 40) scales linearly and runs fast. But the moment a prompt requires cross-referencing entities against each other (does this track appear in more than three of my playlists? was there a year-long gap somewhere in my listening history?) the cost jumps quadratically, and the agent either refuses outright or silently approximates.

That silent approximation is the interesting failure mode. The agent follows a kind of principle of least computational action: when the exact query is too expensive, it relaxes your constraints until it finds a version it can afford to run. You asked for a specific valley in the search space; it rolled downhill to the nearest one instead. The result is a local minimum, close enough to look right, cheap enough to serve, but it’s not what you asked for, and it doesn’t tell you the difference.

This ceiling isn’t unique to Spotify. Any agent built on indexed database lookups will hit the same wall. The boundary sits right where queries stop being decomposable into independent WHERE clauses and start requiring joins, full scans, or aggregations across your entire history. Below that line, the agent is a precision tool. Above it, it’s a recommendation engine wearing a precision tool’s clothes. The question for anyone building these systems isn’t whether the ceiling exists (it always does) but whether your users know where it is.

What to Do About It: Design Recommendations

If prompt fidelity is a real and measurable property of agentic systems, the natural question is what to do about it. Here are five recommendations for anyone building or deploying AI agents with tool access.

Report fidelity, even approximately. Spotify already shows audio quality as a simple indicator (low, normal, high, very high) when you’re streaming music. The same pattern works for prompt fidelity. You don’t need to show the user a decimal score. A simple label (“this playlist closely matches your prompt” versus “this playlist is inspired by your prompt”) would be enough to set expectations correctly. The difference between a precision tool and a recommendation engine is fine, as long as the user knows which one they’re holding.
Distinguish grounded claims from inferred ones in the UX. This can be subtle. A small icon, a slight color shift, a footnote. When Spotify’s playlist notes say “86 skips” that’s a fact from a database. When they say “John Deacon’s bass line drives the whole track” that’s the LLM’s general knowledge. Both are presented identically today. Even a minimal visual distinction would let users calibrate their trust per claim rather than trusting or distrusting the entire output as a block.
Disclose substitutions explicitly. When an agent can’t fulfill a request exactly but can get close, it should say so. “I couldn’t filter on download status, so I found songs from albums you’ve saved but haven’t liked” preserves trust far more than silently serving a nearby result and narrating it as if the original request was fulfilled. Users are forgiving of limitations. They are much less forgiving of being misled.
Provide deterministic capability discovery. When I asked the Spotify agent to list every field it could filter on, it produced a different answer each time depending on the context of the prompt. The LLM was reconstructing the field list from memory rather than reading from a fixed reference. Any agent that exposes filtering or querying capabilities to users should have a stable, deterministic way to discover those capabilities. A “show me what you can do” command that returns the same answer every time is table stakes for user trust.
Audit your own agent with this technique before your users do. The methodology in this piece (pairing impossible constraints with target fields to force informative refusals) is a general-purpose audit technique that works on any agent with tool access. It took one evening and about a dozen prompts to map Spotify’s full data contract. Your users will do the same thing, whether you invite them to or not. The question is whether you understand your own system’s boundaries before they do.

Closing

Every AI agent has a fidelity score. Most are lower than you’d expect. None of them report it.

The methodology here (using impossible constraints to force informative refusals) isn’t specific to music or playlists. It works on any agent that calls tools. If the system can refuse, it can leak. If it can leak, you can map it. A dozen well-crafted prompts and an evening of curiosity is all it takes to understand what a production agent can actually do versus what it claims to do.

The math generalizes too. Weighting constraints by their selectivity rather than just counting them reveals something that a naïve audit misses: the constraints that make a prompt feel personal and specific are almost always the ones the system can’t verify. The most interesting outputs come from the least faithful execution. That tension doesn’t go away with better models or bigger databases. It’s structural.

As AI agents become the primary way people interact with data systems (their music libraries today, their financial accounts and medical records tomorrow) users will probe boundaries. They’ll find the gaps between what was promised and what was delivered. They’ll discover that the confident, well-narrated response was partially grounded and partially invented, with no way to tell which parts were which.

The question isn’t whether your agent’s fidelity will be measured. It’s whether you measured it first.

Bonus: Prompts Worth Trying (If You Have Spotify Premium)

Once you know the schema, you can write prompts that surface genuinely surprising things about your listening history. These all worked for me with varying degrees of tweaking:

The Relationship Autopsy

“Songs where my skip count is higher than my play count”
Fair warning: this one may cause existential discomfort (you skip these songs for a reason!)

Love at First Listen

“Songs where I saved them within 24 hours of my first play, sorted by oldest first”
A chronological timeline of tracks that grabbed you immediately

The Lifecycle

“Songs I first ever played, sorted by most plays”
Your origin story on the platform

The Marathon

“Songs where my total ms_played is highest, convert to hours”
Not most plays — most total time. A different and often surprising list

The Longest Relationship

“Songs with the smallest gap between first play and most recent play, with at least 50 plays, ordered by earliest first listen”

The One-Week Obsessions

“Songs I played more than 10 times in a single week and then never touched again”
Your former obsessions, fossilized. This was like a time machine for me.

The Time Capsule

“One song from each year I’ve been on Spotify — the song with the most plays from that year”

The Before and After

“Two sets: my 10 most-played songs in the 6 months before [milestone date] and my 10 most-played in the 6 months after”
Plug in any date that mattered — a move, a new job, a breakup, or even Covid-19 lockdown

The Soundtrack to a Year

“Pick the year where my total ms_played was highest. Build a playlist of my top songs from that year”

What Didn’t Work (and Why)

Comeback Story (year-long gap detection): “Songs I rediscovered after a year-long gap in listening”
- agent can’t scan full play history for gaps. Snapshot queries work, timeline scans don’t.
Seasonal patterns (only played in December): “Songs I only played in December but never any other month”
- proving universal negation requires full scan. Same fundamental limitation.
Derived math (ms_played / play_count): “Songs where my average listen time is under 30 seconds per play”
- agent struggles with computed fields. Stick to raw comparisons.
These failures map directly to the complexity ceiling — they require O(n²) or full-scan operations the agent can’t or isn’t allowed to perform.

Tips

Reference field names directly when the agent misinterprets natural language
Start broad and tighten. Loose constraints succeed more often
“If you can’t do X, tell me what you CAN do” is the universal audit prompt

Track Metadata

Field	Status	Description
album	✅ Verified	Album name
album_uri	✅ Verified	Spotify URI for the album
artist	✅ Verified	Artist name
artist_uri	✅ Verified	Spotify URI for the artist
duration_ms	✅ Verified	Track length in milliseconds
release_date	✅ Verified	Release date, supports arbitrary cutoffs
popularity	✅ Verified	0–100 index. Proxy for streams, not a precise count
explicit	✅ Verified	Boolean flag for explicit content
genre	✅ Verified	Genre tags for track/artist
language_of_performance	✅ Verified	Language code. “zxx” (no linguistic content) used as instrumentalness proxy

Audio Features (Partial)

Field	Status	Description
energy	✅ Verified	Available as filterable field
tempo	✅ Verified	BPM, available as filterable field
key / mode	❌ Unavailable	“Would have to infer from knowledge; no verified field”
danceability	❌ Unavailable	Not exposed despite existing in Spotify’s public API
valence	❌ Unavailable	Not exposed despite existing in Spotify’s public API
acousticness	❌ Unavailable	Not exposed despite existing in Spotify’s public API
speechiness	❌ Unavailable	Not exposed despite existing in Spotify’s public API
instrumentalness	❌ Unavailable	Replaced by language_of_performance == “zxx” workaround

User Behavioral Data

Field	Status	Description
user_play_count	✅ Verified	Total plays per track. Observed: 122, 210, 276
user_ms_played	✅ Verified	Total milliseconds streamed per track, album, artist
user_skip_count	✅ Verified	Total skips per track. Observed: 64, 86
user_saved	✅ Verified	Whether track is in Liked Songs
user_saved_album	✅ Verified	Whether the album is saved to library
user_saved_date	✅ Verified	Timestamp of when the track/album was saved
user_first_played	✅ Verified	Timestamp of first play
user_last_played	✅ Verified	Timestamp of most recent play
user_days_since_played	✅ Verified	Pre-computed convenience field for recency filtering
user_streamed_track	✅ Verified	Boolean: ever streamed this track
user_streamed_track_recently	✅ Verified	Boolean: streamed in approx. last 6 months
user_streamed_artist	✅ Verified	Boolean: ever streamed this artist
user_streamed_artist_recently	✅ Verified	Boolean: streamed this artist recently
user_added_at	✅ Verified	When a track was added to a playlist

Source & Context

Field	Status	Description
source	✅ Verified	Play source: playlist, album, radio, autoplay, etc.
source_index	✅ Verified	Position within the source
matched_playlist_name	✅ Verified	Which playlist a track belongs to. No cross-playlist aggregation.

Period Analytics (Time-Windowed)

Field	Status	Description
period_ms_played	✅ Verified	Milliseconds played within a rolling time window
period_plays	✅ Verified	Play count within a rolling time window
period_skips	✅ Verified	Skip count within a rolling time window
period_total	✅ Verified	Total engagement metric within a rolling time window

Query / Search Fields

Field	Status	Description
title_query	✅ Verified	Fuzzy text matching on track titles
artist_query	✅ Verified	Fuzzy text matching on artist names

Confirmed Unavailable

Field	Status	Notes
Global stream counts	❌ Unavailable	Cannot filter by exact play count (e.g., “under 10M streams”)
Cross-playlist count	❌ Unavailable	Cannot count how many playlists a track appears in
Family/household data	❌ Unavailable	Cannot access other users’ listening data
Download status	⚠️ Unreliable	Agent served results but most tracks lacked download indicators. Likely device-local.