Voice AI on smart glasses | RBKAVIN. Immersive Studio

The assumption that shapes every bad brief

When a brand team hears "smart glasses campaign," they picture a visual AR experience. A 3D product hovering in front of the user. An overlay on a logo. A world that gains a digital layer. The mental model is phones, but worn on your face.

Meta Ray-Ban smart glasses (the ones with 40 million sold and real cultural penetration) have no visual display at all. No lens, no overlay, no AR layer. The only output is audio, through open-ear speakers. The only input is voice, via "Hey Meta," plus a touchpad on the right arm for basic playback controls. There is no screen to design for. No visual UI. No spatial layout to compose.

This is not a limitation. It is a different design medium, and for certain brand experiences it is a better one. The problem is that almost no brand brief acknowledges it, because almost no brand team knows it is an option.

Meta Ray-Ban Display smart glasses worn on a person's face, showing the discreet form factor — The Meta Ray-Ban family: from left, the no-display Gen 2 and the Display variant with its in-lens screen. Form factor is identical; the interaction model is completely different. © Karissa Bell / Engadget

What Meta Ray-Ban actually is

The Meta Ray-Ban Gen 2 (the current standard model) looks like a pair of Wayfarer sunglasses. Inside the frame are four microphones, open-ear speakers positioned near the ear canal, a 12MP camera on the right lens, and the Meta AI assistant running on device and in the cloud via the MetaAI platform.

The interaction is simple: say "Hey Meta" followed by a question or instruction. The glasses respond through the speakers. The camera can be used for visual questions: point at a menu, ask what the calorie count is; point at a sign, ask for a translation. Meta AI handles Japanese, Mandarin, Arabic, and a growing list of other languages for real-time translation. The session is hands-free and screen-free, which is the point.

Battery life runs to around four hours of mixed use. The glasses connect to a phone via Bluetooth for internet access, but the core AI features work without pulling out the phone. That ambient, always-available quality is what makes them interesting for brand experiences: the user is in the world, not staring at a screen, and the brand can meet them there.

Three things brands can actually build

These are the use cases we deliver on the platform, not theoretical concepts:

Voice AI companion

A branded AI assistant with a personality, domain knowledge, and a tone of voice. In-store product advisor, event guide, always-on brand companion. Triggered hands-free, available anywhere the user has a signal.

Creator capture campaigns

The 12MP camera captures first-person POV at the same quality as a phone camera. Authentic, unposed, shot from the perspective the audience actually wants to see. Built for brand content programmes, athlete partnerships, and event coverage.

Audio brand experience

Branded soundscapes, narrated experiences, audio guides tied to a location or moment. Museum tours, festival activations, retail audio journeys. The spatial audio system creates presence even without a visual layer.

Voice AI companion: what the brief actually covers

A voice AI companion is not an FAQ chatbot with a name. Done well, it has a consistent personality across every interaction: a recognisable cadence, opinions on the brand's territory, and the ability to move between topics naturally. Done badly, it reads like a product manual out loud.

The production work breaks into three areas. First, model customisation: Meta's Llama models can be given system prompts, persona definitions, and domain knowledge bases via the Meta Llama API. This is where brand voice is established. Second, conversation design: mapping out likely request patterns, defining fallback responses, setting boundaries on what the assistant will and won't engage with. Third, audio production: the voice used for responses, confirmation tones, and any branded sound moments that open or close a session.

The result is an experience that feels like talking to someone who knows the brand well, not someone reading from a script. At events, in-store, or as a daily utility for loyal customers, that quality shift matters.

Creator capture: the brief nobody writes but everyone wants

First-person content from wearable cameras has been a content format for over a decade, but it has mostly required a GoPro strapped to something. Meta Ray-Ban changes this: the camera sits in a pair of glasses that look normal, the footage is stable and well-exposed, and the user is not visibly "filming" in a way that changes how people around them behave.

For brands running influencer or creator programmes, the brief is straightforward: seed the glasses to a roster of creators, define the moments to capture, brief the content direction, and receive first-person footage that audiences have consistently found more compelling than polished set-piece production. The footage shoots to the connected phone automatically. Post-production is handled by the creator or the studio.

This format works particularly well for sports sponsorships, music and festival content, and food or hospitality brands where the experience itself is the story. The first-person perspective puts the audience into a moment they could not otherwise access, which is the brief most content programmes aim for anyway.

Conversational UX: designing without a screen

Every design decision that normally lives in visual UI (hierarchy, flow, feedback, error states) has to be rebuilt in language and sound. This is conversational UX, and it is a distinct discipline from visual or spatial design.

The primary skill is response design: writing AI responses that feel like conversation, not documentation. Short sentences. Active voice. A single idea per turn. No lists, because lists require visual scanning that audio cannot support. The mental model is closer to radio writing than interface design: every word has to work acoustically, and the rhythm of a response matters as much as its content.

The secondary skill is error handling. When a user asks something outside the assistant's domain, or when a request is ambiguous, the response has to redirect gracefully without making the user feel stupid. "I'm not the right person for that, but here is what I can help with" is more useful than a generic "I don't understand." Mapping these edge cases in advance is most of the production work on a voice AI project.

Feedback without a screen

On a visual interface, you always know whether your input was received: a button changes state, a spinner appears, a confirmation pops up. On an audio-only interface, the gap between action and response feels longer, even when it is not. The fix is a quick audio acknowledgement before the full response: a short tone, or a single word ("Looking...") that confirms the request was heard. This is the audio equivalent of a loading state, and it removes most of the friction.

Sound design matters more here than most brand teams expect. The tone that confirms a request, the sound that signals a session is starting, the ambient audio that plays when an experience loads. These carry brand weight in the same way that colour or typography does in visual design. Treating them as functional details rather than creative decisions is the most common mistake we see in voice-first briefs.

The platform family: three different briefs

The Meta wearables line now covers three distinct platforms, and each one is a different production brief:

No display

Meta Ray-Ban Gen 2

Voice in, audio out. The AI assistant platform. Design discipline: conversational UX, audio branding, response writing. Best for: brand companions, creator capture, always-on utility.

In-lens screen

Meta Ray-Ban Display

Small in-lens screen, controlled by Neural Band EMG wristband. Design discipline: glanceable UI, notification design, wrist gesture interaction. Best for: quick information surfaces, captions, navigation.

Full AR display

Snap Spectacles

46-degree stereo AR display, full hand tracking, 6DoF. The spatial UX platform. Design discipline: spatial composition, hand interaction, session architecture. Best for: immersive brand experiences, interactive installations.

The mistake is treating these as the same thing at different price points. They are not. A brief written for Spectacles will not map onto Meta Ray-Ban Gen 2 without a complete redesign. The first question in any smart glasses brief should be: which platform, and what does that platform actually output? For visual AR, that is Spectacles. For voice-first AI, that is Meta Ray-Ban Gen 2. For a glanceable display layer, that is the Meta Ray-Ban Display. See designing for smart glasses for a deeper look at how the brief changes across each platform's capabilities.

What a project brief looks like in practice

A voice AI companion project typically runs over eight to twelve weeks from brief to live. The main phases are: persona and knowledge base definition (two to three weeks), conversation design and scripting (two to three weeks), integration and testing (two to three weeks), and deployment and measurement setup (one to two weeks).

The brief itself needs to answer five questions before production can start:

Who is the user? What do they already know about the brand, and what do they need from this interaction?
Where does the experience happen? In-store, at an event, in daily life? Context shapes the response cadence and the topics the assistant needs to handle.
What is the brand voice? Tone, vocabulary, what the assistant will and will not say. This needs to be documented before conversation design begins.
What does success look like? Session length, question completion rate, user return rate. Audio experiences need different measurement frameworks than visual campaigns.
How does it end? Every interaction needs an off-ramp: a call to action, a referral, or a clean close. "Anything else?" is not a strategy.

The medium shapes the message

Voice-first experiences succeed when the brief is written for the medium, not translated from a visual campaign. A visual AR activation and a voice AI companion are both smart glasses projects. They are not the same brief, the same production process, or the same measure of success. The brands that build the most compelling voice experiences are the ones that start by understanding what audio can do that visuals cannot: ambient presence, hands-free utility, and the directness of a conversation with something that actually knows about your world.

We build voice-first experiences on Meta Ray-Ban alongside visual spatial work on Snap Spectacles. If you want to understand which format fits your next campaign, the brief is the best place to start.

Insights newsletter

Smart glasses, AR campaigns, spatial computing.

Straight to your inbox. No noise.

Frequently asked questions

Can Meta Ray-Ban smart glasses run branded AI experiences?

Yes. The Meta AI assistant on Ray-Ban smart glasses can be given a persona, domain knowledge, and a tone of voice through the Meta Llama API and integration tools. A brand can build an always-on voice companion that knows its products, answers questions, and speaks in the brand's voice, all without the user reaching for their phone. The experience is triggered by "Hey Meta" and runs hands-free.

What is the difference between Meta Ray-Ban Gen 2 and Meta Ray-Ban Display?

Meta Ray-Ban Gen 2 (the standard model) has no visual display. Input is voice via "Hey Meta" or a touchpad on the frame. Output is audio through open-ear speakers. Meta Ray-Ban Display is a separate product that adds a small in-lens screen controlled by the Neural Band EMG wristband. They are different design briefs: Gen 2 is conversational UX, Display is glanceable information design.

What kind of brand experiences work on a voice-only smart glasses platform?

Three categories work well: voice AI companions (brand assistants, product advisors, in-store guides), creator capture campaigns (using the glasses' 12MP camera for authentic first-person brand content), and live event audio companions (guided experiences, artist or exhibitor information, wayfinding by voice). Each requires a different production approach but all run natively on Meta Ray-Ban without any additional hardware.

How is audio branding different from visual branding for smart glasses?

Visual branding is about what you see: logo, colour, typography, motion. Audio branding is about what you hear: the brand voice, the confirmation tone, the ambient sound that plays at the start of an experience. On a no-display platform, audio is the only output channel, so the quality of the voice, the cadence of responses, and the sound design all carry the brand weight that visuals would normally carry. Brands that treat the voice as a production detail rather than a creative decision miss most of the medium.

Building a voice-first brand experience?

We would like to hear the brief.

Start a project

Voice-first AI on smart glasses:what brands can build without a display