cortisengineeringon-device AISiri

How We Built an AI That Works Inside Siri

How Cortis runs a real language model on your iPhone and feeds answers back through Siri — no app opens, no internet needed. A technical look at headless on-device inference.

April 19, 20265 min readNERON

Also in:Türkçe

Most AI apps are just chat windows with an API call behind them. You type a prompt, it hits a server, the server runs inference, and you get a response. Cortis does something different.

When you say "Hey Siri, ask Cortis: explain quantum computing", a real language model runs entirely on your iPhone's CPU and GPU, generates a response, and feeds it back through Siri. No app opens. No internet. No server call. The answer comes from the neural engine in your pocket.

Here's how we built it.

The Problem: Siri Can't Think

Siri is great at setting timers and playing music. Ask it something that requires actual reasoning — "explain the difference between TCP and UDP" — and you get a web search link. Apple Intelligence is improving this, but it requires iPhone 15 Pro or newer, and it's still limited in scope.

We wanted to give every iPhone 12+ user access to a real LLM through Siri, offline, with zero cloud dependency. (If you're curious how this approach stacks up against Google's on-device effort, we wrote an honest comparison of Cortis and Google AI Edge Gallery — two very different takes on on-device AI.)

Dual Inference Architecture

Cortis has two separate inference paths:

Path 1: Foreground (React Native) The chat UI uses llama.rn — React Native JSI bindings for llama.cpp. This runs when the app is open and you're typing in the chat.

Path 2: Headless (Native Swift/ObjC++) Siri and Shortcuts use a completely separate native inference engine. This is the interesting part.

Siri → AppIntent.perform() → LlamaEngine (Swift) → LlamaBridge (ObjC++) → llama.cpp C API
                                    ↑                                            ↓
                         App Group UserDefaults                         String result returned
                         (model path, settings)                         (app never opens)

The headless engine is a Swift class called LlamaEngine that reads the active model path from an App Group shared container, loads the model through an ObjC++ bridge (LlamaBridge), and runs inference directly against the llama.cpp C API.

Why Two Engines?

React Native can't run in the background. When Siri invokes an App Intent, the React Native runtime isn't loaded — there's no JavaScript context, no bridge, no JSI. Pure native code is required.

The ObjC++ bridge (LlamaBridge.mm) is a singleton that wraps the raw llama.cpp functions: model loading, tokenization, decoding, and sampling. The Swift layer (LlamaEngine.swift) handles the higher-level logic: reading configuration from the App Group, progressive memory fallback, and stop-string detection.

Progressive Memory Fallback

Background processes on iOS get far less memory than foreground apps. A model that loads fine in the app might crash when invoked from Siri. Our solution: progressive fallback.

Try GPU + 2048 context
Fall back to GPU + 1024
Fall back to GPU + 512
Fall back to CPU-only + 512

Each step reduces memory requirements. The engine tries the most capable configuration first and degrades gracefully.

17 App Intents with String Returns

This is what makes Cortis uniquely powerful for Shortcuts. Every headless intent conforms to ReturnsValue<String> — meaning the AI's response is returned as a plain string that Shortcuts can pipe into the next action.

Example Shortcuts chain:

Safari → Get Page Content → Cortis: Summarize → Notes: Create Note
Voice Memo → Transcribe → Cortis: Extract Action Items → Reminders: Add

No other on-device AI app returns string values from Shortcuts. This enables fully offline AI automation pipelines. If you want to see these chains in action, we've written a step-by-step guide to five practical Cortis Shortcuts you can build in under a minute each.

Token Limit and Siri-Optimized Prompts

Siri has a ~15-second timeout for intent responses. At roughly 20 tokens/second on an iPhone 14, that gives us about 200 usable tokens. Every headless intent caps output at 200 tokens.

We also use special system prompts for Siri that enforce:

Plain conversational text (no markdown, no emoji)
No preamble ("Sure! Here's..." — Siri reads this aloud and it sounds robotic)
Cortis identifies itself as a private on-device assistant

What This Means for Users

Ask Siri anything, get a real answer — not a web search link
Build AI automations with Shortcuts — summarize, translate, rewrite, extract — all offline
Share from any app — select text in Safari, share to Cortis, get a summary
Search past conversations from Spotlight — every chat is indexed

This isn't a gimmick. It's a fundamentally different architecture from cloud AI apps, and it enables use cases that ChatGPT and Claude simply can't offer — because they need a server. For a broader look at ten practical things you can do with on-device AI when you have no internet at all, from draft emails on a flight to offline translation abroad, the Siri-first approach really starts to pay off.

Try It

Cortis is available on the App Store. The Siri integration, Shortcuts support, and Share Extension all work on the free tier — no Pro purchase needed.

The free model (Llama 3.2 1B) is small but capable enough for Siri queries. Pro unlocks larger models, remote server inference, and custom personas. Want to see Cortis in action? Try our guide to 5 AI Shortcuts that work without internet or 10 things you can do with AI offline.

NERON LLC builds AI tools that respect your privacy. Cortis is our on-device AI assistant for iOS and Android.