← Back

Is "Agentic Cooking" Possible?

Apr 2026  ·  AI  ·  Computer Vision  ·  Wearables

When you cook something for the first time, the hardest part is not the steps, it is not knowing what to ask. Recipes are written for people who already get it, and your phone is not very useful when your hands are covered in flour. Most of the help you actually need ("is this the right colour", "what does fold in mean", "is this small enough") only really works if someone is standing next to you watching the pan. Glasses can see what you see, and a model on the other end can answer.

How it all started

At Imperial we are given a lot of topics to choose from for a Final Year Project (FYP). I looked through 100+ topics, but the project with an intriguing name "Real-Time Gaze-Contingent Contextual AI on Meta Aria Glasses" intrigued me so much I came to the supervisor's office 30 minutes after the release of the projects.

Part of why it grabbed me was that the pieces have only recently lined up. AI glasses are showing up more and more as a real form factor, vision models that can actually understand what is on a chopping board are still pretty new, and local LLMs got good enough to run the whole loop without paying for API calls every few seconds.

First Prototype

Plumbing interfaces together

This system has multiple components: Meta Aria Glasses, Laptop, an LLM-backend.

  • Meta Aria Glasses have cameras and a microphone. We need to stream them in real-time.
  • Laptop connects them all together does API requests to the backend, converts speech to text, text to speech.
  • LLM-backend can be a simple api request to a model provider or to a self-hosted LLM (my case)

The glasses can stream images + audio really well when they are connected via a cable, but you cannot cook much with glasses being a meter away from the laptop.

The glasses also have their own wifi hotspot, so streaming over wifi is possible. The problem is the hotspot has no internet, so the moment the laptop joins it the laptop loses the route to the backend.

The workaround i found was to keep the laptop on the glasses' hotspot for the stream, and tether my phone over usb so the backend traffic goes through that instead. Two separate connections at the same time, one for each job.

I have access to a PC with a Radeon RX 9070 XT GPU (sadly not Nvidia, that would have been much easier). To make ollama actually use the GPU instead of the CPU you need ROCm drivers, and getting ROCm working on a brand new card took me a few install/reinstall cycles. Once it is set up though, it works surprisingly well.

After that i set up a reverse SSH tunnel from the PC to my laptop that forwards ollama's port. From the laptop's side ollama looks like it is running on localhost, even though it is actually on the other machine.

What's next

There are two things i still want to figure out before this feels good to actually cook with: when the agent should talk on its own, and how to make it fast enough that the wait does not break the flow.

User Experience

The initial prototype is capable of a very basic tasks: "Tell me what to do next", "How do I mix the batter together?" "Where should I move the vegetables". However, all of the interactions are user driven, you have to know what to ask, which more often not the case while cooking something for the first time.

In order to tackle this issue a system that views the current POV autonomously and changes its understanding and progress, and then suggests what to do next will be an amazing addition. First it can check periodically, say once every few seconds. However, it still can miss very subtle actions, so it is a direction to explore.

Latency

The current pipeline takes seconds before the user hears anything, and longer answers stretch the silence further. The user just stands there not doing anything.

The first thing to try is streaming the speech in chunks (streaming TTS). The first chunk plays while the next one is being generated, so with the right threading the user only waits for the first chunk instead of the whole answer.

The input side is more interesting. Since the GPU is mine alone and the recipe + system prompt stay the same the whole session, i can keep the KV cache for that prefix warm and only pay for the new tokens of each query. Also while the user is still talking, i can start decoding on the partial transcript. If the final question matches what i predicted, the answer is already half done. If not, the time is about the same as before.

The most interesting idea is speculative answers, basically a small set of canned responses for the questions every cook asks. "How long do i simmer the pasta?" → "11 minutes, until al dente." If the incoming question is close enough to one of these, skip the model and answer in milliseconds.

Foreword

A lot of these are just hunches and ideas, they have to be more rigorous with existing research backing it or informing other decisions. One thing is for sure gonna be a lot of work and almost all of it will look differently by the end of it. Cooking is a nice first thing to try this on, but the same setup would work for anything where your hands are busy and someone watching what you are doing would help.

Stay tuned for the next post in this series, where i will go through the Context Engineering needed to make any of this actually work.

Thanks for reading my first blog post.