Apple AI research reveals a model that will make giving commands to Siri faster and more efficient by converting any given context into text, which is easier to parse by a Large Language Model.
Having a computer program perform a task based on vague language inputs, like how a user might say “this” or “that,” is called reference resolution. It’s a complex issue to solve since computers can’t interpret images the way humans can, but Apple may have found a streamlined resolution using LLMs.
When speaking to smart assistants like Siri, users might reference any number of contextual information to interact with, such as background tasks, on-display data, and other non-conversational entities. Traditional parsing methods rely on incredibly large models and reference materials like images, but Apple has streamlined the approach by converting everything to text.
Apple found that its smallest ReALM models performed similarly to GPT-4 with much fewer parameters, thus better suited for on-device use. Increasing the parameters used in ReALM made it substantially outperform GPT-4.
One reason for this performance boost is GPT-4’s reliance on image parsing to understand on-screen information. Much of the image training data is built on natural imagery, not artificial code-based web pages filled with text, so direct OCR is less efficient.
Converting an image into text allows ReALM to skip needing these advanced image recognition parameters, thus making it smaller and more efficient. Apple also avoids issues with hallucination by including the ability to constrain decoding or use simple post-processing.