Apple’s use of fake data to train AI is not as weird as it sounds


Last weekend, Bloomberg’s Mark Gurman and Drake Bennett published a comprehensive look into what went wrong with Apple Intelligence.

The piece details everything from years-long oversights to a deep misunderstanding of AI’s potential at the company’s highest levels. But more importantly, it also outlines what Apple is doing now to catch up. One of those efforts? A push into synthetic data.

As Gurman and Bennett put it:

All this has left Apple’s researchers more heavily reliant on datasets it licenses from third parties and on so-called synthetic data—artificial data created expressly to train AI.

And

Thanks to a recent software update, iPhones have also been enlisted to help improve Apple’s synthetic data. The fake data is assessed and enhanced by comparing it with the language in user emails on their phones, providing real-world reference points for AI training without feeding actual user information into the models.

If this idea sounds weird, here’s the first thing you should know: Apple is hardly the first company to lean on computer-generated “fake” data to train AI models.

Companies like OpenAI, Microsoft, and Meta have all successfully trained models relying on this technique. But Bloomberg’s report has put the method under the spotlight for Apple enthusiasts.

In short, synthetic data lets engineers create enormous, perfectly labeled, privacy-safe datasets on demand. It allows them to cover edge cases that rarely appear in the wild, and iterate far faster than if they waited for real-world samples to trickle in.

Here’s how OpenAI detailed the use of synthetic data to reduce hallucinations during the training process of GPT-4 in Mach, 2023:

For closed-domain hallucinations, we are able to use GPT-4 itself to generate synthetic data. Specifically, we design a multi-step process to generate comparison data:

  1. Pass a prompt through GPT-4 model and get a response
  2. Pass prompt + response through GPT-4 with an instruction to list all hallucinations
    (a) If no hallucinations are found, continue
  3. Pass prompt + response + hallucinations through GPT-4 with an instruction to rewrite the response without hallucinations
  4. Pass prompt + new response through GPT-4 with an instruction to list all hallucinations
    (a) If none are found, keep (original response, new response) comparison pair
    (b) Otherwise, repeat up to 5x

This process produces comparisons between (original response with hallucinations, new response without hallucinations according to GPT-4), which we also mix into our RM dataset. We find that our mitigations on hallucinations improve performance on factuality as measured by evaluations such as TruthfulQA and increase accuracy to around 60% as compared to 30% for an earlier version.

As for Microsoft, their Small Language Model Phi-4 from December 2024 was trained on 55% synthetic data, while the remaining 45% was split across other sources. Of course, it did help that Phi-4 was an SLM with just 14 billion parameters, instead of the trillions of parameters currently needed to train a frontier LLMs.

Yet, the model (which is open, by the way) outperformed bigger models like GPT-4o and Gemini Pro 1 on math and reasoning tasks.

Phi-4 performance against bigger models.
Average performance of different models on the November 2024 AMC-10 and AMC-12 tests from Microsoft’s Phi-4 Technical Report

But what exactly is “synthetic data”?

Synthetic data is information generated by an algorithm (often another AI model) or even manually, rather than collected from real data. And because it’s created in-house, engineers can:

  • Guarantee perfect label accuracy;
  • Adjust for rare scenarios;
  • Avoid including personally identifiable or copyrighted material in the dataset.

Apple’s own research blog gives a concrete example of its use of synthetic data. In a nutshell, the company fabricates thousands of sample emails (“Want to play tennis tomorrow at 11:30 a.m.?”) on device, compares them to real messages locally, and only sends back an anonymized signal about which synthetic samples look most relevant.

Apple's synthetic data generation pipeline.

For once, being late to the game is paying off

The reason why so many AI behemoths are turning to synthetic data is simple: basically, they have already gobbled up all the available data in the world, and they need more.

This, in turn, has resulted in research investments and significant performance improvements for AI training with synthetic data in the last two years.

In Apple’s case, this might turn out to be kind of perfect. The company was sound asleep as the entire market allegedly infringed on copyrighted material left and right. And when it finally woke up, it (mostly) stuck to its privacy convictions. At that point, synthetic data generation for AI model training was starting to take off, and Apple finally joined in.

It’s obviously not that simple, but you get the idea.

But won’t this just collapse the models?

In a word, no. In a few words, not if done correctly.

In the past, it was widely believed that the entire internet would turn into AI-generated slop, trained on AI-generated slop, and the whole thing was unavoidably done for.

Slowly but surely, a few studies began to suggest that partial use of carefully curated synthetic data could actually improve model performance. More so, in fact, than relying solely on raw, “organic” data. Microsoft’s Phi-4, for instance, went on to prove that and push this idea even further.

As for Apple, training its models using synthetic data may prove a multi-fold win, as it might speed up Siri’s reboot, accelerate its support for more languages and regions, all while needing fewer GPUs (which is good, because they decided they didn’t need those for AI) due to smaller corpora of material.

Bottom line

Of course, as with any tech-related decision, this comes with important tradeoffs. For one, it is far more expensive and slower to gather clean, human-curated synthetic data rather than the “traditional” alternatives.

Also, while using an LLM to generate synthetic data may theoretically avoid including personally identifiable or copyrighted material in the dataset, there is always the possibility of the model spitting out something verbatim that is in the “organic” training data.

And finally (at least for the purposes of this piece), having humans in the loop means introducing bias, as much as they might try to avoid it.

Still, Apple’s investment in synthetic data for Apple Intelligence is good news. Well, any news of Apple investing in AI is good news. For all the leaks, reports, and (justified) finger pointing of the last few weeks, Apple might finally be ready to turn the page and start talking about what it’ll actually do to pull itself out of the AI-shaped hole it spent the last years digging itself into.

FTC: We use income earning auto affiliate links. More.



Source link

Previous articleUsing Let’s Encrypt SSL Certificates? You Need to Check Your Setup