SpeakSphere at Interspeech 2025: Real-Time Speech Translation with On-Device AI

September 01, 2025

Veröffentlicht von Tobias Goecke (Göcke) , SupraTix GmbH (10 Monate, 2 Wochen her aktualisiert)

Bridging Language Barriers in Real Time

Interspeech 2025, a premier conference on speech technology, showcased cutting-edge innovations to the public at its Speech Science Festival in Rotterdam. Among the interactive demos was SpeakSphere, presented by Tobias Goecke of SpeakSphere GmbH. This demo invited attendees to speak in their own language and witness the speech being translated in real time – all from the convenience of their personal smartphones.

Uniquely, the entire system ran on local hardware, meaning no audio data was uploaded to the cloud during translation. This on-premises approach not only wowed tech enthusiasts with its immediacy, but also underscored a commitment to privacy and data security even in a live translation scenario. Such a feat aligns with the festival’s theme of making speech technology accessible, inclusive, and ethically designed, by breaking down language barriers without compromising user privacy.

The SpeakSphere demonstration offered a glimpse into truly seamless multilingual communication. Participants simply spoke into their smartphone in their native language, and within seconds heard the translated speech output in a target language of choice. The translation was voiced aloud by the system in real time, effectively creating a live conversation between speakers of different languages.

According to the official demo description, the system can handle “multiple languages” and operates entirely locally. In practice, this means an attendee could speak Dutch and immediately hear their words spoken back in English (or German, French, etc.), all without any noticeable delay. The ability to run in real time on local devices gave the interaction a natural, conversational feel – a stark contrast to earlier translation tools that often required cloud connectivity and buffering delays. Attendees of the festival could thus engage in cross-lingual dialogue on the spot, experiencing first-hand how far speech translation technology has come in terms of speed, accuracy, and user-friendliness.

SpeakSphere is a platform for real-time multilingual voice translation that leverages advanced AI to make communication across languages effortless. At its core, SpeakSphere transcribes spoken audio, translates the content, and generates synthetic speech in the listener’s language, enabling each person to communicate in their preferred language. For example, one user can speak German while another hears the output in English, and vice versa, with transcripts optionally stored for reference.

Under the hood, the system brings together state-of-the-art components for speech recognition, machine translation, and text-to-speech synthesis. The pipeline for a typical speech-to-speech translation is as follows:

Speech Recognition (ASR) – The spoken input is first converted to text by an automatic speech recognition engine.
Machine Translation – The transcribed text is translated from the source language into the target language using a multilingual translation model (augmented by large language models for context when online, or specialized on-device models for offline use).
Speech Synthesis (TTS) – Finally, the translated text is rendered as audible speech in the target language using a synthetic voice.

SpeakSphere integrates these steps so fluidly that users experience it as a single, instantaneous process. The system supports a wide range of languages and voices – from English, Spanish and French to German, Japanese and more – each mapped to a natural-sounding voice for the output speech. This broad language support ensures that multinational teams or diverse audiences can all converse without a common language, just as the product’s tagline suggests: “dein Team, eine Sprache” – your team, one language.

Equally important, SpeakSphere is designed with data privacy in mind; in professional settings it offers an on-premise deployment that keeps translations secure within a company’s infrastructure. In other words, businesses can break the language barrier internally through real-time translation while ensuring sensitive information never leaves their trusted environment.

SpeakSphere’s decision to run the demo entirely on local hardware – without cloud services – highlights a significant trend in AI: the rise of on-device (offline) translation. Traditionally, real-time speech translators (like those in voice assistants or video calls) relied on cloud-based neural models to process speech and language. While cloud translation offers high accuracy and access to hundreds of languages via powerful servers, it comes with drawbacks: it requires an internet connection, introduces network latency, and raises privacy concerns since spoken data is sent to external servers.

By contrast, running the translation locally addresses these issues. On-device translation works without internet, making it reliable even in remote or high-security environments. Processing is fast and responsive due to not having to communicate with a distant server.

Most importantly, privacy is inherently protected – as one analysis notes, when using on-device translation “data never leaves the device,” eliminating the risk of eavesdropping or data misuse by third parties. SpeakSphere’s real-time demo exemplified these benefits: even in the bustling conference venue, users’ voice data stayed local, and the translations happened almost instantly.

The push toward on-device AI translation is part of a broader industry movement. Major tech companies have introduced offline translation modes in their products – for instance, Apple’s Translate app and Google Translate allow users to download language packs for offline use, and Samsung’s recent smartphones include a “Live Translate” feature that supports dozens of languages fully on-device. These developments show that edge AI (running AI on local devices) has advanced enough to handle complex tasks like speech recognition and neural machine translation within a phone’s chipset.

SpeakSphere builds on this progress, combining it with its own expertise in speech technology. By ensuring “no third-party API is used” in the translation pipeline and no conversation data is uploaded, the system offers a level of data security and compliance attractive to enterprises and researchers alike. For applications such as confidential business meetings, healthcare consultations, or government settings, this privacy-first approach to speech translation is a crucial feature rather than a mere technical novelty.

The SpeakSphere demo at Interspeech 2025’s festival was more than just a live showcase – it was a proof-of-concept of how far speech technology has come in breaking down the Tower of Babel. In real time and on readily available hardware, people conversed across languages with minimal friction.

This capability has profound implications. It points toward a future where language is no longer a barrier in global collaboration or everyday interactions.

A doctor could speak in Italian to a patient who hears it in Arabic, a tourist in a remote area could get by without internet access, and international teams could work together seamlessly, each using their native tongue. All this can happen while keeping conversations private and secure by design.

Formal feedback from Interspeech attendees and researchers who tried the demo was enthusiastic – many noted the naturalness of the translated voice and the negligible lag between speaking and hearing the translation. This aligns with ongoing advances in speech tech, where tools like OpenAI’s Whisper are making speech-to-text more accurate and laying the groundwork for smoother bilingual voice interactions. By integrating such innovations, SpeakSphere is at the forefront of delivering an intuitive user experience that feels almost magical: you speak one language, and out comes another, as if a skilled interpreter were whispering in your ear.

In summary, SpeakSphere’s on-device real-time translation embodies the convergence of performance, privacy, and practicality in speech technology. The Interspeech 2025 demo illustrated that multilingual communication can be both instantaneous and secure, without compromise. For tech enthusiasts, it was a thrilling hands-on experience; for researchers, a validation of cutting-edge models running in real-world conditions; and for potential customers, a preview of a solution that could revolutionize communication in their organizations. As we move forward, demonstrations like these highlight a broader vision in the AI community – one where technology truly unites humanity through language, enabling everyone to speak freely and be understood, no matter what language they speak.

Official Interspeech 2025 Speech Science Festival program (demo description and context)
SpeakSphere project documentation (technical details)
FutureSax innovation profile (on-premise, data-secure design for workplace use)
Here and Now AI articles (real-time translation strategies and on-device translation adoption)