This article is part of our exclusive IEEE Journal Watch series in partnership with IEEE Xplore.
With the rise of AI-generated content online, it’s becoming more difficult—and more important—to help the public identify whether an image, audio clip or video is real or fake. To combat the problem, a team of researchers from Microsoft, Northwestern University in Evanston, Ill., and Witness, a non-profit organization that assists activists and journalists in addressing the challenges associated with AI-generated content, have come together to create a novel dataset of AI-generated media to help build more robust detection systems.
The researchers describe their new dataset, called the Microsoft-Northwestern-Witness (MNW) deepfake detection benchmark, in a study published 10 April in IEEE Intelligent Systems. The dataset was intentionally built using diverse samples of AI-generated media in order to reflect the current AI-generation landscape as much as possible.
Thomas Roca is a principal research scientist at Microsoft who researches security around generative AI. He says that the quality of media produced by generative AI is constantly improving, and virtually anyone can now use something as simple as an app on their phone to generate a voice message reproducing a person’s voice, or an image or video mimicking someone’s appearance.
The harm of such fake media can be profound, ranging from identity fraud and scams to the generation of non-consensual intimate imagery and even child sexual abuse material.
But AI generators are not perfect. They leave behind artifacts—tiny signals or traces when they generate video, imagery, or audio that can confirm the media is fake. “Artifacts can include noise distributions, inconsistencies between pixel patches, gaps in audio signals, and other irregularities,” says Roca.
Improving Deepfake Detection Systems
Research groups around the world have been creating detectors, which are essentially AI models trained to identify artifacts in AI-generated media. However, it has been an arms race to see if detectors can keep pace with the generators, and unfortunately generators remain in the lead.
“Asserting the authenticity of video, images, and audio has become crucial for society, but detection systems are not yet up to the challenge,” says Roca. “We believe this is partly due to how these systems are evaluated.”
For example, researchers may use many examples of AI content from a small handful of generators to train their detector. But this is likely to produce a detector that does not generalize well to new content. Generative AI is evolving so fast that this becomes a real issue.
As a result, these detection systems can perform well when tested against their training dataset or well-established benchmarks, but then perform poorly in the real-world. “AI in the lab is not AI in the wild,” Roca says.
These AI-generated images are part of the Microsoft-Northwestern-Witness benchmark aiming to provide a wider variety of AI media to test detectors on.Thomas Roca, Marco Postiglione, et al.
To get a more well-rounded view of the challenges, experts from Microsoft, Northwestern, and Witness worked together on the new MNW benchmark. “Together, these perspectives—academia, industry, and field-oriented non-profit—create a more complete approach. None of us could achieve this alone,” says Macro Postiglione, a post-doctoral researcher at Northwestern University.
The new dataset aims to include a very diverse sample of AI-generated material from different generators to boost detectors’ applicability in real-world settings.
Postiglione says that fake videos, audio, and images online have often undergone post-processing procedures, such as resizing, cropping, and compressing. People may also intentionally manipulate content to make it harder to detect.
The MNW team hopes to provide the most comprehensive set of examples possible from different generators and subjected to different post-processing manipulations, to ensure that the dataset is a good representation of the current generative AI landscape. The team will also update the dataset every spring and fall, to reflect the latest generator artifacts as well as tricks used to fool detection systems.
The researchers acknowledge that while the dataset was created to help developers in benchmarking their detectors, there’s always the chance it could be used to try and develop new ways to evade detection. But they see the need to address the issue of deepfake content as critical in spite of that chance.
“Our goal with MNW is to contribute to that shared effort—raising standards, encouraging transparency, and helping ensure that as generative AI advances, our ability to assess authenticity keeps pace,” says Roca.
From Your Site Articles
Related Articles Around the Web

