Other-ai

Microsoft Trained MAI Models on Unlicensed Web Data

Microsoft trained its MAI models on unlicensed web data despite claiming to use only enterprise-grade, clean data, according to a technical paper.

Image: The Decoder

Microsoft trained its MAI models on unlicensed web data, contradicting its earlier claims about using only enterprise-grade, clean data. A technical paper reveals the company used Common Crawl and other sources, as noted by Simon Willison. This practice aligns with how many AI firms operate, leveraging publicly available content for training, even as they market their data as especially clean. Microsoft's data pipeline includes a proprietary crawler that respects the Robots Exclusion Protocol and related meta-tags, allowing site owners to manage how their content is accessed and used. This approach shifts the responsibility of content protection to site owners, similar to assuming anyone who doesn't lock their door consents to a break-in. Fair use remains a contested legal area, with courts still determining its boundaries. Microsoft’s actions reflect a common industry practice, yet the company continues to sell its training data as particularly 'clean.'

Microsoft’s data pipeline for AI training includes a proprietary crawler that respects the Robots Exclusion Protocol and related meta-tag and HTML controls, enabling site owners to manage how content on their sites is accessed and used. This method puts the burden of protecting content on site owners, akin to assuming anyone who doesn't lock their door consents to a break-in. The company’s use of unlicensed web data contradicts its earlier assertions about using only enterprise-grade, clean data. Despite this, Microsoft continues to market its training data as especially 'clean,' raising questions about the accuracy of such claims.

Microsoft's data pipeline for AI training also taps the open internet, similar to other AI companies scraping the web. The company says it relies on fair use, though this remains a contested legal area. Courts are still sorting out the boundaries of fair use, and Microsoft’s actions reflect a common industry practice. The technical paper describes the data as a 'mixture of publicly available and licensed human-generated data,' highlighting the complexity of data sourcing in AI training.

Source: thedecoder

Key points

Microsoft trained its MAI models on unlicensed web data despite claiming to use only enterprise-grade, clean data.
Microsoft used Common Crawl and other sources for training its MAI models, as noted by Simon Willison.
Microsoft’s data pipeline includes a proprietary crawler that respects the Robots Exclusion Protocol and related meta-tag and HTML controls.
The technical paper describes the data as a 'mixture of publicly available and licensed human-generated data.'
Microsoft’s actions reflect a common industry practice of using unlicensed web data for AI training.

Source: The Decoder Read the original →

WRITTEN BY

Priya Anand

Emerging AI & Applications

Priya covers emerging AI applications and the wider impact of AI across industries.

Microsoft Trained MAI Models on Unlicensed Web Data

Key points

Related articles

Pentagon's AI Strategy Prioritizes Speed Over Alignment

Google-backed satellites begin wildfire detection in US, Canada

Zoom Hack Lets Users Opt Out of Recording

IBM Announces Cyrography Abstraction Layer for Quantum Security