Microsoft trained its MAI models on unlicensed web data, contradicting its earlier claims about using only enterprise-grade, clean data. A technical paper reveals the company used Common Crawl and other sources, as noted by Simon Willison. This practice aligns with how many AI firms operate, leveraging publicly available content for training, even as they market their data as especially clean. Microsoft's data pipeline includes a proprietary crawler that respects the Robots Exclusion Protocol and related meta-tags, allowing site owners to manage how their content is accessed and used. This approach shifts the responsibility of content protection to site owners, similar to assuming anyone who doesn't lock their door consents to a break-in. Fair use remains a contested legal area, with courts still determining its boundaries. Microsoft’s actions reflect a common industry practice, yet the company continues to sell its training data as particularly 'clean.'
Microsoft’s data pipeline for AI training includes a proprietary crawler that respects the Robots Exclusion Protocol and related meta-tag and HTML controls, enabling site owners to manage how content on their sites is accessed and used. This method puts the burden of protecting content on site owners, akin to assuming anyone who doesn't lock their door consents to a break-in. The company’s use of unlicensed web data contradicts its earlier assertions about using only enterprise-grade, clean data. Despite this, Microsoft continues to market its training data as especially 'clean,' raising questions about the accuracy of such claims.
Microsoft's data pipeline for AI training also taps the open internet, similar to other AI companies scraping the web. The company says it relies on fair use, though this remains a contested legal area. Courts are still sorting out the boundaries of fair use, and Microsoft’s actions reflect a common industry practice. The technical paper describes the data as a 'mixture of publicly available and licensed human-generated data,' highlighting the complexity of data sourcing in AI training.
Source: thedecoder