Enlarge / A man peers through a glass partition, seeking transparency.
The Open Source Initiative (OSI) recently released an updated draft definition of “open source AI,” aiming to clarify the term's ambiguous use in a rapidly changing field. The move comes as some companies, such as Meta, use the “open source” label while releasing weights and code for trained AI language models with usage restrictions. This has sparked a fierce debate among free software advocates about what “open source” truly constitutes in the context of AI.
For example, Meta's Llama 3 model is freely available but does not meet traditional open source standards as defined by the OSI for software because it imposes license restrictions on its use depending on the size of the company and the type of content created with the model. AI image generator Flux is another “open” model that is not truly open source. Because of this ambiguity, AI models that contain limited code or weights, or that have no accompanying training data, are typically described with alternative terms such as “open weights” or “source available.”
To formally address the issue, OSI, a well-known advocate of open software standards, formed a group of about 70 participants, including researchers, lawyers, policymakers, and activists. Representatives from major technology companies, including Meta, Google, and Amazon, also joined the effort. The group's current draft definition of open source AI (version 0.0.9) highlights “four fundamental freedoms” that are reminiscent of the free software definition: allowing users of AI systems to use the AI for any purpose without permission, to study how the AI works, to modify the AI system for any purpose, and to share the AI with or without modifications to the AI system.
By establishing a clear standard for open source AI, the organization hopes to provide a benchmark for evaluating AI systems, helping developers, researchers, and users make more informed decisions about the AI tools they create, study, and use.
True open source AI can also uncover potential software vulnerabilities in AI systems, because researchers can see how AI models work behind the scenes. Contrast this approach with an opaque system such as OpenAI's ChatGPT: it's not just a GPT-4o large-scale language model with a fancy interface; it's a proprietary system of interlocking models and filters, whose exact architecture is a closely guarded secret.
According to OSI's project timeline, a stable version of the “open source AI” definition is expected to be announced at the All Things Open 2024 event in Raleigh, North Carolina, in October.
“Innovation without permission”
In a May press release, OSI emphasized the importance of defining what open source AI really means. “AI is different from regular software, which requires all stakeholders to consider how open source principles apply to this field,” said Stefano Maffuri, executive director of OSI. “OSI believes that everyone has the right to maintain ownership and control over their technology. We also recognize that markets will thrive when clear definitions foster transparency, collaboration, and permissionless innovation.”
The organization's latest draft definition goes beyond just the AI model and its weights to encompass the entire system and its components.
For an AI system to qualify as open source, it must provide access to what OSI calls a “preferred format for making modifications.” This includes detailed information about the training data, the full source code used to train and run the system, and the model weights and parameters. All of these elements must be available under an OSI-approved license or terms.
Notably, the draft does not mandate the publication of raw training data, but instead calls for “data information” – detailed metadata about the training data and methods, including information about data sources, selection criteria, pre-processing techniques, and other relevant details that would allow a skilled person to reproduce a similar system.
The “data information” approach aims to provide transparency and reproducibility without necessarily publishing the actual datasets, ostensibly addressing potential privacy and copyright concerns while upholding open source principles, although that specific point may be more controversial.
“The most interesting thing about[this definition]is that it doesn't allow for the publication of training data,” independent AI researcher Simon Willison said in a brief interview with Ars about the OSI proposal. “It's an extremely pragmatic approach. If it didn't, there would be very few competent 'open source' models.”