Suddenly, "open source" has become the latest buzzword in the field of artificial intelligence. Meta has committed to creating open-source general artificial intelligence, and Elon Musk is suing OpenAI for "no longer caring" about open-source artificial intelligence models.
At the same time, more and more technology leaders and companies are positioning themselves as "open-source role models."
However, there is a fundamental issue here: there is a failure to reach a consensus on the definition of "open-source artificial intelligence."
On the surface, open-source artificial intelligence promises a future where anyone can participate in the development of technology.
This could accelerate innovation, improve transparency, and give users greater control over artificial intelligence systems, which may soon reshape every aspect of our lives.What exactly is the definition of open source? What characteristics determine whether an artificial intelligence model is open source? And what characteristics make it no longer open source?
These answers may have a significant impact on the future of this technology. Before the tech industry establishes a recognized definition, powerful companies can easily change the concept to meet their own needs, and it may become a tool to consolidate the dominant position of today's leading companies.
Advertisement
The debate was sparked by the Open Source Initiative (OSI), a non-profit organization dedicated to promoting and protecting open source.
The organization was founded in 1998 and is the leader in defining and agreeing on open source in the software field. Many of the rules and agreements it has established have been widely accepted by developers around the world to determine whether a software can be considered open source.
Now, the organization has convened more than 70 researchers, lawyers, policymakers, activists, and representatives from large technology companies such as Meta, Google, and Amazon, to try to propose a clear definition for open source artificial intelligence.However, the open-source community is filled with a diverse array of individuals and organizations, including hacktivists (those who carry out cyber-attacks for political purposes) as well as Fortune 500 companies.
Stefano Maffulli, the executive director of the Open Source Initiative, stated that while participants have reached a broad consensus on general principles, it is becoming increasingly apparent that many details are difficult to finalize.
With so many competing interest groups involved, finding a solution that satisfies everyone while ensuring the participation of the largest companies is by no means an easy task.
Standards AmbiguityThe absence of a fixed definition has not prevented technology companies from adopting the term "open source."
In July last year, Meta made its so-called "open source" Llama 2 model available to the public for free, and it has a history of publicly releasing artificial intelligence technology.
Jonathan Torres, Meta's Deputy General Counsel for Artificial Intelligence, Open Source, and Licensing, told us: "We support the efforts of the open source initiative organization to define open source artificial intelligence and look forward to continuing to participate in their leadership process to benefit the open source community around the world."
This is in stark contrast to the competitor OpenAI, which has become increasingly reluctant to share technical details of its advanced models in recent years, citing security concerns.
An OpenAI spokesperson said: "We can only open source powerful artificial intelligence models after carefully weighing the benefits and risks (including misuse and acceleration)."Other leading artificial intelligence companies, such as Stability AI and Aleph Alpha, have also released what they call open-source models. Hugging Face boasts a vast library of free artificial intelligence models.
Although Google has taken a more closed approach to its most powerful models, such as Gemini and PaLM 2, the Gemma model released last month is freely accessible and is designed to compete with Meta's Llama 2. However, the company describes it as "open" rather than "open-source."
But there is considerable disagreement over whether all of these models can truly be described as open-source. First, some licensing for Llama 2 and Gemma will restrict the ways in which users can utilize these models.
This fundamentally contravenes the traditional principles of open-source: a key tenet of the open-source definition is the prohibition of imposing any restrictions based on use cases.
Even for models that do not have these conditions, the standards are ambiguous. The concept of open-source is intended to ensure that developers can use, study, modify, and share software without restrictions.Mafuli said, but the way artificial intelligence works is fundamentally different, and many core concepts cannot be well transferred from software to artificial intelligence.
One of the biggest obstacles is that artificial intelligence models contain too many "components". Mafuli said that for software, you only need to modify its underlying source code.
But for artificial intelligence models, different goals mean different workloads, and modifying an artificial intelligence model may require access to the trained model, training data, code for preprocessing this data, code for managing the training process, the underlying architecture of the model, and many more minor details.
What "components" people need to study and modify models meaningfully is still open to discussion. Mafuli said: "We have identified the basic freedoms or basic rights we hope to exercise. But the mechanism for exercising these rights is still unclear."
Mafuli said that if the artificial intelligence community wants to benefit from open source as much as software developers, resolving this debate will be crucial.He said: "Having a definition that is respected and adopted by most of the industry can convey a clear and unambiguous message.
With clarity, compliance costs are lower, friction is less, and people's understanding of the same thing is the same."
So far, the biggest sticking point has been data. All major artificial intelligence companies have only released pre-trained models without disclosing their training datasets.
Mafuli said that for those who want to implement a stricter definition of open source artificial intelligence, this severely limits their efforts to modify and study the models, so they cannot be considered open source.
Mafuli said that others believe that a simple description of the data is usually sufficient to understand a model, and the adjustments you need to make do not necessarily have to start from scratch retraining.Pre-trained models are typically fine-tuned, a process in which they undergo partial retraining on a smaller, usually application-specific dataset.
Roman Shaposhnik, CEO of open-source AI company Ainekko and Vice President of Legal Affairs at the Apache Software Foundation, says that Meta's Llama 2 is a great example of this.
Although Meta has only released a pre-trained model, a thriving developer community has been downloading and using it, sharing the modifications they have made.
He said, "People are using it in various projects and have built a complete ecosystem around it. So, we had to give it a name. Is it semi-open?"
Zuzanna Warso, Research Director at non-profit organization Open Future, which is involved in discussions about open-source initiatives, says that while it is technically possible to modify the model without the original training data, restricting access to key "components" does not truly align with the spirit of open source.Can people truly exercise the freedom to study models when they do not know what information the models are trained on? This is also controversial.
"This is a key part of the whole process," she said. "If we care about openness, we should also care about the openness of data."
Having one's cake and eating it too.
We must understand one thing: why are companies that claim to be "open source models" unwilling to hand over training data? Valso said that obtaining high-quality training data is a major bottleneck in artificial intelligence research and a competitive advantage that large companies want to keep in their hands.At the same time, open source has brought many benefits, and companies hope to see these benefits extended to the field of artificial intelligence.
Valsoe said that, on the surface, the term "open source" has a positive connotation for many people, making it easy to win public favor through what is known as "open washing."
This can also have a significant impact on a company's bottom line. Economists at Harvard Business School in the United States recently found that open source software allows companies to build products on top of high-quality free software, rather than writing from scratch themselves, saving these companies nearly $9 trillion in development costs.
Valsoe said that for large companies, making their software open source, so that it can be reused and modified by other developers, helps to build a strong ecosystem around their products.
A typical example is Google's open sourcing of its Android mobile operating system, which has solidified its dominant position in the smartphone revolution.Meta's Mark Zuckerberg clearly stated this motive during the earnings call, saying, "Open source software tends to become the industry standard, and when companies build on our technology stack in a standardized way, it becomes easier to integrate new innovations into our products."
Valso pointed out that it is crucial that open source artificial intelligence seems to be treated favorably in some regulatory areas.
She noted that the newly passed Artificial Intelligence Act by the European Union excludes certain open source projects from stricter requirements.
Valso said that, overall, sharing pre-trained models while restricting access to the data needed to build the models makes sense from a business perspective.
But she added that it does seem a bit like companies want to have their cake and eat it too. If this strategy helps to consolidate the dominant position that large technology companies already occupy, it is difficult to see how it aligns with the fundamental concept of open source.Walsh said: "We believe that openness is one of the tools to challenge the concentration of power. If this definition helps to challenge the issue of power concentration, then the data issue becomes even more important."
Chaboshnik believes that compromise is possible. The vast amount of data used to train the largest models comes from open repositories such as Wikipedia or Common Crawl, which scrape data from the web and share it for free.
He said that companies can simply share the open resources used to train their models, allowing people the possibility to recreate a similar dataset, thereby better studying and understanding the model.
Aviya Skowron, the policy and ethics officer of EleutherAI, a non-profit artificial intelligence research organization, is also involved in the open-source initiative-led discussions.
He said that there is a lack of clarity on whether the artistic or writing training data scraped from the internet infringes on the rights of creators, which could become very complex in terms of law. This makes developers cautious about making their data public.Stefano Zacchiroli, a professor of computer science at the École Polytechnique in Paris, France, has also contributed to the definition of open source promoted by the open source initiative.
He understands the necessity of pragmatism, and his personal view is that a complete description of the model training data is the minimum requirement for open source, but he also recognizes that there may be a lack of appeal in establishing a stricter definition of open source artificial intelligence.
Zacchiroli said that ultimately, the community needs to decide what it wants to achieve: "Do you just want to go with the flow, let the market develop naturally, and eventually see that companies will not fundamentally recognize the term 'open source artificial intelligence'? Or do you want to strive to push the market to be more open, providing users with more freedom?"
What does open source mean?Sarah Myers West, Co-Executive Director of the AI Now Institute, stated that no matter how open-source artificial intelligence is ultimately defined, the extent to which it can create a fair competitive environment remains a controversial topic.
She co-authored a paper published in August 2023, which revealed that many open-source artificial intelligence projects lack openness.
However, it also emphasized that no matter how open the model is, the large amount of data and computing power required to train cutting-edge artificial intelligence will bring deeper structural barriers to smaller participants.
Myers West believes that there is also a lack of clarity about what people hope to achieve through open-source artificial intelligence.
She asked, "Is it safety? Is it the ability to conduct academic research? Is it trying to promote more competition? We need to understand more accurately what the goal is, and how opening a system can change the pursuit of the goal."The open-source initiative organization appears to be keen on avoiding these conversations. The draft definition mentions that autonomy and transparency are key advantages, but when the organization was asked to explain why it places such a high value on these concepts, Maffuli was unwilling to answer.
The document also contains a "scope exclusion" section, explicitly stating that the definition will not involve issues of "ethical, trustworthy, or responsible" artificial intelligence.
Maffuli said that historically, the open-source community has focused on achieving seamless sharing of software and avoiding getting caught up in debates about what the software should be used for. "It's not our job," he said.
But Valso said that no matter how hard people have tried for decades, these issues cannot be ignored. She added that the idea that "technology is neutral and ethics and other topics are beyond the scope of discussion" is a fantasy.
She suspects that this is a fairy tale that has to be maintained to prevent the already loose alliance of the open-source community from breaking apart. Valso said: "I think people realize that this is not true (a fairy tale), but we need this consensus to move forward."In addition to open-source initiatives, others have adopted different approaches. In 2022, a group of researchers introduced the Responsible AI Licenses (RAIL), a license similar to open-source licenses but with terms that can restrict specific use cases.
Danish Contractor, an AI researcher who created the license, said that the goal is to allow developers to prevent their work from being used for (what they consider) inappropriate or unethical matters.
He said, "As a researcher, I hate it when my work is used in harmful ways." He is not the only one; he and his colleagues recently conducted an analysis on the model hosting platform of AI startup Hugging Face and found that 28% of the models used the RAIL license.
Google has also adopted a similar approach in the license attached to its Gemma. The company stated in a recent blog post that its terms of use list various prohibited use cases considered "harmful," reflecting its commitment to "responsible AI development."
The Allen Institute for AI has also formulated its own open licensing policy, with its ImpACT license restricting redistribution based on the potential risks of the model and data.Luis Villa, co-founder and legal head of open source software management company Tidelift, stated that considering the differences between artificial intelligence and traditional software, it is inevitable to conduct a certain degree of experimentation with varying levels of openness, and it may be beneficial to the field.
However, he is concerned that the increasing number of incompatible "open" licenses may undermine the cooperation that has made open source so successful, slow down innovation in artificial intelligence, reduce transparency, and make it more difficult for smaller participants to innovate based on each other's work.
Ultimately, Villa believes that the entire community needs to unite around a standard, otherwise industry participants will ignore it and decide the meaning of "open" for themselves.
However, he does not envy the work of the open source initiative. When it proposed the definition of open source software, it had a lot of time and little external review. Today, artificial intelligence has become the focus of attention for large enterprises and regulatory agencies.
But if the open source community cannot quickly determine a definition, others will come up with a definition that suits their own needs.Vera said: "They will fill this gap. Mark Zuckerberg will tell us what he thinks 'open' means, and his voice is loud, which will be heard by many people."
POST A COMMENT