The Linux Foundation on Monday introduced the Community Data License Agreement, a new framework for sharing large sets of data required for research, collaborative learning and other purposes.
CDLAs will allow both individuals and groups to share data sets in the same way they share open source software code, the foundation said.
"As systems require data to learn and evolve, no one organization can build, maintain and source all data required," noted Mike Dolan, VP of strategic programs at The Linux Foundation.
"Data communities are forming around artificial intelligence and machine learning use cases, autonomous systems, and connected civil infrastructure," he told LinuxInsider. "The CDLA license agreements enable sharing data openly, embodying best practices learned over decades of sharing source code."
The agreement could help foster an increase in data sharing across a variety of industries, supporting collaboration in climate modeling, automotive safety, energy consumption, building permit processes, water use management and other functions.
Uniform Guidelines
The agreement calls for two main sets of licenses, which are designed to help data contributors and consumers work with a uniform set of guidelines that clarify the rules of the road and mitigate risks.
The Sharing license encourages contributions of data to the community. The Permissive license does not require any additional sharing of data.
Among the commercial and creative implications of the licenses:
- Data producers can be more specific regarding what recipients can do with data. Data producers can choose between the Sharing and Permissive licenses, depending on which model better aligns with their needs. Either type of license gives them greater clarity of agreement terms, and provides greater protection from liability and warranties.
- Licenses allow communities to share data on equal terms that balance out the needs of data users and producers. Data communities can add their own rules and requirements for curating data, particularly involving personally identifiable information.
- A data user looking for information that will be used for training on an artificial intelligence system or for another use will have access to data shared under a known license model that has terms that are clearly spelled out.
The agreements are agnostic with regard to data privacy, and it will be up to publishers and curators of data to create their own governance structure, taking into account applicable laws.
Higher Learning
The agreement comes at a time when technologies like machine learning and artificial intelligence are capable of analyzing data sets in ways that previously were not possible. The licensing agreements provide a framework to make data repositories uniform enough to allow accurate and replicable analysis.
"The critical issues for deep learning are verification and transparency -- and is the training replicable?" said Paul Teich, principal analyst at Tirias Research.
Organizations often share data in order to allow other groups to try to replicate their results, he told LinuxInsider. In addition, organizations might publish data sets speculatively for other groups to process -- and potentially pick a vendor for advanced analytics, depending on how well different algorithms worked on a particular data set.
"The new Community Data License from The Linux Foundation reflects the growing importance of information as a resource for big data analytics, machine learning and artificial intelligence," said Charles King, principal analyst at Pund-IT.
"In essence, data provides the fuel required for processes, including 'teaching' systems to accurately perform complex functions and analyze ongoing occurrences," he told LinuxInsider.
Rising Demand
There has been a surge in the level of interest in data sets in recent years, noted Mark Radcliffe, global chair of the FOSS Global Practice Group at DLA Piper.
For example, connected cars can provide a wealth of data, including GPS, miles per hour and music playlist information, he told LinuxInsider. Internet of Things devices could provide information like boiler temperatures, or wind speeds from wind farms.
CDLAs will encourage a more uniform process for sharing such data.
"These license agreements [could be] very, very helpful," Radcliffe said, "because in many cases people are doing this on an ad hoc basis."
The legal protection available for data is very fragmented and very uncertain, he pointed out. "It's not an area that has had [much] case law involved. In many cases you have a very uncertain background in which to work."
The Open Transport Partnership, which is backed by the World Bank, has been working since 2016 to collect GPS streams in order to research traffic congestion, particularly during peak commute times.
The partnership launched an effort last year with a number of organizations, including the World Resources Institute, the National Association of City Transportation Officials, ride-sharing firms like Grab and Easy Taxi, open mapping firm Mapzen, data platforms like MDrive, and other firms.
The World Bank collaborated with Grab, with backing from the Korea Green Growth Trust Fund, to use anonymized GPS data from 500,000 Grab drivers to map out peak congestion times in Manila. The program was scheduled to expand to other countries like Brazil, Malaysia and Columbia.