A Large-scale Dataset of (Open Source) License Text Variants

Stefano Zacchiroli

doi:10.1145/3524842.3528491

Communication Dans Un Congrès Année : 2022

A Large-scale Dataset of (Open Source) License Text Variants

(1, 2, 3)

1
2
3

Stefano Zacchiroli

Fonction : Auteur
PersonId : 15184
IdHAL : stefano-zacchiroli
ORCID : 0000-0002-4576-136X
IdRef : 176946942

Institut Polytechnique de Paris

Autonomic and Critical Embedded Systems

Département Informatique et Réseaux

Résumé

We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive-the largest publicly available archive of FOSS source code with accompanying development history-all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license files, plus several portable CSV files for metadata, referencing files via cryptographic checksums.

Mots clés

dataset open source software license copyright intellectual property software engineering natural language processing

Domaines

Génie logiciel [cs.SE]

Fichier principal

main.pdf (537.48 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Stefano Zacchiroli : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03624198

Soumis le : jeudi 31 mars 2022-14:49:10

Dernière modification le : lundi 9 octobre 2023-12:49:40

Dates et versions

hal-03624198 , version 1 (31-03-2022)

Identifiants

HAL Id : hal-03624198 , version 1
ARXIV : 2204.00256
DOI : 10.1145/3524842.3528491

Citer

Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. The 2022 Mining Software Repositories Conference, May 2022, Pittsburgh, Pennsylvania, United States. ⟨10.1145/3524842.3528491⟩. ⟨hal-03624198⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM LTCI INFRES ACES IP_PARIS

651 Consultations

276 Téléchargements

A Large-scale Dataset of (Open Source) License Text Variants

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager