Bits of Freedom
Bits of Freedom

Tinkerer and thinker on everything free and open. Exploring possibilities and engaging with new opportunities to instigate change.

Jonas Öberg
Author

Jonas is a dad, husband, tinkerer, thinker and traveler. He's passionate about the future and bringing people together, from all fields of free and open.

Share


My Newsletters


If you're interested in what Commons Machinery, Elog.io, or myself are up to, I'd love for you to be part of my notification lists. It's pretty low volume, a few messages per month, depending on which notifications you sign up for. Thanks for taking an interest!

Read more & subscribe
Bits of Freedom

A REUSE compliant Curl

Jonas ÖbergJonas Öberg

The REUSE initiative is aiming to make free and open source software licenses computer readable. We do this by the introduction of our three REUSE best practices, all of which seek to make it possible for a computer program to read which licenses apply to a specific software package.

In this post, I'll be introducing you to the steps I took to make cURL REUSE compliant. The work is based on a branch made about three weeks ago from the main curl Git repository. The intent here is to show the work involved in making a mid-sized software project compliant. You can read this post, and reference the Git repository (GitHub mirror) with its reuse-compliant branch to see what this looks like in practice.

reuse compliant

The reason we decided to work on the curl code base for this demonstration is that it's a reasonably homogenous code base, has a good size for this demonstration, and has an award winning maintainer!

REUSE curl

The first two practices in the REUSE practices, which are often the only ones relevant, introduce some clarity around the licenses applicable to each file in a repository. They ensure that for each file, regardless of what kind of file it is, there's a definite and unambiguous license statement. Either in the file itself and if that's not possible, in a standardised location where it's easy to find.

If the practices are implemented, it's possible to create utilities which easily retrieve the license applicable to a particular source code file, assemble a list of all licenses used in a source code repository, create a list of all attributions which need to go into a binary distribution, or similarly.

Here are the practices, one by one:

1. Provide the exact text of each license used

The curl repository includes code licensed under a variety of licenses, including several BSD variants. The primary license of the software is a permissive license inspired by the MIT license. REUSE practices mandate that when a software includes multiple licenses, these are all included in a directory called LICENSES/.

This practice intends to make sure each license is included in the source code, such that it can be referenced from the individual source code files. In the current curl repository, only the principal licens for curl is included as a separate file. All other licenses are included in individual copyright headers.

However, the intent of the REUSE practices here is to make sure a computer can understand what the license snippet is. Merely leaving the license information in the headers doesn't really suffice. We still need a way to identify which text constitute the license.

Adding them explicitly in the LICENSES/ folder would work for this, as we would then use the License-Filename tag (see later) to reference the explicit license relevant for a file. Another way, which is easier and cleaner in this case, is to copy over the relevant license statement to a DEP5/copyright file. The DEP5/copyright format is designed to be computer readable, and can include custom license texts, which we copy from the individual files.

So for curl, we will leave the curl license where it is, but add ancillary licenses in a computer readable way later on.

REUSE practices give the filename for the license file as LICENSE, and not COPYING. This has been amended in the next release version of the REUSE practices to allow for both common variants, and so we opt here to not change the name of the file but to leave it as COPYING.

2. Include a copyright notice and license in each file

curl is exemplary in that almost all files have a consistent header, which looks like this:

Copyright (C) 1998 - 2017, Daniel Stenberg, daniel@haxx.se, et al.

This software is licensed as described in the file COPYING, which you should have received as part of this distribution. The terms are also available at https://curl.haxx.se/docs/copyright.html.

You may opt to use, copy, modify, merge, publish, distribute and/or sell copies of the Software, and permit persons to whom the Software is furnished to do so, under the terms of the COPYING file.

This software is distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY KIND, either express or implied.

The REUSE practices are explicit in that we should not change the header, but we can (and in this case should) add information to it: a reference to the license file, and an SPDX license identifier.

SPDX license identifiers aren't new, but they're starting to make inroads into larger code bases (such as the Linux kernel) for one important reason: it's far easier to parse and understand what a well-known tag with a well-known content means, than to parse a license file.

For the SPDX license identifier, curl is a special case. While the license is MIT inspired, it is not an exact copy of the MIT license. It's a free and open source software license, but we can not use the default MIT license identifier. Had the curl license not been included in the SPDX license list, we would have opted to not include an SPDX license identifier.

However, the curl license has been explicitly included in the SPDX license list with the name curl. So we use this reference in our identifier:

SPDX-License-Identifier: curl

The REUSE practices also give that we should include a reference to the license file. The reference is already there, but it doesn't make use of the REUSE practices License-Filename tag, and as such, it's computer readable. Adding the License-Filename tag with the name of the license file will ensure tools supporting REUSE compliant source code can understand the reference to the license filename without previously having encountered the format of the curl headers.

License-Filename: COPYING

This makes the license, and the reference to the license file, very clear, and making these two additions to the copyright headers, resolve the situation for the majority of included files in the repository.

It's worth noting that adding both is relevant. The License-Filename tag is more specific than the SPDX-License-Identifier and doesn't depend on an external repository to convey information, but including the SPDX-License-Identifier tag also means generic tools working with SPDX can parse the source code, regardless of if supportig the full REUSE practices or not.

We fix up the headers with the following two sed scripts (improvements welcome!):

/^# This software is distributed/,/^# KIND, either express or implied./c\
# This software is distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY\
# KIND, either express or implied.\
#\
# License-Filename: COPYING\
# SPDX-License-Identifier: curl

and

/^ \* This software is distributed/,/^ * KIND, either express or implied./c\
@*@ This software is distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY\
@*@ KIND, either express or implied.\
@*@\
@*@ License-Filename: COPYING\
@*@ SPDX-License-Identifier: curl

We run these with:

$ find . -type f -exec sed -i -f sed-hash.script {} \;
$ find . -type f -exec sed -i -f sed-star.script {} \;
$ find . -type f -exec sed -i 's/^@\*@/ */' {} \;

(The trick with @*@ is to preserve proper formatting since sed has a tendency to want to strip spaces. Unfortunately for us, there are plenty of files with other types of comments, some starting with .\" * for man pages, others with # * and yet others with rem *, so some manual work is needed for this.)

A good way to find problems is to do a git diff and look for lines removed. Since we never intend to remove any information, but only add to it, anytime a git diff flags a line as having been removed, there's a fair chance we've done something wrong.

The curl repo includes 2783 files. Adding the SPDX license identifier and license filename to the headers leave us with 1693 files remaining.

A lot of the remaining files concern test cases (files in tests/data) and documentation which can not include copyright headers.

The REUSE practices offer two ways of resolving this. Either add one supplementary file for each file which can not include a copyright header. Name this supplementary file FILENAME.license and include in it the standard copyright header. We don't want to do this, as it would add some 1693 additional files to the repository!

The other way is to make a single file, in this case in the DEP5/copyright file format, which documents the license of each file which can not in itself include a license header.

In a debian/copyright file, we can include license information such as:

Files: tests/data/*  
Copyright: Copyright (c) 1996 - 2017, Daniel Stenberg, <daniel@haxx.se>  
License: curl  

This allows us to get rid of a large chunk of files which can not have a header. This gets us down to about 289 files remaining, which do in one way or another require some manual processing.

For many, they can include headers, but for various reasons, this has been forgotten. This is the case for winbuild/Makefile.vc which was committed at the same time as winbuild/MakefileBuild.vc. I didn't look deeper at the commit history, but the latter includes a proper header; the former does not.

For most files which can include a copyright header, we've added the SPDX-License-Identifier and License-Filename tags to the header, but we did not add the full curl header. It would be up to the curl developers to determine whether a file should have a curl header, and if so, what to include in the header in terms of copyright information.

The case of Public Domain

lib/md4.c is in the public domain, or in the absence of this under a very simplified BSD license. There are excellent reasons for why public domain doesn't have an SPDX license identifier, so this file is left untouched. Debian has opted, in their repository, to explicitly mark the file as in the public domain. We do the same. But as the public domain is a concept which differs by jurisdiction, it is up to the final recipient to make the judgement about whether the file can be used.

Important lesson: do pick a license, even if it's a simple one, which does the same thing as dedicating a file to the public domain. Don't just slap "public domain" on a file and hope all is well.

Why we need source-level information

tests/python_dependencies/impacket/smbserver.py and related files serve a good example of why our principles ask for as much information as possible to be included in the source code files themselves. These files have the following copyright header:

# Copyright (c) 2003-2016 CORE Security Technologies
#
# This software is provided under a slightly modified version
# of the Apache Software License. See the accompanying LICENSE file
# for more information.

Unfortunately of course, as often happens, these files have been copied without being accompanied by the corresponding LICENSE file. In fact, the curl repository contains no file at all called LICENSE, which can leave one to wonder: what does the "slightly modified" version look like?

The answer can be found by looking up the original repository from where these files were taken. It's mainly an Apache 1.1 license with "Apache" replaced by "CORE Security Technologies".

This is one situation where it is warranted to add this obviously missing license information to the repository, and update the header with a License-Filename indicating the right license file. We can not add an SPDX license identifier as there are modifications to the original license (even if they are minor).

Do note that for consistency with the header, I add the license file from the original repository in the impacket directory, and not in the top level LICENSES/ directory which the REUSE practices recommend. The location of the licenses is a SHOULD requirement, however, so we can violate it here, as long as we follow the MUST requirement of actually including all license files.

The original repository is somehow inconsistent in its licensing though. Two files, smb.py and nmb.py are indicated in the LICENSE file as being licensed under a custom license, and not the modified Apache license.

However, the individual files have headers which indicate the license is the modified Apache license, with a reference to the LICENSE file. This would ideally be clarified upstream, but since the LICENSE file includes both licenses and an explanation of the situation, referencing it from the copyright header at least ensures the recipient receive as much information as is available upstream.

OpenEvidence licensed files

curl contains a small number of files licensed by the OpenEvidence Project, using a license inspired by the OpenSSL license, but using different advertisement clauses. Specifically, in one of the files docs/examples/curlx.c (which, admittedly, is not included in the builds), the license advertisement clause is given as:

 * 6. Redistributions of any form whatsoever must retain the following
 *    acknowledgments:
 *    "This product includes software developed by the OpenEvidence Project
 *    for use in the OpenEvidence Toolkit (http://www.openevidence.org/)
 *    This product includes software developed by the OpenSSL Project
 *    for use in the OpenSSL Toolkit (https://www.openssl.org/)"
 *    This product includes cryptographic software written by Eric Young
 *    (eay@cryptsoft.com).  This product includes software written by Tim
 *    Hudson (tjh@cryptsoft.com)."

While the license is very similar to the OpenSSL license, we can not use the OpenSSL SPDX identifier in this case, since the obligations are different. While the same can happen with the BSD licenses as well, SPDX deal with the two differently.

In BSD-4-Clause, as one example, the text representation of the license included in SPDX has a variable for the attribution requirement:

This product includes software developed by the <<var;name=organizationClause3;original=the organization;match=.+>>.  

This should then, in theory, enable the license to be matched regardless of what organisation is specified in the license, and license scanners would know to expect an organisation name in this place. The same isn't true of the OpenSSL entry in SPDX which means that OpenSSL means precisely that; OpenSSL, without any variables or deviations in the text. So it would not match against the OpenEvidence license.

For this file, we'll use, as Debian does, the convention of specifying the license "other" in the DEP5/copyright file and including the license header license text in full.

Finding the copyright holder

It's worth noting some files in curl has no author or copyright information given. Such is the case of packages/vms/curl_release_note_start.txt and related files. We can infer from the Git log who the author might be, but the REUSE practices should not be interpreted as an archaeological expedition! You have to decide for yourself the length you go to in this.

From a project perspective, it might sometimes be useful to document this, but for a project like curl, whose list of contributors is upwards of 1600 people, untangling this becomes a project as a whole, and might not even be relevant.

So the priority becomes identifying the right license for the files included. If some files are under a different license from the one covering most of the other distribution, this would be important to note. But do solve one problem at a time. Digging through to identify every single copyright holder would be time consuming, prone to errors, and in most cases not answer to a problem anyone has.

3. Provide an inventory for included software, but only if you can generate it automatically.

For curl, we will deal with this practice in an easy way: we simply won't do it. Ideally, we should ship, together with curl, or generated at build time, a bill of material of included software with their copyrights and licenses marked. There are some initiatives and tooling which would be helpful in this, but currently, providing a complete inventory would be more trouble than it's worth.

If we did provide an inventory, the likelihood of it not being updated and maintained is significant. So since we can't do it automatically right now, we will not.

Parsing a REUSE compliant repository

Having passed through the REUSE practices, added the appropriate license headers and the DEP5/copyright file, where does this leave us? It leaves us in a state where finding the license of a source file included in curl is easy and can be automated.

  1. If the file includes the SPDX-License-Identifier tag, then the tag value corresponds to the license from the SPDX license list.
  2. If the file includes the License-Filename tag, then the tag value corresponds to the file containing the actual license text in the repository. This tag takes precedence over the SPDX license identifier.
  3. If there are no SPDX or License-Filename tags, look for a file with the same name with the suffix .license. If it exists and contains the tags in (1) and (2), parse them the same way as if they were included in the file itself.
  4. If there's a debian/copyright file, match the filename against it, and if found, extract the license indicated.
  5. If neither of the above works, the repository is not REUSE compliant.

Where to next?

This has been an example and demonstration of the work involved in making a repository REUSE compliant. We will continue to review the REUSE practices and release further guidance in the future, but more importantly: we hope others will pick up this work and include support for REUSE compliant repositories in tools which serve to understand software licensing.

We're also looking forward to see more tools being built in general. One of our interns, Carmen, is currently working on a tool which would lead to the generation of a lint checker for REUSE compliance. That's one of many tools needed to help us on the way towards making copyrights and licenses computer readable. And computer understandable.

Jonas Öberg
Author

Jonas Öberg

Jonas is a dad, husband, tinkerer, thinker and traveler. He's passionate about the future and bringing people together, from all fields of free and open.