Annual Reports – Software Heritage https://www.softwareheritage.org Sat, 17 Feb 2024 08:50:03 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.3 https://www.softwareheritage.org/wp-content/uploads/2015/08/cropped-swh-logo-32x32.png Annual Reports – Software Heritage https://www.softwareheritage.org 32 32 Software Heritage in 2023: a perspective https://www.softwareheritage.org/2024/02/01/software-heritage-annual-report-2023/ Thu, 01 Feb 2024 09:01:11 +0000 https://www.softwareheritage.org/?p=37646 As we enter 2024, we publish, as usual, our annual report on the past year, and like last year this is now available as a → standalone document ←, making it easier to grasp the breadth of the mission, follow the progress made and share it with a broader audience.

The start of 2023 witnessed the Software Heritage symposium and summit held at UNESCO’s headquarters in Paris, France. This collaborative event with UNESCO focused on the international conference themed “Software Source Code as documentary heritage and an enabler for sustainable development.” The program extensively delved into five primary dimensions:

  • Understanding software source code as documentary heritage and its role in digital skills education;
  • Considering software source code as a research object in open science;
  • Examining software source code’s impact on innovation and sharing in industry and administration;
  • Discussing long-term preservation perspectives, and
  • Reviewing technological advances in software source code analysis.

UNESCO – Paris | © Inria / Photo B. Fourrier

The event gathered our community, including team members, ambassadors, grantees, partners, and contributors who discussed the Software Heritage Archive and various aspects of its mission. The dedicated blog post offers a summary of the workshop’s key points, and our annual report, presented as a standalone document for the first time, gives an overview of our progress.

We suggest reading UNESCO’s article, , Positioning software source code as digital heritage for sustainable development“, the complete transcript is accessible in PDF format.

The event’s recording is also available online for those who couldn’t attend.

In 2023, we welcomed 10 new ambassadors to our cause, 5 women and 5 men, bringing the count of our team of ambassadors to 33 worldwide. We featured several ambassador articles this year: one by Simon Phipps titled “Open Source ensures code remains a part of culture” advocating for the preservation of software as a cultural element through Open Source, one by Agustin Bethencourt titled “Why did I become a Software Heritage Ambassador?” that delves into the significance of Software Heritage within the industry, and one titled  “Viewpoints on software in research at the Gustave Eiffel University, an interview with Céline Rousselot and Joenio Marques da Costa.

Throughout the year, the ambassador community held two plenary sessions, in close contact with the Software Heritage core team. One key topic has been software metadata, a complex but essential issue, that is detailed in the article  “Deep Dive into the archival of Software Metadata”. A special effort has been made to present the broad lines of the 2023 Software Heritage technical roadmap, that has been published in the first quarter of 2023.

Supporting Open Source

At Software Heritage, we remain committed to advocating for the importance of open-source software and its role in shaping the future of technology. This is why we co-signed an open letter with the Eclipse Foundation on the Cyber Resilience Act. The objective of this new regulation is to ensure the safety and security of our digital infrastructure, including software, but we must make sure that it does not hinder the progress and innovation of open-source software as an unintended side effect. You can read the open letter and learn more about this important topic on the Eclipse Foundation’s website.

Building a collaboration infrastructure

We know that to succeed in the humbling mission we have undertaken we need to enalbe a large community to contribute and collaborate. This year we are happy to report several key adavances in this direction.

We concluded a multi year effort conducted with help by Open Tech Strategies to transition our development and operations from our previous system to our own GitLab instance, that is more familiar for external contributors.

We opened a new documentation landing page at docs.softwareheritage.org to make it easier for newcomers to find their way in the vast amount of documentation available.

We have been working to make it easier for developers to regularly archive their software in Software Heritage by introducing the dedicated save code webhooks in the API for several popular forges and technologies: Bitbucket, Gitea, GitHub, Gitlab and Sourceforge.

Last, but not least, we have introduced a GRaphQL API, that greatly simplifies programmatic access to the archive: users can play with it usint the Software Heritage GraphQL Explorer. This is an addition to the traditional Software Heritage’s REST API that will enable clients to craft robust queries and seamlessly retrieve server data.

SWHID sees growing adoption adn becomes the Software Hash Identifier

A key part of the Software Heritage infrastructure are the persistent identifiers known as SWHID, that allow to guarantee integrity of software artefact without relying on third parties, enabling better scientific reproductibilit.

This year, SWHID adoption has been growing in academia. A close collaboration wich CCSD and IES-INRIA led to opening up SWHID deposit on HAL since January 2023 to all french researchers, massively simplifying the referencing research software in french institutional portals, and the generation of the many reports often requested in an academic career. At an international level, the Computer Graphic Replicability Stamp Initiative (GRSI) now uses Software Heritage to archive software associated to research articles, and uses SWHIDs to reference it: when a code is accepted for the Replicability Stamp, it relies on Software Heritage to create a snapshot of the project and references the accepted version with the corresponding SWHID.

The SWHD identifier has been developed at Software Heritage, where it has been in use in our archive for almost a decade. Since it can be computed independently, and used of a variety of other applications, the time has come to create and independent specification, to ensure that all stakeholders can benefit from it. To this end, after almost two years of intense work an open working group has released the publicly available specification of the SWHID, that is now spelled “Software Hash Identifier” and no longer “Software Heritage Identifier” (pronounce it /ˈswɪd/).

Software Heritage in European Research Projects

At Software Heritage, we have a long tradition of participating to collaborative research project when we can help improve the way research software is archived, referenced, descibed and cited. On the infrastructural side of Open Science, groundbreaking work is ongoing in a dedicated work package in the FAIRCORE4EOSC European project, to connect scholarly infrastructures with the Software Heritage archive. The first visible outcome is the partnership initiated with the swMATH portal to bridge mathematical publications with comprehensive software records, enriching the scholarly landscape. This year, we also contributed to a collaborative effort by two such projects,  FAIR-IMPACT and FAIRCORE4OSC during the RDA P20 plenary in Gothenburg.

Software Heritage in also part of the SoFAIR project, recently awarded through the CHISTERA Open Research Data & Software Call, whose goal is to elevate the discoverability and reusability of open research software, aligning with our commitment to advancing the accessibility of software source code artifacts.

Research on Software Heritage

Campus Cyber – Paris | © Inria / Photo B. Fourrier

Software Heritage is an archive, but also an exceptional infrastructure to enable research on software develoment. This year, we embarked in the SWHSec project, announced during the launch of a new national research and innovation program on cybersecurity – PTCC. This groundbreaking initiative brings together eight expert research teams specializing in security, software engineering, and open-source software to harness the power of Software Heritage’s robust infrastructure and create cutting-edge tools for cybersecurity.

Software Heritage and Large Language Models for Code

We acknowledge the huge potential of the Software Heritage archive for the training of machine learning models, particularly large language models (LLMs) that can automatically generate code to assist with software development tasks. In alignment with our mission, we advocate for a transparent and respectful approach to the development of these models, aligned with our mission, as detailed in our statement for acceptable machine learning use of the Software Heritage archive.

Saving Inria’s software legacy

In the pursuit of safeguarding Inria’s software legacy, we started a collaboration with the Inria alumni network and the Direction of Culture and Scientific Information (DCIS) to reach out to, and invite former individuals who had worked at Inria to participate in enriching the inventory of software heritage created at Inria since its inception.

Leveraging the Software Stories interface, created in 2021 in collaboration with the Science Stories team and the University of Pisa with UNESCO’s support, a first result of this effort is the publication of the story of the web browser and editor Amaya

Software, a pillar of Open Science

Software, and its source code, is a pillar of Open Science, and Software Heritage has been recognized by the Global Sustainability Coalition for Open Science Services (SCOSS) for its key role in ensuring continuous access to software as a research output. We look forward to seeing many new members join the newly created Archives and Libraries Interest Group (ALIG) that will bring together academic stakeholders worldwide.

Thanks to our sponsors

We’re grateful to our sponsors, including our new additions Hugging Face, ServiceNow, and Scanoss: it is their continued support that enables us to make progress in this long term mission.

 

First international mirror

And we finished this intense year with the launch of the first international mirror of the Software Heritage Mirror Network by ENEA, the Italian National Agency for New Technologies, Energy and Sustainable Economic Development.  This is a key milestone in the long-term preservation strategy of all our software commons, and is the result of long years of technical and organisational development efforts that will make it much easier for the other forthcoming mirrors to go into production.

 

Roberto Di Cosmo
Director, Software Heritage

]]>
Software Heritage in 2022: full speed ahead https://www.softwareheritage.org/2023/02/14/software-heritage-annual-report-2022/ Tue, 14 Feb 2023 13:19:58 +0000 https://www.softwareheritage.org/?p=33663 As we enter 2023, we publish, as usual, our annual report on the past year.

We are excited to announce that this is now available as a → standalone document, making it easier to grasp the breadth of the mission, follow the progress made and share it with a broader audience.

In 2022, we continued to expand our collection of source code, reaching a total of over 13 billion unique source files from over 200 million origins. This represents a significant increase over the previous year, as we have been working hard on increasing ingestion and archival efficiency.

Numerous user-facing improvements are now visible, with the addition of features like “Add Forge Now” and the release of the updateswh browser extension.

We have organised the SWHAP Days event to raise awareness of the importance of source code preservation and the role that Software Heritage plays in this effort. And we are organising and supporting the growing community of ambassadors and contributors.

We want to extend our gratitude to our sponsors and members, as well as to all the supporters and volunteers, without whom none of this would be possible. We are thankful for your continued support, and we look forward to working together to pursue Software Heritage’s long-term mission.

]]>
Software Heritage in 2021: five years already! https://www.softwareheritage.org/2022/01/05/software-heritage-in-2021-five-years-already/ Wed, 05 Jan 2022 13:00:46 +0000 https://www.softwareheritage.org/?p=28957 Despite the health crisis we have been experiencing since March 2020, Software Heritage pursues its mission of collecting, preserving and sharing software source code, which started many years ago and was unveiled to the public on June 30th 2016.

Expanding the archive: a collective effort

The archive is growing steadily, counting now over 11 billion unique source files from more than 170 million projects, from a growing number of origins: we are grateful to the volunteers that submitted over 100.000 Save Code Now requests, and to the many expert contributors that helped expand the coverage of the archive (more on this below).

Overall, there has been significant technical progress, described in the Software Heritage 2021 technical roadmap that provides an overview of what we are working on in a number of areas: Collect, Preserve, Share, Organize, Measure, Documentation, Community and Tooling. Many of these tasks involve important improvements to the Software Heritage infrastructure and software stack that may go unnoticed, despite being very important: for example, we rehauled the archive counters, added a breakdown of the archived origins, and a Save Code Now request is now handled in a matter of minutes, which is a real game-changer for the users.

Documentation has been improved significantly, both for developers and for users, and we invite you to visit the brand new Features Page that provides precious insight into the many new functionalities that are now available, as well as a peek under the hood.

Building the universal source code archive is not an easy task: we need to cope with a variety of technical challenges, and in order to succeed in the long term mission of Software Heritage, a collective effort is needed.

This year, we pursued our partnership with the Alfred P. Sloan Foundation and the NLnet Foundation to provide grants for experts that are willing to get involved and build the many connectors needed to expand the coverage of the archive.

A grant was awarded to Octobus to work on archiving SourceForge and adding Bazaar to the list of version control systems supported by the Software Heritage ingestion pipeline; a grant was also awarded to OCamlPro to help increase the coverage of the Software Heritage by integrating it with the OCaml ecosystem; another one to Easter-Eggs to help us to build the next-generation object storage for Software Heritage; and a grant was awarded to Castalia Solutions to develop the Maven Repositories connector to archive the Maven ecosystem.

We are still calling on all experts to step up and express their interest in participating! Please fill in this simple form if you are interested.

Growing an international community 

The Software Heritage infrastructure now offers a wealth of stable features that can be used in a variety of applications, ranging from cultural heritage to science, industry and public administration, and the time has come to foster adoption broadly. To this end, we launched the Ambassador program, and in 2021 we were delighted to welcome 17 volunteers ambassadors willing to contribute to community engagement, and accelerate the adoption of Software Heritage in the many fields where it brings groundbreaking benefits. You can contact them to learn more about Software Heritage, and you can become an ambassador too.17 Software Heritage Ambassadors

We welcomed Google Summer of Code students again, giving more student developers access to open source software development.

Recognizing Software as a key pillar of Open Science

This year 2021 has been a turning point for software in Open Science, with several highly relevant events. On July 6th, we were excited to see software fully recognized as a key pillar of Open Science in the second national plan for Open Science, unveiled by the French Ministry of Research. In this landmark official government document, a groundbreaking strategy for software in research has been laid out, and Software Heritage plays an important role in it.

On September 29th, the European Open Science Cloud (EOSC) established several task forces, dedicated to improving the infrastructures supporting Open Science in Europe, and we are happy to co-chair the one focused on infrastructures for quality research software.

In November, the UNESCO member states approved the recommendation on Open Science, which now explicitly mentions Open Source software as a key component of Open Science. It also states that open science infrastructures should be “based, as far as possible, on open source software stacks” and “organized and financed upon an essentially not-for-profit and long-term vision”, which is exactly the approach we have taken in Software Heritage since the very beginning of our journey.

Software Heritage for cybersecurity

This year, awareness has been rising significantly about the increasing impact of cybersecurity threats on society as a whole, and the executive order issued in May from the President of the United States has a full section dedicated to “Enhancing Software Supply Chain Security”, that includes a call for ensuring and attesting, to the extent practicable, to the integrity and provenance of open source software used within any portion of a product”. This gave us the opportunity to show how the Software Heritage archive can contribute by improving the software supply chain, by ensuring availability, guaranteeing integrity, and enabling traceability of all publicly available software source code.

Celebrating five years at UNESCO: our first International Conference

On November 30th, a special event took place at UNESCO’s headquarters to celebrate the five years of Software Heritage. It was the opportunity to take stock of the achievements and status of Software Heritage, and to highlight the relevance of building a universal software source code archive in the context of today’s dynamic digital innovation landscape.

Unesco – Paris | © Inria / Photo B. Fourrier

On this occasion, we brought together the growing international community of Software Heritage, welcomed CEA as our first diamond sponsor, and unveiled Software Stories, a novel approach to present the history of landmark software projects, developed in a joint collaboration with the sciencestories.io team and the University of Pisa.

A detailed account of the event can be found on the UNESCO dedicated web page, as well as on our blog. 5 minutes celebratory video that recaps the key milestones of these first five years has been unveiled.

Thanking our sponsors

 

We are very grateful to all of our sponsors and partners for maintaining, and even increasing, their support of our mission, despite the difficult year we all went through again in 2021. This is essential to ensure that Software Heritage can continue to develop its core infrastructure, and roll out the services that will make it useful to all the stakeholders!

Looking ahead

In the year to come, our top priority will again be to ensure that the key functionalities that Software Heritage offers are rock solid, completing the work already started to improve key components of the infrastructure under the hood. We look forward to rolling out the first operational mirrors, improving the resilience of the preservation effort started over five years ago, and to continue expanding the archive coverage, as well as integrating extrinsic metadata sources that will help better describe its contents.

And most importantly, we will continue to expand the international community around Software Heritage: collecting, preserving and sharing all the software source code is a humbling undertaking that requires institutional support, with sponsors and partners (take a look at the different sponsorship program possibilities), as well as individual engagement, ranging from collaborators to contributors, from ambassadors to the donors that are answering the call of the ongoing end of year fundraising campaign (yes, you should join too!).

We look forward to working with all interested parties: let’s work together to preserve our past, improve our present, and prepare a better future.

]]>
Software Heritage in 2020: looking beyond the crisis https://www.softwareheritage.org/2021/01/07/software-heritage-in-2020-looking-beyond-the-crisis/ Thu, 07 Jan 2021 12:54:20 +0000 https://www.softwareheritage.org/?p=23575 The year that has passed since we posted our last activity report is a very special one: humankind has been confronted with a global crisis that young generations have never seen the equivalent before, but it is not the first time we are confronted with such a challenge, and it will not be the last.

The epidemics we are facing today is a powerful yet mindless adversary that knows no border and has no political agenda. This means that we cannot sit down with this enemy and negotiate our way out of the danger that it poses to humankind as a whole: our only hope is to tap into all our collective knowledge to find a cure.

At Software Heritage, we committed ourselves over five years ago to the long term mission of collecting, preserving and making available to all the source code of all software publicly available, as it contains a growing amount of our collective knowledge.

We strongly believe that building the universal source code archive as a common non profit infrastructure will help humankind be better prepared for the next global crisis, contributing to answer the WHO, UNESCO and UNHR joint appeal for Open Science.

It is the commitment to this mission that kept the Software Heritage team working at full speed despite the difficulties of a year 2020 spent almost entirely in lockdown. And we are happy to report that it has been a very productive year!

Preserving the web of knowledge: archival

A quick look at the source code archive main page shows that it contains now almost 10 billion unique source files from more than 2 billion unique commits coming from over 150 million projects collected worldwide. We have put to good use the collaboration with GitHub to ease its archival,  salvaged hundreds of thousand of endangered repositories from Bitbucket, processed tens of thousands of save code now requests, and established collaborations with academic journals in life sciences and computer science to archive research software associated to published articles.

Preserving the web of knowledge: reference

Ensuring that past and present software source code is collected and safely archived is one part of the mission, but to fully reap the benefits of this effort, it is necessary to also make sure that all the software artifacts we archive can be referenced now and in the long term.

This is why we put significant effort in formalising the Software Heritage intrinsic identifiers, aka SWHID, that the Software Heritage archive provides for the tens of billions of software artifacts that it preserves, and in working with Industry and Academia towards their widespread adoption.

We are happy to report that the full specification of SWHIDs is available, the swh prefix used in SWHIDs is now registered with IANA, the Software Package Data Exchange (SPDX) industry standard specification includes SWHIDs  in its recently published version 2.2, and SWHIDs are now clearly present in the scholarly landscape for software source code identification.

Preserving the web of knowledge: description and citation

Our engagement to preserve the web of knowledge goes beyond ensuring that software is safely archived and identified for the long term. We also care about the metadata that describes it, and the way it is cited in documentation and research articles. As part of our effort to support Open Science and reproducibility of research, we contribute to community initiatives to describe research software with proper metadata, we have released the first bibliographic style ever designed to cite software, and we have introduced the Software Heritage badges (swh-badges), that you can use to link to the archived source code.

There are three types of badges:

These are important steps forward in our years long engagement to raise awareness about the importance of software in general, and as a key ingredient for academic research, on a par with articles and data.

Building the community

The scope of our mission is broad, and humbling. We know well that to succeed in the long term we need help from a broad community, ranging from industry to academia, from governments to international organisations, from private foundations to cultural institutions, from passionate individuals and contributors all over the world.

This year has seen the first bold step forward to foster the emergence of such a broad community.

Grants

We partnered with the Alfred P. Sloan Foundation and the NLnet foundation, to provide grants for experts that are willing to engage with the long term mission of Software Heritage and build adaptors for each of the platforms and version control systems out there in order to collect and archive it properly.

The NLnet foundation supported Octobus to rescue 250.000 endangered public Mercurial repositories, and Tweag to develop an adapter that allows Software Heritage to archive more than 20.000 source code tarballs used to build the Nix package collection.

The cascading grant received from the Sloan Foundation has enabled us to award two subgrants already. The first is supporting Cottage Labs which will connect that will allow all instances of InvenioRDM to safely and efficiently archive in Software Heritage the source code of all research projects that will be deposited in them, and to provide the corresponding intrinsic identifiers (SWHID) to the research community. The second will fund the work of Stefan Sperling to improve the current Subversion loader and develop a CVS loader.

The Software Heritage website has now a dedicated page that details the grant programs: some are still open, do not hesitate in applying!

Open Science and Research Infrastructures

We want to help carry the voice of software developers, researchers and research engineers in the Open Science movement: we are actively participating in the Research Data Alliance (RDA), perform different activities to promote software recognition in the FAIR ecosystem, participate in the FAIRsFAIR and EOSC-Pillar european projects.

EOSC SIRS Architecture

This year we moved one step forward, by coordinating the EOSC SIRS task force, that brought together 9 scholarly infrastructures, and produced an official report on the basic building blocks to support software source code in the scholarly ecosystem, An essential infrastructure in the global architecture is the Software Heritage universal software archive.

Sponsors

During the meeting organized on UNESCO’s headquarters in February 2020, 30 participants, representing the expanding network that supports our mission, met to contribute to the discussion on the next steps and strategic directions for the next years.

 

We are very grateful to them for keeping supporting our mission despite the difficult year we all went through.

And we have been delighted to welcome three new important supporters: the CNRS as platinum sponsor and Sorbonne Université and Université de Paris as gold sponsors are now working with us to build the software pillar of Open Science.

Sharing and spreading the news

 

We have spent years working hard to accomplish our mission. The time has come to share and spread the news better than what we have been doing up to now.  This is why this year we launched the Newsletter and the YouTube channel. Now you can stay up to date with Software Heritage news by subscribing to the newsletter and find all our presentations in one place on our YouTube channel!

Introducing the Software Heritage ambassador programme

Last but not least, we are now launching the Software Heritage ambassador programme, designed to welcome enthusiastic organizations and individuals that want to help spread the word about Software Heritage and the services it provides to society as a whole. There are many reasons to engage and you can apply to become an Ambassador for Software Heritage right now!

 

Preparing for the long haul

Our top priority will again be to ensure that the key functionalities that Software Heritage offers are rock solid: browsing, referencing, and saving source code. We have also been actively working on many exciting developments that are not really visible right now, and we hope to roll them out progressively in the coming months:  mirrors are coming, and integration with extrinsic metadata sources that will help better describe the contents of the archive, to cite a few.

And most importantly, the time has come to start working on setting up the independent, international, non profit, open organization that will host Software Heritage for the long term, with an exclusive focus on its mission of building and maintaining the universal source code archive, for the benefit of society as a whole.

We look forward to working with all interested parties to build this essential infrastructure that will contribute to preserve our software commons and provide the reference archive and knowledge base for all use cases, from industry to research, from cultural heritage to governments, from individuals to organizations.

Let’s join forces to preserve our past, improve our present, and, looking beyond the current crisis, prepare better for the future.

— Roberto Di Cosmo

 


]]>
Software Heritage in 2019: a progress report https://www.softwareheritage.org/2019/12/31/software-heritage-in-2019-a-progress-report/ Tue, 31 Dec 2019 22:29:14 +0000 https://www.softwareheritage.org/?p=20499 Pursuing our mission to collect, preserve and share the source code of all software ever written, 2019 was a year of great achievements for Software Heritage.

Today is a good time to look back and talk about what has been accomplished in 2019 since our last activity report, and give some perspective on the future.

Policy and Awareness

An important part of our mission is to raise awareness about the importance of software, and software source code, in all aspects of human activity.

This year we pursued and intensified Paris Call on Software Source Code as Heritage for Sustainable Development
our collaboration with UNESCO, leading to the publication of the Paris Call on Software Source Code, signed by more than 40 international experts.

The Paris Call provides a strong basis to support a variety of policy actions, ranging from source code preservation, to sustainability of Free and Open Source Software communities.

Pursuing our efforts to have software source code recognised as a pillar of Open Science, we welcomed the French Ministry of Research and Innovation, and started our work in the FAIRsFAIR and EOSC-Pillar european research projects.

On the key issue of attributing and referencing research software, we intensified our engagement with the RDA and FORCE international communities, and published key recommendations leveraging Inria’s 50 years experience in the field.

In order to help researchers in all disciplines to improve the reproducibility of their work, and enhance their articles using the Software Heritage intrinsic identifiers, we published detailed and actionable guidelines for saving and referencing research software source code and reachead out to artefact evaluation committees of major international conferences.
We also joined forces with GNU Guix to enable long term reproducibility.

Last, but not least, together with many international organisations, we played an essential albeit inconspicuous role in protecting software development from extremely damaging provisions in the european copyright reform adopted on April 15th 2019.

Progress on the roadmap

A significant effort went into making progress in our technical and strategic roadmap, continuing our work to collect, preserve and share an ever growing part of the source code of all software ever written.

Collect

The mission of Software Heritage is to collect the source code of all software ever written, and this is a complex undertaking: some software is easily available (online), some is not (offline), and while a growing part of it is open, a lot is still behind closed doors.

We have clearly exposed the different strategies we envision to adopt, depending on the kind of source code at stake, summarised in the diagram shown here: automation, crowdsourcing, focused search and escrow.

This year we made progress on the automation side by adding the npm and PyPI to the list of software origins that are harvested systematically. In order to support crowdsourcing, Software Heritage now allows you to issue save code now requests to archive version control systems on forges it does not yet harvest systematically. This allows you to ensure that work you cherish is preserved, and you can trigger new archivals whenever you want.
A remarkable use of this new feature is made by the French Digital Directorate, that maintains a list of public sector software source codes and leverages the save code now API to ensure that each and every one of them is archived in Software Heritage.

The Software Heritage Acquisition Process

We also made an important step forward to kickstart the focused search work needed to collect and curate landmark legacy source code written by pioneers of the digital age, many of which are still around and willing to contribute their knowledge.

In collaboration with Unesco and the University of Pisa we develop the Software Heritage Acquisition Process (SWHAP), intended to support and empower all those that are interested to contribute to this effort.

The first set of SWHAP guidelines are available, providing concrete, actionable instructions, as well as a detailed walkthrough of the process on a medium sized landmark legacy software developed over twenty years ago at the Department of Computing of the University of Pisa.

You can contribute to this important mission: use the SWHAP process, and start your curation journey today!

Preserve

An important part of our long-term strategy to ensure that the precious source code we collect is preserved and passed over to future generations is the development of a geographically distributed network of mirrors, implemented using a variety of storage technologies, running in various administrative domains, controlled by different institutions, and located in different jurisdictions.

We are delighted to report that the mirror network has grown : after the first industry member from Sweden, this year we have been thrilled to welcome ENEA, from Italy, as the first institutional partner. This is an important step forward for the Software Heritage mirror network, that we hope to see starting to operate next year.

Another part of the long-term strategy is to establish collaborations with institutional archives to store regular offline snapshots of the archive contents: this year we made a first step in this direction by partnering with Cines in the framework of the EOSC-Pillar european research project.

Share

This year has also seen significant progress in our efforts to make the contents of the archive easily accessible and referenceable for a variety of users.

Software Heritage intrinsic identifiers are showcased in a blog post published on the anniversary of the first manned landing on the moon.

The permalinks tab that provides these identifiers for the tens of billions of software artifacts in the archive has been improved:

it now offers badges that you can use to enhance your web pages, and point to the archived version of the artifact you are interested in.

We also made available to researchers, both on AWS, and on Azure, the whole graph of Software Heritage: this dataset is used for the mining challenge of the MSR 2020 international conference.

Looking ahead

There are so many exciting areas of development and collaboration that will keep us busy in the coming years, so it’s now time to fix some priorities.

Saving massively endangered source code will always be a topmost priority: we already know that time and energy will need to be devoted to salvage the 250K+ mercurial repositories that BitBucket is planning to remove by June 2020.

After that, our primary goal will be to ensure that the key functionalities that Software Heritage offers are rock solid: browsing, referencing, and saving source code.

Then, we will focus on scaling up, and deploying the mirror network, to cope with the growing amount of source code that will need to be harvested. We count on the recent partnership established with GitHub to improve the efficiency of archiving the software projects hosted on GitHub, through the GitHub Archive Program and dedicated support from GitHub’s teams, and hope to see more forges following GitHub’s example and establishing partnerships that ease the archival of their contents.

We’ll also work on several new exciting functionalities to make the archive even more usable.

Last, but not least, after some first steps made with GSoC and a few other collaborators, we look forward to foster the emergence of a broad community of contributors to complement the effort of the Software Heritage core team: it is essential that everybody interested and concerned steps up, if we collectively want to take up the huge challenge that underlies the mission we have undertaken!

— Roberto Di Cosmo

]]>
Opening up to the world, and full speed ahead https://www.softwareheritage.org/2018/12/28/activity-report-2018/ Fri, 28 Dec 2018 14:37:56 +0000 https://www.softwareheritage.org/2018/12/28/activity-report-2018/ Our ambition is to collect, preserve, and share the source code of all software ever written.

In the very intense year 2018 that has passed since our last activity report we have moved forward at a steady pace, and now is a good moment to recall the key accomplishments, putting them in perspective in the framework of our mission, at the service of cultural heritage, science, industry, and society as a whole.

Let’s start by looking at the progress made along these three dimensions, and then we’ll wrap up looking at how awareness is finally raising.

Collect: expanding the coverage

This last year the archive has grown quite a lot and we have made steady progress in expanding the coverage of our collection process. We have automated the tracking of Debian sources and added PyPi, GitLab.com and Inria’s own GitLab instance to the list of tracked forges.

For the source code hosted on forges that we do not track yet, we are always eager to receive contributions to our listers, but we are also opening up an initial version of a save code now service that allows to request the ingestion of source code even if it is not hosted on one of the already tracked forges.

And last, but not least, for research software, we have opened up a moderated software source code deposit service.

Preserve: kickstarting the mirror network

We acknowledge that there are many threats that might endanger long-term source code preservation: from technical failures, to mere economic decisions, from dispersion of efforts in uncoordinated initiatives, to changes in the legal framework, like the EU Copyright reform that absorbed a great amount of our time this year.

We know very well that we cannot entirely avoid them if we work alone. 

That’s why an important part of our long-term strategy is the development of a geographically distributed network of mirrors, implemented using a variety of storage technologies, running in various administrative domains, controlled by different institutions, and located in different jurisdictions.

We have spent a great deal of effort to set up the legal and ethical framework for the mirror network, and we are now really thrilled to have recently welcomed the first member of the network, that we hope to see growing steadily in the future.

Share: making the contents accessible and referenceable

Collecting and preserving the software source code have been our top priorities from the very beginning, and for a while very little of the huge amount of work we made was visible from outside. This year has been a turning point, with a lot of visible progress on the share dimension!

Opening the doors of the archive

On June 7th 2018, just a bit more than a year after we established a landmark agreement on preserving and sharing the knowledge embedded in software source code, we were proud to be back at Unesco headquarters for the grand opening of the doors of the Software Heritage archive to the public. This is a major step forward in the sharing dimension: it is now possible to explore the largest collection of software source code in the world taking advantage of the many features detailed in the archive guided tour. And we will keep adding new ones!

Intrinsic identifiers for digital objects

When you find something interesting in the archive, you want to share it, and for this you need an identifier that will point to what you found. Hence we must provide identifiers for each and every one of the over 10 billions digital objects in the Software Heritage archive. It is not an easy task, because it is not just a technological challege: we knew that whatever choice we made, it would end up setting a standard in the medium term, and this is a serious responsibility.

Our iPres2018 research article provides a full account of why and how we designed the system of identifiers that is now deployed across all the Software Heritage archive (and that stands behind the “Permalinks” red vertical tab that is available in all views of the webapp that allows to browse the code).

Awareness is raising

The importance of software in general, and software source code in particular, in our modern societies has been long underestimated, overshadowed by the more visible aspects of the digital revolution, and that is one of the major reasons why Software Heritage was not created earlier.

Thanks to patient and steady efforts, the situation is now slowly changing, and the past year has been a real turning point.

An international expert meeting that was held at Unesco headquarters in November produced a detailed report on the relevance of Software Source Code, the blockers and the enablers, and a call for action that we hope will be heard all around the world.

Research Software, with its source code, is a long forgotten essential pillar for Open Science, along with research articles and research data. Things are now changing: Software Heritage has been included in the french national plan for Open Science, and research software can be in Software Heritage via an open access portal, a stepping stone for software citation.

We also reached out to the broad computing community to ask for help with the many technical and scientific challenges lying ahead.

According to our plan, we have recently established a non for profit foundation for Software Heritage, currently hosted by the Inria foundation. That’s why everybody can now contribute to our mission, joining the growing list of sponsors and partners: donations of any size are now accepted, and we do welcome even very small ones, as they indicate that you care.

Looking ahead

There are very many exciting areas of development and collaboration that will keep us busy in this coming year.

The coverage of the Software Heritage archive will be increased: more forges will be tracked, and more version control systems will be supported. We will also work on improving access to the contents of the archive, progressively adding metadata and provenance information to the webapp. We will welcome new members in the mirror network, and collaborate with industry on new interesting use cases.

Research software deposit in the Software Heritage archive will be made available through more open access portals, software source code will be promoted as a pillar for Open Science, and the intrinsic identifiers pointing into the archive will be promoted in the framework of software citation and reproducibility of research.

The content of the Software Heritage archive will be made available as a dataset to researchers in all fields, enabling the emergence of new tools and techniques that will enhance the fruition of the software heritage of humankind.

And we hope to attract even more users and contributors: we have an exciting mission, and everybody is welcome onboard!

— Roberto Di Cosmo

]]>
Getting up to speed, clear sky ahead https://www.softwareheritage.org/2018/01/08/yearly-anniversary-report/ Mon, 08 Jan 2018 07:24:59 +0000 https://www.softwareheritage.org/?p=13479 One year has passed after we posted our first activity report, and it is now a good time to look at what was accomplished in 2017, and give some perspective on the future.

Our mission and our principles

Here at Software Heritage, we are taking over the mission of collecting, preserving, and sharing the source code of all the software available.

We do this for multiple reasons. To preserve the scientific and technological knowledge embedded in software source code, that is a precious part of our heritage. To allow better software development and reuse for society and industry, by building the largest and open software knowledge database, enabling the development of a broad range of value added applications. To foster better science, by assembling the largest curated archive for software research, and building the infrastructure for preserving and sharing research software.

We do this now, because we are at a turning point: on the one hand, most founding fathers of computer technology are still around, and willing to contribute their knowledge, but only for a limited time. On the other hand, we seem to be at increasing risk of massive lossage of source code developed by the Free and Open Source community, in particulare due to code hosting sites that shut down when their popularity decreases.

This challenging and humbling undertaking must be carried on in a long term perspective, hence we have published a white paper that states a first set of principles on which we base our work, including transparency, openness, and collaboration.

And we are well aware of the fact that the success of our mission requires widespread recognition, proper resources and good scientific and technical foundations.

Raising awareness at the political level

2017 has been an exceptional year for raising awareness of the importance of software source code at the highest levels.

In early January, an intense week in Chile led to headline news and official support after meeting with the president of the republic.

On April 3rd, months of patient
preparatory work culminated in the signature of a landmark agreement with Unesco on software source code preservation and access, in the presence of the president of the french republic, ambassadors from many countries, and four hundred personalities.
On September 28th, we took part in the Unesco working groups for the International Day on Universal Access to Information, and the outcome document, the Mauritius call, advocates policies, standards and legislations to enable access to and preservation of information, including, explicitly, our software heritage.

Growing support

When we unveiled the Software Heritage initiative exactly one year and a half ago, on June 30th, 2016, we were happy to count on Microsoft and DANS as early sponsors, and over twenty endorsers.

This year, we have welcomed six more sponsors, Société Générale, Intel, Huawei, Nokia Bell Labs, the University of Bologna and, freshly arrived, GitHub.

In November, representatives from all the sponsors were invited to Paris for a first face to face meeting, which was a great occasion to exchange ideas on the progress that has been made and discuss future developments.

Connecting with diverse communities

We have reached out to many communities, explaining what we do, and inviting them to join forces.

The keynote at FOSDEM in February was a first great opportunity to connect with fellow our Free and Open Source software developers, followed by the OSCON presentation in May, and the EclipseCon keynote in October, among others.

We kept our own Computer Science community up to date on Software Heritage in many occasions, and in particular the Inria’s 50th anniversary celebration, the European Computer Science Summit and the ACM working group on reproducibility. Early contacts have been made with several research teams around the world that are interested in the unique potential offered by the Software Heritage archive.

To address the scientific community at large, we joined forces with DANS and the SSI to spark interest in software source code in the Research Data Alliance, that led to the creation of a group specifically dedicated to it. The relevance of Software Heritage for streamlining scientific software citation has been noticed, and we launched a fruitful collaboration on researcher initiated software deposit with the french national open access repository, HAL.

Last, but not least, we made contact with many other preservation initiatives: Computer History MuseumBNF, Software Preservation NetworkPersist, Living Computers: Museum + Labs, and CINES.

Many of the talks given in these occasions are available online, some with accompanying videos.

Development

A large amount of work has been invested on the Software Heritage own infrastructure: most of it was under the hood, but some results are visible already.

A top priority is still to expand our collection of source code. We are quite proud to have crossed the 4 billion unique source code files mark, harvesting over 70 millions origins, but we know that there is a long queue waiting out there. So we spent quite a bit of effort to   make it easy for external contributors to add support for more code hosting platforms, with special thanks to a great collaborator: you can now write your own lister in a few lines of code!

And yes, we know that a crowd is queuing up in front of the doors of the great library of source code we are building, eager to have a look at its contents. A first step has been made by opening up a public API to access a part of the archive, and thanks to the new members of the team, that has grown significantly over the past months, we will be making progress faster.

As we prepare to fully open the doors to the archive, the question of the terms of use for the data we collect required special care and attention: we are quite happy to have made a significant step forward by publishing the terms of use for our public API.

Looking ahead

Being committed to a long term mission, we will patiently tackle one after the other the many challenges that lie in front of us, building paths for collaboration with all kindred spirits around the world.

This coming year we will prioritize developments that will enable you to browse and download the contents of the Software Heritage archive. A software deposit mechanism will also be made available for connecting with specific platforms, in particular for research software.

We will also focus on the technology, process and legal framework for establishing the first mirrors of the Software Heritage archive.

On the organisational side, a very important step will be the establishment of the Software Heritage Foundation, after many long months of preparatory work, providing the official non profit structure that will oversee the development of Software Heritage on the long term, and provide the proper vehicle for accepting contributions and donations of any size from all those that share our vision.

There is clear sky in front of us, it’s time to take off for another year.

— Roberto Di Cosmo

]]>
We have come a long way, and the road ahead https://www.softwareheritage.org/2016/12/30/we-have-come-a-long-way/ Fri, 30 Dec 2016 07:28:40 +0000 https://www.softwareheritage.org/?p=8630 We unveiled the Software Heritage initiative exactly six month ago, on June 30th, 2016. Now it seems a good time to look back at the origin of the project, where we started, what we have accomplished up to now, and get a glimpse of the future.

A (not so long) time ago…

The first informal discussions on what has now become Software Heritage started back in the spring of 2014, (as often happens, around a coffee machine and a tea pot) at IRILL in Paris. In the months that followed, some serious preliminary work had to be done: exploring the state of the art, charting related initiatives, elaborating the vision we now proudly share, estimating the effort required, and finding the right host and the initial support for starting the project.

inria logo english resized

It was an extremely intense period, but if one has to pick a symbolic starting date, that would definitely be October 21st 2014: on that day, Inria‘s director, Antoine Petit, after an in-depth analysis of the initial case statement, encouraged me to go forward and propose the project to the Inria decision bodies.

SWH-logo_share-150x150

And indeed, after going through some serious scrutiny, Inria’s support for the project became quickly a reality: in march 2015 we had a go, the advisory committee was put in place, and by September 2015 we had our logo, the current team was already complete and we were working at full speed.

… we started collecting the source code of the World …

swh-dinner

During the summer of 2015, after a much needed inaugural dinner, our initial infrastructure was set up, and the operations for collecting the source code repositories were started.

swh-server

The graphs on our website trace the steady growth of the archive since then, and we have started providing some insight on what is going on under the hood.

We are especially happy to have arrived just in time to collect the contents of Gitorious and Google Code before it was too late, thanks to some wonderful people that were eager to help.

What makes our mission special, is that we do not just check out a copy of a given software project when we crawl it: we trace all of its development history, as contained in its version control system, and we do this again every time we visit it, building a sort of Wayback machine of software development.

…looking for partners and spreading the word

For Software Heritage to keep its promise of long term source code preservation, it is essential to bring together a broad set of partners, from cultural heritage to education, from research to industry, build support for the mission, and pave the way for collaboration. Naturally, we opened up our source code that is all Free/Open Source Software.

In parallel with the technical development, an intense effort was dedicated to the presentation of the vision of the project to a great many people, institutions, industries and organisations: the talks given in the course of this journey are available on our wiki.

The breadth of the scope of Software Heritage means that there are different aspects that will appeal differently to different stakeholders. Hence we gave different presentations focusing on the digital preservation issues, the science crisis and software reproducibility, the scientific challenges and the technical architecture of this unprecedented archive, as well as industry applications, like compliance.

Where we stand today…

After unveiling Software Heritage to the world, six month ago, we are happy to see quite a lot of media coverage, that is traced in our public git annex repository, and a broad support of our mission and our way of implementing it with openness and transparency.

acteursdulibre

We were delighted to be invited to deliver a keynote at OSCON London and receive our first award on occasion of the Paris Open Source Summit this November.

Several major IT players are already sponsoring the project, and we hope to see more coming, especially from the Free/Open Source world, that is the first concerned by our initiative.

More infrastructure is now available for the collection and archival effort, and we are working hard to put in place a first mirror on the cloud.

…and a glimpse of the future

Looking back, we made a lot of progress, but looking forward, there is a lot more left to be done… here is a glimpse of what is on our roadmap for the future.

Of course, our top priority now is to collect more and more existing source code: mainstream development platform have started closing down last year, and we need to focus on endangered content first. This means discovering more software sources, and ingest their contents. But we hope to be able next year to open up the doors of this great library of source code for all to read, and for scientists to analyze: that’s why we are hiring.

We would also like to start working on connecting Software Heritage with existing metadata information and in particular with Open Access and Open Data repositories. And we might explore ways of enriching the storage layer with a replicated and distributed infrastructure that will ease data access.

If you share our vision and like what we do, please explore our website, follow the pointers, and you will find many ways of lending a hand: it can be as simple as spreading the word.

— Roberto Di Cosmo

]]>