NIH Software Discovery Index Meeting Report

This website was created to compliment the Software Discovery workshop held in May 2014 which explored the challenges and opportunities associated with citing, tracking, and sharing biomedical software. When the domain expired all this information was no longer accessible except through its archived pages. When I discovered the domain had become available I immediately bought it with the intent of restoring as much of its original content as possible from its 2014 archived pages. Unfortunately, not all content was available, but at least I was able to post the Software Discovery Index Meeting Report along with some comments from interested people.

My background is in science and I worked for a number of years in the field during research for green technologies. I'm glad to see inroads into the green market at such a large online site. But there is still much more that needs to be done.

INTRODUCTION

The National Institutes of Health (NIH) (website: datascience.nih.gov/bd2k) , through the Big Data to Knowledge (BD2K) initiative, held a workshop in May of 2014 to explore challenges facing the biomedical research community in locating, citing, and reusing biomedical software. The workshop participants examined these issues and prepared this report summarizing their findings.

The constituents with the potential to benefit from improved software discoverability include software users, developers, journal publishers, and funders. Software developers face challenges disseminating their software and measuring its adoption. Software users have difficulty identifying the most appropriate software for their work. Journal publishers lack a consistent way to handle software citations or to ensure reproducibility of published findings. Funding agencies struggle to make informed funding decisions about which software projects to support, while reviewers have a hard time understanding the relevancy and effectiveness of proposed software in the context of data management plans and proposed analysis.

This document summarizes recommendations generated from an NIH Software Discovery Meeting held in May 2014. We are now requesting comments from the larger community. We have contacted a broad set of constituents who represent software users, software developers, NIH staff, electronic repositories, and journal publishers.

Though numerous changes are needed to address all these gaps, the workshop identified one fundamental prerequisite for success: an automated, broadly accessible system enabling comprehensive identification of biomedical software.

This objectives of this “Software Discovery Index” would be:

to assign standard and unambiguous identifiers to reference all software,
to track specific metadata features that describe that software, and
to enable robust querying of all relevant information for users.

If broadly used, this Software Discovery Index will form a cornerstone in a software ecosystem that benefits software developers, software users, journal publishers, and funding agencies.

The workshop attendees agreed that technical resources exist to create both this ecosystem and the needed tools to leverage it. The success of such efforts, however, depends on their acceptance by the scientific community: software developers must obtain identifiers for their software; users must cite software in their publications; journals must leverage and expose these citations; and if this is used properly, and judiciously, funding agencies should use this new wealth of information to shape funding decisions and long-term planning. It is only when each constituency sees benefits of engaging in this effort that significant progress can be made.

The ultimate goal of this effort is to ensure all publicly funded biomedical software is highly accessible to the research community. Making software easier to find, easier to cite, and easier to reuse are all necessary steps. It is also critical, however, to support the continued development and availability of software tools. Without access to both the tools and the scientific literature describing their use, the research community will not be able to select and use the best tools. Without tools maintained in common, open-access repositories such as GitHub and SourceForge, improvements to existing tools will be hampered. In all these areas, better support for software can help maximize the impact of NIH’s investment in biomedical research.

A. FRAMEWORK SUPPORTING THE SOFTWARE DISCOVERY INDEX

The workshop identified many potential characteristics and features for an ecosystem in which users can locate, cite, and reuse software. As discussed at the workshop, one prerequisite for such an ecosystem is the use of unique identifiers are obtained by developers and linked to software wherever it is hosted.

Unique identifiers

Unique identifiers for biomedical software are critical for all that follows. The specific system of identifiers used is of far less importance than the adoption of those identifiers among software developers, software users, and publishers. Even so, however, the choice of identifiers could make it easier or harder to meet the needs of each of these communities.

The temporally dynamic nature of software development makes unambiguous identification difficult. Individual software packages may have many versions, may be branched along different development paths, and may be bundled into collections with other packages. Identifiers must operate across all of these cases, both disambiguating and linking related tools.

The system of identifiers should also enable the association of meta data to software. The metadata so associated should facilitate the identification of scientifically relevant software packages. Collecting this information as a static catalog or set of web pages runs the significant risk of perpetuating stale metadata. In facing that challenge, the open-source software community has developed multiple ways to capture metadata on projects with minimal duplication. The most common approach is to define a format in which the project metadata can be stored as part of the project itself and then scraped by any interested parties. This means that the software developers only have to provide the metadata once, enabling the Software Discovery Index and other interested parties to scrape and use it. It also ensures that updates by the software developers are reflected in all repositories. In this effort, controlled vocabularies and ontologies may prove to be useful, but should not be the primary focus of the initial effort.

Connections to publishers

There is increasing recognition within the scientific community that recording how software is used is a critical part of the scientific record. The dissemination of scientific results, however performed, must unambiguously describe the software used to generate those results and the steps performed. With publications currently the lingua franca for disseminating biomedical research results, connections with journal publishers will be essential for this effort.

Comprehensively and efficiently tracking the use of software in research requires a new standard for software citations. At present, most software is cited indirectly by citing either a publication or a URL where the software is described. Citing publications leverages the existing publication citation infrastructure, but it only enables citation of software described in publications. Even software described in publications, if actively-developed, is likely to cycle through many more released versions than publications. URLs pointing at descriptions of software not only fail to provide a standardized mechanism for tracking versions and metadata, but also frequently break as documentation and source code move. To support reproducibility and archiving, a persistent mechanism for citing software, even software no longer being actively developed, is critical.

A consistent system of unique identifiers for software and an API for querying those identifiers will enable a better system for citing software. When used in publications, these identifiers would make it possible to identify all publications using a particular software tool. Retrieving the citations from publications can be accomplished through direct submission from journals, extraction by MEDLINE, and full-text mining.

One major initiative that aims to address this issue is the use of Research Resource Identifiers (RRIDs), currently underway at FORCE11 and lead by a partnership between the University of California, San Diego, and Oregon Health & Science University. The RRID project makes it easier to track key research resources within the biomedical literature by ensuring that authors provide unique identifiers (RRIDs) for each resource used to produce the results of a published study. The initial pilot project was launched in February 2014 with over 30 journals agreeing to ask authors during submission to provide RRIDs for antibodies, for genetically modified animals, and for software tools/databases. The project established a centralized portal (http://scicrunch.com/resources) that aggregates accession numbers from authoritiave registries for each of these types of resources and enables searching by identifiers.

Digital Object Identifiers (DOIs) represent another broadly-used class of identifier. DOIs are widely used in publishing and there are ongoing efforts to leverage their capabilities for software citations. One significant initiative is a collaboration between Mozilla, figshare, GitHub, and Zenodo that allows software developers to easily mint DOIs for their tools^.

Ultimately, successful use of unique identifiers for software requires not only a structure, but also social adoption. No tracking system will work unless authors begin properly citing the software used in their research. Funding agencies have set a precedent in driving public access among grantees that may be useful in encouraging adoption. Pilot projects have shown that both journals and authors recognize the need for such a system and are willing to adjust workflows to accommodate it.

Use cases

The combination of unique identifiers for software and the use of those identifiers in publications will enable the creation of a rich dataset of software relevant to the research community. This dataset will be captured through the Software Discovery Index, consolidating data on software packages and their use. The Index would not be a new repository for software code, but rather a resource collecting data from many repositories, publishers, and other sources. This Index, though a highly tangible aspect of the effort, is likely to be highly susceptible to feature creep. It is essential, therefore, to select a subset of relevant features appropriate for the next phase of implementation. Not all of the features described here are likely appropriate for the next phase of implementation.

One obvious function of the Discovery Index is to index metadata describing software. The metadata selected for inclusion must be carefully considered for relevance to various users. A selection of metadata fields are listed in Appendix 1. Appropriate metadata, identified by a broad community, should be useful for a range of systems both within and beyond NIH.

Aggregating data across multiple sources will help make software from multiple sources more comparable, enabling the calculation of software ratings and utility scores. These scores should depend on a range of criteria, including such factors as citations in the scientific literature, documentation, codebase activity, and user community vitality. Though capturing metrics on these attributes will require a significant development effort, this information has the potential to be of tremendous value to the scientific community. Much additional metadata is likely to be of value to the research community even if it is not currently amenable to quantitative comparisons between different software. Even without quantitative measures of software reliability, there are many reliability indicators that would be worth tracking. Completeness of documentation can be measured at several different levels, including the ability to install the code, or understand the impact of changing run-time variables. The presence of unit or integration tests are critically important, but currently impractical to rigorously measure. Inclusion of benchmarking results can help describe the operation of software. Having metrics such as these available, whether or not they are factored into a reliability score, will help researchers selecting software.

As a secondary benefit, exposing the results of various measures may encourage developers to invest in documentation, unit tests, benchmarking, and other best practices. With benchmarking in particular, the Index has the potential to simplify benchmarking by providing gold standard datasets against which benchmarks could be run. Ultimately, if the Index captures multiple measures of software utility and quality, it should be possible to offer certification levels for software to signal compliance with various best practices. Such recognition, already being examined, could both help users wishing to find software and encourage software developers to follow those best practices.

Supporting reproducibility is a critical need and an area where the Index is likely to grow significantly with time. While initially citations alone would be unlikely to provide enough information to enable full reproducibility of published results, they do provide a framework for documenting not only the software used, but also the environment and parameters. In time, one of the major contributions of the Index may be in improving the reproducibility of published analysis.

Finally, to maximize the utility of this index, it is critical that its information be exposed via both a website and an API. The website should provide a convenient point of entry that allows various faceted searches and browsing software tools. This website is expected to evolve significantly as this effort matures, but it is important that the first iteration is usable and streamlined. Over the long term, however, the API is likely to be at least equally important as the website. It is likely that websites serving specific research communities will wish to provide their own filtered views of the data, and this should be encouraged through an API. Moreover, other resources such as Synapse, GitHub, Zenodo, figshare, SciCrunch/NIF, and others may wish to expose their software to this index, and the API should enable this as well. A thoroughly-documented and usable API for both providing data to this aggregator and retrieving data from it will be critical to its long-term success.

The use cases briefly outlined here provide only a small slice of the likely eventual functionality for a successful Software Discovery Index. As the index and citation patterns expand, the novel datasets generated should be further uses not yet planned.

Complementarity with the Data Discovery Index

The Data Discovery Index (DDI) is a NIH Big Data to Knowledge (BD2K) project. The DDI will enable investigators discover, access, and cite biomedical big data. The DDI aims to cut across disciplines and provide an index that will broadly serve across all NIH investigators. We expect that the Software Discovery Index will be fully compatible with the DDI, with the goal of allowing DDI and Software Index objects appearing in electronic journal articles, and enabling comprehensive retrieval of both data and the software that is utilized to analyze or produce these datasets.

B. CHALLENGES AND REMAINING QUESTIONS

The framework proposed above consists of endorsing identifiers, collaborating with publishers, and developing a Software Discovery Index. Each of these tasks carries particular challenges. Some of these challenges must be solved now, while others should be considered and possible solutions proposed in the first iteration of this effort.

Two important early needs are to define the scope of this project and to provide a help desk. A key element of defining the scope of this initial project will be deciding what software should be covered. Though ultimately the system should not limit itself to NIH-supported software, a limited scope could be useful in the early stages. The help desk should help users navigate this system. Software developers, in particular, will need guidance on obtaining unique identifiers for their software and in crafting useful metadata files. Other users are also likely to benefit from assistance locating, citing software, and tracking software.

Defining relevant software

Defining relevant software will be a challenge for this effort. Biomedical researchers use a tremendous amount of software that does not need to be captured in this effort. It is relatively clear that no citation is necessary for a text editor, even its features may have greatly helped a researcher. Likewise, it is relatively clear that the statistical analysis package used to analyze a dataset should be cited. A great deal of biomedical software, however, falls between these two extremes. Many researchers use a few lines of script to store parameters for command line programs. Sometimes, those simple scripts grow and become tools in their own right. In other cases, exploratory tools may have been critical for generating initial hypotheses, but not have been used to generate or analyze the published data. It will be important to balance completeness with navigability. Multiple avenues exist for achieving this and it will be important to select with care the approach for the project.

Integrating with other repositories

Aggregating data from multiple sources, though it opens major opportunities to improve software development and design, also requires integration with multiple repositories. The goal of the Software Discovery Index is not to replace existing software repositories, but rather to pull as much information from them as possible and to present that information in a consistent and useful form. This is similar to the role that PubMed plays for journals – PubMed aggregates the results and provides them in a consistent form, but does not oversee curation or peer review. Similarly, the repositories will likely be the ones that ensure standard metadata and provide some degree of curation. This will require strong relationships with multiple existing repositories and a willingness to work with new repositories that contain relevant software.

Evaluating progress and distinguishing this from other efforts

This system is not the first attempt to create an Software Index for NIH-supported software, and we should learn from prior efforts. For example, NIH dedicated significant support to the BioSiteMaps effort. Numerous researchers have created lists of significant software in their own fields, for example curated projects like the Neuroimaging Tools and Resource Clearinghouse (NITRC). Finally, the RRID project and underlying Neuroscience Information Framework Resource Registry have been broadly populated and are used by a broad community. It will be critical to consider what distinguishes this effort from previous efforts as well as any overlap in order to define metrics for success. Some of the key features of this effort that distinguish it from prior efforts include the automated indexing of software, the integration with multiple registries, and the provision of APIs enabling the creation of community-specific user interfaces.

C. IMPLEMENTATION ROADMAP

Below is a preliminary list of milestones involved in implementing a Software Discovery Index:

Define a checklist for the Minimal information about software (see Appendix 1).
Develop and implement methods to assign unique identifiers to software systems, leveraging existing approaches where possible.
Establish and maintain an API for searching, browsing, entering data, and interacting with journals.
Establish and maintain a facile and streamlined website with search and browsing capability for software tools (see Appendix 2 for a selection of use cases).
Partner with journal editors to implement the selected unique identifiers in electronic publications. At a minimum this would require identifying relevant journals, developing file formats and APIs for data exchange, and developing documentation for authors.
Establish and maintain an Advisory Working Group of international members of the user community, software developers, software repositories, other relevant electronic repositories, and journals.
Engage effectively with the Data Discovery Index.
Implement performance metrics for the Software Discovery Index (see Appendix 3). Summary results should be made available publically while detailed results should be made available to funders and advisory groups.
Promote the Software Discovery Index in journal editorials, conferences, social media, and scientific publications.

D. CONCLUSIONS

The Software Discovery Index Meeting of May 12-13, 2014 was a valuable forum for many important discussing. Participants agreed that the unprecedented abundance of electronically encoded information such as `omic data, imaging, and EHR, the software required to manage and understand this data is has also become increasingly critical to biomedical research. Properly documenting this software is critical.

Meeting attendees also agreed that software is no longer incidental to the data; the systems used to produce and analyze raw data must be indexed to support analysis reproducibility. Due diligence towards reviewing existing projects, including the RRID project, will provide valuable insights into how this system should operate. Assigning universal locators to software enables significant improvements in the processes for finding, citing, and reusing software. This workshop proposed the use of unique identifiers for software packages, the formation of collaborations with Publishers to track software used in publications, and the creation of a Software Discovery Index to provide information on software packages. If successful, an implementation of these three efforts would benefit software developers, software users, journal publishers, and funding agencies.

E. APPENDIXES

Appendix 1: Minimal information about software (MIAS)

A common set of metadata fields are critical for useful indexing. If this effort only provides refined free-text searching capabilities, it will not be a major improvement over currently-available resources. It is necessary, therefore, to define a key set of minimal fields can that provide maximum value. At the workshop, the following fields were described as candidates for inclusion in this list:

Persistent identifier
Software title
Software version
Software license
Links to code repository
Human-readable synopsis
Author names and affiliations
Terms to describe software objectives or functions, and/or the following two bullets (controlled by an appropriate ontology)
Formats for data inputs and outputs
Platform, environment, and dependencies
Associated grants and publications

Appendix 2: Use cases

Developer: A developer registers their software, she is able to track and quantify all use of their software in scientific publications, through comprehensive and accurate citation of the index-associated identifier. With the ability to find similar types of software packages (e.g., other assembly programs), she would also identify benchmarking data sets, and other related software development efforts.
User: An NIH funded researcher is seeking software for analysis. They are able to identify the most appropriate software relevant for their study on their data on their computer systems and objectives, and be provided with all information necessary to locate, obtain, and deploy the software.
NIH: A program officer can identify both the creation and the use of all software funded by a grant they have awarded, analogous to how they can track all papers and citations to those papers funded by a grant they have awarded. They can also identify similar or overlapping products. Review panels can assess software choices in funding proposals and data management plans.
Publisher: A publisher can associate software with their publications during & for peer review and upon publication for citation. They can also pull & display metrics related to all the research objects surrounding the article, including software based on the software identifier.

Appendix 3: Metrics and milestones

It is critical to define metrics for this effort. These metrics should be evaluated both in absolute terms and in relative terms, monitoring the growth with time. These metrics are particularly significant because this is not the first effort to make biomedical software more accessible to researchers. This effort will face many of the same challenges faced by previous efforts and it is critical to closely monitor whether it is accomplishing its purpose. Specific metrics proposed for the initial effort included:

Number of developers contributing software
Number of software records created
Software identifiers appearing in and extracted from publications
Links from publications to software records
Links between indexed software and other resources, people, and data
Annotation of existing collections of software packages (e.g., Bioconductor)
The number of interoperating resources, including repositories, aggregation resources, and user forums
The use of the APIs to re-package the data for specific use cases
The proportion of NIH-supported software tracked by the software discovery index

Tracking these metrics will provide insight into the progress of this effort. Progress against these milestones should, wherever possible, be evaluated against milestones. Specific milestones could include the fraction of NIH-supported software included in the first year, the time for machine-actionable links to software in PubMed, and the time for API establishment.

Appendix 4: Existing software indexes

There are numerous existing software indexes serving specific communities, many of unrelated to biomedical researchers. Some of the challenges that this effort will face have also been addressed by these indexes.

There are existing package management systems, notably RPM and dpkg for Linux. These tools facilitate the installation, upgrading, and uninstallation of software packages. Both systems have ways to unambiguously track software packages and ways to aggregate data on those packages, significant requirements articulated at this workshop. It is also interesting to note that these are both low-level tools and that users typically interact with them via higher level interfaces. This sort of modular model fits well with what was described at the workshop, with its focus on providing an extendable framework that others can leverage. This model differs from that famously employed in the Apple App Store and Google Play, where the software is directly hosted and managed on the index itself. The index, as described here, would be a lower-level construct that supports various package management functions but does not itself perform those functions.

SciCrunch/NIF/RRID: The Neuroscience Information Framework, (http://neuinfo.org) is a project of the NIH Blueprint Consortium that has been surveying, cataloging and federating data and resources (tools, materials, services) of relevance to neuroscience since 2008. It has maintained and populated the NIF Registry, a high level metadata catalog of research resources, currently comprising over 11,000 resources, and tracked them over the past 6 years. Through its unique data ingestion and query platform, it has created a search engine for data that searches across over 200 independent databases comprising over 800 million records. Although NIF was developed for neuroscience, it has expanded well beyond primary neuroscience resources to biomedical resources as a whole. Thus, the software sitting behind NIF was rechristened SciCrunch. SciCrunch allows different communities, e.g., NIF, to use the same infrastructure and data sources to create their own communities. The SciCrunch Registry provides the unique identifiers for software tools and databases for the Resource Identification Initiative (RRID project; http://scicrunch.com/resources). SciCrunch aggregates software tools from multiple repositories, e.g., NITRC. It utilizes authoritative identifiers where possible, and assigns an identifier when the source repository does not.

Participation in the RRID project was voluntary, i.e., not a condition of publication, and was requested by the journals through an email request to the author. The project deliberately did not require journals to modify their journal submission system in order to allow broad participation in the project. To date, ~50 papers have appeared from 11 different journals that use RRID’s. Over 200 RRID’s have been reported. The FORCE11 working group is collecting data regarding the use of RRIDs in the literature and is making it freely available.

The error rate to date is ~7%. Papers using RRID’s can be retrieved from Google Scholar by searching for a particular ID (Figure 2). A resolving service has also been developed such so that 3rd party tools can utilize the RRID’s to link to a resolvable record as well as to map identifiers where needed. Automated routines based on NLP as being developed to recognize RRID’s and to suggest appropriate RRID’s based on the resources described. Currently, RRID’s are only assigned at the level of the software tool or data resource, that is, it does not specify versioning information. This was a calculated decision, as the primary objective of the RRID pilot was to determine if unique identification of software and other resources would be achievable by publishers and authors. The RRID project in the future will aim to include more detailed and machine actionable information as per outcomes of these and other related community discussions.

Neuroimaging Informatics Tools and Resources Clearinghouse (NITRC): Since 2006, the Neuroimaging Informatics Tools and Resources Clearinghouse (NITRC) has provided a comprehensive support infrastructure for resources, including software, in the neuroimaigng domain (including MRI, PET, EEG, MEG, SPECT, CT and optical neuroimaging tools and resources). NITRC fosters a user-friendly clearinghouse environment for the neuroimaging informatics community. NITRC’s goal is to support researchers dedicated to enhancing, adopting, distributing, and contributing to the evolution of previously funded neuroimaging analysis tools and resources for broader community use. Located at www.nitrc.org, NITRC promotes software tools, workflows, resources, vocabularies, test data, and now, pre-processed, community- generated data sets (1000 Functional Connectomes, ADHD-200) through its Image Repository (NITRC-IR). NITRC gives researchers greater and more efficient access to the tools and resources they need, including: better categorizing and organizing existing tools and resources via a controlled vocabulary; facilitating interactions between researchers and developers through forums, direct email contact, ratings and reviews; and promoting better use through enhanced documentation.

nanoHUB.org: Starting in 2002, the NSF-sponsored Network for Computational Nanotechnology established a web site at nanoHUB.org to support the National Nanotechnology Initiative. Any user within the community can contribute a simulation/modeling or analysis tool to this platform. Tools are not only cataloged, but hosted, so that any user can run the tool through the web via the click of a button–without having to download or install any software. In 2013, more than 13,100 users launched some 500,000 simulation jobs using more than 340 different simulators contributed by the community and installed on nanoHUB. These tools have been used by 22,649 students across 1,165 courses at 185 institutions. nanoHUB also hosts more than 4,000 other resources—including seminars, tutorials, animations, and even complete courses—that help to document the tools and educate new users. In the last 12 months alone, nanoHUB served more than 300,000 unique users with this content, and that number has been doubling every 18 months. In June 2011, the National Science and Technology Council’s Materials Genome Initiative for Global Competitiveness highlighted nanoHUB as an exemplar of “open innovation” that is critical for global competitiveness. The HUBzero software that powers nanoHUB.org is available as open source, and more than 60 other projects have used the same software to create similar “hubs” for their own scientific community.

****

Comments

• Martin Fenner says:

October 7, 2014 at 7:57 pm

The workshop and the report come very timely, as the scholarly community is working in numerous initiatives to make scientific software more discoverable and to give software authors due credit. The importance of persistent identifiers and a central searchable index with metadata can’t be stressed enough and I applaud this group for this activity. As always, the devil is in the details, and I have a number of comments/concerns regarding the report.

Persistent identifiers are a social contract. For them to work you need a set of agreed principles between the organization issuing the identifiers and the person or organization using the identifier with the software. For journal articles (publishers) and datasets (data centers) these roles are clear, but it is not clear to me from the report of who would do that for the software. A software developer, or a code repository (at least popular commercial repositories such as SourceForge or Github) is the wrong partner for this, as they have other priorities than thinking about how to keep the software available for the next 20 years or more.

One possible approach is that data repositories such as figshare or Zenodo take that role, building on the work they have already done. Or we see more specialized software repositories evolve that cater to those needs and with additional features compared to traditional data repositories. I see one of the biggest gaps right now in repositories for long-term preservation of software, using an intelligent integration with code repositories. Without this partner with a long-term perspective persistent identifiers don’t work.

Another consideration is duplication of effort. I would argue that it would be a bad idea to invent a new persistent identifier just for software. I personally think that DOIs are perfect for this, and I would wish that the report would use stronger language to support DOIs for this activity. (DataCite and CrossRef) DOIs have required metadata (e.g. version, license, contributors, related identifiers) stored in a central, searchable metadata store, and in particular DataCite has considered software in its metadata schema. DOIs also work with a lot of existing infrastructure, the extra effort to include special considerations for software would be much smaller than building a new infrastructure for persistent identifiers for software.

Lastly, software should be cited the same way as books, articles and data, with the citation appearing in the references list (the citation to the software, not only the citation to a paper describing the software). This is not only something that authors, readers, publishers are familiar with, but this makes it much easier to actually find these software citations. Not only are software citations within the body text very difficult to find if the content is not open access, but the variations in formatting (software name only or also link, in many different places of the manuscript, etc.) make it very hard for automated tools to find these software citations.

Something that DataCite is not providing is a Software citation index, and a service build based on DOIs should be developed to fill that void. This service should also cooperate with CrossRef, as we hopefully will see an increasing number of software citations in the reference lists of scholarly papers. These software citations should not be extracted only from journal articles, but wherever they happen, most importantly other software packages and in association with datasets. Another important activity that is missing, and is mentioned in the report, is a listing of appropriate software repositories (both code repositories and repositories for long-term preservation). Databib and re3data, together with DataCite, are doing this for data repositories, and a software repository discovery index can either be a part of those services, or a separate activity.

• Vijayaraj Nagarajan says:

October 8, 2014 at 12:59 pm

Excellent point about the potential problems with relying on a third-party repository.

NCBI has the ability and is already successfully performing archiving of genome scale data. If not for all of the software, why not create a small facility under the hood of NCBI to archive the public funded software ?

Such a system could be more robust, with easy integration in to other data components of NCBI. This would also leverage on the expertise that NCBI has acquired all these years in creating an excellent archiving ecosystem. I don’t think infrastructure resources would be of a concern for such an in-house repository, considering its scope and size in relation to the existing archives of NCBI.

• Steven Salzberg says:

October 8, 2014 at 9:09 pm

This is a good suggestion: NCBI is already set up to handle repositories, and it would be a simple matter for them to create a small software repository. I see no need for anything more elaborate, if we even need this at all. Most good software gets published and we have papers to cite already.

• Istvan Albert says:

October 7, 2014 at 7:59 pm

What seems to be missing from the MIAS fields is any mention of minimal user support. Shouldn’t there be some requirement that ensures that there are avenues available to assisting someone having problems or needing advice?

This is why Biostars exists https://www.biostars.org/ and the hundreds of thousands of visitors that the we get monthly indicate that the problem of software support is just as important as that of categorizing it properly.

•W. Trevor King says:

October 7, 2014 at 9:26 pm

I think encouraging researchers to cite software they used to conduct their research and analysis is important, but I’m happy leaving the threshold up to the researchers (e.g. not listing their text editor, but listing their experiment-controls and stats packages with version numbers). It also makes sense to require publically-funded software to be archived somewhere to ensure it will be accessible in the future. Beyond that, I don’t see much need for additional tooling. Won’t users be able to find software by looking through the papers written by folks doing similar research or through a generic search engine?

I don’t see how any of this will help “Review panels can assess software choices in funding proposals and data management plans.” Assessing software seems much more complicated and subjective than something you can condense into some indexed metadata. Is there an active community using and developing this software? How responsive are the maintainers to bug reports and patch submissions? I think assessing that sort of thing needs a more organic touch.

• Vijayaraj Nagarajan says:

October 8, 2014 at 3:05 pm

This is why it might of great help to create our own first of its kind in-house biomedical software repository. This would enable us to put that “organic touch”…. as being done in the NCBI GEO repository and indexing. NCBI does an amazing job of giving this organic touch to all of the submissions made to this repository, which has made GEO an overwhelming success.

More Background On SoftwareDiscoveryIndex.org

SoftwareDiscoveryIndex.org represents a historically important effort to strengthen the reproducibility, discoverability, and long-term sustainability of scientific software—particularly within the biomedical research ecosystem. Although the domain today functions mainly as an archival repository rather than a fully operational index, its origins lie in a major National Institutes of Health (NIH) initiative under the Big Data to Knowledge (BD2K) program. The site preserves the Software Discovery Index Meeting Report, a foundational document outlining the strategy, goals, and challenges involved in establishing a universal, metadata-rich, fully queryable index of biomedical software.

The purpose of this comprehensive article is to provide readers with a clear understanding of SoftwareDiscoveryIndex.org, including its origins, ownership context, goals, use cases, audience, expected impact, historical role, and cultural significance in scientific research. This overview also examines the lasting influence of the Software Discovery Index concept on today’s discussions about reproducibility, software citation, FAIR principles, and the infrastructure needed to support scientific innovation.

Origins of SoftwareDiscoveryIndex.org

The website was created to accompany a workshop convened by the NIH in May 2014, during which research leaders, journal editors, software developers, funders, and data-infrastructure experts examined a central challenge: the scientific community lacked a universal, systematic method for locating, citing, evaluating, and reusing biomedical software.

As documented in the report, the workshop highlighted a structural problem that had been growing for decades: scientific software had become central to data analysis and discovery, yet it was rarely cited consistently and almost never indexed in a unified, machine-readable way. Without an index, researchers struggled to locate appropriate tools, developers struggled to gain visibility, and funders lacked metrics to assess which software projects were thriving or fading.

SoftwareDiscoveryIndex.org was launched to preserve the workshop’s findings, propose a roadmap for action, and invite feedback from the broader community. The website helps ensure the report remains accessible even though the original hosting expired and had to be restored later by an independent individual who recovered the archived materials.

Ownership and Restoration Effort

While the original concept was developed under the NIH BD2K initiative, the current domain ownership is private. After the domain expired, a science professional with a background in green technology and research discovered it was available and purchased it with the explicit purpose of restoring the original 2014 content as completely as possible.

Although not all pages could be recovered, the site now preserves:

The NIH Software Discovery Index Meeting Report
The workshop’s recommendations
Community comments and responses
Historical discussions about identifiers, metadata, publishers, and indexing strategies

The restoration ensures that this important piece of biomedical software-infrastructure history remains findable—a fitting outcome for a site whose mission was to improve discoverability.

Goals of the Software Discovery Index Initiative

The workshop and resulting report identify several primary goals that the Software Discovery Index should fulfill:

1. Assign Unique, Persistent Identifiers to All Biomedical Software

Researchers need a standard, unambiguous reference for software—similar to DOIs for publications. Identifiers must:

Distinguish between versions
Persist even if software is moved or archived
Connect to metadata, publications, and repositories

2. Collect and Maintain a Unified Set of Software Metadata

Metadata enables software to be searchable and comparable. Key fields include:

Title, version, license
Authors and affiliations
Dependencies and environment
Input/output formats
Links to code repositories
Related grants and publications
Purpose, objectives, and domain

3. Enable Researchers to Query Software and Metadata Easily

The Index would function not as a code repository, but as a central aggregator that draws metadata from multiple locations—GitHub, SourceForge, institutional repos, journals, and NIH databases. A robust API would allow community-specific interfaces and integration with other platforms.

4. Support Reproducibility in Published Research

By connecting tools to publications through persistent identifiers, the Index would create a traceable link showing exactly which tools (and versions) were used in an experiment or analysis.

5. Provide Insight for Funders and Publishers

Funding agencies could assess:

Which tools have been widely adopted
Which projects need support
How software impacts scientific output

Publishers could incorporate software citation seamlessly into editorial workflows.

SoftwareDiscoverability Challenges Highlighted by the Workshop

The report outlines numerous obstacles that have long hindered software discoverability:

Ambiguous or Missing Citations

Scientists often cite a paper about software instead of citing the software itself—a problem when versions evolve or code moves to new locations.

Broken URLs

Links to documentation or download pages frequently die as software is reorganized.

Lack of Standard Research Metadata

Without consistent metadata, comparing tools becomes nearly impossible.

Dynamic Nature of Software

Software evolves rapidly, branching into versions, patches, forks, and bundled suites.

Fragmented Repositories

No single repository captures all biomedical software. Instead, code lives across platforms:

GitHub
SourceForge
Institutional archives
Personal websites
Journal supplementary material

Insufficient Community Adoption

Even when infrastructure exists, researchers must use it for it to succeed. The report underscores that widespread adoption across journals, funders, and developers is essential.

Minimal Information About Software (MIAS)

A major contribution of the report is the proposed Minimal Information About Software (MIAS) checklist. This minimal metadata set is critical to ensure tools remain findable and reusable. Fields proposed include:

Persistent identifier
Title and version
License
Synopsis
Author details
Input/output specifications
Environment and dependencies
Publications and grants associated with the software

These fields form the intellectual backbone of a standardized software registry.

Use Cases and Who Benefits from the Index

The site describes how each constituency in the research ecosystem benefits from a successful Software Discovery Index.

Developers

Gain measurable metrics of software adoption
Receive more credit in publications
Access benchmarking datasets
Discover competing or complementary tools

Researchers and End-Users

Quickly find software appropriate to their data, goals, and computing environment
Compare tools using metadata
Retrieve installation instructions, licensing terms, and documentation

NIH Program Officers and Funders

Track software funded by their programs
Assess usage, community impact, and redundancy across grants

Publishers

Incorporate identifiers into submission and review workflows
Enhance reproducibility through standardized citations

Repositories

Integrate with the Index through APIs
Receive unified traffic from a central search portal

Collectively, these use cases illustrate why software discoverability forms a cornerstone of modern scientific infrastructure.

Relationship to the BD2K Data Discovery Index

A major theme in the report is the need for compatibility between the Software Discovery Index and the NIH’s Data Discovery Index. Because software and data are deeply intertwined, the SDI would ensure that:

Published datasets can be linked to the exact tools used to analyze them
Researchers can trace analysis pipelines
Journals can surface both data and software identifiers within articles

This “dual-index” approach supports a more cohesive scientific record.

Cultural and Scientific Significance

The Software Discovery Index proposal arrived at a pivotal moment in modern science. By 2014, biomedical research had entered an era of massive datasets—genomics, imaging, proteomics, and electronic health records. The scientific community increasingly recognized software not merely as a convenience, but as:

A critical research object
A source of scientific truth
A key contributor to reproducibility

The SDI report helped accelerate movements such as:

FAIR principles (Findable, Accessible, Interoperable, Reusable) for software
Software citation standards within major journals
Long-term preservation efforts in institutional repositories
Metadata and identifier frameworks, including DOIs for software

While the Index itself did not become a live global service, the site stands today as a historical blueprint that influenced many subsequent initiatives.

Limitations and Remaining Questions

The report openly acknowledges several unsolved issues:

Who should manage software identifiers long-term?
How should the Index avoid duplication with other registries?
What qualifies as relevant software?
How will the scientific community adopt new citation practices?
What metrics should evaluate the Index’s success?

These questions remain relevant today across disciplines.

Why SoftwareDiscoveryIndex.org Remains Important Today

Even if the Index itself was not fully implemented, the principles behind it continue to shape:

Research software registries
Journal citation requirements
Metadata standards
Funding agency expectations

The site is a time capsule of a moment when the scientific world confronted the reality that reproducibility depends on treating software as a primary, citable research product.

SoftwareDiscoveryIndex.org