diff --git a/_metadata.yml b/_metadata.yml index dbb37dc..caf846b 100644 --- a/_metadata.yml +++ b/_metadata.yml @@ -9,12 +9,18 @@ authors: - name: Vani Mandava affiliation: University of Washington roles: writing - corresponding: true orcid: 0000-0003-3592-9453 - name: Nicoleta Cristea affiliation: University of Washington roles: writing - corresponding: true orcid: 0000-0002-9091-0280 + - name: Anshul Tambay + affiliation: University of Washington + roles: writing + orcid: 0009-0004-9010-1223 + - name: Andrew J. Connolly + affiliation: University of Washington + roles: writing + orcid: 0000-0001-5576-8189 bibliography: references.bib diff --git a/references.bib b/references.bib index 6a222a3..dfc4eb2 100644 --- a/references.bib +++ b/references.bib @@ -1,3 +1,172 @@ +@MISC{Van-Tuyl2023-vp, + title = "Hiring, managing, and retaining data scientists and Research + Software Engineers in academia: A career guidebook from {ADSA} + and {US}-{RSE}", + author = "Van Tuyl, Steve (ed )", + doi = {https://doi.org/10.5281/zenodo.8329337}, + url = {https://zenodo.org/records/8329337}, + abstract = "The importance of data, software, and computation has long been + recognized in academia and is reflected in the recent rise of job + opportunities for data scientists and research software + engineers. Big data, for example, created a wave of novel job + descriptions before the term Data Scientist (DS) was widely used. + And even though software has become a major driver for research + (Nangia and Katz, 2017), Research Software Engineer (RSE) as a + formal role has lagged behind in terms of job openings, + recognition, and prominence within the community. Despite their + importance in the academic research ecosystem, the value of DS + and RSE roles is not yet widely understood or appreciated in the + academic community, and research data, software, and workflows + are, in many domains, still regarded as by-products of research. + Data Scientists and Research Software Engineers (DS/RSEs) face + similar challenges when it comes to career paths in academia - + both are non-traditional academic professions with few incentives + and a lack of clear career trajectories. This guidebook presents + the challenges and suggestions for solutions to improve the + situation and to reach a wide community of stakeholders needed to + advance career paths for DS/RSEs.", + year = 2023, + keywords = "data science; research software engineering; career guidebook" +} + + +@ARTICLE{Adler-Milstein2017-id, + title = "Information blocking: Is it occurring and what policy strategies + can address it?", + author = "Adler-Milstein, Julia and Pfeifer, Eric", + journal = "Milbank Q.", + publisher = "John Wiley \& Sons, Ltd", + volume = 95, + number = 1, + pages = "117--135", + abstract = "Policy Points: Congress has expressed concern about electronic + health record (EHR) vendors and health care providers knowingly + interfering with the electronic exchange of patient health + informatio...", + month = mar, + year = 2017, + keywords = "electronic health records; health policy; incentives; + interoperability", + language = "en" +} + +@ARTICLE{Barker2024-ox, + title = "A national survey of digital health company experiences with + electronic health record application programming interfaces", + author = "Barker, Wesley and Maisel, Natalya and Strawley, Catherine E and + Israelit, Grace K and Adler-Milstein, Julia and Rosner, Benjamin", + journal = "J. Am. Med. Inform. Assoc.", + publisher = "Oxford Academic", + volume = 31, + number = 4, + pages = "866--874", + abstract = "OBJECTIVES: This study sought to capture current digital health + company experiences integrating with electronic health records + (EHRs), given new federally regulated standards-based application + programming interface (API) policies. MATERIALS AND METHODS: We + developed and fielded a survey among companies that develop + solutions enabling human interaction with an EHR API. The survey + was developed by the University of California San Francisco in + collaboration with the Office of the National Coordinator for + Health Information Technology, the California Health Care + Foundation, and ScaleHealth. The instrument contained questions + pertaining to experiences with API integrations, barriers faced + during API integrations, and API-relevant policy efforts. + RESULTS: About 73\% of companies reported current or previous use + of a standards-based EHR API in production. About 57\% of + respondents indicated using both standards-based and proprietary + APIs to integrate with an EHR, and 24\% worked about equally with + both APIs. Most companies reported use of the Fast Healthcare + Interoperability Resources standard. Companies reported that + standards-based APIs required on average less burden than + proprietary APIs to establish and maintain. However, companies + face barriers to adopting standards-based APIs, including high + fees, lack of realistic clinical testing data, and lack of data + elements of interest or value. DISCUSSION: The industry is moving + toward the use of standardized APIs to streamline data exchange, + with a majority of digital health companies using standards-based + APIs to integrate with EHRs. However, barriers persist. + CONCLUSION: A large portion of digital health companies use + standards-based APIs to interoperate with EHRs. Continuing to + improve the resources for digital health companies to find, test, + connect, and use these APIs ``without special effort'' will be + crucial to ensure future technology robustness and durability.", + month = apr, + year = 2024, + keywords = "application programming interface; digital health; electronic + health record; industry", + language = "en" +} + +@ARTICLE{Gillon2024-vu, + title = "{ODIN}: Open Data In Neurophysiology: Advancements, Solutions + \& Challenges", + author = "Gillon, Colleen J and Baker, Cody and Ly, Ryan and Balzani, + Edoardo and Brunton, Bingni W and Schottdorf, Manuel and + Ghosh, Satrajit and Dehghani, Nima", + journal = "arXiv [q-bio.NC]", + abstract = "Across the life sciences, an ongoing effort over the last 50 + years has made data and methods more reproducible and + transparent. This openness has led to transformative insights + and vastly accelerated scientific progress. For example, + structural biology and genomics have undertaken systematic + collection and publication of protein sequences and + structures over the past half-century, and these data have + led to scientific breakthroughs that were unthinkable when + data collection first began. We believe that neuroscience is + poised to follow the same path, and that principles of open + data and open science will transform our understanding of the + nervous system in ways that are impossible to predict at the + moment. To this end, new social structures along with active + and open scientific communities are essential to facilitate + and expand the still limited adoption of open science + practices in our field. Unified by shared values of openness, + we set out to organize a symposium for Open Data in + Neuroscience (ODIN) to strengthen our community and + facilitate transformative neuroscience research at large. In + this report, we share what we learned during this first ODIN + event. We also lay out plans for how to grow this movement, + document emerging conversations, and propose a path toward a + better and more transparent science of tomorrow.", + month = jul, + year = 2024, + archivePrefix = "arXiv", + primaryClass = "q-bio.NC" +} + +@INCOLLECTION{Hermes2023-aw, + title = "How can intracranial {EEG} data be published in a standardized + format?", + author = "Hermes, Dora and Cimbalnek, Jan", + booktitle = "Studies in Neuroscience, Psychology and Behavioral Economics", + publisher = "Springer International Publishing", + address = "Cham", + pages = "595--604", + abstract = "Sharing data or code with publications is not something new and + licenses for public sharing have existed since the late 20s + century. More recent worldwide efforts have led to an increase in + the amount of data shared: funding agencies require that data are + shared, journals request that data are made available, and some + journals publish papers describing data resources. For + intracranial EEG (iEEG) data, considering how and when to share + data does not happen only at the stage of publication. Human + subjects’ rights demand that data sharing is something that + should be considered when writing an ethical protocol and + designing a study before data are collected. At that moment, it + should already be considered what levels of data will be + collected and potentially shared. This includes levels of data + directly from the amplifier, reformatted or processed data, + clinical information and imaging data. In this chapter we will + describe considerations and scholarship behind sharing iEEG data, + to make it easier for the iEEG community to share data for + reproducibility, teaching, advancing computational efforts, + integrating iEEG data with other modalities and allow others to + build on previous work.", + year = 2023, + language = "en" +} + + @ARTICLE{Hanisch2015-cu, title = "The Virtual Astronomical Observatory: Re-engineering access to astronomical data", diff --git a/sections/01-introduction.qmd b/sections/01-introduction.qmd index c3ba8f3..d50954b 100644 --- a/sections/01-introduction.qmd +++ b/sections/01-introduction.qmd @@ -30,7 +30,7 @@ non-profit organization that was founded in the 1990s developed a set of guidelines for licensing of OSS that is designed to protect the rights of developers and users. On the technical side, tools such as the Git Source-code management system support complex and distributed open-source workflows that -accelerate, streamline, and robustify OSS development. Governance approaches +accelerate, streamline, and make OSS development more robust. Governance approaches have been honed to address the challenges of managing a range of stakeholder interests and to mediate between large numbers of weakly-connected individuals that contribute to OSS. When these social and technical innovations are put @@ -44,9 +44,9 @@ Data and metadata standards that use tools and practices of OSS ("open-source standards" henceforth) reap many of the benefits that the OSS model has provided in the development of other technologies. The present report explores how OSS processes and tools have affected the development of data and metadata -standards. The report will triangulate common features of a variety of use -cases; it will identify some of the challenges and pitfalls of this mode of -standards development, with a particular focus on cross-sector interactions; -and it will make recommendations for future developments and policies that can -help this mode of standards development thrive and reach its full potential. +standards. The report will survey common features of a variety of use cases; it +will identify some of the challenges and pitfalls of this mode of standards +development, with a particular focus on cross-sector interactions; and it will +make recommendations for future developments and policies that can help this +mode of standards development thrive and reach its full potential. diff --git a/sections/02-use-cases.qmd b/sections/02-use-cases.qmd index d699199..96119b6 100644 --- a/sections/02-use-cases.qmd +++ b/sections/02-use-cases.qmd @@ -4,7 +4,7 @@ To understand how OSS development practices affect the development of data and metadata standards, it is informative to demonstrate this cross-fertilization through a few use cases. As we will see in these examples, some fields, such as astronomy, high-energy physics and earth sciences have a relatively long -history of shared data resources from organizations such as LSST, CERN, and +history of shared data resources from organizations such as SDSS, CERN, and NASA, while other fields have only relatively recently become aware of the value of data sharing and its impact. These disparate histories inform how standards have evolved and how OSS practices have pervaded their development. @@ -27,14 +27,14 @@ that break backward compatibility. Among the features that make FITS so durable is that it was designed originally to have a very restricted metadata schema. That is, FITS records were designed to be the lowest common denominator of word lengths in computer systems at the time. However, while FITS is compact, its -ability to encode the coordinate frame and pixels, means that data from -different observational instruments can be stored in this format and -relationships between data from different instruments can be related, rendering -manual and error-prone procedures for conforming images obsolete. Nevertheless, -the stability has also raised some issues as the field continues to adapt to -new measurement methods and the demands of ever-increasing data volumes and -complex data analysis use-case, such as interchange with other data and the use -of complex data bases to store and share data [@Scroggins2020-ut]. Another +ability to encode a coordinate frame for pixels, means that data from different +observational instruments can be stored in this format and relationships +between data from different instruments can be defined, rendering manual and +error-prone procedures for conforming images obsolete. Nevertheless, the +stability has also raised some issues as the field continues to adapt to new +measurement methods and the demands of ever-increasing data volumes and complex +data analysis use-case, such as interchange with other data and the use of +complex data bases to store and share data [@Scroggins2020-ut]. Another prominent example of the use of open-source processes to develop standards in Astronomy is in the tools and protocols developed by the International Virtual Observatory Alliance (IVOA) and its national implementations, e.g., in the US @@ -56,11 +56,11 @@ have been established and the adoption of these standards in data analysis has high penetration [@Basaglia2023-dq]. A top-down approach is taken so that within every large collaboration, standards are enforced, and this adoption is centrally managed. Access to raw data is essentially impossible because of its -large volume, and making it publicly available is both technically very hard -and potentially ill-advised. Therefore, analysis tools are tuned specifically -to the standards of the released data. Incentives to use the standards are -provided by funders that require data management plans that specify how the -data is shared (i.e., in a standards-compliant manner). +large volume, and making it publicly available would be technically very +difficult. Therefore, analysis tools are tuned specifically to the standards of +the released data. Incentives to use the standards are provided by funders that +require data management plans that specify how the data is shared (i.e., in a +standards-compliant manner). ## Earth sciences @@ -73,22 +73,20 @@ as the Network Common Data Form (NetCDF) developed by the University Corporation for Atmospheric Research (UCAR), and the Hierarchical Data Format (HDF), a set of file formats (HDF4, HDF5) that are widely used, particularly in climate research. The GeoTIFF format, which originated at NASA in the late -1990s, is extensively used to share image data. In the 1990s, open web mapping -also began with MapServer (https://mapserver.org) and continued later with -other projects such as OpenStreetMap (https://www.openstreetmap.org). The -following two decades, the 2000s-2020s, brought an expansion of open standards -and integration with web technologies developed by OGC, as well as other -standards such as the Keyhole Markup Language (KML) for displaying geographic -data in Earth browsers. Formats suitable for cloud computing also emerged, such -as the Cloud Optimized GeoTIFF (COG), followed by Zarr and Apache Parquet for -array and tabular data, respectively. In 2006, the Open Source Geospatial -Foundation (OSGeo, https://www.osgeo.org) was established, demonstrating the -community's commitment to the development of open-source geospatial -technologies. While some standards have been developed in the industry (e.g., -Keyhole Markup Language (KML) by Keyhole Inc., which Google later acquired), -they later became international standards of the OGC, which now encompasses -more than 450 commercial, governmental, nonprofit, and research organizations -working together on the development and implementation of open standards +1990s, is extensively used to share image data. The following two decades, the +2000s-2020s, brought an expansion of open standards and integration with web +technologies developed by OGC, as well as other standards such as the Keyhole +Markup Language (KML) for displaying geographic data in Earth browsers. Formats +suitable for cloud computing also emerged, such as the Cloud Optimized GeoTIFF +(COG), followed by Zarr and Apache Parquet for array and tabular data, +respectively. In 2006, the Open Source Geospatial Foundation (OSGeo, +https://www.osgeo.org) was established, demonstrating the community's +commitment to the development of open-source geospatial technologies. While +some standards have been developed in the industry (e.g., Keyhole Markup +Language (KML) by Keyhole Inc., which Google later acquired), they later became +international standards of the OGC, which now encompasses more than 450 +commercial, governmental, nonprofit, and research organizations working +together on the development and implementation of open standards (https://www.ogc.org). ## Neuroscience @@ -123,25 +121,30 @@ wide range of stakeholders and tap a broad base of expertise. ## Community science Another interesting use case for open-source standards is community/citizen -science. This approach, which has grown in the last 20 years, has many benefits -for both the research field that harnesses the energy of non-scientist members -of the community to engage with scientific data, as well as to the community -members themselves who can draw both knowledge and pride in their participation -in the scientific endeavor. It is also recognized that unique broader benefits -are accrued from this mode of scientific research, through the inclusion of -perspectives and data that would not otherwise be included. To make data -accessible to community scientists, and to make the data collected by community -scientists accessible to professional scientists, it needs to be provided in a -manner that can be created and accessed without specialized instruments or -specialized knowledge. Here, standards are needed to facilitate interactions -between an in-group of expert researchers who generate and curate data and a -broader set of out-group enthusiasts who would like to make meaningful -contributions to the science. This creates a particularly stringent constraint -on transparency and simplicity of standards. Creating these standards in a -manner that addresses these unique constraints can benefit from OSS tools, with -the caveat that some of these tools require additional expertise. For example, -if the standard is developed using git/GitHub for versioning, this would -require learning the complex and obscure technical aspects of these system that -are far from easy to adopt, even for many professional scientists. +science. An early example of this approach is OpenStreetMap +(https://www.openstreetmap.org), which allows users to contribute to the +project development with code and data and freely use the maps and other +related geospatial datasets. But this example is not unique. Overall, this +approach has grown in the last 20 years and has been adopted in many different +fields. It has many benefits for both the research field that harnesses the +energy of non-scientist members of the community to engage with scientific +data, as well as to the community members themselves who can draw both +knowledge and pride in their participation in the scientific endeavor. It is +also recognized that unique broader benefits are accrued from this mode of +scientific research, through the inclusion of perspectives and data that would +not otherwise be included. To make data accessible to community scientists, and +to make the data collected by community scientists accessible to professional +scientists, it needs to be provided in a manner that can be created and +accessed without specialized instruments or specialized knowledge. Here, +standards are needed to facilitate interactions between an in-group of expert +researchers who generate and curate data and a broader set of out-group +enthusiasts who would like to make meaningful contributions to the science. +This creates a particularly stringent constraint on transparency and simplicity +of standards. Creating these standards in a manner that addresses these unique +constraints can benefit from OSS tools, with the caveat that some of these +tools require additional expertise. For example, if the standard is developed +using git/GitHub for versioning, this would require learning the complex and +obscure technical aspects of these system that are far from easy to adopt, even +for many professional scientists. diff --git a/sections/03-challenges.qmd b/sections/03-challenges.qmd index 8ddaa32..96993e3 100644 --- a/sections/03-challenges.qmd +++ b/sections/03-challenges.qmd @@ -9,50 +9,52 @@ mitigated. One of the defining characteristics of OSS is its dynamism and its rapid evolution. Because OSS can be used by anyone and, in most cases, contributions can be made by anyone, innovations flow into OSS in a bottom-up fashion from -user/developers. Pathways to contribution by members of the community are often -well-defined: both from the technical perspective (e.g., through a pull request -on GitHub, or other similar mechanisms), as well as from the social perspective -(e.g., whether contributors need to accept certain licensing conditions through -a contributor licensing agreement) and the socio-technical perspective (e.g., -how many people need to review a contribution, what are the timelines for a -contribution to be reviewed and accepted, what are the release cycles of the -software that make the contribution available to a broader community of users, -etc.). Similarly, open-source standards may also find themselves addressing use -cases and solutions that were not originally envisioned through bottom-up -contributions of members of a research community to which the standard -pertains. However, while this dynamism provides an avenue for flexibility it -also presents a source of tension. This is because data and metadata standards -apply to already existing datasets, and changes may affect the compliance of -these existing datasets. Similarly, analysis technology stacks that are -developed based on an existing version of a standard have to adapt to the -introduction of new ideas and changes into a standard. Dynamic changes of this -sort therefore risk causing a loss of faith in the standard by a user -community, and migration away from the standard. Similarly, if a standard -evolves too rapidly, users may choose to stick to an outdated version of a -standard for a long time, creating strains on the community of developers and -maintainers of a standard who will need to accommodate long deprecation cycles. -On the other hand, in cases in which some forms of dynamic change is prohibited --- as in the case of the FITS file format, which prohibits changes that break -backwards-compatibility -- there is also a cost associated with the stability -[@Scroggins2020-ut]: limiting adoption and combinations of new types of -measurements, new analysis methods or new modes of data storage and data -sharing. +users/developers. Pathways to contribution by members of the community are +often well-defined: both from the technical perspective (e.g., through a pull +request on GitHub, or other similar mechanisms), as well as from the social +perspective (e.g., whether contributors need to accept certain licensing +conditions through a contributor licensing agreement) and the socio-technical +perspective (e.g., how many people need to review a contribution, what are the +timelines for a contribution to be reviewed and accepted, what are the release +cycles of the software that make the contribution available to a broader +community of users, etc.). Similarly, open-source standards may also find +themselves addressing use cases and solutions that were not originally +envisioned through bottom-up contributions of members of a research community +to which the standard pertains. However, while this dynamism provides an avenue +for flexibility it also presents a source of tension. This is because data and +metadata standards apply to already existing datasets, and changes may affect +the compliance of these existing datasets. These existing datasets may have a +lifespan of decades, making continued compatibility crucial. Similarly, +analysis technology stacks that are developed based on an existing version of a +standard have to adapt to the introduction of new ideas and changes into a +standard. Dynamic changes of this sort therefore risk causing a loss of faith +in the standard by a user community, and migration away from the standard. +Similarly, if a standard evolves too rapidly, users may choose to stick to an +outdated version of a standard for a long time, creating strains on the +community of developers and maintainers of a standard who will need to +accommodate long deprecation cycles. On the other hand, in cases in which some +forms of dynamic change is prohibited -- as in the case of the FITS file +format, which prohibits changes that break backwards-compatibility -- there is +also a cost associated with the stability [@Scroggins2020-ut]: limiting +adoption and combinations of new types of measurements, new analysis methods or +new modes of data storage and data sharing. ## Mismatches between standards developers and user communities -Open-source standards often entail an inherent gap in both interest and ability -to engage with the technical details undergirding standards and their -development between the core developers of the standard and the users of the -standard, which are members of the broader research field to which the standard -pertains. This gap, in and of itself, creates friction on the path to broad -adoption and best utilization of the standards. In extreme cases, the interests -of researchers and standards developers may even seem at odds, as developers -implement sophisticated mechanisms to automate the creation and validation of -the standard or advocate for more technically advanced mechanisms for evolving -the standard. These advanced capabilities offer more robust development -practices and consistency in cases where the standards are complex and -elaborate. They can also ease the maintenance burden of the standard. On the -other hand, they may end up leaving potential users sidelined in the +Open-source standards often entail an inherent gap between the core developers +of the standard and the users of the standard. The former may be possess higher +ability to engage with the technical details undergirding standards and their +development, while the latter still have a high level of interest as members of +the broader research field to which the standard pertains. This gap, in and of +itself, creates friction on the path to broad adoption and best utilization of +the standards. In extreme cases, the interests of researchers and standards +developers may even seem at odds, as developers implement sophisticated +mechanisms to automate the creation and validation of the standard or advocate +for more technically advanced mechanisms for evolving the standard. These +advanced capabilities offer more robust development practices and consistency +in cases where the standards are complex and elaborate. They can also ease the +maintenance burden of the standard. On the other hand, they may end up leaving +potential experimental researchers and data providers sidelined in the development of the standard, and limiting their ability to provide feedback about the practical implications of changes to the standards. One example of this (already mentioned above in @sec-use-cases) is the use of git/GitHub for @@ -101,14 +103,22 @@ directed towards commercial solutions) there is an incentive to create data formats and data analysis platforms that are proprietary. This may drive innovative applications of scientific measurements, but also creates sub-fields where scientific observations are generated by proprietary instrumentation, due -to these commercialization or other profit-driven incentives. There is a lack -of regulatory oversight to adhere to available standards or evolve common -tools, limiting integration across different measurements. In cases where a -significant amount of data is already stored in proprietary formats, -significant data transformations may be required to get data to a state that is -amenable to open-source standards. In these sub-fields there may also be a lack -of incentive to set aside investment or resources to invest in establishing -open-source data standards, leaving these sub-fields relatively siloed. +to these commercialization or other profit-driven incentives. FTIR Spectroscopy +is one such example, wherein use of Bruker instrumentation necessitates +downstream analysis of the resulting measurements using proprietary binary +formats necessary for the OPUS Software. Another example is the proliferation +of proprietary file formats in electrophysiological measurements of brain +signals [@Gillon2024-vu, @Hermes2023-aw]. And yet another one is proprietary +application programming interfaces (APIs) used in electronic health records +[@Barker2024-ox, @Adler-Milstein2017-id]. In most cases, there is a lack of +regulatory oversight to adhere to available standards or evolve common tools, +limiting integration across different measurements. In cases where a +significant amount of data is already stored in proprietary formats, or where +access is limited by proprietary APIs significant data transformations may be +required to get data to a state that is amenable to open-source standards. In +these sub-fields there may also be a lack of incentive to set aside investment +or resources to invest in establishing open-source data standards, leaving +these sub-fields relatively siloed. ### Harnessing new computing paradigms and technologies @@ -125,8 +135,9 @@ because cloud data access patterns are fundamentally different from the ones that are used in local posix-style file-systems. Suspicion of cloud computing comes in two different flavors: the first by researchers and administrators who may be wary of costs associated with cloud computing, and especially with the -difficulty of predicting these costs. Projects such as NSF's Cloud Bank seek to -mitigate some of these concerns, by providing an additional layer of +difficulty of predicting these costs. This can particularly affect scenarios +where long-term preservation is required. Projects such as NSF's Cloud Bank +seek to mitigate some of these concerns, by providing an additional layer of transparency into cloud costs [@Norman2021CloudBank]. The other type of objection relates to the fact that cloud computing services, by their very nature, are closed ecosystems that resist portability and interoperability. diff --git a/sections/05-recommendations.qmd b/sections/05-recommendations.qmd index 75b316c..63452ca 100644 --- a/sections/05-recommendations.qmd +++ b/sections/05-recommendations.qmd @@ -31,12 +31,12 @@ existing standards is that there is significant knowledge that exists across fields and domains and that informs the development of standards within each field, but that could be surfaced to the level where it may be adopted more widely in different domains and be more broadly useful. One approach to this is -a comparative approach: In this approach, a readiness and/or maturity model can +a comparative approach: in this approach, a readiness and/or maturity model can be developed that assesses the challenges and opportunities that a specific standard faces at its current phase of development. Developing such a maturity model, while it goes beyond the scope of the current report, could lead to the eventual development of a meta-standard or a standard-of-standards. This would -encompass a succinct description of cross-cutting best-practices that can be +facilitate a succinct description of cross-cutting best-practices that can be used as a basis for the analysis or assessment of an existing standard, or as guidelines to develop new standards. For instance, specific barriers to adopting a data standard that take into account the size of the community and @@ -57,10 +57,10 @@ expertise typical of this community, and so forth -- could help guide the standards-development process towards more effective adoption and use. A set of meta-standards and high-level descriptions of the standards-development process -- some of which is laid out in this report -- could help standard developers -avoid known pitfalls, such as the dreaded proliferation of standards, or such -as complexity-impeded adoption. Surveying and documenting the success and -failures of current standards for a specific dataset / domain can help -disseminate knowledge about the standardization process. Resources such as +avoid known pitfalls, such as the dreaded proliferation of standards, or +complexity-impeded adoption. Surveying and documenting the success and failures +of current standards for a specific dataset / domain can help disseminate +knowledge about the standardization process. Resources such as [Fairsharing](https://fairsharing.org/) or [Digital Curation Center](https://www.dcc.ac.uk/guidance/standards) can help guide this process. @@ -92,7 +92,7 @@ community efforts and tools for this. The OSS model is seen as a particularly promising avenue for an investment of resources, because it builds on previously-developed procedures and technical infrastructure and because it provides avenues for the democratization of development processes and for -community input along the way. At the same time, there is significant +community input along the way. At the same time, there are significant challenges associated with incentives to engage, ranging from the dilution of credit to individual contributors, and ranging through the burnout of maintainers and developers. The clarity offered by procedures for enhancement @@ -119,17 +119,31 @@ training so that this crucial role is encouraged. Initial proposals for the curriculum and scope of the role have already been proposed (e.g., in [@Mons2018DataStewardshipBook]), but we identify here also a need to connect these individuals directly to the practices that exemplify open-source -standards. Thus, it will be important for these individuals to be facile in the -methodology of OSS. This does not mean that they need to become software +standards. Thus, it will be important for these individuals to be conversant in +the methodology of OSS. This does not mean that they need to become software engineers -- though for some of them there may be some overlap with the role of research software engineers [@Connolly2023Software] -- but rather that they need to become familiar with those parts of the OSS development life-cycle that are specifically useful for the development of open-source standards. For example, tools for version control, tools for versioning, and tools for -creation and validation of compliant data and metadata. +creation and validation of compliant data and metadata. Stakeholder +organizations should invest in training grants to establish curriculum for data +and metadata standards education. + +Ultimately, efficient use of data stewards and their knowledge will have to be +applied. It is evident that not every project and every lab that produces data +requires a full-time data steward. Instead, data stewardship could be +centralized within organizations such as libraries, data science, or software +engineering cores of larger research organizations. This would be akin to +recent models for research software engineering that are becoming common in +many research organization [@Van-Tuyl2023-vp]. Efficiency considerations also +suggest that the development of data standards would not have its intended +purpose unless funds are also allocated to the implementation of the standard +in practice. Mandating standards without appropriate funding for their +implementation by data producers and data users could risk hampering science +and could leading to researchers doing the bare minimum to make their data +“open”. -Stakeholder organizations should invest in training grants to establish -curriculum for data and metadata standards education. ### Review open-source standards pathways @@ -141,11 +155,13 @@ development of open-source standards, and to build on prior experience, the documentation and dissemination of lifecycles should be seen as an integral step of the work of standards creators and granting agencies. In the meanwhile, it would be good to also retroactively document the lifecycle of existing -standards that are seen as success stories. Research on the principles that -underlie successful open-source standards development can be used to formulate -new standards and iterate on existing ones. Data management plans should -promote the sharing of not only data, but also metadata and descriptions of how -to use it. +standards that are seen as success stories, and to foster the awareness of +these standards. In addition, fostering research projects on the principles +that underlie successful open-source standards development will help formulate +new standards and iterate on existing ones. In accordance, data management +plans should promote the sharing of not only data, but also metadata and +descriptions of how to use it. + ### Manage Cross Sector alliances @@ -153,6 +169,6 @@ Encourage cross-sector and cross-domain alliances that can impact successful standards creation. Invest in robust program management of these alliances to align pace and create incentives (for instance via Open Source Program Offices at Universities or other research organizations). Similar to program officers -at funding agencies, standards evolution need sustained PM efforts. -Multi-company partnerships should include strategic initiatives for standard -establishment such as the Pistoia Alliance (https://www.pistoiaalliance.org/). +at funding agencies, standards evolution need sustained PM efforts. Multi-party +partnerships should include strategic initiatives for standard establishment +such as the Pistoia Alliance (https://www.pistoiaalliance.org/). \ No newline at end of file