Skip to content

Latest commit

 

History

History
132 lines (78 loc) · 10.1 KB

Proposal-2023.md

File metadata and controls

132 lines (78 loc) · 10.1 KB

3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)

With great scientific breakthrough comes solid engineering and open communities. The Natural Language Processing (NLP) community has benefited greatly from the open culture in sharing knowledge, data, and software. The primary objective of this workshop is to further the sharing of insights on the engineering and community aspects of creating, developing, and maintaining NLP open source software (OSS), which we seldom talk about in scientific publications. Our secondary goal is to promote synergies between different open source projects and encourage cross-software collaborations and comparisons.

We refer to Natural Language Processing OSS as an umbrella term that not only covers traditional syntactic, semantic, phonetic, and pragmatic applications; we extend the definition to include task-specific applications (e.g., machine translation, information retrieval, question-answering systems), low-level string processing that contains valid linguistic information (e.g. Unicode creation for new languages, language-based character set definitions) and machine learning/artificial intelligence frameworks with functionalities focusing on text applications.

There are many workshops focusing on the creation and curation of open language resources and annotations (e.g. BUCC, GWN, LAW, LOD, WAC). Moreover, we have the flagship LREC conference dedicated to linguistic resources. However, the engineering aspects of NLP-OSS are overlooked and under-discussed within the community. There are open source conferences and venues (such as FOSDEM, OSCON, Open Source Summit) where discussions range from operating system kernels to air traffic control hardware but the representation of NLP related presentations is limited. In the Machine Learning (ML) field, the Journal of Machine Learning Research - Machine Learning Open Source Software (JMLR-MLOSS) is a forum for discussions and dissemination of ML OSS topics. We envision that the Workshop for NLP-OSS becomes a similar avenue for NLP-OSS discussions.

A decade ago, there was also the SETQA-NLP (Software Engineering, Testing, and Quality Assurance for Natural Language Processing) workshop that raised awareness of the need for good software engineering practices in NLP. In the earlier days of NLP, linguistic software was often monolithic and the learning curve to install, use, and extend the tools was steep and frustrating. More often than not, NLP-OSS developers/users interact in siloed communities within the ecologies of their respective projects. In addition to engineering aspects of NLP software, the open source movement has brought a community aspect that we often overlook in building impactful NLP technologies.

{{Also add reference to https://bigscience.huggingface.co/acl-2022}}

{{Newer examples of synergy in NLP-OSS needed, see old proposals for reference https://github.com/nlposs/NLP-OSS/blob/master/nlposs-2020/Proposal-2020.md}}

The first and second NLP-OSS workshop, which was co-located with ACL 2018, was the first workshop in recent years that focused more on building quality software for NLP, open sourcing, developing useful engineering practices, and less on scientific novelty or state-of-art development. We hope that the 3rd NLP-OSS workshop could also be hosted in an *ACL conference, to be an intellectual forum to collate this type of knowledge, announce new software/features, promote the open source culture and best practices that go beyond the conferences.

Call for Papers

We invite full papers (8 pages) or short papers (4 pages) on topics related to NLP-OSS broadly categorized into (i) software development, (ii) scientific contribution and (iii) NLP-OSS case studies.

  • Software Development

    • Designing and developing NLP-OSS
    • Licensing issues in NLP-OSS
    • Backwards compatibility and stale code in NLP-OSS
    • Growing, maintaining and motivating an NLP-OSS community
    • Best practices for NLP-OSS documentation and testing
    • Contribution to NLP-OSS without coding
    • Incentivizing OSS contributions in NLP
    • Commercialization and Intellectual Property of NLP-OSS
    • Defining and managing NLP-OSS project scope
    • Issues in API design for NLP
    • NLP-OSS software interoperability
    • Analysis of the NLP-OSS community
  • Scientific Contribution

    • Surveying OSS for specific NLP task(s)
    • Demonstration, introductions and/or tutorial of NLP-OSS
    • Small but useful NLP-OSS
    • NLP components in ML OSS
    • Citations and references for NLP-OSS
    • OSS and experiment replicability
    • Gaps between existing NLP-OSS
    • Task-generic vs task-specific software
  • Case studies

    • Case studies of how a specific bug is fixed or feature is added
    • Writing wrappers for other NLP-OSS
    • Writing open-source APIs for open data
    • Teaching NLP with OSS
    • NLP-OSS in the industry

Invited Speakers

Demographic Diversity

Organizers: We have 4 organizers with representation from industrial NLP/ML labs, government organization and academic institutes.

PC members: We strive to a have a balance of academic and industrial PC from diverse gender and geolocation demographics. On top of our existing PC members from NLP-OSS 2018 and 20220 editions, we've invited new PCs members by (i) recommendation of previous PC members and (ii) extending our invitation to a subset of WiNLP members on the BIG directory (we tried to contact as many as possible and in total we've sent to XXXX invitations to WiNLP members); if they have not replied and accepted the invitation, we have not listed in the PC list below.

Invited speakers: Our invited speakers come from various backgrounds and have been involved with NLP open source development and may not have been that active within the ACL community. We strive to continue the tradition of inviting a diverse range of speakers to talk about NLP OSS to talk about a balance between OSS development, scientific contribution and community building around OSS. All three speakers have confirmed their participation to give an invited talk (depending on their availability after the exact conference assignment).

Misc

Estimated no. of Attendees: 50

Shared Task: No

Special Requirements / Technical Needs: No

Preferred Venue:

  1. EACL / ACL / EMNLP
  2. EACL / ACL / EMNLP
  3. EACL / ACL / EMNLP

Previous Workshop:

Expected no. of submissions: 30-40 submissions

Organizers

  • Geeticka Chauhan, Massachusetts Institute of Technology

    Geeticka Chauhan is a Ph.D. student at MIT, working on NLP for healthcare advised by Prof. Peter Szolovits. Her master thesis focused on revealing the reproducibility and generalizability problems in Relation Extraction, and experimentally showed the importance of streamlining evaluation methods in NLP challenges

  • Dmitrijs Milajevs, Grayscale AI.

    Dmitrijs Milajevs is a data scientist at KMPG. Previously, he evaluated information retrieval systems at National Institute of Standards and Technology (NIST). He has defended a Ph.D. thesis on evaluation of compositional models in distributional semantics.

  • Elijah Rippeth, University of Maryland.

    Elijah Rippeth is Ph.D. student at the University of Maryland in the Department of Computer Science. His work focuses broadly on natural language processing, but with a focus on multilingual NLP, cross-lingual transfer, and machine translation.

  • Jeremy Gwinnup, Air Force Research Laboratory.

    Jeremy Gwinnup is a Research Computer Scientist in the Airman Systems Directorate of the Air Force Research Laboratory located in Dayton, Ohio USA. His research focuses on multimodal machine translation and is the topic of his studies as a Doctor of Engineering student at Johns Hopkins University.

  • Liling Tan, Rakuten Institute of Technology

    Liling is a research scientist at Rakuten Institute of Technology working on Machine Translation and developing applications using language technologies. He has been actively involved in corpora creation/maintenance, Asian NLP and machine translation. He co-organized the Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2014-16).

Programme Committee