diff --git a/docs/2024/ci-scanner/updates/2024-07-11.md b/docs/2024/ci-scanner/updates/2024-07-11.md new file mode 100644 index 000000000..9656800f7 --- /dev/null +++ b/docs/2024/ci-scanner/updates/2024-07-11.md @@ -0,0 +1,66 @@ +--- +title: Week 6 +author: Rajul Jha +tags: [gsoc24, CI] +--- + + +# Week 6 +*(July 05, 2024 - July 11, 2024)* + +## Meeting 1 +*(July 10, 2024)* + +## Attendees +* [Rajul Jha](https://github.com/rajuljha) +* [Shaheem Azmal](https://github.com/shaheemazmalmmd) +* [Kaushlendra](https://github.com/Kaushl2208) +* [Avinal Kumar](https://github.com/avinal) + +## Discussions +* Mentioned about the progress and completion of [#PR2785](https://github.com/fossology/fossology/pull/2785) which adds the relevant byte info the nomos scanner's JSON Output. +* Also got review from mentors regarding [#PR2784](https://github.com/fossology/fossology/pull/2784) which allows for the user to pass a custom `allowlist.json` file for whitelisting certain licenses or keywords. +* Gave a demo to the mentors on how the Github Action for Fossology Scanners works. I studied `docker` actions as well as `composite` actions and decided to go for the composite actions because: + * **Emulation on our end**: Composite actions give us the flexibility to run multiple steps in our jobs which makes it easier for us to implement QEMU Emulator for cross platform image support de-facto. + * **Uploading Artifacts**: Using composite actions, the user does not need to set up a separate step of uploading the results as an artifact, as we can add this step into our action itself. User can just provide the `report_format` key to tell the script which report to generate. Thus, it ensures less clout for the user. + + +## Work Done +* Completed the allowlist functionality and sent a [#PR2784](https://github.com/fossology/fossology/pull/2784) for the same. + * The user can now pass a `allowlist.json` file in a particular [format](https://github.com/fossology/fossology/blob/master/utils/automation/allowlist.sample.json) like this: + ```yaml + { + "licenses": [ + "GPL-2.0-or-later", + "GPL-2.0-only", + "LGPL-2.1-or-later" + ], + "exclude": [ + "*/agent_tests/*", + "src/vendor/*" + ] + } + ``` + * The script looks for the file allowlist file first. If not found here, then looks for `allowlist.json` file in the root directory. If not found here then looks for `whitelist.json`. If this is also not found, populates an empty dictionary with `license` and `exclude` keys. + The decision tree looks like this: + + ![Screenshot](/img/ci/Whitelist_decision_tree.png) + +* As discussed and resolved in the previous meeting, the `start`, `end`, and `len` information is updated into the nomos JSON output in this [#PR2785](https://github.com/fossology/fossology/pull/2785). + + ![Screenshot](/img/ci/Nomos_json_output.png) + +* Started working on the line number part for `nomos` and `ojo` scanners. + +* Researched and understood the different Github Actions and decided to go with [`composite` actions](https://docs.github.com/en/actions/creating-actions/creating-a-composite-action) as they allow us to customize our action in an easier manner. +* Implemented a demo Github Action for testing and demo'd it to the mentors. +![Screenshot](/img/ci/foss-action-test.png) + +## Planning for next week +* Need to complete the action, test all cases and boundary conditions. +* Once the action is completed, we need to think about a name for it and publish it to the Github Marketplace. +* After that, resume working on the line number part for the `nomos` and `ojo` scanners as well. \ No newline at end of file diff --git a/docs/2024/index.md b/docs/2024/index.md index 70b31ca55..e9db2a779 100644 --- a/docs/2024/index.md +++ b/docs/2024/index.md @@ -26,7 +26,7 @@ More info to come here. | :--------------------------------------------------- | :----------------------------------------------------------- | | [Aaditya Singh](https://github.com/aadsingh) | [Overhaul Scheduler Design](/docs/2024/scheduler) | | [Abdelrahman Jamal](https://github.com/Hero2323) | [AI Powered License Detection](/docs/2024/license-detection) | -| [Abhishek Kumar](https://github.com/abhi-kumar17871) | [SPDX 3.0 Support](/docs/2024/spdx30) | +| [Abhishek Kumar](https://github.com/abhi-kumar17871) | [Support SPDX 3.0 Reports](/docs/2024/spdx30) | | [Akash Sah](https://github.com/AkashSah2003) | [SPDX License Expression](/docs/2024/spdx-expression) | | [Divij Sharma](https://github.com/dvjsharma) | [REST API Improvements](/docs/2024/rest) | | [Rajul Jha](https://github.com/rajuljha) | [Improving CI Scanner](/docs/2024/ci-scanner) | diff --git a/docs/2024/license-detection/updates/2024-06-06.md b/docs/2024/license-detection/updates/2024-06-06.md index 24548ed6c..8248e6e36 100644 --- a/docs/2024/license-detection/updates/2024-06-06.md +++ b/docs/2024/license-detection/updates/2024-06-06.md @@ -10,7 +10,7 @@ SPDX-FileCopyrightText: 2024 Abdelrahman Jamal # Meeting 2 -*(June 6,2023)* +*(June 6,2024)* ## Attendees: - [Kaushlendra Pratap](https://github.com/Kaushl2208) diff --git a/docs/2024/license-detection/updates/2024-06-13.md b/docs/2024/license-detection/updates/2024-06-13.md index 2b1126dea..7be28aeee 100644 --- a/docs/2024/license-detection/updates/2024-06-13.md +++ b/docs/2024/license-detection/updates/2024-06-13.md @@ -10,7 +10,7 @@ SPDX-FileCopyrightText: 2024 Abdelrahman Jamal # Meeting 3 -*(June 13,2023)* +*(June 13,2024)* ## Attendees: - [Kaushlendra Pratap](https://github.com/Kaushl2208) diff --git a/docs/2024/license-detection/updates/2024-06-20.md b/docs/2024/license-detection/updates/2024-06-20.md index 9aa40eb06..36e314e5c 100644 --- a/docs/2024/license-detection/updates/2024-06-20.md +++ b/docs/2024/license-detection/updates/2024-06-20.md @@ -10,7 +10,7 @@ SPDX-FileCopyrightText: 2024 Abdelrahman Jamal # Meeting 4 -*(June 20,2023)* +*(June 20,2024)* ## Attendees: - [Kaushlendra Pratap](https://github.com/Kaushl2208) diff --git a/docs/2024/license-detection/updates/2024-06-27.md b/docs/2024/license-detection/updates/2024-06-27.md index f8da22507..116e537b2 100644 --- a/docs/2024/license-detection/updates/2024-06-27.md +++ b/docs/2024/license-detection/updates/2024-06-27.md @@ -10,7 +10,7 @@ SPDX-FileCopyrightText: 2024 Abdelrahman Jamal # Meeting 5 -*(June 27,2023)* +*(June 27,2024)* ## Attendees: - [Kaushlendra Pratap](https://github.com/Kaushl2208) diff --git a/docs/2024/license-detection/updates/2024-07-04.md b/docs/2024/license-detection/updates/2024-07-04.md new file mode 100644 index 000000000..ac0889f85 --- /dev/null +++ b/docs/2024/license-detection/updates/2024-07-04.md @@ -0,0 +1,125 @@ +--- +title: Week 6 +author: Abdelrahman Jamal +--- + + +# Meeting 6 + +*(July 4,2024)* + +## Attendees: +- [Kaushlendra Pratap](https://github.com/Kaushl2208) +- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd) +- [Ayush Bhardwaj](https://github.com/hastagAB) +- [Avinal Kumar](https://github.com/avinal) + +## Discussion: + +### Integration of Semantic Search with LLMs + +- Initial Attempt + 1. Prompt: The initial prompt focused on providing text and metadata to the LLM for license identification. + 2. Issues: The LLM attempted to match all provided lines to a license, even when many lines were clearly irrelevant to licensing. + +- Initial Prompt +``` +[Task] +You are provided with text extracted from a file, along with potential license matches identified by a semantic search tool. +Your task is to carefully analyze the provided text and metadata to determine the actual software license(s) present in the original file. +Out of the 10 provided lines, not all matches will be correct or relevant, so focus on the most relevant lines in your analysis. + +[Metadata Explanation] +The metadata provided for each line is a tuple containing four elements: + * **Line:** The actual line of text extracted from the file. + * **Potential License Match:** The name of a license that the semantic search tool believes the line might belong to. + * **License ID:** The SPDX identifier of the potential license match. + * **Matched License Text:** The specific text within the potential license that the line was matched to. + +[Guidelines] +1. **License Identification:** If a license is found, clearly state its name and its corresponding SPDX identifier (e.g., MIT License, SPDX-License-Identifier: MIT). If multiple licenses are found, list them all. +2. **Evidence and Reasoning (Focus on Relevance and Clarity):** + * For each identified license, extract the specific text snippet(s) from the provided text that confirm its presence. Include surrounding context if it helps clarify the license's applicability. Prioritize the most relevant lines of text. + * Explain why the identified license is the most likely match, taking into account the potential license matches and the matched license text provided in the metadata. + * Only consider matches that are clear and obviously correct. The semantic search tool will always attempt to match lines to licenses, but these matches are not always accurate. +3. **Override Semantic Search:** If the semantic search tool's suggested match seems incorrect, feel free to disregard it and rely on your own knowledge and analysis to determine the correct license. Provide a clear explanation of why you chose a different license. +4. **Exclude Irrelevant Information:** + * Disregard copyright notices and statements and lines of code as they do not indicate the software license. + * Focus only on text that is found in licenses or clearly identifies licenses. +5. **No License Scenario:** If no license is detected in the text, explicitly state "No software license found." +6. **Ambiguity:** If the license cannot be confidently determined due to ambiguity or conflicting information, clearly state this and provide an explanation. +7. **Response Format:** Provide the results in the following format: + * **Licenses = [list of identified licenses]** + * **SPDX-IDs = [list of corresponding SPDX identifiers]** + + If no licenses are found, both lists should be empty: + * **Licenses = []** + * **SPDX-IDs = []** + +[Text and Metadata] +``` +- Outcome: The LLM tried too hard to relate irrelevant lines to licenses, resulting in many false positives. + +### Revised Approach + +- Second Attempt + - Prompt: Changed the task to identify relevant lines before determining licenses. + - Issues: Reduced the number of irrelevant lines identified, but the problem of false positives persisted. + +- Second Prompt +``` +[Task] +From the following tuples, select those that are relevant to software licensing and ignore the rest. +A relevant tuple is a tuple that contains a line of text that is relevant and can be used to identify a license. + +[Tuples] +Each tuple consists of three elements: + 1. **Line:** The actual line of text extracted from the file. This is the element you need to evaluate for relevance to software licensing. + 2. **Potential License Match:** The name of a license that the semantic search tool suggests the line might belong to (provided for reference). + 3. **License ID:** The SPDX identifier of the potential license match (provided for reference). + +[Guidelines] +1. **Select License-Specific Lines:** Choose only lines that: + * Explicitly mention license terms + * Directly quote from known license texts + * Include specific license references or titles. + +2. **Ignore Irrelevant Lines:** + * Disregard lines that do not explicitly mention license terms. + * Ignore copyright notices, code snippets, comments, and general documentation. + * Ignore code documentation lines that seem to be documenting code or just general instructions or information. + * Do not select lines that are general descriptions, code, or comments unrelated to license terms. + +3. **No License:** If no license is found, state "No software license found." +4. **Ambiguity:** If uncertain, explain the ambiguity. +5. **Response Format:** + * **Relevant Lines = [list of relevant lines]** + * **Licenses = [list of identified licenses from relevant lines]** + * **SPDX-IDs = [list of corresponding SPDX identifiers from relevant lines]** + +[Text and Metadata] +``` +- Outcome: The LLM still included irrelevant lines in its output, indicating a persistent issue with following the prompt guidelines. + +### Key Findings +- Performance Issues: Despite detailed prompts, the LLMs struggled to correctly identify relevant lines and accurately match licenses. + +- RAG Exploration: Suggested by Kaushl, Retrieval-Augmented Generation (RAG) may provide a more robust solution to improve accuracy in license identification. + +## Conclusions and Next Steps +- Improve Semantic Search: Continue refining the semantic search approach for better initial filtering of potential license lines. + +- RAG Implementation: Investigate and implement RAG to enhance the LLM's ability to accurately identify relevant lines and match licenses. + +- Further Prompt Engineering: Experiment with additional prompt variations to improve LLM performance. + + +- Performance Metrics: Establish metrics to evaluate the effectiveness of the integrated approach and analyze the results for further improvements. +**** + + + diff --git a/docs/2024/license-detection/updates/2024-07-11.md b/docs/2024/license-detection/updates/2024-07-11.md new file mode 100644 index 000000000..c5bb3561c --- /dev/null +++ b/docs/2024/license-detection/updates/2024-07-11.md @@ -0,0 +1,165 @@ +--- +title: Week 7 +author: Abdelrahman Jamal +--- + + +# Meeting 7 + +*(July 4,2024)* + +## Attendees: +- [Kaushlendra Pratap](https://github.com/Kaushl2208) +- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd) +- [Ayush Bhardwaj](https://github.com/hastagAB) +- [Avinal Kumar](https://github.com/avinal) +- [Anupam Ghosh](https://github.com/ag4ums) + +## Discussion: + +### Improved Semantic Search Algorithm + +- Presentation of Improved Algorithm + + 1. Approach: Implemented chunking by dividing code comments and paragraphs into multiline strings, starting new chunks at empty lines. + + 2. Challenges: Original method of extracting comments resulted in long, merged lines, necessitating a rework to handle multiline comments effectively. + +- Chunking Example + +``` +/* +This is a multiline comment +that should be considered as one +single chunk for better accuracy. + +This is a separate chunk. +*/ + +// This is a single line comment + +// This is still a single line comment +// This is also still a single line comment and not a chunk with the previous comment +``` +### License Matching Performance + +- Initial Results + + 1. Accuracy: The chunking approach significantly improved license matching accuracy but struggled with extremely similar licenses (e.g., 0BSD and ISC). + + 2. Problem: Minor differences between very similar licenses led to occasional misidentifications. + +- License Text Examples + +``` +0BSD License Text: +Copyright (C) YEAR by AUTHOR EMAIL + +Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted. + +THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. + +ISC License Text: +Copyright (c) 2004-2010 by Internet Systems Consortium, Inc. ("ISC") +Copyright (c) 1995-2003 by Internet Software Consortium + +Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies. + +THE SOFTWARE IS PROVIDED "AS IS" AND ISC DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL ISC BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. + +``` + +- Enhancements + +1. Chunk Merging: Attempted merging potential chunks to improve license identification by providing more comprehensive text for comparison. + +2. Combined Line and Chunk Matching: Implemented both line and chunk matching to enhance accuracy, though it increased processing time due to the greater number of combinations. + +- Results + + 1. Metrics: + Predicted License Accuracy: 93.33% + Predicted Licenses Covered: 84.0% + + 2. Notes: Approximately 5% of the remaining 7.6% were files referring to a different license file, not containing the license text directly. + +- Matching Output Example: + +``` +[(100.0, + ' THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.', + 'BSD Zero Clause License', + '0BSD', + 'THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.', + [('BSD Zero Clause License', 100.0), + ('BSD Zero Clause License', 100.0), + ('ISC License', 97.0), + ('ISC License', 97.0), + ('Mackerras 3-Clause License', 93.0)]), + (100.0, + ' Permission to use, copy, modify, and distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.', + 'curl License', + 'curl', + 'Permission to use, copy, modify, and distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.', + [('curl License', 100.0), + ('curl License', 100.0), + ('pkgconf License', 99.0), + ('ISC License', 99.0), + ('ISC License', 99.0)]), + (99.0, + ' Permission to use, copy, modify, and distribute this software for any', + 'OAR License', + 'OAR', + 'Permission to use, copy, modify, and distribute this software for any', + [('Historical Permission Notice and Disclaimer - Fenneberg-Livingston variant', + 99.0), + ('David M. Gay dtoa License', 99.0), + ('OAR License', 99.0), + ('pkgconf License', 97.0), + ('SGI OpenGL License', 96.0)]), + (99.0, + ' copyright notice and this permission notice appear in all copies.', + 'pkgconf License', + 'pkgconf', + 'copyright notice and this permission notice appear in all copies.', + [('pkgconf License', 99.0), + ('Historical Permission Notice and Disclaimer - documentation variant', + 99.0), + ('CMU Mach - no notices-in-documentation variant', 96.0), + ('ISC Veillard variant', 87.0), + ('Historical Permission Notice and Disclaimer - Pbmplus variant', 86.0)]), +.... (many more) + +``` + +### Key Findings + +- Enhanced Accuracy: The combination of chunking and line matching improved overall accuracy and coverage. + +- Increased Processing Time: The dual approach led to longer search times due to the increased number of combinations. + +## Conclusions and Next Steps + +- Evaluate on Nomos Test Dataset: + + 1. Dataset Links: + - [LastGoodNomosTestfilesScan](https://github.com/fossology/fossology/blob/master/src/nomos/agent_tests/testdata/LastGoodNomosTestfilesScan) + - [NomosTestfiles](https://github.com/fossology/fossology/tree/master/src/nomos/agent_tests/testdata/NomosTestfiles) + + 2. Objective: Assess the performance of the semantic search and LLM integration on the provided test datasets. + +- Limit Line/Chunk Matching: To address the issue of excessive matches, limit the line/chunk matching to optimize search time and accuracy. + +- Additional Tasks + + 1. Acknowledgement from Notice Files: Begin work on identifying and acknowledging licenses from notice files. + + 2. Obligations: Convert identified licenses to obligations, detailing the requirements and conditions of each license. + + + + diff --git a/docs/2024/pipeline/updates/2024-07-04.md b/docs/2024/pipeline/updates/2024-07-04.md new file mode 100644 index 000000000..41fbf6fa9 --- /dev/null +++ b/docs/2024/pipeline/updates/2024-07-04.md @@ -0,0 +1,66 @@ +--- +title: Week 6 +author: Shreya Gautam +tags: [gsoc24, pipeline] +--- + + + +# WEEK 6 +*(July 4, 2024)* + +## Attendees: +- [Kaushlendra Pratap](https://github.com/Kaushl2208) +- [Gaurav Mishra](https://github.com/GMishx) +- [Avinal Kumar](https://github.com/avinal) + +## Discussion: +1. Completed the Python script for fetching copyright contents from the database, incorporating Gaurav's recommendation to also retrieve user-modified contents. The updated script now collects copyrights and stores them in a CSV file with four columns: + +| original_content | original_is_enabled | edited_content | modified_is_enabled | +|-----------------------|---------------------|-----------------------|---------------------| + + +You can find the updated script [here](https://github.com/ShreyaGautamm/gsoc_24/blob/895ac5814097386f816d9ae703034cbe60244819/files/copyrights_script_v2.py). + +2. Automated the model training process with the idea that at a threshold of 500 new entries in the database, the Safaa model should be retrained. I explored GitHub Actions and wrote a YAML script to check the number of new entries and trigger the model retraining script if the threshold is met. However, due to connection issues between GitHub Actions and the locally hosted database, I consulted the mentors. They suggested making a connection for retraining when a new copyright file is uploaded to the repository. This task will be continued in the coming week, and updates will be provided in the following meeting. + +3. Explored incremental learning in Safaa. Currently, Safaa uses Scikit-learn's SVM implementation, which retrains from scratch. Since SVM is incapable of incremental learning, I switched to the SGD Classifier model from Scikit-learn, which supports incremental learning. I calculated its metric reports and found that its results are similar to those of the SVM. As per the mentors' suggestions, I will create a PR showing the results from both SVM and SGD. You can find my implementation for SVM [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/model_implementations/copyright_classification_SVM.ipynb), for SGD [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/model_implementations/copyright_classification_SGD.ipynb), and the comparison between them can be found below. The dataset used for implementation can be found [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/datasets/fossology-master.csv). + +The results are as follows: + +
+ +
+ +**SGD Classifier:** + +| | precision | recall | f1-score | support | +|---------------|-----------|--------|----------|---------| +| 0 | 0.99 | 0.99 | 0.99 | 2878 | +| 1 | 0.96 | 0.98 | 0.97 | 1016 | + +
+ +
+ +**SVM Classifier:** + +| | precision | recall | f1-score | support | +|---------------|-----------|--------|----------|---------| +| 0 | 0.99 | 0.99 | 0.99 | 2878 | +| 1 | 0.96 | 0.97 | 0.97 | 1016 | + +
+ +
+ +* Started working on creating a Python library for the NER-POS tagging task. I experimented with the Stanford NER Tagger. You can find my work [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/ner/Stanford_Tagger.ipynb). However, I plan on exploring the fine-tuning of BERT or GPT for this task in the coming weeks. + +## Subsequent Steps +* Address the issue with GitHub Actions by establishing a connection for retraining when a new copyright file is uploaded to the repository. Do all the implementations locally and then create the final yaml file to try out GitHub Actions. +* Explore and implement the fine-tuning of BERT or GPT for the NER-POS tagging task. diff --git a/docs/2024/rest/updates/Divij/2024-07-02.md b/docs/2024/rest/updates/Divij/2024-07-02.md new file mode 100644 index 000000000..34871dbd9 --- /dev/null +++ b/docs/2024/rest/updates/Divij/2024-07-02.md @@ -0,0 +1,101 @@ +--- +title: Week 6 +author: Divij Sharma +tags: [gsoc24, rest] +--- + + + +# Week 6 meeting and activities + +_(July 02,2024)_ + +## Attendees + +- [Divij Sharma](https://github.com/dvjsharma) +- Katharina Ettinger +- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd) + +## Discussion + +- Gave updates and demo on previous week's work. +- Discussed on the endpoint requirements for the Jobs API. + +## Activities + +- Modified `/jobs`, `/jobs/{id}`, `/jobs/all` endpoints considering the following points: + + 1. Currently, the structure of the `jobs` object returned by these endpoints is as follows: + ```json + [ + { + "id": , + "name": , + "queueDate": , + "uploadId": , + "userId": , + "groupId": , + "eta":