-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refinements #97
Comments
Made it a week to guarantee a good job. This is a core feature that needs to work at least as well as before. |
Seems like test with known samples is a good next step here. |
/start |
Tips:
|
|
Unfortunately you or I would just have to manually check old completed tasks and see their rewards. None in particular come to mind, but I would pay attention to those posted by "ubiquibot" instead of "ubiquity-os" as those used an older version of conversation rewards that seemed more accurate.
It is under the "formatting score" or "quantitative scoring" section. You might be able to search for these keywords in the codebase. I am mobile so pointing to code is not feasible. @gentlementlegen perhaps you can help with this point. |
@sshivaditya2019, this task has been idle for a while. Please provide an update. |
@gentlementlegen Really nice to see this finally working as expected. Except the revision hash in the metadata is undefined. This should be fixed! |
I tried with this prompt, for models I would suggest a better approach would be reduce the {
"1": 0.1,
"2": 0.2,
"3": 0.0
} Explanation:
|
Great idea except if temp is set too low I know it repeats and crashes. I'm pretty sure I played with these settings in my original implementation (see the repo called comment-incentives)
I'm pretty sure it's implemented this way. I know for a fact in my original implementation I had them all evaluate in one shot. |
So I tested few examples a |
|
Depends what is meant by "all together". Now it is all together, but by user, not all comment from the whole issue / PR in one block. Original implementation was the same except that there was a batch of 10 attemps averaged. |
I see. Given the extended context windows of the latest models, perhaps we should do it all in one shot? |
If that enhances precision and gives more context for better results it is nice, however I wonder if we would easily burst through the max tokens doing so for long reviews. |
Context window is too long these days I am pretty sure it will be fine |
The problem is that this tests runs on real data which is subject to change. What could be done is including this test but excluding it from the test run. However, there is this issue that should eventually get that covered, didn't have time to look into it. |
Let me know if you’re not working on it right now, I can take a look at it then. |
@0x4007. I think the issue spec for this one is a bit vague. I think relevance issue should be fixed with the PR. The img credit is a configuration issue. But, I am not sure about the outcomes for this ticket. Is the goal for the ticket to rewrite the entire |
For vague specs which happen occasionally, we are to share research and concerns on here. We all get credited for it.
Some recent observations:
|
@0x4007 Embedding will not solve the problem, I think the present relevance scoring is the best technique. According to me a better approach would be to use something like Bag of Words model with hierarchal labeling, and assign scores according to the depth of the concept. Let me know about this I can put together a small write up on this.
I think present implementation focuses only on formatting. Effectively this would mean that entire
I am very skeptical about embeddings in this use case. As I mentioned before, embeddings provide local context, and references, and on their own would not mean anything. I created my own script to plot visualize the embeddings and perform PCA to extract cluster centers. Original Comments: Embeddings Plot with Comments: Here, you can see three distinctive cluster centers, In the embeddings plot with comment I have added a new comment |
My peer suggested some search engine results related algorithm. I'm asking him now to clarify which. This should help us see how on topic it is for the specification. We could consider adding this as one of several dimensions we evaluate the comments by. Starting to wonder if sub plugins are realistic, or just make npm modules (or if we should just use something like git modules) ChatGPT is recommending me:
This seems a lot more interesting compared to word count but we should test. The idea is that we can develop a proprietary algorithm that combines several strategies. Ideally we should make a playground that we can plugin these different modules and run tests against live GitHub issues to tweak it Strategy ideas:
|
|
My peer got back to me regarding the search engine recommendation TF-IDF (Term Frequency-Inverse Document Frequency) is a classic algorithm used in search and information retrieval to evaluate how important a word is to a document relative to a collection of documents (often referred to as a "corpus"). It helps identify which terms are most relevant to the context of a specific document. In the Context of Your Goals: Evaluating GitHub CommentsGiven your objective to measure the value of GitHub comments in relation to problem-solving, TF-IDF could be a useful tool to assess the relevance and informational density of individual comments with respect to the overall issue or conversation. Here's how TF-IDF might be applied in your scenario: 1. How TF-IDF Works
2. Applying TF-IDF to Evaluate Comment Relevance:
3. Enhancing Your Continuum-Based Scoring System:
Practical Steps for Implementation:
Benefits for Your Goals:
Using TF-IDF will give you an effective way to measure the informational value and relevance of comments, aligning well with your goal of continuum-based scoring. Let me know if you’d like to dive deeper into any specific aspect of this approach! |
In this case, I am not sure how this is relevant. Here, we are assigning scores, within a
Just for context, TF-IDF is a transformation technique, not would give out a real valued vector. With which we then apply some distance metric like cosine similarity. This is almost similar to the Embedding and Vector Search.
These are not fixed. In the linked issue spec, Comments were relevant to the topic, but were flagged irrelevant. This will not be a issue as we would either way implement stemming or lemmatizing the input phrases and be tagging for POS (Parts of Speech).
I don't think this is possible. We would need to have some dictionary or something (WordNet), to assign values for words. This would not cater to specific words in comments like for eg:
TF-IDF is a good starting point, but I don't believe it suits this problem well. We need to assign scores or relevances to comments, and since no two comment threads will have the same set of high TF-IDF words, this could penalize terms that are highly relevant to the context individually but not as whole for multiple comment threads. |
I came up with a new approach to categorize comments into topic bins (Topic would be added using LLM/ML Model). We can then perform a similarity search using the Next, we can assess user engagement for each comment based on various roles, such as reactions and replies. The weight assigned to different types of engagement can vary depending on the role (e.g., Additionally, we’ll incorporate a credibility score to evaluate whether a comment was made by a verified member of the organization, a regular collaborator, or an unknown user. The overall score could be calculated using the following formula: Where:
This will allow us to effectively evaluate the quality and relevance of comments. |
Credibility score we can adjust to the author no matter their position/relationship with the organization. The spec author generally has the clearest vision on the task so if what is commented aligns with them (i.e. they agree) then more credit should be offered (this of course is only in the context of funded tasks) Reactions we usually have a very limited amount of but I think reactions from author and core team could be a positive indicator. If we can attribute block quotes that can be interesting. The problem there is that I generally comment from mobile and block quotes can be inconvenient but sometimes I make sure to do in order to enhance clarity. I would be more curious to experiment with attributing block quote crediting. |
Otherwise is the method and scoring criteria fine ? @Keyrxng rfc, I think this should be good enough |
@sshivaditya2019, this task has been idle for a while. Please provide an update. |
Sure yes let's try. I have a feeling it might make sense to prototype strategies with a command line tool and then live test against real examples before we fully integrate it into our system. Some strategies might prove to be bad and it would be unfortunate to invest in building and integrating them and then scrapping them right away. |
Is there an organization-wide execution time limit on plugins? I tried a version of this, but it's not very efficient since most processing is happening locally. Is this plugin designed to run on GitHub Actions or on Cloudflare Workers? |
Conversation rewards plug-in is designed to run on GitHub actions. This is because we need to generate a virtual dom for every comment which can take anywhere between 100 and 500 ms in my testing long ago. For long issues and linked pulls you can imagine that the rendering time can be quite substantial. With Cloudflare workers we are limited to ~500 ms as I recall. We essentially have no limits for GitHub actions (6 hours per job, 20 concurrent jobs allowed, unlimited per day) |
I think then timing would not be much of an issue then. Right now, a very crude version takes around 150 to 600ms depending on thread length. I can get the exact performance benchmark, but roughly for around 50 msgs it takes around 200ms (avg). |
Scoring Criteria to Incorporate
Footnotes
|
I think the updated prompt works fine as well. You could check that out in the linked PR. Footnotes
|
I think what might make the most sense for next steps:
|
I guess our plugins are down but I anticipated 300 USD price. |
Note This output has been truncated due to the comment length limit.
|
View | Contribution | Count | Reward |
---|---|---|---|
Issue | Task | 1 | 300 |
Issue | Comment | 22 | 56.382 |
Review | Comment | 9 | 0 |
[ 142.738 WXDAI ]
@0x4007
Contributions Overview
View | Contribution | Count | Reward |
---|---|---|---|
Issue | Specification | 1 | 5.85 |
Issue | Comment | 21 | 131.142 |
Review | Comment | 6 | 5.746 |
[ 29.612 WXDAI ]
@gentlementlegen
Contributions Overview
View | Contribution | Count | Reward |
---|---|---|---|
Issue | Comment | 8 | 29.612 |
@0x4007 I noticed the comment reward tends to be truncated a lot. Maybe we should look into optimizing the contents somehow. |
Sure you can make a normal priority task for it |
Qualitative and quantitative analysis have unexpected results according to how I implemented in v1. Research, and refine.
Originally posted by @0x4007 in ubiquity-os-marketplace/command-start-stop#14 (comment)
The text was updated successfully, but these errors were encountered: