-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major Refactoring for Enhanced Flexibility and Performance #90
base: main
Are you sure you want to change the base?
Conversation
# Changes - Now support multiple configs - New options for a more flexible config - ExcludeSelectors : Allow to remove content from selector. - MaxConcurrency : Allow crawler to be concurrent - Name : Define the name of the config. Useful when multi config. - Config now has default value set in the parser. - Output now is in it's own folder `output/` - Now display a multi progressbar instead of logging every page. # Known issues - INFO, ERROR, are not displayed correctly now that the multibar feature is implemented - Progressbar is append to next line if terminal width is too small.
toArray now make empty Array if string empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice @maxime4000! I see lots of value to these changes! I'm willing to partake to resolve the logging 🤗
My first take would be on the prettier config. Would you mind keeping the tab width 2 or 1 (I'm not sure what's the default)? It occurs to me that it would make the PR review and evolution easier as formatting changes wouldn't be present 🤗
export async function sequentialAsyncMap<T, R>( | ||
array: T[], | ||
asyncFunction: (value: T, index: number, array: T[]) => Promise<R>, | ||
): Promise<R[]> { | ||
return array.reduce( | ||
async (previousPromise: Promise<R[]>, currentItem: T, index, array) => { | ||
const accumulator = await previousPromise; | ||
const result = await asyncFunction(currentItem, index, array); | ||
return [...accumulator, result]; | ||
}, | ||
Promise.resolve([] as R[]), | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not being used anywhere, right?
export async function sequentialAsyncMap<T, R>( | |
array: T[], | |
asyncFunction: (value: T, index: number, array: T[]) => Promise<R>, | |
): Promise<R[]> { | |
return array.reduce( | |
async (previousPromise: Promise<R[]>, currentItem: T, index, array) => { | |
const accumulator = await previousPromise; | |
const result = await asyncFunction(currentItem, index, array); | |
return [...accumulator, result]; | |
}, | |
Promise.resolve([] as R[]), | |
); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exact! I had few issues with data being saved in the wrong config. I use that one to make it sequential and see what is happening. I found out that the problem was the Dataset/RequestQueue that didn't exist, so I change it back to a parallel afterward. Will remove it as it is not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, after reviewing the code I got to a conclusion:
I'm in for multiple configurations and enhancing the overall performance and flexibility of the crawler, but my suggestion is to address the multi progress bar feature separately, on a different pull request, specially because it introduces possibly undesired behaviours on logging and needs further attention.
This way we can release the multi-config feature and work on enhancing the logging apart.
What do you think @maxime4000?
@marcelovicentegc Thanks I appreciate :)
Caveat of Reverting the LoggerWhen running multiple crawlers simultaneously, log outputs will mix, making real-time tracking challenging. This issue mainly arises with multiple configurations. But I'm also guessing that if you use this tools with multiple config, you kinda know what you are doing. 😉 Timeline for ChangesI plan to complete thoses changes by tomorrow or the next day. However, I'd prefer to have others review these changes before proceeding further. The proposed changes are substantial, and I've made some important decisions independently. It's prudent to wait for feedback, especially from someone at BuilderIO, to ensure there's no disagreement or need for discussion about these changes before doing changes. |
Maybe the executor should also be changed. |
@subframe7536 While you do have a point about the speed of tsc and esbuild/swc would be a better transpiller, we could also says that when we run While your point is an improvement, this PR is already doing too much and thus, I won't make that change here. Though, I suggest you to make a separate PR with this change as it would be faster if you are interested to bring that to the project. |
I'm sorry @marcelovicentegc, I couldn't make it... I was optimistic about my own time and deadlines to do the things I promise to do. And I blinked, and it's been a month and couldn't put the time to finish it... I think about it, I'm anxious about the promise of doing it but can't find the time to do it. Also, the facts are that my customs GPTs using the output produced by this PR wasn't any better than without it. In facts, chatgpt was slower, more prone to "network error" and didn't give better results than without "the knowledge". Open AI did update their model to use more recent data and it made it feels like this PR was procrastination with my borrowed times as I should be using this time to build the things I want to build instead of improving a "random" tools that will scrape data to build knowledge files that will slightly help an AI to be more "accurate" when the AI is already able to help me without this data and also the output generated from the tools and the improvement I made didn't end up used because the experience was worse when those files were added to the custom GPT... Motivation has been low as the beneficial gains of using the tools weren't substantial. I came here with a "take it or leave it" mentality, I promise to do better and support my work, but the urgency of finishing the work fall off short as I didn't used that work in the end... I'm still planning to finish this PR, but I wanted to be transparent about this. I'm on borrowed times with my current life situation, I want to use that time to get the most of it and hopefully secure an independent prosper future. This was procrastination to my current goals. I wish it would have been more useful to me. I will take time to finish it someday, but it's not my priority at the moment. Reviewing changes is not the same mental load as doing the changes and I'm still open about reviewing contributions to merge those improvements the sooner, but can't give an estimate of when will I do the promised changes. On that note, I wish everyone who read this a good holiday season, an happy new years and to take time to take care of yourself 😊 |
🚨 Warning
This is fully functionnal and shouldn't break anythings that weren't breaking, but it is not fully "finished/stable". I would like to have someone finishing the PR. Open to revert some possibly "controversial" changes, but not interested to fix the multi progressbar as I don't have deep knowledge into playwright and the logger.
Overview
This PR introduces a major refactoring of the codebase, adding support for multiple configurations and enhancing the overall performance and flexibility of the crawler.
Was talked briefly in #82
Key Changes
output/
, improving data management.output/data.json
.npm run prettier
ornpm run fmt
fmt
Partially Completed Features and Known Issues
Current State and Request for Collaboration
The changes made significantly improve the tool's capabilities. However, they are not entirely stable, particularly regarding terminal output. I seek collaboration from the community to resolve these issues, especially the log display discrepancies and progress bar wrapping.
Note :
I am open to make those changes as they might be controversial: