Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I recommend to speed-up with the clickhouse-git-import tool. #2

Open
alexey-milovidov opened this issue Jan 29, 2023 · 0 comments
Open

Comments

@alexey-milovidov
Copy link

Installation:

curl https://clickhouse.com/ | sh

Usage:

./clickhouse git-import --help
  • will show the documentation and the usage of the tool.

Then the tool can be run directly inside the git repository.
It will collect data like commits, file changes, and changes of every
line in every file for further analysis.
It works well even on the largest repositories like Linux or Chromium.

Example of a trivial query:

SELECT author AS k, count() AS c FROM line_changes WHERE
file_extension IN ('h', 'cpp') GROUP BY k ORDER BY c DESC LIMIT 20

Example of some non-trivial query - a matrix of authors, how much code
of one author is removed by another:

SELECT k, written_code.c, removed_code.c,
    round(removed_code.c * 100 / written_code.c) AS remove_ratio
FROM (
    SELECT author AS k, count() AS c
    FROM line_changes
    WHERE sign = 1 AND file_extension IN ('h', 'cpp')
        AND line_type NOT IN ('Punct', 'Empty')
    GROUP BY k
) AS written_code
INNER JOIN (
    SELECT prev_author AS k, count() AS c
    FROM line_changes
    WHERE sign = -1 AND file_extension IN ('h', 'cpp')
        AND line_type NOT IN ('Punct', 'Empty')
        AND author != prev_author
    GROUP BY k
) AS removed_code USING (k)
WHERE written_code.c > 1000
ORDER BY c DESC LIMIT 500
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant