Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files which are never created or modified #42

Open
maelick opened this issue Jun 11, 2013 · 14 comments
Open

Files which are never created or modified #42

maelick opened this issue Jun 11, 2013 · 14 comments
Assignees

Comments

@maelick
Copy link
Contributor

maelick commented Jun 11, 2013

There are many files which are never created. For example in Tomboy I have 5143 entries in the files table but only 4514 are actually references in actions (i.e. 629 which are never touched). Moreover only 3202 have been created (added or copied) at least one. Using a bigger repository like Evolution this becomes even more enormous: on 4941692 file entries, only 19672 have been created!

Here are the queries I've used:

  • SELECT COUNT(DISTINCT f.id) FROM files f, repositories r WHERE f.repository_id = r.id AND r.name = ?;
  • SELECT COUNT(DISTINCT f.id) FROM files f, repositories r, actions a WHERE a.file_id = f.id AND a.type IN ("a", "c") AND f.repository_id = r.id AND r.name = ?;

I have tried to find out what is the source of the problem while crawling through the code but I still didn't find the origin of the problem. In general there are too many entries created in files table (like the enormous number of entries in Evolution) and my intuition is that this might be related to branches. For example if a file is created in the master branch, then a new branch is created and the file modified in this new branch, then a new file entry will be created (and also one for each of the parent directories).

This might be related to issue #3 as I have also seen in Tomboy 5 files for which there are two entries associated to the same commits. For one of them, the file is renamed in a branch but was created in another one, thus new entries are created in files and file_links here, then a second file_links is created for the action of renaming here

@andygrunwald
Copy link
Contributor

Hey @maelick,

can you provide more information on this topic?
E.g. which CVSAnaly2 + Repository version do you use? What are the commands you executed? Which extensions were actived?

@maelick
Copy link
Contributor Author

maelick commented Nov 27, 2013

Using last version of CVSAnaly
(3d67e70) on Tomboy
(https://git.gnome.org/browse/tomboy/) last version
(cea2c730f3fe135067a26aafd6dd258348932662) without extensions in an
empty MySQL database:

git clone https://github.com/MetricsGrimoire/CVSAnalY.git
cd CVSAnalY
git co 3d67e700902e54d8ac3cfac60169e80b68d70f4e
cd ..
git clone git://git.gnome.org/tomboy
cd tomboy
git co cea2c730f3fe135067a26aafd6dd258348932662
cd ..
./CVSAnalY/cvsanaly2 -u <user> -p <pass> -d <db> tomboy/

Then I used to following SQL queries to compute total number of files
and number of files that have been created (add or copy action on it).

SELECT COUNT(*) FROM files;
SELECT COUNT(DISTINCT f.id) FROM files f, actions a
WHERE f.id = a.file_id AND a.type IN ("C", "A");

This gives 3218 files that have been added or copied out of 5181
files.

For example let's have a look at the README file at the root of
Tomboy's repository. First I created a views to get the files id that
were not created and the number of file links for each file:

CREATE VIEW not_created_files AS
SELECT * FROM files WHERE id NOT IN (
  SELECT DISTINCT f.id FROM files f, actions a
  WHERE f.id = a.file_id
  AND a.type IN ("C", "A")
);

CREATE VIEW file_links_count AS
SELECT f.id, COUNT(*) n FROM files f, file_links fl
WHERE f.id = fl.file_id GROUP BY f.id;

Then I looked at those files and the different actions that were done
on them:

SELECT f.id, fl.file_path, flc.n, a.type, a.branch_id, l.date
FROM files f, file_links fl, file_links_count flc, actions a, scmlog l
WHERE fl.file_id = f.id AND flc.id = f.id AND f.id = a.file_id
AND f.file_name = "README" AND fl.file_path = "README" AND a.commit_id = l.id
ORDER BY l.date;

This gives the following result:

+------+-----------+---+------+-----------+---------------------+
| id   | file_path | n | type | branch_id | date                |
+------+-----------+---+------+-----------+---------------------+
|  595 | README    | 1 | A    |         1 | 2004-09-20 09:13:55 |
|   64 | README    | 1 | A    |         2 | 2004-09-20 09:13:55 |
|  595 | README    | 1 | M    |         1 | 2004-10-08 12:56:20 |
|   64 | README    | 1 | M    |         2 | 2004-10-08 12:56:20 |
|  595 | README    | 1 | M    |         1 | 2005-04-04 17:38:32 |
|   64 | README    | 1 | M    |         2 | 2005-04-04 17:38:32 |
|  595 | README    | 1 | M    |         1 | 2006-01-26 08:32:15 |
|   64 | README    | 1 | M    |         2 | 2006-01-26 08:32:15 |
|   64 | README    | 1 | M    |         2 | 2006-11-16 15:44:57 |
|  595 | README    | 1 | M    |         1 | 2006-11-16 15:44:57 |
|   64 | README    | 1 | M    |         2 | 2007-01-04 21:35:47 |
|  595 | README    | 1 | M    |         1 | 2007-01-04 21:35:47 |
|   64 | README    | 1 | D    |         2 | 2007-03-05 17:11:31 |
| 1636 | README    | 1 | M    |         5 | 2007-05-12 13:37:36 |
|  595 | README    | 1 | M    |         1 | 2007-05-12 13:37:36 |
| 3694 | README    | 1 | M    |         9 | 2007-09-14 22:27:18 |
|  595 | README    | 1 | M    |         1 | 2007-09-14 22:27:18 |
| 2909 | README    | 1 | M    |         8 | 2007-12-07 21:34:48 |
|  595 | README    | 1 | M    |         1 | 2009-04-22 09:57:31 |
| 4729 | README    | 1 | M    |        14 | 2009-05-10 18:29:59 |
+------+-----------+---+------+-----------+---------------------+

The "n" column is used to ensure that the file has only one file link
(was never renamed) and that we don't miss some actions.

First a README file was added at the same time on two different
branches (this is actually an issue in Tomboy repository for tags
exported from SVN to Git), but for the other files were only modified
one time on a different branch.

Another example, the files never created on the master branch:

SELECT f.id, fl.file_path, flc.n, a.type, a.branch_id, l.date
FROM not_created_files f, file_links fl, file_links_count flc,
actions a, scmlog l, branches b
WHERE fl.file_id = f.id AND flc.id = f.id AND f.id = a.file_id
AND a.branch_id = b.id AND b.name = "master" AND a.commit_id = l.id
ORDER BY l.date;
+------+-----------------------------------------------------------------------------+---+------+-----------+---------------------+
| id   | file_path                                                                   | n | type | branch_id | date                |
+------+-----------------------------------------------------------------------------+---+------+-----------+---------------------+
| 4360 | Tomboy/TagEntry.cs                                                          | 1 | M    |         1 | 2009-01-19 23:30:07 |
| 4375 | Mono.Addins/Mono.Addins/Mono.Addins/AddinLocalizer.cs                       | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4368 | Mono.Addins/Mono.Addins/Mono.Addins.Localization/GettextLocalizer.cs        | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4376 | Mono.Addins/Mono.Addins/Mono.Addins/IAddinInstaller.cs                      | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4369 | Mono.Addins/Mono.Addins/Mono.Addins.Localization/IAddinLocalizer.cs         | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4361 | Mono.Addins/Mono.Addins.Gui/Mono.Addins.Gui/AddinInstaller.cs               | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4377 | Mono.Addins/Mono.Addins/Mono.Addins/InstanceExtensionNode.cs                | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4362 | Mono.Addins/Mono.Addins.Gui/Mono.Addins.Gui/AddinInstallerDialog.cs         | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4370 | Mono.Addins/Mono.Addins/Mono.Addins.Localization/IAddinLocalizerFactory.cs  | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4363 | Mono.Addins/Mono.Addins.Gui/gtk-gui/Mono.Addins.Gui.AddinInstallerDialog.cs | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4371 | Mono.Addins/Mono.Addins/Mono.Addins.Localization/IPluralAddinLocalizer.cs   | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4364 | Mono.Addins/Mono.Addins.Gui/icons/system-software-update.png                | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4372 | Mono.Addins/Mono.Addins/Mono.Addins.Localization/NullLocalizer.cs           | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4365 | Mono.Addins/Mono.Addins.Setup/AssemblyInfo.cs                               | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4373 | Mono.Addins/Mono.Addins/Mono.Addins.Localization/StringResourceLocalizer.cs | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4366 | Mono.Addins/Mono.Addins.Setup/Mono.Addins.Setup/ConsoleAddinInstaller.cs    | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4374 | Mono.Addins/Mono.Addins/Mono.Addins.Localization/StringTableLocalizer.cs    | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4367 | Mono.Addins/Mono.Addins/Mono.Addins.Localization/GettextDomain.cs           | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 4360 | Tomboy/TagEntry.cs                                                          | 1 | D    |         1 | 2009-01-31 09:31:33 |
| 4378 | Tomboy/WrapBox.cs                                                           | 1 | D    |         1 | 2009-01-31 09:31:33 |
| 4947 | help/C/tomboy.xml                                                           | 1 | D    |         1 | 2010-06-28 12:20:12 |
| 4967 | help/C/tomboy-C.omf                                                         | 1 | D    |         1 | 2010-06-28 12:20:18 |
+------+-----------------------------------------------------------------------------+---+------+-----------+---------------------+

Here most of them are files that were only deleted.

My first intuition was that when a file is created in branch X, then
an action (M, D, V) is done in branch Y on that file, a new file entry
is added in branch Y although it shouldn't.

But looking at the file
"Mono.Addins/Mono.Addins/Mono.Addins/AddinLocalizer.cs" shows more
weird things and I really have no idea what happens:

SELECT f.id, fl.file_path, flc.n, a.type, a.branch_id, l.date
FROM files f, file_links fl, file_links_count flc, actions a, scmlog l
WHERE fl.file_id = f.id AND flc.id = f.id AND f.id = a.file_id
AND fl.file_path = "Mono.Addins/Mono.Addins/Mono.Addins/AddinLocalizer.cs"
AND a.commit_id = l.id ORDER BY l.date;
+------+-------------------------------------------------------+---+------+-----------+---------------------+
| id   | file_path                                             | n | type | branch_id | date                |
+------+-------------------------------------------------------+---+------+-----------+---------------------+
| 3262 | Mono.Addins/Mono.Addins/Mono.Addins/AddinLocalizer.cs | 1 | C    |         1 | 2008-03-02 21:23:19 |
| 3929 | Mono.Addins/Mono.Addins/Mono.Addins/AddinLocalizer.cs | 1 | C    |         9 | 2008-03-02 21:23:19 |
| 4375 | Mono.Addins/Mono.Addins/Mono.Addins/AddinLocalizer.cs | 1 | D    |         1 | 2009-01-21 18:45:31 |
| 3262 | Mono.Addins/Mono.Addins/Mono.Addins/AddinLocalizer.cs | 1 | D    |         1 | 2009-01-21 18:45:31 |
+------+-------------------------------------------------------+---+------+-----------+---------------------+

I have found this problem recurrent across most GNOME git repositories
(with the most extreme case being Evolution). While there is many issue
with those repositories because of conversion from CVS to SVN
and then from SVN to Git, I experienced similar problems with other
"pure" git repositories when using CVSAnalY.

@andygrunwald
Copy link
Contributor

Thanks for this detailed description.
Sadly i wont have time to have a look now (and until february), but maybe some other team members can have a look now.

@ghost ghost assigned sduenas Nov 27, 2013
@maelick
Copy link
Contributor Author

maelick commented Nov 27, 2013

Here's more information.

I created a simple test git repository as follow:

mkdir test
cd test
git init
echo "hello" > file
git add file
git commit -m "added"
git checkout -b test
echo "hello world" > file
git commit -am "modified"
git checkout master
echo "hello foo" > file
git commit -am "modified master"
cd ..
./CVSAnalY/cvsanaly2 -u <user> -p <pass> -d <db> ./test

It contains a single file, two branches with one commit each modifying the file.
This gives the following content for table actions:

+----+------+---------+-----------+-----------+
| id | type | file_id | commit_id | branch_id |
+----+------+---------+-----------+-----------+
|  1 | A    |       1 |         1 |         1 |
|  2 | M    |       2 |         2 |         2 |
|  3 | M    |       1 |         3 |         1 |
+----+------+---------+-----------+-----------+

Running CVSAnalY in debug mode shows that when the second commit (the
one on the "test" branch) is added to the database the file searched
in the cache is "<branch_id>://file". Thus for this second commit it
searches "2://file" which doesn't exist and add a new entry in files:

DBG: DBContentHandler: commit: 2 rev: bcfe4087a2830c60fae43e59ead58e4204ebfa0c
DBG: DBContentHandler: Action: M
DBG: DBContentHandler: ensure_branch test
DBG: SELECT id from branches where name = %s
DBG: INSERT INTO branches (id, name) values (%s, %s)
DBG: DBContentHandler: Looking for path 2://file in cache
DBG: DBContentHandler: looking for path 2://file in moves cache
DBG: DBContentHandler: ensure_path 2://file

I wonder why the branch id is added to the file path? Is this
something related to CVS way of treating branches and thus needed to
handle CVS branches? Extracting a CVS repository
(http://sourceforge.net/p/tyrant/code/) gives also files never
created. This doesn't make sense when using git and I don't know if it does for CVS or even SVN...

Does anyone know if adding the branch id to the file path is needed? And if it is, why?

@andygrunwald
Copy link
Contributor

Hey @maelick,

a big thank you the detailed test case.
I can reproduce this behaviour.
And i`ve tested your change. At a first look it seems to work. This is the next big thank you :)

Would you be so nice to create a pull request with your changes?
If you do not get the time for this. I can do this for you.
After this we are waiting for another reviewer.

What do you think :) ?

@maelick
Copy link
Contributor Author

maelick commented Dec 9, 2013

Pull request sent ;)

However this solves the problem only partially.

For example there is still a problem with this this dummy repository:

mkdir test
cd test
git init
echo "hello" > file
git add file
git commit -m "added"
git branch test
git rm file
sleep 1
git commit -m "file removed"
git checkout test
echo "hello world" > file
sleep 1
git commit -am "modified"
git checkout master
cd ..
./CVSAnalY/cvsanaly2 -u <user> -p <pass> -d <db> ./test

In this case the file is removed from branch master but was modified on the second branch. What happens is that CVSAnalY first processes the commits from the master branch, and then the commits from the second branch. This means that when the commit on the second branch is processed, CVSAnalY finds no file matching the path in the cache (see here) and thus adds a new file entry while it shouldn't. This is more difficult to fix because it will probably require a knowledge of the commit DAG.

I suppose other weird things could happen even in a repository without explicit branches because of implicit branching. For example what happens if the "test" branch is merged back in master? The result may be dependent of the order in which the commits are done....

I also wonder if this is also an issue in a centralized VCS.

A solution which could partially resolve or at least minimize it is to parse only the master branch. I already wrote code for this to add an option "--no-ref" to CVSAnalY CLI which removes the "--all" option from the git CLI called by RepositoryHandler. I will create a pull request for this soon.

@sduenas
Copy link
Member

sduenas commented Dec 9, 2013

This is a matter of how CVSAnalY was designed. I'm going to use the last case that @maelick posted to show you what the rationale is behind this. It's not really easy to explain (even in Spanish! and my English is not so good) so please, ask me anything that you don't understand.

First of all, you have to take into account that our main source is the repository log. The key idea is to track the changes on the repository using just the log. Every action that was stored in the database is because was found in the log. We don't guess or invent anything (with minor exceptions, of course ;) ). This means that if you don't find an action for a file is because it doesn't exist in the log.

When a branch is created there aren't actions about which 'branched files' were added (SVN is the exception, read below). You will only find 'A' (add) actions for those files that are new on that branch.

When we were coding how to track branches and their files on the database, we were tempted to create add actions for the 'branched files' but finally we considered that extremely inefficient in terms of memory and database performance. To add those 'branched files' we have to store in memory the directory structure for every branch (tracking their changes) and store in the database thousands of entry files about files that will never be modified, deleted, copied, etc. We rejected that idea following another approach.

We decided to consider that a file in a branch is new the first time that there is an action over it in that branch. Take into account that at this point CVSAnalY doesn't know anything about which files are on the tree and which file is which in another branch, either. When this happens, a new file_id is created for that file in that branch.

Let's move to @maelick example to see how this is done:

+----+------+---------+-----------+-----------+
| id | type | file_id | commit_id | branch_id |
+----+------+---------+-----------+-----------+
|  1 | A    |       1 |         1 |         1 |
|  2 | M    |       2 |         2 |         2 |
|  3 | M    |       1 |         3 |         1 |
+----+------+---------+-----------+-----------+

The first commit (id 1) creates the file "file" on master branch (branch_id 1). Then, a new branch 'test' (branch_id 2) is created and "file" is modified. When the branch is created there aren't 'A' actions in the log, so no new file is created. Then, 'file' is modified. It's the first time that CVSAnalY knows anything about this file, so the new file_id 2 for this file is created and added to the actions table.

And now. What the hell happens with SVN? Well, SVN is our Nemesis, the mother of all evil... The reason of our workarounds and tricks in CVSAnalY.

In SVN there aren't real branches. In SVN a branch is a directory that someone says that is a branch. Creating a branch in SVN is copying the trunk directory to another place. That's the reason why you can find 'A' actions for branches files. There were explicit add actions on those files (svn add commands) . But you can also won't find any of these actions because if instead of adding files you just add the directory containing those files, the SVN log only stores that a directory was added. Damn it!

Why do we need a branch_id for the files and actions? The documentation is really clear about that.
(See https://github.com/MetricsGrimoire/CVSAnalY/blob/master/help/doc/cvsanaly.texi)

While it's logical to think that a commit is always associated to a
single branch, that's not true in SVN repositories. The fact that
branches don't really exist in SVN (they are just paths in the
repository), makes possible to find commits involving files from different
branches for the same revision. It happens, indeed, more often
than expected. So, in most of the cases, all actions referencing the
same commit will reference the same branch too, but we need to keep the
relationship between action and branch in order to support all other cases. 

Think that CVSAnalY was designed first for CVS and SVN. Git was added later. Nowadays CVS and SVN are deprecated. Rethinking on how to do these things can led us to a better design.

@maelick
Copy link
Contributor Author

maelick commented Dec 10, 2013

Thanks for your clarification. I understand why there is branch_id field in the actions table. However my concern was not about the branch_id field in the actions table. Moreover you mentioned also branch_id field in the files table but there aren't (only repository_id). When I mentioned branch id it isn't related to the database schema but in the python source code of CVSAnalY. For example those lines add the branch id before the file path and thus create a file entry for each branch when there is an action on that branch and file.

Again I understand clearly why actions need to be related to branches but why do files? The only answer I can find to this answer is that it can lead to problems when files are deleted on a branch and not the other one... which is exactly the kind of problems I relate here. The main problem related I encountered is that it is impossible to reconstruct the list (or number) of files that existed in a repository at a given time even for centralized VCS.

Maybe one way of solving it would be to effectively a branch_id field to the files table.

@maelick
Copy link
Contributor Author

maelick commented Dec 10, 2013

Talking of design, I think the problem is strongly tight with the difference between files and file_links (see). I think it's a really nice feature but unfortunately I fear it causes a lot of problems. Thinking about the Evolution case that I mentioned earlier, there is a little less than 5 millions entries in the files table. Actually it shouldn't that much (there are less than 20000 unique absolute file path). I think that when working with big repositories with a lot of branches it becomes impossible to fulfill the original goal of the files/file_links feature: "Assigning identifiers to the
files instead of the paths we can follow the history of any given file even if it's renamed or moved." On the contrary it makes it impossible...

@sduenas
Copy link
Member

sduenas commented Dec 13, 2013

Files need to be related to branches because the content of a file can be different between branches, can be deleted (as you wrote) in one branch and not in the others, can be replaced by other files, etc. These is useful for extensions like Metrics. Metrics extension retrieves the contents of files (to calculate sloc and other metrics) and you have to specify from which branch you get that file.

Files are linked to branches via actions table. If I remember well, it is some kind of improvement to avoid replication of data among tables. branch_id can be also included in file_links table but as common queries go through actions table you can get the id from there.

Regarding why branch_id is added to the path of a file it's because CVSAnalY stores a cache of files (the class DBContentHandler that you mentioned) and this path is used as key to know whether the file exists in the cache or not.

@sduenas
Copy link
Member

sduenas commented Dec 13, 2013

I forgot to mention that you can reconstruct the file tree of a repository for a given revision... but it's very tricky. CVSAnalY was never designed to do that in an easy way because the fastest and reliable way of getting it is using the source code repository by itself. They were designed for it :)

@maelick
Copy link
Contributor Author

maelick commented Dec 17, 2013

I still don't get why there need a branch id in both actions and files. What bothers me even more is that the branch_id is not in the files table but in the cache. You told that it is because SVN can modify the same file on many branches in the same commit that you need the files being specific for each branch. OK but then why not putting it into the files table? Moreover if the goal is also to avoid data replication it would make more sense to have the branch_id in the files rather than actions. Branch id in the actions makes sense if there is one file entry for all branches. Branch id in the files makes sense if there is one file entry for each branch. If it's important to keep files for each branch I think that we should move the branch_id field from actions to files.

Regarding reconstructing file tree, this is not what I want to do. I am rather interested in doing things like counting how many code files were in the repository at a given time for example to have an idea of the size of the repository. Reconstructing file tree is easy with a VCS (in particular git). Counting number of code file is not as much trivial and should be as easy and efficient as: SELECT r.name, COUNT(DISTINCT f.id) FROM repositories r, files f WHERE r.id = f.repository_id AND f.name LIKE "%.py" GROUP BY r.id

If you do that now on a repository with a lot of branches, you'll get a completely biased result. You could argue that then maybe you can simply restrict to the main branch. You'll miss files that have been created in the main branch but maybe it's not important. But then in this case we should have the branch_id in files rather than actions.

@andygrunwald
Copy link
Contributor

@sduenas @maelick Any updates here?

@linzhp
Copy link
Contributor

linzhp commented Jun 19, 2015

Regarding to @sduenas's comment about "Files need to be related to branches", I would say yes and no.

Yes, file content needs to be related to branches, and with content table created by Content extension, this can be easily achieved by relating commit_id and file_id to actions table.

And no, file id should not be related to branches. If we want to analyze how a specific file evolves over time, we would like to see its history starting with an 'A' action, not an 'M' action. Having several file_ids for the same file in different branch essentially makes it impossible to perform the change analysis on branches other than the "master" branch (assuming "master" is the oldest one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants