Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Possibility to store everything in one file instead of one file per folder #22

Closed
AxelPetermann opened this issue Oct 24, 2024 · 28 comments

Comments

@AxelPetermann
Copy link

I would like to have the option to store everything just in one ".chkbit" file instead of one file per folder. Why?: Because on Windows I use a Software which monitors some folders and I can't tell this Software to ignore the ".chkbit" file.

@alexkallai
Copy link

I'd also welcome such a change, it would make this tool really versatile, in its current state I wouldn't use the application, since I don't want to "litter" my whole data structure with all the files. (also, it makes backing up the data even slower, since small file copy is rather slow)

Though in my opinion the approach should be different:
Instead of json, the single file should be an sqlite3 file, which would make all transactions faster in a bigger dataset.

@laktak
Copy link
Owner

laktak commented Nov 5, 2024

I've updated the FAQ with more details for this question:

  • when you make a backup the index is also backed up, a central index would need to be backed up separately
  • if the index is just a play file it can't be damaged as easily
  • if it is damaged, only one directory is affected
  • if you split up your files over backups, the relevant index is alwys included
  • when updating an index, only the index in one directory is affected, reducing the risk of errors
  • also useful, when you move a directory the index moves with it

That's why I did not consider a central index.

@huyz
Copy link

huyz commented Nov 26, 2024

https://github.com/ambv/bitrot takes the centralized approach but it doesn't seem very actively maintained

laktak added a commit that referenced this issue Nov 26, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@laktak
Copy link
Owner

laktak commented Nov 26, 2024

I had an idea for a simple solution that does not require a lot of changes to the code.

It's missing some cleanup to remove delete directories from the index but you can test the binaries from the prerelease-artifacts in https://github.com/laktak/chkbit/actions/runs/12039051476

  --index-db                use a index database instead of index files

This places a single .chkbitdb in the current directory.

@AxelPetermann
Copy link
Author

Thanks for still having a look into this.

I did a quick test and got the following error:

EXC .chkbitdb: Binary was compiled with 'CGO_ENABLED=0', go-sqlite3 requires cgo to work. This is a stub

I've tested with the following command:

chkbit.exe "D:\test for chkbit" --update --index-db

@laktak
Copy link
Owner

laktak commented Nov 27, 2024

I removed the flag from the build process. It should work now though I can't test on windows. Please use https://github.com/laktak/chkbit/actions/runs/12052712326

@AxelPetermann
Copy link
Author

Thanks again, but exact same error occurs.

$ chkbit.exe --version
github.com/laktak/chkbit
a3af97f8a4d7c1faa927eb5bee21b3586e1d2010

@laktak
Copy link
Owner

laktak commented Nov 28, 2024

Thanks for your help. I was unfamiliar with cgo but it has to do with cross compiling and go would automatically enable it on the source os, which is why I never saw this on Linux. I will look for an alternative to sqlite because it won't allow me to cross compile.

@gstjee
Copy link

gstjee commented Nov 29, 2024

  • if it is damaged, only one directory is affected

To have same benefit of above and get rid of multiple index files in sub dirs, In regards of central index file we can have 3 index files in root dir like below and some other suggestions :
index1: current index file
index2: backup of current index file
index3: last(older) index file( it will be generated when we update index so that we can have for safety like in-case we accidentally updated index)

  • To force central index we can supply tag --use-central-index .
  • and is current .chkbit is JSON format ? and if possible can we have linear vertical flow of content in that file instead of horizontal ? in vertical fashion it is easy to navigate and compare two index files.

Some suggested code flow:
To ensure index are not corrupted/altered, while validation program can:
a. program can have internal check whether index1 and index2 are equal if not then can issue warning and stop further files validation.
b. program can have some tag ex. --index-file so that user can force validation with specific index file. in that case check a is not needed.

@laktak
Copy link
Owner

laktak commented Nov 29, 2024

If anyone want's to give it a try, I have a new version that can be tested:

https://github.com/laktak/chkbit/actions/runs/12089414524

  • using the bbolt database - as this is written in pure go it should run as well on Windows and Mac as on Linux
  • you can specify one target directory per run
  • .indexdb will be placed into the target
  • not yet done: handling of the database file/backup etc.

@laktak
Copy link
Owner

laktak commented Nov 29, 2024

@gstjee

In regards of central index file we can have 3 index files in root dir like below and some other suggestions

The indexdb will be placed in the directory that is to be checked. There will be backups available and a json export for long term storage.

* and is current .chkbit is JSON format ? and if possible can we have linear vertical flow of content in that file instead of horizontal ? in vertical fashion it is easy to navigate and compare two index files.

Not really on topic and also not sure what you mean with vertical flow but if you want to view it differently you can use jq to extract data.

@gstjee
Copy link

gstjee commented Nov 29, 2024

If anyone want's to give it a try, I have a new version that can be tested:

https://github.com/laktak/chkbit/actions/runs/12089414524

checked on Windows 10 for update, validate and append. seems working fine.

1. what if during updating index, process get killed by system or user? will it corrupt index database file ? same question for older index files implementation as well ? is there any safety mechanism ?
2. in index database way: can we check individual sub directories as well ? like we can do in original code.

.... not sure what you mean with vertical flow...

in Notepad++, .chkbit file's all content is shown in one single horizontal line so it is difficult to edit, check, compare etc.
image

@laktak
Copy link
Owner

laktak commented Nov 29, 2024

Thanks for the quick feedback. Good to know that bbolt works on Windows.

1 - The plan is to write to a new database on each run and then move the old to a backup/move in the new one, once finished.
2 - yes, just not yet

For notepad - you are asking for a formatted json. You can get that by running jq < .chkbit, see https://github.com/jqlang/jq

laktak added a commit that referenced this issue Dec 2, 2024
@laktak
Copy link
Owner

laktak commented Dec 2, 2024

Of course this is suddenly more work than I tought ... I made a new iteration for testing

To use a chkbit database you need to initalize it first with chkbit --init-db PATH.
Then you can run chkbit --db on anything below PATH and it will be tracked in the database.

https://github.com/laktak/chkbit/actions/runs/12127789704
https://github.com/laktak/chkbit/actions/runs/12127789704/artifacts/2263878673

@laktak
Copy link
Owner

laktak commented Dec 18, 2024

I mostly finished the db feature (note, the default filename is now .chkbit-db)

  • updates are written to a new database
  • when chkbit finishes it moves the old db to a backup and moves the new db into place
  • the only thing unfinished is exporting the db to a big json

If you'd like to test

If you have feedback, now would be a good time.

@gstjee
Copy link

gstjee commented Dec 18, 2024

  • run chkbit --init-db PATH to create the db
  • run chkbit --db -u PATH[s] or in any subdir of PATH use chkbit --db . to use the db

i think these two steps are unnecessarily making thing more complex.

I think to let program know db path and where we want to do operations we can simply do:
chkbit -u -db primaryPath secondaryPath

if both paths given then primary path should be considered as path of db and secondary path should be considered as where we want to do operation.
if only one path given then db path and where we want to do operation are same directory.

@y653
Copy link

y653 commented Dec 19, 2024

Maybe the command chkbit --db -u PATH[s] should be enough. If a db does not exist, then it will be initialised. The --db argument indicates that the user wants to use the db instead of separate files, so if a db is there it will be updated, else it will be initialised.

On the other hand, a separate command to initialise the db is not a big deal. It will be done once in a directory tree that we want to be checksummed.

Thank you for the new very useful feature!

@laktak
Copy link
Owner

laktak commented Dec 19, 2024

@gstjee I actually thought quite a bit on how to make this easy for the user without explaining how it works.

The database needs to keep relative paths to the files. If you specify the database as a parameter the user may think he can just move it without any consequence, however this would break all paths. Making it clear that the database is responsible for all files in its subfolders avoids this. You also can't specify the wrong database file when you have multiple instances since chkbit will find it automatically (you can run operations on any subfolder).

It's also harder to forget to backup the database when it's located in the same path.

Also when you run an update that specifies just the database folder I can write to a new database to get rid of obsolte indexes. This makes any maintainence shrink operations unnecessary.

@y653 by making it a separate command the user is informed where the database file is located.

@vredesbyyrd
Copy link

vredesbyyrd commented Dec 28, 2024

The database needs to keep relative paths to the files. If you specify the database as a parameter the user may think he can just move it without any consequence, however this would break all paths.

I'm a little confused on the expected workflow when using the database. I assumed since the database keeps relative paths, we should include the database in our backup, and then run chkbit --db check $DiR on the backup to verify everything went as expected. A pseudo-code example with rsync and chkbit

SOURCE="/mnt/A/Media/"
DEST="/mnt/B"

# update & verify $SOURCE integrity before backup
chkbit --init-db $SOURCE # only done once to create db
chkbit --db update $SOURCE

# backup $SOURCE
rsync $SOURCE $DEST

# post backup, verify $DEST in read-only mode
chkbit --db check /mnt/B/Media

chkbit --db check /mnt/B/Media returns an error:

Using indexdb in /mnt/B/Media
EXC index: no such device
Processed 0 files in readonly mode.
chkbit ran into errors:
index: no such device

Do we need to initialize (--init-db) on the backup as well? I feel like I must be overlooking something obvious, any pointers would be appreciated.

UPDATE:

@laktak I may have found the issue. I tried to run chkbit init-db /mnt/CASE and received the same error: error: no such device. In this instance, /mnt/CASE is mergerfs mount. Mergerfs is a union filesystem which allows you to pool multiple disks under a single mount point. Mergerfs uses fuse under the hood.

It's the first time I have any had any issues using mergerfs, so it did not come to mind initially. Any thoughts on why mergerfs + chkbit are not playing nicely?

UPDATE 2:

Figured out the issue, although I am not smart enough to fully understand/explain it. My understanding is Mergerfs (or any?) fuse filesystem will not work with programs that use mmap (I assume BoltDB does) without workarounds. Using said workaround incurs a not insignificant performance penalty. More info here if anyone is interested.

I may just forgo the db and use the .chkbit files instead.

@laktak
Copy link
Owner

laktak commented Dec 29, 2024

Thanks for reporting the issue and for the detailed follow up! I'll make sure to improve the error message when opening the DB.

Since the db can be exported to a simple json I'll think about allowing to check against the plain json instead of the db. Or maybe I can come up with a better solution. I'll need to do some performance tests...

@vredesbyyrd
Copy link

I should add, after more thorough testing with the described workarounds I am not seeing a performance hit. I think there was some funny business going on during my initial tests. And as far as I understand performance penalties can really vary depending on the system, program implementation etc.

Since the db can be exported to a simple json I'll think about allowing to check against the plain json instead of the db. Or maybe I can come up with a better solution. I'll need to do some performance tests...

It would be interesting to see how simple json performs vs the db on various systems. I could also see it being a good option for people using the db on fuse mounts. There may not be a ton of user yet, but chkbit is such a nicely done tool I could see it getting wide adoption down the road, at least it should!

@laktak
Copy link
Owner

laktak commented Dec 30, 2024

I wasn't happy with having a database that isn't easily parsable anyways. JSON is better in this case because it's just text and can be read by other tools in the future.

@laktak
Copy link
Owner

laktak commented Dec 30, 2024

I've pushed an update that switches the database from bolt to a single json. bolt is still used but only as a cache.

If you want to convert the bolt index from the beta to json, get the binary from this branch first (updated) and run:

chkbit export-db .
rm .chkbit-db
mv .chkbit-db.json .chkbit-db

Then switch to the latest binary.

@vredesbyyrd please let me know how this performs, thanks.

@vredesbyyrd
Copy link

Very cool.

I followed your instructions to convert the BoltDB to json with the binary from this link, then switched to the latest binary.

It appears that upon running chkbit --db check /foo/bar or chkbit --db update /foo/bar, chkbit does not see items already in the index and adds them as new, which results in duplicate items. If there is any further testing/info you want from me regarding this let me know.

I deleted the old index and re-initialized, chkbit --db init /foo/bar, and ran an update followed by a check. Everything went smoothly. I cannot comment on speed/performance yet, other than the fact that the index being JSON will be a bonus on any fuse filesystems.

@laktak
Copy link
Owner

laktak commented Dec 31, 2024

@vredesbyyrd sorry, fixed, please use this to convert

@laktak
Copy link
Owner

laktak commented Dec 31, 2024

I've pushed an update that removes the --db flag.

Instead you now specify the mode when doing chkbit init and chkbit will automatically detect which to use

Usage: chkbit init <mode> <path> [flags]

initialize a new index at the given path

Arguments:
  <mode>    the mode defines if you wish to store one index per directory
            (split) or a single index in the given path (atom)
  <path>    directory for the store root

@vredesbyyrd
Copy link

All works great. And I prefer the new way to define which mode to use.

@laktak
Copy link
Owner

laktak commented Jan 1, 2025

just released v6 :)

@laktak laktak closed this as completed Jan 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants