For a quick start, take a look at QUICKSTART or the German SCHNELLSTART.
DedupFS is a file system for storing many backups of large collections of files - for example your photo collection. The stored files are deduplicated, meaning: If you store the same files multiple times, the storage (almost) doesn't grow! For example, today you can store a backup of all your documents in DedupFS in the /documents/2022.12.30
directory. If next week you store a backup of all your documents in DedupFS again, this time in the /documents/2023.01.06
directory, this needs almost no additional space on the drive where your DedupFS data folder is located. So, in general, you can regard DedupFS as a backup storage drive where you can store considerably more files than on an ordinary drive. And if you like to script your backups, DedupFS comes with nice utilities for that, too.
Technically speaking, DedupFS is a file system with transparent file content deduplication: If you store multiple files containing exactly the same sequence of bytes, DedupFS stores the contents only once and references them multiple times. Of course, you can still update the contents of any file without impact on the contents of other files.
- Status Of The DedupFS Software
- What DedupFS Is Best For
- Why Is DedupFS Better Than ...
- What DedupFS Should Not Be Used For
- Caveats
- System Requirements
- Basic Steps To Use DedupFS
- How To ...
- How I Do It
- Upgrading And Version History
- Storage Format
- License
DedupFS is provided "as is", without any warranty. That being said, I use DedupFS since 2018 for backing up my private files. End of 2022 my backup repository contains 4.5 Million files/directories with 630.000 file contents stored comprising 1.7 TB of data. The largest file stored has a size of about 7.5 GB.
Summary: For my personal use it is mature. Decide for yourself.
The DedupFS dedup file system is good for keeping a backup archive of your files. Its main advantage is that you don't have to worry storing the same file multiple times, because additional copies of a file need very little space.
For example, each week you can copy "everything" into the dedup file system, into a directory that is named by the current date. That way, you have a nice backup and can access earlier versions of your files by date.
Whether DedupFS is better than any other backup software depends mostly on how you like it and how you use it. Here are the main things why I like DedupFS better than all alternatives I have found so far:
- DedupFS is fast at writing and reading backups (at least for personal requirements).
- DedupFS is lightweight, meaning it's easy to install and to run, and it needs little RAM compared to other deduplication software.
- DedupFS uses a simple storage format, so you know that if something goes horribly wrong there is still a good chance to retrieve most of the stored data.
- "Delete" in DedupFS is a two-step process, so if you accidentally deleted important files from your backups, they are not lost until you explicitly run the "reclaim space" utility.
- DedupFS automatically creates and keeps backups of the file tree and metadata database, so if necessary you can restore the dedup file system to earlier states.
- DedupFS is designed to make it fast and easy to keep a second offline copy of your backup repository up-to-date, even if the repository is terabytes in size.
- DedupFS is open source, see its GitHub repository. It consists of ~2000 lines of production code.
Don't use DedupFS as your everyday file system. It is not fully POSIX compatible. Locking a file for example will probably not work at all. If you change file contents often, that leads to a large amount of unused data entries that eat up space unless you use the "reclaim space" utility.
Don't use DedupFS for security critical things. One reason for that: DedupFS uses MD5 hashes to find duplicate content, and there is no safeguard implemented against hash collisions. Note that this is not a problem when you store backups of your holiday photos...
- DedupFS only supports regular directories and regular files. It does not support soft or hard links or sparse files. Support for soft links is planned for future versions.
- Deleting files in DedupFS is a two-step process. Don't expect that the repository size shrinks if you delete files. Even if you run the "reclaim space" utility, the repository size will not shrink. Instead, it will not grow further for some time if you store new files.
- DedupFS uses MD5 hashes to find duplicate content, and there is no safeguard implemented against hash collisions.
- Since DedupFS has been used less on Linux, there might be additional issues there.
- To support a special operation mode, if data files go missing, DedupFS replaces the missing bytes more or less silently with zero values.
- On Linux (e.g. WSL/Debian), CTRL-C might not correctly unmount the file system. The cause might be
/etc/mtab
not being available. In this case the following might help:sudo ln -s /proc/self/mounts /etc/mtab
DedupFS needs a Java 21 runtime. The application comes bundled with a suitable Java runtime for Windows x64 and Linux x64.
DedupFS needs disk space for its repository. If you back up lots of data, it will need lots of space. Keep an eye on available disk space when using.
DedupFS runs fine with approximately 128 MB RAM assigned to its process. See below for details.
Tested on Windows 10 64-bit.
Download and install a WinFSP Release to make fuse (Filesystem in Userspace) available. For DedupFS, only the WinFsp "Core" package is needed. I use WinFsp 2023
a.k.a. winfsp-2.0.23075
for running DedupFS. Until October 2023, I used WinFsp 2021
a.k.a. winfsp-1.9.21096
which also worked fine for me.
Tested on Ubuntu 64-bit.
DedupFS needs fuse (Filesystem in Userspace) to create a filesystem in userspace. If DedupFS doesn't find it, install it using sudo apt install libfuse2
. fuse3 is not yet supported.
The following are the basic steps needed to use DedupFS. For details, see the How To ... section of this document.
- Make sure the system requirements are fulfilled.
- Initialize the DedupFS data repository, for example on an external drive.
- Mount the file system, then use it to store backups of your files.
- Stop the file system.
- If required, update the copy of the DedupFS repository that is stored on a different external drive.
- If required, reclaim space by trashing files that have been marked deleted in the dedup file system.
The dedup file system stores all its data in a repository directory, inside the subdirectories fsdb
and data
. Before the dedup file system can be used, the repository needs to be initialized:
- Create a repository directory for the dedup file system data, for example on an external backup drive.
- Unpack the DedupFS archive to that repository directory. The repository directory now should contain this:
[directory] dedupfs-[version]
[ file ] QUICKSTART.html
[ file ] README.html
[ file ] SCHNELLSTART.html
- Now, the DedupFS utility scripts like
repo-init
anddedupfs
should be located in thededupfs-[version]
directory in the repository directory. - Start the DedupFS
repo-init
utility in thededupfs-[version]
directory, for example by double-clicking. - Check the log output printed to the console where
repo-init
is executed. - If successful, this command creates in the repository directory the database directory
fsdb
and in thededupfs-[version]
directory the log files directorylogs
.
Note:
- By default,
repo-init
and all other DedupFS utilities regard the parent of the current working directory as the DedupFS repository directory. If you run the script from the command line, you can add arepo=<target directory>
parameter in order to point the utilities to a different repository directory. DedupFS always creates itslogs
directory in the directory containing the DedupFS utility scripts.
If you want to write, update, or read files in the dedup file system, you have to "mount" it first. Note that the dedup file system must be initialized before you can mount it, see above. Here are the steps to mount the dedup file system:
- If you have installed DedupFS in the repository directory as recommended, start the dedup file system by running
gui-dedupfs
in thededupfs
directory, for example by double-clicking. - After some time the DedupFS GUI will open, showing log entries.
- Some time later a log entry will tell you that the dedup file system is started:
Mounting the dedup file system now...
- In the log entries, you see among others which repository directory is used and where the dedup file system is mounted.
Notes:
- The default mount point on Windows is
J:\
, on Linux/mnt/dedupfs
. To mount the file system somewhere else, call the script with amount=<mount point>
parameter (themount=
part of the parameter can be omitted). - On Windows, mount the dedup file system to a file system root like
J:\
or to a directory likeC:\myFiles\dedupfs
, whereC:\myFiles
must be an existing directory andC:\myFiles\dedupfs
must not exist yet. - On Linux, mount the dedup file system to an existing empty writable directory.
- Don't mount more than one dedup file system if you can avoid it. If you cannot avoid it, make sure the dedup file systems have different
mount=<mount point>
mount points configured. gui-dedupfs
creates a database backup before mounting the file system, so you can restore the previous state of the file system if something goes wrong. To suppress database backup creation, call the script with thedbBackup=false
parameter.- By default,
gui-dedupfs
uses the parent of the current working directory as DedupFS repository. If you run the script from the command line, you can add arepo=<target directory>
parameter in order use a different repository directory. - For additional options see the paragraphs below.
If you want to mount the dedup file system without a GUI, run dedupfs
. This behaves like gui-dedupfs
except that it does not start a GUI. So see above for more details on how dedupfs
works. Note that to stop the filesystem you may have to press CTRL-C
twice in the dedupfs
console.
If you want to mount the dedup file system read-only, use the gui-readonly
or readonly
utility. These utilities work the same way the write enabled utilities do, except that they don't create a database backup before starting.
Why mount read-only? This can be handy if for example you want to look up files in your backups while making sure that you cannot accidentally add, change or delete files in the backup.
The stats
utility allows you to read basic file system statistics. Like the other utilities, it accepts the optional repo=<target directory>
parameter.
The following rules apply for command parameters of the DedupFS utilities:
- The keys of
<key>=<value>
parameters are evaluated case-insensitively. - A backslash '
\
' before an equals sign '=
' is interpreted as escape character (possibly preventing interpretation of the parameter as<key>=<value>
pair) and is removed before evaluating the actual parameter.
A number of command line utilities for the dedup file system is available through the fsc
command. Like the other utilities, fsc
accepts the optional repo=<target directory>
parameter. fsc
is not meant to be run without additional parameters.
Use fsc backup <source> <target> [reference]
to copy files and directories to the dedup file system without having to mount it first.
Syntax:
fsc [repo=<repository directory>] [dbBackup=true] backup <source> [<source2> [<source...N>]] <target> [reference=<reference>] [forceReference=true]
Example:
fsc backup /docs /notes/* /backup/?[yyyy]/![yyyy.MM.dd_HH.mm] reference=/backup/????/????.??.??*
On Windows:
fsc backup "C:/my files/documents" "C:/my files/notes/*" /backup/?[yyyy]/![yyyy.MM.dd_HH.mm] reference=/backup/????/????.??.??*
The source
parameters:
In each source
parameter's last path element the wildcards "?
" and "*
" are resolved to a list of matching files / directories. The resolved sources must be readable files or directories on your computer.
The target
parameter:
The target
parameter specifies the DedupFS directory to copy the source files and directories to. Only the forward slash "/
" is interpreted as path separator. The backslash "\
" is used as escape character.
In the target
parameter's path elements, everything enclosed by square brackets [...]
is interpreted as java.text.SimpleDateFormat for formatting the current date/time, unless the opening square bracket is escaped with a backslash "\
".
If a target
path element starts with the question mark "?
", the question mark is removed and the corresponding target directory and its children are created if missing.
If a target
path element starts with the exclamation mark "!
", the exclamation mark is removed. It is ensured that the corresponding target directory does not exist, then it and its children are created. The exclamation mark can be escaped with a backslash \
.
Note that on Linux / Bash,
!
and\
are special characters. To use them, escape them with a backslash\
.
The reference
and forceReference
parameters:
If a directory containing many and / or large files has been stored in the DedupFS before and most files have not been changed in the meantime, creating another backup can be significantly accelerated by using the reference
parameter. This parameter tells the backup utility to first compare the file size and time stamp of the file to store with the reference file stored previously. If they are found to be the same, the backup tool creates a copy of the reference file in the target location instead of copying the source contents there. Note that using a reference
, if a file's contents have changed, but its size and time stamp have not changed, the changed contents are not stored in the backup.
Example:
- Create a first backup of
docs
: fsc backup /docs /backup/?[yyyy]/![yyyy.MM.dd_HH.mm]/
- Provide a reference for subsequent backups of
docs
: fsc backup /docs /backup/?[yyyy]/![yyyy.MM.dd_HH.mm]/ reference=/backup/????/????.??.??_*
When a reference
is provided, the backup utility looks for the reference directory in the DedupFS, resolving '*
' and '?
' wildcards with the alphabetically last match. Then the backup utility checks whether the source and the reference directory "look similar". This is to reduce the probability of accidentally specifying an incorrect reference
. Use the forceReference=true
parameter to skip this check.
Excluding files / directories from the backup:
To exclude files or directories from the backup, proceed as follows:
- Either put an empty file
.backupignore
into a source directory to ignore. - Or put into a source directory a text file
.backupignore
containing 'ignore' rules, one per line. These rules define which entries the backup utility will ignore.- Lines are trimmed, empty lines and lines starting with "
#
" are ignored. *
is the "anything" wildcard.?
is the "any single character" wildcard.- Rules for directories end with "
/
", rules for files without. - The rules are matched against the relative source path.
- Lines are trimmed, empty lines and lines starting with "
Example for a .backupignore
file:
# Do not store the .backupignore itself in the backup.
.backupignore
# Do not store the 'temp' subdirectory in the backup.
temp/
# In any subdirectories named 'log*', do not store
# files in the backup that are named '*.log'.
log*/*.log
Use fsc db-backup
to create a database backup. The database backup is created in the repository's fsdb
directory.
Use fsc db-restore
to restore a previous versions of the DedupFS database, thus effectively resetting the dedup file system to an earlier state. This command overwrites the current database without further confirmation.
If run without additional [file name]
parameter, it restores the database to the way it was before the last write operation was started. Provide a [file name]
parameter to point the utility to an earlier database backup zip file located in the fsdb
subdirectory of the repository.
Use fsc db-compact
to compact the database file.
Use fsc find <pattern>
to find files matching the name or path fragment pattern provided. You can search for file names like java.exe
, directory names like bin
, or for path fragments like jre/bin/java
. The patterns support '*
' as wildcard for any number of characters and '?
' as wildcard for a single character. The find behavior of patterns containing '%
' or '_
' is not fully defined and may yield unexpected results.
Use fsc list <path>
to list the contents of the directory denoted by <path>
.
Use fsc del <path>
to delete the file or recursively delete the directory denoted by <path>
. This utility does not create a database backup. If required, use fsc db-backup
before using fsc del <path>
.
The DedupFS utilities come with reasonable default memory settings. You can change these by editing the utility scripts. Let's start with some rules of thumb:
- It does not hurt to assign much RAM to the DedupFS utilities. That is, unless the operating system or other software running on the same computer doesn't have enough free RAM left.
repo-init
does not need more than ~64 MB RAM.gui-dedupfs
anddedupfs
need at least ~96 MB RAM for good operation. When storing large files or using a slow storage device, additional RAM improves performance.db-restore
might need more than 64 MB RAM, it depends on the database size.gui-readonly
andreadonly
work fine with ~80 MB RAM. Assigning more will not improve performance.- The
reclaim-space
utility need about ((number of data entries) * 64 B + 64 MB) RAM.
To change the RAM assignment of a utility, open it in a text editor. After the $JAVA
or %JAVA%
call, change -Xmx
maximum heap memory setting.
When large files are written to the dedup file system and DedupFS cannot cache them in memory, it caches them in a "temp" directory. By default, it uses a subdirectory of the "temp" directory configured for the user in the operating system.
To get maximum write performance, make sure the temp directory is on a fast (SSD) drive that and is not on the same physical drive as either the repository or the source from which you copy the files. The write utilities accept a temp=<temp directory>
option, so you can override the default.
The dedup file system can be operated in a "copy when moving" state. In this state, whenever a command is issued to move a file (not a directory) within the dedup file system from directory A to directory B, the file is not moved but copied. What the heck?
Moving files within the dedup file system is very fast. Copying files is much slower, because the operating system reads the file contents and writes them to the new file. In the "copy when moving" state, copy-moving files is just as fast as moving files.
This can be used to first create in the dedup file system a copy of an existing backup (fast) and then only to update those files that have changed (fast if few files have changed).
To enable the "copy when moving" state, either tick the checkbox in the GUI or specify the copyWhenMoving=true
option when starting the dedup file system.
DedupFS writes log files that contain all log entries visible on the console and additionally DEBUG level log entries. DedupFS always creates its logs
directory in the directory containing the DedupFS utility scripts.
The db-backup
and db-restore
utilities are convenience entry points for the fsc db-backup
and fsc db-restore
commands. Other than the plain commands, these utilities ask for confirmation before execution.
The blacklist
utility is for blacklisting files that should be removed from the dedup file system if they are currently stored and should not be stored even when added later to the file system. Reading a blacklisted file yields [file length] zeros. In addition to the usual repo=<target directory>
parameter, the utility accepts the following parameters:
dbBackup=false
(optional, default true): Create a database backup before starting the blacklisting process.blacklistDir=<directory name>
(optional, defaultblacklist
): If the subdirectory ofrepo
with this name contains files, those are added to the blacklist in the repository, in a directory named with the current timestamp.deleteFiles=false
(optional, default true): If true, the files inblacklistDir
are deleted once they have been added to the blacklist in the repository.dfsBlacklist=<directory name>
(optional, defaultblacklist
): Name of the base blacklist directory in the dedup file system, resolved against root.deleteCopies=true
(optional, default false): If true, mark deleted all blacklisted occurrences except for the original entries in thedfsBlacklist
directory.
After running the blacklist
utility, as long as you do not store new files in the dedup file system you can still restore previous file system states by restoring the database from backups. Once new files are stored, restoring a database backup from before the blacklisting process will result in partial data corruption.
When you delete a file in the dedup file system, internally the file is marked as "deleted" and nothing more. This means, that the dedup file system will not free that file's storage space, and that you can make the file available again by restoring a previous state of the database from backup.
If you want to re-use the space deleted files take up, run the reclaim-space
utility. Note that this will not shrink the repository size. Instead, the repository size will not increase for some time if you store new files.
The reclaim-space
utility purges deleted and orphan entries from the database. After running it, as long as you do not store new files in the dedup file system you can still restore previous file system states by restoring the database from backups. Once new files are stored, restoring a database backup from before the reclaim process will result in partial data corruption.
In addition to the usual repo=<target directory>
parameter, the reclaim-space
utility accepts an optional keepDays=[number]
parameter (the keepDays=
part can be omitted) that can be used to specify that recently deleted files should not be reclaimed. Without this parameter, all deleted files are reclaimed.
When the reclaim process is finished, the reclaim-space
utility compacts the database file, then exits.
In the log
subdirectory of the installation directory, up to 1 GB of log files are stored. They are useful for getting insights into how DedupFS was used. You can delete them if you don't need them.
In the fsdb
subdirectory of the repository directory, DedupFS stores database backup zip files that can be used to reset the dedup file system to an earlier state. You can delete older database backup zip files if you don't need them.
This is an advanced usage pattern. Only use if you understand what you are doing!
If you have the repository on a removable device, you can create a copy of it on a local device without the data files in the data
directory. This "shallow" copy of the file system allows you to browse the directories and files, but it will show '0' bytes instead of the actual content when you open files.
If you
- create a "database only" copy of the file system,
- copy (at least) the last active data file in the right location in the
data
directory, - be careful: If you ran
reclaim-space
some time ago, the last active data file might not be the last data file. In case of doubt, use thestats
utility to check the data storage size, - only use the shallow copy of the file system for some time (and not the original),
- don't use the
reclaim-space
utility on the shallow copy of the file system,
then you can merge the shallow repository back to the original file system using standard file sync tools.
Hopefully, you will never see an EnsureFailed
exception the logs. However, if you are reproducibly blocked by an EnsureFailed
exception, you can tell the dedup file system not to stop processing when the offending condition occurs. (Do this at you own risk!) For this, add -Dsuppress.[marker]
to the java options in the script, where [marker]
is the marker string prepended to the actual exception message, e.g. -Dsuppress.cache.keep
or -Dsuppress.utility.restore
.
For demonstration, investigation or debugging it might be interesting to examine the DedupFS database while the DedupFS file system is mounted. For this, run an H2 TCP server locally, e.g.
java -cp "h2-2.1.214.jar" org.h2.tools.Server -tcp -tcpPort 9876
and add -DH2.TcpPort=<TCP port>
to the java options in the script. When running in this mode, at startup time the JDBC connection information used is logged on WARN level.
- At home, I use a laptop computer with a 1 TB SSD. My private collection of photos, videos, audio and document files is larger than 1 TB. Duh.
- Although I try to avoid it, many photos and videos are stored in two places in my collection. This also bloats backups. Urg.
- The laptop actually is our family computer. Sometimes, somebody accidentally deletes or overwrites files we would have liked to keep, or moves them to a place where we will never find them again. Gah.
- One backup isn't enough for me to feel good, I like to keep two backups, and the second backup should be synchronized with the first, at least from time to time. Acg.
- Large media files that I don't need regularly only reside in the backups.
- DedupFS takes care of deduplicating duplicate files.
- "I know that in our photos of the 2010 summer holidays, we had this lovely picture of..." - Let's look in the backup I stored two years ago.
- For synchronizing the second backup, standard file copy tools are enough. I only need to copy those parts of the repository that have changed, and that is easy.
I have two large external storage drives, drive A and drive B.
Every few weeks,
- I store "everything" from the family laptop to a new directory in the dedup file system residing on drive A. This new directory I name by current date, for example "backups_laptop/2020-10-21".
- To speed things up, before actually creating the new backup, I create a duplicate of the previous backup in a directory with the current date with the
copywhenmoving=true
option enabled. This is fast. After this, I use a tree sync software that mirrors the source to the target, overwriting only those files that have changed (changed size or last modified time) and deleting files in the target that are not present in the source. If few things have changed, this is also fast.
Every few months,
- I use a tree sync software that mirrors the source to the target to update the drive B with the new contents of drive A. If few things have changed, this is pretty fast.
Additionally,
- in the dedup file system there is a "current" directory where I keep (mostly media) files that are just too big for my laptop and that I need only occasionally.
For maximum safety,
- TODO shallow copy of the repository that is auto-synchronized to a cloud with the last increments, so I can back up my files even more often without having to go fetch the external drive each time.
To upgrade a DedupFS installation to a newer version:
- Unpack the new DedupFS apps archive to the DedupFS repository directory, next to the existing DedupFS apps installation, possibly overwriting existing
*.html
documentation files in the installation directory. The repository directory now should look like this:
[directory] data
[directory] dedupfs-[old version]
[directory] dedupfs-[new version]
[directory] fsdb
[ file ] QUICKSTART.html
[ file ] README.html
[ file ] SCHNELLSTART.html
- Follow any release specific upgrade instructions (see below).
- Check whether everything works as expected. If yes, you can delete the old
dedupfs-<version>
installation directory.
- Ensure the backup script can be gracefully interrupted.
- In some way give access to deleted files and directories.
- Development: Try out scoverage instead of jacoco (a spike 2022.10 didn't work well).
- Support for soft links.
- Optionally store packed (gz or similar).
- The reclaim utility finds & cleans up data entry duplicates.
- Replace
!
and\
by something else forfsc backup
because they are special characters on Linux / Bash. - Support for fuse3 on Linux systems.
- On Linux, don't hang if the mount point exists but is not writable or owned by mounting user. Possibly check for directory write permission first?
- Only log stack traces for unexpected exceptions.
- Backup tool: Ensure that the last file listed as "stored" is actually stored when interrupting.
- Upgrade H2 database from 2.1.212 to 2.3.232.
- Add database migration script.
- Fixed:
stats
utility on Linux didn't handle failures correctly.
Migration from 5.x to 6.x:
- Unpack the new version of DedupFS apps in the repository directory next to the old version.
- Run the
db-migrate[.bat]
utility to upgrade the database for use with DedupFS 6.x. - When the database upgrade has finished successfully, delete the directory containing the old DedupFS file system apps (but not the
data/
andfsdb/
directories).
Migration from earlier versions to 6.x:
Do a step-by-step migration. See below for details how to upgrade to DedupFS 5.x.
- Read-only apps do not write to logfile.
- Fixed: Linux apps reported incompatible Java version.
- Fixed: Bad Java version compatibility error message on Windows.
- Improve error messages and failure handling when apps terminate with failure code.
- Add explaining text with confirmation to
reclaim-space
utility. - Update
jnr-fuse
library to version 0.5.8.
- Update Scala to 3.3.3 and Java to 21.0.4+7 together with SBT and some dependencies.
- Documentation updates.
fsc find
syntax changed, now allows to search for path fragments, not only for single entry names.- Mark database backups created by
blacklist
because restoring from them can be dangerous. - Fixed handling of illegal file names in
blacklist
. - Updated Scala to 3.3.1 and Java to 21+35 together with SBT and some dependencies.
- Tested operations with WinFSP 2023 which is now the recommended version.
fsc backup
doesn't log 'interrupted' when terminating normally.- In read-only mode, don't look for free areas during startup - it's not needed and might require much memory and/or time.
- In
fsc backup
, check whether a file reference is neither file nor directory to avoid a possible exception. - In
fsc backup
, handle exceptions when reading files to back up. - Updated Scala to 3.2.1 and Java to 17.0.7 together with SBT and some dependencies.
- The
fsc backup
command is mature enough now. - Faster file system start: No waiting for the zip database backup because that is running in parallel.
- Updated Java to 17.0.6_10, some libraries, and SBT.
- Added QUICKSTART.html and German SCHNELLSTART.html documentation.
- Rewritten experimental
fsc backup
command, changed functionality. - Rewritten write cache handling for improved maintainability.
- Added option to run with an H2 TCP server.
- Separate Windows and Linux packages, with a minified JRE, to reduce package size from about 100MB to less than 50 MB.
- Experimental
fsc backup
command for backing up directories without needing to mount the dedup file system. - Added explicit memory settings to all start scripts.
- Unified and cleaner logging and console output; showing log level WARN and ERROR on console for the utilities.
- Performance improvement: Start the file system or run the write command while in parallel creating script database backup from the plain backup.
- Updated Scala to 3.2.1, Java to 17.0.5_8, libraries, SBT and sbt-updates plugin.
- Full database compaction only when running
reclaim-space
orfsc db-compact
, for other write commands compact database for at most 2 seconds. - Added
fsc db-compact
command. - When running
db-backup
, print to console the corresponding restore command. - Added
db-backup
utility script. - Introduced
fsc db-restore
command. - Renamed
fsc backup
tofsc db-backup
.
- Fixed a failing test for the blacklisting code that probably is a bug, although the exact faulty behavior has not been researched.
- Check for children before marking entries deleted. Mainly prevents replacing non-empty directories.
- Fixed bug that mounting the file system read-only failed.
- Updated Scala to version 3.2 and Java to 17.0.4.1_1, updated many libraries.
- Updated H2 database from 2.1.212 to 2.1.214 - binary compatible, no migration necessary.
- Upgrade H2 database from 1.4.200 to 2.1.212.
Migration from 4.x to 5.x:
- Direct migration of repositories from versions prior to 4.x is not supported.
- Use the 4.x dedupfs software to run the command
fsc backup
in order to make sure a current database backup exists. Note that in later releases, this utility is renamedfsc db-backup
. - Unpack the 5.x dedupfs software into the repository.
- Use the 5.x
db-restore
utility like this:
db-restore from=dedupfs_[timestamp].zip
to point the utility to the database backup zip file created above. Look for the exact timestamp to use in thefsdb
subdirectory of the repository. - The migration is complete. Don't use dedupfs versions previous to 5.0.0 anymore with the repository.
- Eventually, manually delete the final version of the 4.x database, that is, the files
dedupfs.mv.db
anddedupfs.mv.db.backup
in thefsdb
subdirectory of the repository.
- New
blacklist
utility for blacklisting files that should not be stored at all. - Simplified reclaim space process: After running the
reclaim-space
utility, freed up space is automatically used when storing new files. - Compact database when unmounting the dedup file system and after
blacklist
andreclaim-space
. (git 548f1803) db-restore
,mount
andreclaim-space
accept an unnamed parameter.
- Updated Java 11 to Java 17.
- Updated dedupfs database version from 2 to 3 to prepare upgrading H2 database to 2.0.202, which will come with dedupfs 5.
- Added
fsc
utilities to create a database backup or to list, find, or delete files without mounting the repository.
Migration from 3.x to 4.x:
- Start dedupfs 4.x with repository write access once to migrate the repository database.
- After migration, dedupfs version prior to 4.x can't be used anymore with the repository.
- To fall back to an earlier software version, restore the repository database from a backup created before upgrading to 4.x.
- On Windows, fixed dedupfs problems caused by missing executable flag. (git 1c69bf58)
Named "3" because this release is a Scala 3 re-write of dedupfs.
- On Windows, named the dedup file system volume "DedupFS". (git 46a076d)
- Fixed occasional out-of-memory or deadlock condition (rewrite of parallel handling).
- Restructured installation directory, ".." instead of "." is the default repository path.
- Fixed deadlock when moving files in the dedup file system along with the dirty corner cases when moving files. (git 7e6d858)
Known problems:
- Occasional deadlock and out-of-memory conditions.
- New
stats
utility. (git a11229b) - In read-only mode, log no error when trying to update last modified time of a file. (git 1bbdcc7)
- Support running the dedup file system with data files missing (for the "Shallow Copy" feature). (git 013fc7d)
- On Windows, the utilities support space characters in the app path. (git 1632819)
- Fixed that background process can go on for a long time.
Known problems:
- Deadlock for example when moving files in the dedup file system.
- Dirty corner cases when moving a file that currently is just written to.
First public release.
Known problems:
- If the drive the temp directory resides in is faster than the target drive, when copying many files the background write-to-target process may still take a long time when the foreground copy-to-DedupFS process has finished.
- Dirty corner cases when moving a file that currently is just written to.
DedupFS stores the tree and file metadata in an H2 SQL database. Use the h2 JDBC driver lib/h2-<version>.jar
in case you want to inspect the database contents. The database schema is approximately:
CREATE TABLE TreeEntries (
id BIGINT PRIMARY KEY,
parentId BIGINT NOT NULL,
name VARCHAR(255) NOT NULL,
time BIGINT NOT NULL,
deleted BIGINT NOT NULL DEFAULT 0,
dataId BIGINT DEFAULT NULL
);
CREATE TABLE DataEntries (
id BIGINT PRIMARY KEY,
seq INTEGER NOT NULL,
length BIGINT NULL,
start BIGINT NOT NULL,
stop BIGINT NOT NULL,
hash BINARY NULL
);
TreeEntries
: The tree root entry has id
0. The tree structure is defined by providing the parentId
for each node. time
is the last modified unix timestamp in milliseconds. deleted
is set to the deletion unix timestamp in milliseconds; for existing (= not deleted) files it is 0
. dataId
is null
for directories. For files, it is a reference to DataEntries.id
. If no matching DataEntries.id
exists, the file size is zero.
DataEntries
: Each sequence of bytes that can be referenced as a file content is assigned a unique id
. The bytes are stored in 1 to n non-contiguous parts, where seq
of the first part is 1
and so on. start
and stop
define the storage position of the part, where stop
points to the position after the respective part. length
is the combined length of all parts, hash
the MD5 hash of the full sequence, that is, of all parts concatenated. length
and hash
are only set for the first part of each data entry, for subsequent parts they are null
.
DedupFS stores the file contents in the data
subdirectory of the repository. The data is distributed to files of 100.000.000 Bytes each. The file names of the data files denote the position of the first byte stored in the respective file. These positions are referenced by DataEntries.start
and DataEntries.stop
, see above.
MIT License, see file LICENSE.