Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script which converts the fdupes output into a csv table with a fixed number of columns #26

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 106 additions & 0 deletions fdupes2table.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
#! /usr/bin/python3
#
# Author : Jérôme Bouat jerome<dot>bouat<at>laposte<dot>net
#
# This script is transforming the output of fdupes
# (with '--recurse --size' options)
# into a csv table with duplicates sorted by decreasing size.
#
# Next, a simple spreadsheet software enables filtering the columns :
# - size of each file
# - occurences of the duplicate file
# - possible saving with the duplicate file
# - duplicate number
# - path
#
# Examples of commands combinations :
#----------
# fdupes --recurse --size . | fdupes2table.py > duplicates.csv
#----------
# fdupes --recurse --size . > duplicates.txt
# cat duplicates.txt | fdupes2table.py > duplicates.csv
#----------
#
# Example with the output of "fdupes --recurse --size . " :
#----------------------
# 16265216 byte(null)each:
# ./titi.DOC
# ./toto.DOC
#
# 5527 byte(null)each:
# ./titi.gif
# ./toto.gif
#
# 560149 byte(null)each:
# ./titi.pdf
# ./toto.pdf
#----------------------
#
# Example of the output of this script with the previous example :
#----------------------
# max possible saving : 16 MB
#
# size (ko);occurences;possible saving (kB);duplicate;path
#
# 15884;2;15884;1;titi.DOC
# 15884;2;15884;1;toto.DOC
#
# 547;2;547;2;titi.pdf
# 547;2;547;2;toto.pdf
#
# 5;2;5;3;titi.gif
# 5;2;5;3;toto.gif
#----------------------
#

import sys, re

beginPath = re.compile('^./')

sizes = {}
paths = {}
dupNumber = 1
getSize = True
line = sys.stdin.readline()
while len(line) > 0 :
if getSize :
size=line.split()[0]
if size.isdigit() :
size = int(size)
if not (size in sizes) :
sizes[size] = []
sizes[size].append(dupNumber)
paths[dupNumber] = []
getSize = False
else :
raise 'doesn\'t find the size of the duplicate files'
elif len(line) == 1 :
dupNumber += 1
getSize = True
else :
line = line[:-1]
line = beginPath.sub('', line)
paths[dupNumber].append(line)
line = sys.stdin.readline()

totalSaving = 0
for size in sizes.keys() :
nbDupForSize = 0
for dupNumber in sizes[size] :
nbDupForSize += (len(paths[dupNumber]) - 1)
totalSaving += size * nbDupForSize
totalSaving = round(totalSaving * 2**-20)
print('max possible saving :;%d MB'%(totalSaving))
print()

print('size (ko);occurrences;possible saving (kB);duplicate;path')
newDupNum=0
for size in sorted(sizes.keys(), reverse=True) :
sizeKo = round(size * 2**-10)
for dupNumber in sizes[size] :
newDupNum += 1
occurences = len(paths[dupNumber])
print()
possibleSavingKB = round((occurences - 1) * size * 2**-10)
for path in sorted(paths[dupNumber]) :
print('%d;%d;%d;%d;%s'%(sizeKo, occurences, possibleSavingKB, newDupNum, path))