+
+
I did a data cleanup over the weekend. I doubt this is interesting
+for anyone else, I just wanted to capture my own notes.
+
Our family photos are stored up in Google Photos. We have about
+100k photos and short movies that take up 0.5TB. I wanted to back
+up to local storage, but also I liked the idea of trying the
+mdisc format for long-term storage.
+
I ran Takeout and unzipped everything. I was surprised to see
+so many duplicates. Some are copies in the same folders with suffixes
+like "(1)", most are multiple copies in different folders.
+Takeout seems to just store a copy for each album a photo is in.
+
Some poking at the data shows 88% of 302,755 files / 72% of
+the bytes were unique (histogram at the bottom of the post). Removing
+dups can save 200+GB. Sure not really worth the trouble but why
+not.
+
Steps
+
1. On the NAS, gather file checksums (md5sum
) and sizes for all
+the files under Takeout/Google Files
. Mostly variations on
+of find -print0 | xargs -0
.
+
2. Data cleanup to get into nice file paths and clean delimiters,
+mostly interactively with vi
.
+
3. Insert into sqlite
using .separator
and .import
.
+
Here's what the database ended up looking like: Photos
and Sizes
+are inputs, HashCounts
, Duplicates
, and Candidates
are
+outputs.
+
CREATE TABLE Photos(hash TEXT,path TEXT);
+CREATE TABLE Sizes(size INT,path TEXT);
+CREATE TABLE HashCounts(hash TEXT,found INT);
+CREATE TABLE Duplicates(path STRING,hash STRING,found INT,size INT);
+CREATE TABLE Candidates(path STRING,hash STRING,size INT,pos INT);
+
+
+
4. Find the dups. I'm surprised to find some with 50 or more copies,
+but it out some family favorites end up in lots of albums.
+
insert into HashCounts
+select hash, sum(1) as found from Photos group by hash;
+
+
+
5. Figure out candidates for deletion
+
insert into Duplicates
+select Photos.path, Photos.hash, HashCounts.found, Sizes.size
+from Photos, HashCounts, Sizes
+where HashCounts.hash=Photos.hash and Photos.path=Sizes.path
+and HashCounts.found > 1;
+
+
+
The first of each set is the one we'll keep.
+
insert into Candidates
+select *
+from (select path, hash, size, row_number() over (partition by hash) as row_number
+ from (select * from Duplicates order by hash, path desc)
+) where row_number > 1;
+
+
+
I was surprised that sqlplus supports window functions, nice.
+
The inner reverse-alpha sort on "path" takes care of two cases.
+I tend to prefer keeping photos with names that start with
+years, and Those come first alphabetically (nice). Also within
+folders often there are many copies with "(1)" and "(2)" suffixes
+that are generally cruft and most worthy of removing, and those
+also sort last alphabetically (nice).
+
5. Dump out the "Candidates" using .output
. Copy back to the NAS.
+Do lots of spot checks. Convert to a bash script of rm
commands,
+run very carefully.
+
Tools
+
Sqlite is my go-to tool for ad-hoc work like this. It's fast
+and simple, but only for small jobs -- this one is MB-scale.
+
select name, sum(pgsize) from dbstat group by name;
+name sum(pgsize)
+------------- -----------
+Candidates 3006464
+Duplicates 5541888
+HashCounts 11051008
+Photos 23240704
+Sizes 13807616
+sqlite_schema 4096
+
+
+
My Synology is a pretty good place for storage with ability to
+ssh in and run local commands. But if I had to do this again,
+though, I should have just bought a large locally-connected SSD.
+All the transfers to-from the NAS were a hassle. Looking now I'm
+stunned that you can get a 4 TB external SSD for under $300.
+
Some private notes here.
+
+(source)
+
+
Comments
diff --git a/posts/oppenheimer/index.html b/posts/oppenheimer/index.html new file mode 100644 index 0000000..9a49c7f --- /dev/null +++ b/posts/oppenheimer/index.html @@ -0,0 +1,177 @@ + + + + + +Oppenheimer -- A Miss
+ ++ Sef Kloninger +
++ + +
++ + Comments + + + + +
+I was so disappointed in Oppenheimer. It wasn't terrible, but what +a missed opportunity.
+Mystifying. Part of the job of a movie like this is to give you +insight into what the person is thinking and feeling. Especially +in this case -- what would it be like to be the person who made +these terrible weapons? That's really what I wanted to get from +this movie. I understand he's complicated, and it's complicated. +But instead of dialog, or even some outright explaining, this movie +instead has a lot of closeups of Cillian Murphy staring into the +distance, or adjusting his hat.
+Women. There were plenty of interesting women around -- in +academia, in the Manhattan project, in relationships. There are +only two women in the movie: one is a sex object with mental health +issues (Pugh); the other is a mother/lush, with a surprisingly sharp +tongue and wit but only for a minute (Blunt). The sex scene, and +quoting the famous line, was gratuitous and super weird.
+Strauss. I'm not invested at all in the conflict +with Admiral Strauss. Paraphrasing the commentary on this by the good +folks at The Incomparable, "I don't care if some guy +becomes Commerce Secretary in Eisenhower Cabinet."
+Given how much I like the history and the science of all +of this, I really pulling for this movie. Dang.
+Comments
+ + + + + Comments powered by Disqus + + +Cleaning Up Photo Duplicates
+ ++ Sef Kloninger +
++ + +
++ + Comments + + + + +
+I did a data cleanup over the weekend. I doubt this is interesting +for anyone else, I just wanted to capture my own notes.
+Our family photos are stored up in Google Photos. We have about +100k photos and short movies that take up 0.5TB. I wanted to back +up to local storage, but also I liked the idea of trying the +mdisc format for long-term storage.
+I ran Takeout and unzipped everything. I was surprised to see +so many duplicates. Some are copies in the same folders with suffixes +like "(1)", most are multiple copies in different folders. +Takeout seems to just store a copy for each album a photo is in.
+Some poking at the data shows 88% of 302,755 files / 72% of +the bytes were unique (histogram at the bottom of the post). Removing +dups can save 200+GB. Sure not really worth the trouble but why +not.
+Steps
+1. On the NAS, gather file checksums (
+md5sum
) and sizes for all +the files underTakeout/Google Files
. Mostly variations on +offind -print0 | xargs -0
.2. Data cleanup to get into nice file paths and clean delimiters, +mostly interactively with
+vi
.3. Insert into
+sqlite
using.separator
and.import
.Here's what the database ended up looking like:
+Photos
andSizes
+are inputs,HashCounts
,Duplicates
, andCandidates
are +outputs.4. Find the dups. I'm surprised to find some with 50 or more copies, +but it out some family favorites end up in lots of albums.
+5. Figure out candidates for deletion
+The first of each set is the one we'll keep.
+I was surprised that sqlplus supports window functions, nice.
+The inner reverse-alpha sort on "path" takes care of two cases. +I tend to prefer keeping photos with names that start with +years, and Those come first alphabetically (nice). Also within +folders often there are many copies with "(1)" and "(2)" suffixes +that are generally cruft and most worthy of removing, and those +also sort last alphabetically (nice).
+5. Dump out the "Candidates" using
+.output
. Copy back to the NAS. +Do lots of spot checks. Convert to a bash script ofrm
commands, +run very carefully.Tools
+Sqlite is my go-to tool for ad-hoc work like this. It's fast +and simple, but only for small jobs -- this one is MB-scale.
+My Synology is a pretty good place for storage with ability to +ssh in and run local commands. But if I had to do this again, +though, I should have just bought a large locally-connected SSD. +All the transfers to-from the NAS were a hassle. Looking now I'm +stunned that you can get a 4 TB external SSD for under $300.
+Some private notes here.
++(source)
+Comments
+ + + + + Comments powered by Disqus + + +