Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First steps on migrating to a new version of mongo. #3651

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

zenhack
Copy link
Collaborator

@zenhack zenhack commented Jul 31, 2022

Per discussion on IRC, we're going to write a helper binary for this in Go, because:

  • The currently maintained driver package supports mongo versions back
    to 2.6 (what we're using), which is not true for most other languages.
  • We already have go in our toolchain for boringssl's test suite.
  • The build is unlikely to break due to bitrot re: Go's toolchain.
  • The generated binary is static, so all else fails we can just bundle
    the executable, though I don't anticipate that.
  • I will be much more productive than in something else.

Right now all this does is bundle up a hello-world go binary with Sandstorm's build system. Marking as a draft.

zenhack added 5 commits August 1, 2022 01:32
We're going to write a helper binary for this in Go, because:

- The currently maintained driver package supports mongo versions back
  to 2.6 (what we're using), which is not true for most other languages.
- We already have go in our toolchain for boringssl's test suite.
- The build is unlikely to break due to bitrot re: Go's toolchain.
- The generated binary is static, so all else fails we can just bundle
  the executable, though I don't anticipate that.
- I will be much more productive than in something else.
I'm getting a permissions error trying to just run this from my local
dev directory; need to figure out what's going on.
We're successfully listing the collections.
@xet7
Copy link
Contributor

xet7 commented Aug 2, 2022

Could be related, listing collections of MongoDB, exporting to JSON.

https://github.com/wekan/wekan/wiki/Export-from-Wekan-Sandstorm-grain-.zip-file#11b-dump-database-to-json-text-files

@ocdtrekkie
Copy link
Collaborator

@xet7 I am hoping this particular project may also yield a way to upgrade meteor-spk-built applications to the newest Mongo version as well.

@xet7
Copy link
Contributor

xet7 commented Aug 2, 2022

@ocdtrekkie

I did also think if it would make any sense to convert raw MongoDB database files directly to other format, without starting any MongoDB server. And would there be similar ways for mongodump file format. But I presume it may not be so useful.

https://www.percona.com/blog/2021/05/18/wiredtiger-file-forensics-part-1-building-wt/

It would be nice if all converting of Sandstorm MongoDB databases would be scheduled to happen at night, so that it would not disturb daytime using of apps.

These conversions needs also checks, that is there enough free disk space to convert.

@xet7
Copy link
Contributor

xet7 commented Aug 2, 2022

mongodump (or mongorestore) of 400 GB MongoDB database takes about 4 hours. Some Snap and Docker WeKan users have that size databases.

@zenhack
Copy link
Collaborator Author

zenhack commented Aug 2, 2022

FWIW, the dumping half of this is already written. I'd be wary of dumping it to json though, since mongo's native format is bson, which supports a couple extra data types, like timestamps and binary blobs -- so dumping to json loses information.

I suspect for sandstorm itself it won't be too slow; the database isn't that huge, since most data is stored in grains' storage. Though I'd be curious to know how big the database on alpha is (@kentonv?)

Note on my local dev instance the on-disk use of /opt/sandstorm/var/mongo was around ~128MiB, and the exported data was less than 512KiB, so I assume it's doing some pre-allocation or something.

@xet7
Copy link
Contributor

xet7 commented Aug 2, 2022

@zenhack

I'd be wary of dumping it to json though, since mongo's native format is bson, which supports a couple extra data types, like timestamps and binary blobs -- so dumping to json loses information.

Really? Binary blobs are exported in base64 encoded format, like GridFS attachments. Is somewhere more info about this?

@zenhack
Copy link
Collaborator Author

zenhack commented Aug 2, 2022 via email

@xet7
Copy link
Contributor

xet7 commented Aug 2, 2022

@zenhack

That bash script exports each collection/table in separate json files. By opening each file in text editor, I see json structure, is it nested, etc. At attachments, each part of attachments has ID and base64 string. If some file is divided to many parts, other info in json shows what is filename, size, md5, part IDs, etc.

Sure more useful way would be to save attachments to binary files, and use those unique file IDs as filenames.It's not so useful to use real filenames, because there are many attachments with same names, special characters urlencoded, etc.

Another way would be to name attachments by their sha256 or other hash, and that way do deduplication and save disk space.

Some thinking is also about encrypting files etc data, but I have not coded it yet.

I also have not coded yet scripts to convert JSON etc to other formats.

@zenhack
Copy link
Collaborator Author

zenhack commented Aug 2, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants