mongojoin

A package that allows a continuous non-blocking read of large batches of documents from a MongoDB database (remote or local), with some action performed on each batch.

Installation instructions:

Download the mongojoin package

   git clone https://github.com/knowbodynos/mongojoin.git

Navigate into the main directory

   cd mongojoin

Install mongojoin

   python setup.py install

Using mongojoin:

Queries are given in the form of a python list of lists:

   [['<COLLECTION_1>',<JSON_QUERY_1>,<JSON_PROJECTIONS_1>,<OPTIONS_1>], ['<COLLECTION_2>',<JSON_QUERY_2>,<JSON_PROJECTIONS_2>,<OPTIONS_2>], ...]

with

<COLLECTION_#> is the name of the collection in the database.
<JSON_QUERY_#> is a query of the form {'_id': 10}.
<JSON_PROJECTIONS_#> is a projection of the form {'_id': 0}.
<OPTIONS_#> is a dictionary of options like HINT,SKIP,SORT,LIMIT,COUNT of the form {'HINT': {'<FIELD_1>':1}, 'SKIP': 5, 'SORT': {'<FIELD_2>': 1}, 'LIMIT': 10, 'COUNT': True}.
The main function is dbcrawl:

   dbcrawl(db,queries,statefilepath,statefilename="querystate",inputfunc=lambda x:{"nsteps":1},inputdoc={"nsteps":1},action=printasfunc,readform=lambda x:eval(x),writeform=lambda x:x,timeleft=lambda:1,counters=[1,1],counterupdate=lambda x:None,resetstatefile=False,limit=None,limittries=10,toplevel=True,initdoc={})

where

db is an pymongo database object.
queries is a query of the form in step 1.
statefilepath is a path to where an intermediate file will be stored, and statefilename is its filename.
inputfunc is a function that returns a dictionary with information that will be used for reading in documents. inputdoc is the first dictionary that is preloaded. nsteps refers to the number of documents that will be read in each batch.
action is a function that performs an action of each batch of documents.
readform and writeform allow you to alter the format in which processed documents are stored in the intermediate file statefilename.
timeleft is a function that returns how much time (in seconds) is left before some limit is reached (default: no limit).
counters is a list containing a batch counter and a document counter. They are both initialized at 1 by default.
resetstatefile is True or False depending on whether the intermediate file statefilename should be overwritten.
limit is a limit on how many documents should be processed total. If there is no limit, set to None (default).
limittries is a limit on how many times a read should be attempted before giving up.
toplevel and initdoc are internal recursive variables and should not be customized.
Some useful actions are:
1. To print batches of file to screen, set action = printasfunc
2. To add batches of documents to a list of batches, set action = lambda x,y,z: my_list.append(z)
3. To add batches of documents to a list of documents, set action = lambda x,y,z: my_list.extend(z)
4. To write batches of documents to a file, set action = lambda x,y,z: writeasfunc("<FILE_PATH>",z)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
mongojoin		mongojoin
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mongojoin

About

Releases

Packages

Languages

License

knowbodynos/mongojoin

Folders and files

Latest commit

History

Repository files navigation

mongojoin

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages