A package that allows a continuous non-blocking read of large batches of documents from a MongoDB database (remote or local), with some action performed on each batch.
Installation instructions:
- Download the
mongojoin
package
git clone https://github.com/knowbodynos/mongojoin.git
- Navigate into the main directory
cd mongojoin
- Install
mongojoin
python setup.py install
Using mongojoin
:
- Queries are given in the form of a
python
list of lists:
[['<COLLECTION_1>',<JSON_QUERY_1>,<JSON_PROJECTIONS_1>,<OPTIONS_1>], ['<COLLECTION_2>',<JSON_QUERY_2>,<JSON_PROJECTIONS_2>,<OPTIONS_2>], ...]
with
-
<COLLECTION_#>
is the name of the collection in the database. -
<JSON_QUERY_#>
is a query of the form{'_id': 10}
. -
<JSON_PROJECTIONS_#>
is a projection of the form{'_id': 0}
. -
<OPTIONS_#>
is a dictionary of options like HINT,SKIP,SORT,LIMIT,COUNT of the form{'HINT': {'<FIELD_1>':1}, 'SKIP': 5, 'SORT': {'<FIELD_2>': 1}, 'LIMIT': 10, 'COUNT': True}
. -
The main function is
dbcrawl
:
dbcrawl(db,queries,statefilepath,statefilename="querystate",inputfunc=lambda x:{"nsteps":1},inputdoc={"nsteps":1},action=printasfunc,readform=lambda x:eval(x),writeform=lambda x:x,timeleft=lambda:1,counters=[1,1],counterupdate=lambda x:None,resetstatefile=False,limit=None,limittries=10,toplevel=True,initdoc={})
where
-
db
is anpymongo
database object. -
queries
is a query of the form in step 1. -
statefilepath
is a path to where an intermediate file will be stored, andstatefilename
is its filename. -
inputfunc
is a function that returns a dictionary with information that will be used for reading in documents.inputdoc
is the first dictionary that is preloaded.nsteps
refers to the number of documents that will be read in each batch. -
action
is a function that performs an action of each batch of documents. -
readform
andwriteform
allow you to alter the format in which processed documents are stored in the intermediate filestatefilename
. -
timeleft
is a function that returns how much time (in seconds) is left before some limit is reached (default: no limit). -
counters
is a list containing a batch counter and a document counter. They are both initialized at 1 by default. -
resetstatefile
is True or False depending on whether the intermediate filestatefilename
should be overwritten. -
limit
is a limit on how many documents should be processed total. If there is no limit, set to None (default). -
limittries
is a limit on how many times a read should be attempted before giving up. -
toplevel
andinitdoc
are internal recursive variables and should not be customized. -
Some useful actions are:
-
To print batches of file to screen, set
action = printasfunc
-
To add batches of documents to a list of batches, set
action = lambda x,y,z: my_list.append(z)
-
To add batches of documents to a list of documents, set
action = lambda x,y,z: my_list.extend(z)
-
To write batches of documents to a file, set
action = lambda x,y,z: writeasfunc("<FILE_PATH>",z)
-