Twofishes indexer does not handle MongoDB query failures properly. #53

piranha32 · 2018-03-20T13:54:13Z

During importing the data into the DB with "./src/jvm/io/fsq/twofishes/scripts/parse.py -w pwd/data/" the indexer crashed with the following error message. After the crash the indexer was stuck doing nothing, and the no more records were processed. Processed data was downloaded with "./src/jvm/io/fsq/twofishes/scripts/download-world.sh"

8246718 INFO  i.f.t.indexer.output.PrefixIndexer - done with 114000 of 2994876 prefixes
8247980 INFO  i.f.t.indexer.output.PrefixIndexer - done with 115000 of 2994876 prefixes
Exception in thread "main" io.fsq.rogue.RogueException: Mongo query on geocoder [db.name_index.find({ "name" : { "$regex" : "^?" , "$options" : ""} , "excludeFromPrefixIndex" : { "$ne" : true}}).sort({ "pop" : -1}).limit(1000)] failed after 2 ms
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.runCommand(BlockingMongoClientAdapter.scala:118)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.runCommand(BlockingMongoClientAdapter.scala:52)
        at io.fsq.rogue.adapter.MongoClientAdapter.queryRunner(MongoClientAdapter.scala:471)
        at io.fsq.rogue.adapter.MongoClientAdapter.query(MongoClientAdapter.scala:496)
        at io.fsq.rogue.query.QueryExecutor.fetch(QueryExecutor.scala:144)
        at io.fsq.twofishes.indexer.output.PrefixIndexer.getRecordsByPrefix(PrefixIndexer.scala:76)
        at io.fsq.twofishes.indexer.output.PrefixIndexer$$anonfun$writeIndexImpl$2.apply(PrefixIndexer.scala:115)
        at io.fsq.twofishes.indexer.output.PrefixIndexer$$anonfun$writeIndexImpl$2.apply(PrefixIndexer.scala:110)
        at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
        at io.fsq.twofishes.indexer.output.PrefixIndexer.writeIndexImpl(PrefixIndexer.scala:110)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply$mcV$sp(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply(Indexer.scala:54)
        at io.fsq.twofishes.util.DurationUtils$.inNanoseconds(DurationUtils.scala:16)
        at io.fsq.twofishes.util.DurationUtils$class.logDuration(DurationUtils.scala:23)
        at io.fsq.twofishes.indexer.output.Indexer.logDuration(Indexer.scala:37)
        at io.fsq.twofishes.indexer.output.Indexer.writeIndex(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.NameIndexer.writeIndexImpl(NameIndexer.scala:84)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply$mcV$sp(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply(Indexer.scala:54)
        at io.fsq.twofishes.util.DurationUtils$.inNanoseconds(DurationUtils.scala:16)
        at io.fsq.twofishes.util.DurationUtils$class.logDuration(DurationUtils.scala:23)
        at io.fsq.twofishes.indexer.output.Indexer.logDuration(Indexer.scala:37)
        at io.fsq.twofishes.indexer.output.Indexer.writeIndex(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.OutputIndexes.buildIndexes(OutputHFile.scala:28)
        at io.fsq.twofishes.indexer.importers.geonames.GeonamesParser$.writeIndexes(GeonamesParser.scala:144)
        at io.fsq.twofishes.indexer.importers.geonames.GeonamesParser$.main(GeonamesParser.scala:106)
        at io.fsq.twofishes.indexer.importers.geonames.GeonamesParser.main(GeonamesParser.scala)
Caused by: com.mongodb.MongoQueryException: Query failed with error code 2 and error message 'Regular expression is invalid: invalid UTF-8 string' on server 127.0.0.1:27017
        at com.mongodb.operation.FindOperation$1.call(FindOperation.java:521)
        at com.mongodb.operation.FindOperation$1.call(FindOperation.java:510)
        at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:431)
        at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:404)
        at com.mongodb.operation.FindOperation.execute(FindOperation.java:510)
        at com.mongodb.operation.FindOperation.execute(FindOperation.java:81)
        at com.mongodb.Mongo.execute(Mongo.java:836)
        at com.mongodb.Mongo$2.execute(Mongo.java:823)
        at com.mongodb.OperationIterable.iterator(OperationIterable.java:47)
        at com.mongodb.OperationIterable.forEach(OperationIterable.java:70)
        at com.mongodb.FindIterableImpl.forEach(FindIterableImpl.java:166)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.forEachProcessor(BlockingMongoClientAdapter.scala:162)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.forEachProcessor(BlockingMongoClientAdapter.scala:52)
        at io.fsq.rogue.adapter.MongoClientAdapter$$anonfun$query$1.apply(MongoClientAdapter.scala:496)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.findImpl(BlockingMongoClientAdapter.scala:272)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.findImpl(BlockingMongoClientAdapter.scala:52)
        at io.fsq.rogue.adapter.MongoClientAdapter$$anonfun$queryRunner$1.apply(MongoClientAdapter.scala:472)
        at io.fsq.rogue.util.DefaultQueryLogger.onExecuteQuery(QueryLogger.scala:52)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.runCommand(BlockingMongoClientAdapter.scala:113)
        ... 30 more

The text was updated successfully, but these errors were encountered:

piranha32 · 2018-03-21T01:28:42Z

Subsequent execution of parse.py with the same input data, and without deleting the output directory completed successfully, but the index seems to be corrupted. Twofishes starts, responds to queries without crashing, but throws exceptions on each query:

WARNING: Unexpected EOF reading file:data/prefix_index/index at entry #0.  Ignoring.
Mar 21, 2018 1:27:08 AM io.fsq.twofishes.server.HandleExceptions$$anonfun$apply$6 applyOrElse
SEVERE: got error: java.io.EOFException: reached EOF while trying to read 13 bytes
java.io.EOFException: reached EOF while trying to read 13 bytes
        at io.fsq.twofishes.core.MMapInputStreamImpl.read(MMapInputStream.scala:91)
        at io.fsq.twofishes.core.MMapInputStream.read(MMapInputStream.scala:160)
        at io.fsq.twofishes.core.MMapInputStream.read(MMapInputStream.scala:139)
        at java.io.DataInputStream.readFully(DataInputStream.java:195)
        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:70)
        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:120)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2359)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2258)
        at io.fsq.twofishes.core.MapFileConcurrentReader.findPosition(MapFileConcurrentReader.java:245)
        at io.fsq.twofishes.core.MapFileConcurrentReader.get(MapFileConcurrentReader.java:279)
        at io.fsq.twofishes.server.MapFileInput$$anonfun$6.apply(HFileStorageService.scala:205)
        at io.fsq.twofishes.server.MapFileInput$$anonfun$6.apply(HFileStorageService.scala:203)
        at io.fsq.twofishes.util.DurationUtils$.inMilliseconds(DurationUtils.scala:10)
        at io.fsq.twofishes.server.MapFileInput.lookup(HFileStorageService.scala:203)
        at io.fsq.twofishes.server.PrefixIndexMapFileInput.get(HFileStorageService.scala:310)
        at io.fsq.twofishes.server.NameIndexHFileInput.getPrefix(HFileStorageService.scala:244)
        at io.fsq.twofishes.server.HFileStorageService.getIdsByNamePrefix(HFileStorageService.scala:54)
        at io.fsq.twofishes.server.HotfixableGeocodeStorageService.getIdsByNamePrefix(HotfixableGeocodeStorageService.scala:22)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl$$anonfun$9.apply(AutocompleteGeocoderImpl.scala:220)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl$$anonfun$9.apply(AutocompleteGeocoderImpl.scala:198)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.Range.foreach(Range.scala:160)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl.generateAutoParsesHelper(AutocompleteGeocoderImpl.scala:198)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl.generateAutoParses(AutocompleteGeocoderImpl.scala:275)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl.doGeocodeImpl(AutocompleteGeocoderImpl.scala:294)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl.doGeocodeImpl(AutocompleteGeocoderImpl.scala:31)
        at io.fsq.twofishes.server.AbstractGeocoderImpl$$anonfun$doGeocode$1.apply(AbstractGeocoderImpl.scala:33)
        at com.twitter.util.Duration$.inMilliseconds(Duration.scala:183)
        at com.twitter.ostrich.stats.StatsProvider$class.time(StatsProvider.scala:196)
        at com.twitter.ostrich.stats.StatsCollection.time(StatsCollection.scala:31)
        at io.fsq.twofishes.server.AbstractGeocoderImpl$Stats$.time(AbstractGeocoderImpl.scala:43)
        at io.fsq.twofishes.server.AbstractGeocoderImpl.doGeocode(AbstractGeocoderImpl.scala:32)
        at io.fsq.twofishes.server.GeocodeRequestDispatcher.geocode(GeocodeRequestDispatcher.scala:28)
        at io.fsq.twofishes.server.GeocodeServerImpl$$anonfun$geocode$2.apply(GeocodeServer.scala:218)
        at io.fsq.twofishes.server.GeocodeServerImpl$$anonfun$geocode$2.apply(GeocodeServer.scala:218)
        at com.twitter.util.Try$.apply(Try.scala:13)
        at com.twitter.util.ExecutorServiceFuturePool$$anon$2.run(FuturePool.scala:115)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

jglesner · 2018-06-13T11:50:51Z

Preface this 'solution' by saying that I'm not a Scala programmer ... so there may be a better way to handle this. However, I was able to temporarily get past this build error, and avoid the corrupted index by editing fsqio/src/jvm/io/fsq/twofishes/indexer/output/PrefixIndexer.scala to add a try/catch block around the section of code where the error occurs (example below). In the current version of this file, that means enclosing lines 115 through 140.

    for {
      (prefix, index) <- sortedPrefixes.zipWithIndex
    } {
      if (index % 1000 == 0) {
        log.info("done with %d of %d prefixes".format(index, numPrefixes))
      }
      try {
         val records = getRecordsByPrefix(prefix, PrefixIndexer.MaxNamesToConsider)

         val (woeMatches, woeMismatches) = records.partition(r =>
           bestWoeTypes.contains(r.woeTypeOrThrow))

         val (prefSortedRecords, unprefSortedRecords) =
           sortRecordsByNames(woeMatches.toList)

         val fids = new HashSet[StoredFeatureId]
         //roundRobinByCountryCode(prefSortedRecords).foreach(f => {
         prefSortedRecords.foreach(f => {
           if (fids.size < PrefixIndexer.MaxFidsToStorePerPrefix) {
             fids.add(f.fidAsFeatureId)
           }
         })

         if (fids.size < PrefixIndexer.MaxFidsWithPreferredNamesBeforeConsideringNonPreferred) {
           //roundRobinByCountryCode(unprefSortedRecords).foreach(f => {
           unprefSortedRecords.foreach(f => {
             if (fids.size < PrefixIndexer.MaxFidsToStorePerPrefix) {
               fids.add(f.fidAsFeatureId)
             }
           })
         }

         prefixWriter.append(prefix, fidsToCanonicalFids(fids.toList))
      } catch {
         case e: Exception => println("Skipping due to error processing prefixes")
      }
}

If you already built the mongo db, you'll obviously want to throw that out and start from mongod --dbpath /local/directory/for/output/ on the import instructions.

To address Issue foursquare#53 (foursquare#53), I have wrapped prefix processing in a try/catch. I am not a scala programmer, so this may not be the best implementation. However, if we do not contain regex errors related to prefix processing, the processing fails, and corrupts the database/indices. Therefore, I feel like its worthwhile to trap the error here, and just skip over processing errors.

jglesner mentioned this issue Jul 12, 2018

Update PrefixIndexer.scala #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Twofishes indexer does not handle MongoDB query failures properly. #53

Twofishes indexer does not handle MongoDB query failures properly. #53

piranha32 commented Mar 20, 2018

piranha32 commented Mar 21, 2018

jglesner commented Jun 13, 2018

Twofishes indexer does not handle MongoDB query failures properly. #53

Twofishes indexer does not handle MongoDB query failures properly. #53

Comments

piranha32 commented Mar 20, 2018

piranha32 commented Mar 21, 2018

jglesner commented Jun 13, 2018