first pass on import/export #174

Ch4s3 · 2018-03-02T02:04:49Z

Trying to address #172
I'm still working through this, but dumping a plain Ruby hash seems like the way to go.

TODO:

return data structure from export to support multiple formats
works on in memory
works with redis
figure out how to work with old and new code
update docs
write simple data migration guide

Ch4s3 · 2018-03-02T02:06:24Z

@ibnesayeed what are your thoughts? Is object serialization the wrong direction? I'm trying to think through how to make this work with the pre-backend code.

Ch4s3 · 2018-03-02T02:39:33Z

@parkr is the best practice for releasing a point release, e.g. 2.1.1, to release it from this branch? I want to release this functionality along side a 2.2.1 so that we can make it easy to migrate data. It seems that from a data marshaling perspective, 2.2 doesn't work with data from older versions due to the new backend api.

parkr · 2018-03-02T02:48:36Z

@Ch4s3 I create a branch like 2.1-stable or 2.0-stable (the major+minor version you're targeting) usually off of the tag of the latest patch version in that series if the branch doesn't already exist. I merge the PR into master, then I backport the PR to the old branch. I never release from a topic/PR branch, always from either master or a X.Y-stable branch, since they're long-lived and people rely on the git tags and such for debugging.

Ch4s3 · 2018-03-02T02:50:43Z

@parkr thanks! That makes a lot of sense, I was having trouble finding a good reference for this.

ibnesayeed · 2018-03-03T20:10:56Z

The hash approach is good, that's how I anticipated it when I described in the sample example in #172 and said that rather than passing YAML files as argument, we should use objects (I meant hashes) in the backend and deal with the serialization outside to allow other formats not just YAML.

Ch4s3 · 2018-03-04T16:14:49Z

Ok. I’ll figure out a roll out strategy later today or tomorrow.

Ch4s3 · 2018-03-06T04:10:03Z

This is turning out to be a bit tricky to get working with new and old versions. Stay tuned.

Ch4s3 · 2018-03-06T05:28:10Z

@ibnesayeed can you take a look at this. I have it working with the in memory backend for the new code, but not redis. I'm not sure what I'm missing though.

Once this works with redis, the back port should be simple.

ibnesayeed · 2018-03-06T15:52:08Z

I would envision it organized something like this:

Have an independent DataHandler module along the lines of ClassifierValidator
The main task of that module is to read (import) from various formats of files (primarily YAML) and load them as a ruby hash and given a ruby hash serialize (export) it to various formats
How exactly that hash is formed or loaded is a concern of backends, but the DataHandler will not talk to backends directly, instead it will use the Bayes class as the proxy

# classifier-reborn/lib/classifier-reborn/data_handler/bayes_data_handler.rb
module ClassifierReborn
  module DataHandler
    def import!(classifier_obj, data_file)
      data = YAML::load_file(data_file)
      classifier_obj.import!(data)
    end

    def export(classifier_obj, output_file)
      data = classifier_obj.export
      File.write(output_file, data.to_yaml)
    end
  end
end

# classifier-reborn/lib/classifier-reborn/bayes.rb
module ClassifierReborn
  class Bayes
    def import!(data)
      @backend.import!(data)
    end

    def export
      @backend.export
    end
  end
end

# classifier-reborn/lib/classifier-reborn/backends/bayes_memory_backend.rb
module ClassifierReborn
  class BayesMemoryBackend
    def import!(data)
      reset
      # Iterate over the data and populate the backend
    end

    def export
      # Return a data hash based on the current state of the backend
    end
  end
end

# classifier-reborn/lib/classifier-reborn/backends/redis_memory_backend.rb
module ClassifierReborn
  class BayesRedisBackend
    def import!(data)
      reset
      # Iterate over the data and populate the backend
    end

    def export
      # Return a data hash based on the current state of the backend
    end
  end
end

Once you have it working on the current version, it should be easier to backport in older version.

Ch4s3 · 2018-03-06T16:29:18Z

@ibnesayeed I like that organization a lot. My primary concern at the moment is that I'm a bit stuck with the implementation on the Redis backend.

ibnesayeed · 2018-03-06T17:06:44Z

My primary concern at the moment is that I'm a bit stuck with the implementation on the Redis backend.

If I guess it right, you might have trouble fetching all the values (when exporting) that are stored as hashes or nested hashes in the Redis backend (such as :category_training_count). For that you can use the HSCAN command of Redis that works using Cursors. You start scanning from the cursor 0 which will return the next cursor identifier and a subset of the hash that you can immediately use in your data hash which is being prepared for export. Then repeat scanning with the returned next cursor and update the data hash until the next cursor is 0 again. This will guarantee that you have iterated over all the entries that were not modified in the interim. Each step of the cursor fetch might return a different number of items in it, but they will be roughly around a bucket size decided by the Redis engine (that can be customized, but don't need to worry about that).

Following is an untested pseud-code to illustrate what I mean, which can be tested in the Pry terminal to see what is going on.

data[:category_training_count] = {}
next_cursor = "0"
loop do
  next_cursor, records = @redis.hscan(:category_training_count, next_cursor)
  records.each do |k, v|
    data[:category_training_count][k] = v.to_i
  end
  break if next_cursor == "0"
end

ibnesayeed · 2018-03-06T18:54:21Z

You can use the following private helper method to fetch stored hashes of categories and words in each category.

# classifier-reborn/lib/classifier-reborn/backends/bayes_redis_backend.rb
module ClassifierReborn
  class BayesRedisBackend

    private
    def fetch_hash(name)
      obj = {}
      next_cursor = "0"
      loop do
        next_cursor, records = @redis.hscan(name, next_cursor)
        obj.merge!(records.map{|k, v| [k, v.to_i]}.to_h)
        break if next_cursor == "0"
      end
      obj
    end

  end
end

ibnesayeed · 2018-03-06T20:56:49Z

It turns out that HGETALL is more straightforward command for hashes in Redis. So the following code should work just as good (no need to scan through cursor and merging subsets):

data[:category_training_count] = @redis.hgetall(:category_training_count).transform_values(&:to_i)

Ch4s3 · 2018-03-07T16:10:37Z

Thanks. I'll work on this a bit more in the next day or so.

khataev · 2019-11-16T03:45:14Z

any progress on this PR?

Ch4s3 mentioned this pull request Mar 2, 2018

Migrating classifier data from an older classifier-reborn structure #172

Open

Ch4s3 mentioned this pull request Mar 2, 2018

Fix the corner cases #173

Merged

Ch4s3 changed the base branch from master to 2.1-stable March 2, 2018 02:56

Ch4s3 changed the base branch from 2.1-stable to master March 2, 2018 02:57

Ch4s3 added 2 commits March 5, 2018 22:58

first pass on import/export

2ca9dfd

make import/export work with simple hashes

74fd30d

Ch4s3 force-pushed the import_export branch from 441e9c7 to 74fd30d Compare March 6, 2018 04:01

Ch4s3 added 2 commits March 5, 2018 23:03

move tests

9524dc6

missing import

da7beb9

working backend import/exp for in mem only

36d0532

Ch4s3 mentioned this pull request Jun 20, 2018

ability to serialize model? #178

Open

Ch4s3 closed this Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

first pass on import/export #174

first pass on import/export #174

Ch4s3 commented Mar 2, 2018 •

edited

Loading

Ch4s3 commented Mar 2, 2018

Ch4s3 commented Mar 2, 2018

parkr commented Mar 2, 2018

Ch4s3 commented Mar 2, 2018

ibnesayeed commented Mar 3, 2018

Ch4s3 commented Mar 4, 2018

Ch4s3 commented Mar 6, 2018

Ch4s3 commented Mar 6, 2018

ibnesayeed commented Mar 6, 2018

Ch4s3 commented Mar 6, 2018

ibnesayeed commented Mar 6, 2018 •

edited

Loading

ibnesayeed commented Mar 6, 2018 •

edited

Loading

ibnesayeed commented Mar 6, 2018 •

edited

Loading

Ch4s3 commented Mar 7, 2018

khataev commented Nov 16, 2019

first pass on import/export #174

first pass on import/export #174

Conversation

Ch4s3 commented Mar 2, 2018 • edited Loading

Ch4s3 commented Mar 2, 2018

Ch4s3 commented Mar 2, 2018

parkr commented Mar 2, 2018

Ch4s3 commented Mar 2, 2018

ibnesayeed commented Mar 3, 2018

Ch4s3 commented Mar 4, 2018

Ch4s3 commented Mar 6, 2018

Ch4s3 commented Mar 6, 2018

ibnesayeed commented Mar 6, 2018

Ch4s3 commented Mar 6, 2018

ibnesayeed commented Mar 6, 2018 • edited Loading

ibnesayeed commented Mar 6, 2018 • edited Loading

ibnesayeed commented Mar 6, 2018 • edited Loading

Ch4s3 commented Mar 7, 2018

khataev commented Nov 16, 2019

Ch4s3 commented Mar 2, 2018 •

edited

Loading

ibnesayeed commented Mar 6, 2018 •

edited

Loading

ibnesayeed commented Mar 6, 2018 •

edited

Loading

ibnesayeed commented Mar 6, 2018 •

edited

Loading