-
Notifications
You must be signed in to change notification settings - Fork 11
Publishing: Manual
There are two ways you can manually publish large amounts of data.
Easier to use, but does a lot of DB queries. May filter out invalid PK values.
- requires less work
- it will automatically use methods for data customization (
#attributes_for_sync
,#attributes_for_destroy
) - you can be sure that data you publish is valid (more or less)
- serializes values for (
#publish_later
)
- it queries database for each entity in batch (for
create
andupdate
) - it may also do a lot of queries if
#attributes_for_sync
contains additional data from other connected entities - serializes values for (
#publish_later
); if your PK contains invalid values (ex. Date) they will be filtered out
More complex to use, but requires very few DB queries. You are responsible for the data you send!
- very customizable; only requirement -
model_name
must exist - you can send whatever data you want
- you have to manually prepare data for publishing
- you have to be really sure you are sending valid data
- Don't make data batches too large!
It may result in failure to process them. Good rule is to send approx. 5000 rows in one batch.
- Make pauses between publishing data batches!
Publishing without pause may overwhelm the receiving side. Either their background job processor (ex. Sidekiq) may clog with jobs, or their consumers may not be able to get messages from Rabbit server fast enough.
1 or 2 seconds is a good wait period.
- Do not use
TableSync::Publishing:Single
to send millions or even thousands of rows of data.
Or just calling update on rows with automatic sync. It WILL overwhelm the receiving side. Especially if they have some additional receiving logic.
- On the receiving side don't create job with custom logic (if it exists) for every row in a batch.
Better to process it whole. Otherwise three batches of 5000 will result in 15000 new jobs.
- Send one test batch before publishing the rest of the data.
Make sure it was received properly. This way you won't send a lot invalid messages.
-
Check the other arguments.
- Ensure the routing_key is correct if you are using a custom one. Remember, that batch publishing ignores
#attributes_for_routing_key
on a model. - Ensure that
object_class
is correct. And it belongs to entities you want to send. - Ensure that you chose the correct event.
- If you have some logic depending on headers, ensure they are also correct. Remember, that batch publishing ignores
#attributes_for_headers
on a model.
- Ensure the routing_key is correct if you are using a custom one. Remember, that batch publishing ignores
-
You can check what you send before publishing with:
TableSync::Publishing:Batch.new(...).message.message_params
TableSync::Publishing:Raw.new(...).message.message_params
# For Sequel
# gem 'sequel-batches' or equivalent that will allow you to chunk data somehow
# or #each_page from Sequel
# this is a simple query
# they can be much more complex, with joins and other things
# just make sure that it results in a set of data you expect
data = User.in_batches(of: 5000).naked.select(:id, :name, :email)
data.each_with_index do |batch, i|
TableSync::Publishing::Raw.new(
model_name: "User",
original_attributes: batch,
event: :create,
table_name: "users", # optional, for notifications
schema_name: "public", # optional, for notifications
routing_key: :custom_key, # optional
headers: { type: "admin" }, # optional
).publish_now
# make a pause between batches
sleep 1
# for when you are sending from terminal
# allows you to keep an eye on progress
# you can create more complex output
puts "Batch #{i} sent!"
end
If you don't want to create a data query (maybe it's too complex) but there is a lot of quereing in #attributes_for_sync
and you are willing to trade a little bit of perfomance, you can try the following.
class User < Sequel
one_to_many :user_info
# For example our receiving side wants to know the ips user logged in under
# But doesn't want to sync the user_info
def attributes_for_sync
attributes.merge(
ips: user_info.ips
)
end
end
# to prevent the need to query for every piece of additional data we can user eager load
# and construct published data by calling #attributes_for_sync
# don't forget to chunk it into more managable sizes before trying to send
data = User.eager(:statuses).map { |user| user.attributes_for_sync }
This way it will not make unnecessary queries.
Remember, it will query or initialize each row.
# You can just send ids.
data = User.in_batches(of: 5000).naked.select(:id)
data.each_with_index do |data, i|
TableSync::Publishing::Batch.new(
object_class: "User",
original_attributes: data,
event: :create,
routing_key: :custom_key, # optional
headers: { type: "admin" }, # optional
).publish_now
sleep 1
puts "Batch #{i} sent!"
end