Refactoring, Dynamic prefix and AWS v2 #102

ph · 2016-09-16T04:50:50Z

Motivation
One of the most requested features was adding a way to add dynamic prefixes using the fieldref
syntax for the files on the bucket and also the changes in the pipeline to support shared delegator.
The S3 output by nature was always a single threaded writes but had multiples workers to process the upload, the code was threadsafe when used in the concurrency :single mode.

This PR addresses a few problems and provide shorter and more structured code:

This Plugin now uses the V2 version of the SDK, this make sure we receive the latest updates and changes.
We now uses S3's upload_file instead of reading chunks, this method is more efficient and will uses the multipart with threads if the files is too big.
You can now use the fieldref syntax in the prefix to dynamically changes the target with the events it receives.
The Upload queue is now a bounded list, this options is necessary to allow back pressure to be communicated back to the pipeline but its configurable by the user.
If the queue is full the plugin will start the upload in the current thread.
The plugin now threadsafe and support the concurrency model shared
The rotation strategy can be selected, the recommended is size_and_time that will check for both the configured limits (size and time are also available)
The restore option will now use a separate threadpool with an unbounded queue
The restore option will not block the launch of logstash and will uses less resources than the real time path
The plugin now uses multi_receive_encode, this will optimize the writes to the files
rotate operation are now batched to reduce the number of IO calls.
Empty file will not be uploaded by any rotation rotation strategy
We now use Concurrent-Ruby for the implementation of the java executor
If you have finer grain permission on prefixes or want faster boot, you can disable the credentials check with validate_credentials_on_root_bucket
The credentials check will no longer fails if we can't delete the file
We now have a full suite of integration test for all the defined rotation

Fixes: #4 #81 #44 #59 #50

tkampour · 2016-09-20T14:28:37Z

great work guys.
How likely is this to be merged soon? Seeing the other pull request from Oct 2015 I am a bit afraid about its timing.

ph · 2016-09-20T14:31:24Z

@tkampour Since this is a big change, in fact a complete rewrite, it need to go through the review process like any other PRs. Since you have commented on this issue you will get notified when it happen, since its this one of most requested feature I am also eager to get this merged in.

lremurphy · 2016-09-20T15:48:00Z

Awesome work! Can't wait to try this!

vistorve · 2016-09-20T16:32:17Z

I too am very excited for this change, so excited I am attempting to create a local build of it to get working but I keep running into the following:

Couldn't find any output plugin named 's3'. Are you sure this is correct? Trying to load the s3 output plugin resulted in this error: no such file to load -- concurrent/map", :level=>:error, :file=>"logstash/agent.rb", :line=>"448", :method=>"create_pipeline"}
{:timestamp=>"2016-09-20T16:29:51.292000+0000", :message=>"starting agent", :level=>:info, :file=>"logstash/agent.rb", :line=>"213", :method=>"execute"}

I built the gem and am using that with --no-verify to install. Logstash version=2.4, I am doing this inside a docker if that makes a difference.

suyograo · 2016-09-20T16:35:14Z

@vistorve how did you build this gem? are you following the steps here? https://github.com/logstash-plugins/logstash-output-s3#2-running-your-unpublished-plugin-in-logstash

vistorve · 2016-09-20T16:45:48Z

@suyograo I did the steps in 2.2 under the "or you can build the gem and install it using". That seemed like the easier approach since it was in a docker.

ph · 2016-09-20T17:13:14Z

@vistorve Nice finding, this is a bug in this PR. I am using concurrent-ruby's map implementation and its available in 1.0.0 >= Logstash 5.0/master ships with it but 2.4 ships with 0.9.2. The dependency is not strict enough in the gemspec this is why you were able to install it but failed to run.

logstash-core depends on a strict version of it in https://github.com/elastic/logstash/blob/2.4/Gemfile.jruby-1.9.lock#L162 so if the plugin requires a newer version the core wont permit it.

@suyograo If we want to make this change works in both 2.4 and 5.0 we have the following solutions:

When running in 2.4 Using threadsafe::hash that already shipped with logstash, otherwise we use concurrent-ruby's map.
Replace concurrent-ruby's map with https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ConcurrentMap.html which is part of the standard library.
release a 2.4.1 with a concurrent-ruby updated.

The other features I used from concurrent-ruby ScheduledTask and the threadpool are available in 0.9.2.

suyograo · 2016-09-21T16:24:55Z

@ph can we do #2 since it looks like a safe option across 2.4 and 5.0?

ph · 2016-09-21T16:59:30Z

@suyograo #2? We have the temporary_directory option for quite a while now?

suyograo · 2016-09-21T17:02:03Z

@ph I meant point 2:

Replace concurrent-ruby's map with https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ConcurrentMap.html which is part of the standard library.

ph · 2016-09-21T18:13:23Z

/facepalm ☕ ☕

andrewvc · 2016-09-21T21:23:11Z

FWIW I generally prefer the java util concurrent stuff to the concurrent ruby objects. IMHO the main point of concurrent ruby is to be portable across ruby implementations' concurrency models, which is not a concern for Logstash.

andrewvc

Absolutely amazing work here @ph !

This is an initial pass at review. I have still not covered the test suite. I'll do another round once the changes here have been discussed and changes made.

andrewvc · 2016-09-21T21:27:17Z

lib/logstash/outputs/s3.rb


  # S3 bucket
-  config :bucket, :validate => :string
+  config :bucket, :validate => :string, :required => true

  # Set the size of file in bytes, this means that files on bucket when have dimension > file_size, they are stored in two or more file.


This needs some grammatical fixes, I don't completely understand what these comments are trying to say.

I believe it is something like:

"Set the target size of uploaded files in bytes. This will result in multiple files being created once the data exceeds that size threshold"

The note regarding the local file I don't understand.

andrewvc · 2016-09-21T21:27:48Z

lib/logstash/outputs/s3.rb


  # Set the size of file in bytes, this means that files on bucket when have dimension > file_size, they are stored in two or more file.
  # If you have tags then it will generate a specific size file for every tags
  ##NOTE: define size of file is the better thing, because generate a local temporary file on disk and then put it in bucket.
-  config :size_file, :validate => :number, :default => 0
+  config :size_file, :validate => :number, :default => 1024 * 1024 * 5


Maybe this should be named target_file_size since it is a goal but not a guarantee. Alternatively max_file_size would make sense.

+1, I wanted to change theses names for quite a while, I will keep the other working and mark them as deprecated.

andrewvc · 2016-09-21T21:28:54Z

lib/logstash/outputs/s3.rb


  # Set the size of file in bytes, this means that files on bucket when have dimension > file_size, they are stored in two or more file.
  # If you have tags then it will generate a specific size file for every tags
  ##NOTE: define size of file is the better thing, because generate a local temporary file on disk and then put it in bucket.
-  config :size_file, :validate => :number, :default => 0
+  config :size_file, :validate => :number, :default => 1024 * 1024 * 5

  # Set the time, in MINUTES, to close the current sub_time_section of bucket.


This also needs to be edited for clarity

andrewvc · 2016-09-21T21:30:09Z

lib/logstash/outputs/s3.rb


  # Set the time, in MINUTES, to close the current sub_time_section of bucket.
  # If you define file_size you have a number of files in consideration of the section and the current tag.
  # 0 stay all time on listerner, beware if you specific 0 and size_file 0, because you will not put the file on bucket,
  # for now the only thing this plugin can do is to put the file when logstash restart.
-  config :time_file, :validate => :number, :default => 0
+  config :time_file, :validate => :number, :default => 15 * 60


Maybe this should be renamed max_upload_delay or something. I'm not sure if this setting controls the max time from file creation or the max time from the last write by the comments above.

is the max time from file creation in this case.

andrewvc · 2016-09-21T21:30:28Z

lib/logstash/outputs/s3.rb

@@ -102,7 +121,7 @@ class LogStash::Outputs::S3 < LogStash::Outputs::Base
  config :restore, :validate => :boolean, :default => false


As we discussed this will be changed to default to true.

andrewvc · 2016-09-21T21:57:54Z

lib/logstash/outputs/s3/file_repository.rb

+        # Ensure that all access or work done
+        # on a factory is threadsafe
+        class PrefixedValue
+          def initialize(factory, stale_time)


factory lacks context. Have you considered renaming this file_factory or similar?

andrewvc · 2016-09-21T21:59:50Z

lib/logstash/outputs/s3/temporary_file_factory.rb

+      #
+      # Since the UUID should be fairly unique I can destroy the whole path when an upload is complete.
+      # I do not have to mess around to check if the other directory have file in it before destroying them.
+      class TemporaryFileFactory


Is this really a factory? It seems more like a proxy. I would expect a factory to make and return new objects. This abstracts over an underlying set of files. Consider renaming to RotatingTemporaryFile or something else that's more descriptive.

I agree it feel more like a proxy to me, in this case.

andrewvc · 2016-09-21T22:05:57Z

lib/logstash/outputs/s3/size_and_time_rotation_policy.rb

+          @size_strategy.rotate?(file) || @time_strategy.rotate?(file)
+        end 
+
+        def need_periodic?


s/need_periodic/needs_periodic/

andrewvc · 2016-09-21T22:08:02Z

lib/logstash/outputs/s3/temporary_file.rb

+        extend Forwardable
+        DELEGATES_METHODS = [:path, :write, :close, :size, :fsync]
+
+        def_delegators :@fd, *DELEGATES_METHODS


AFAICT we only use a constant here to test if these methods are defined in a later spec. I would just define them inline and remove the spec. I don't think there's a ton of value in specs that check that methods merely exist since at that point we're testing that Forwardable isn't broken.

I don't think we need to do any testing here FWIW and we can simplify this.

andrewvc · 2016-09-21T22:23:02Z

lib/logstash/outputs/s3/temporary_file_factory.rb

+        end
+
+        def generate_name
+          filename = "ls.s3.#{Socket.gethostname}.#{current_time}"


We need to document that there is a (remote) possibility of collisions between LS instances if this scheme is used. If users use multiple logstashes they should probably give them unique prefixes.

Alternatively we could inject some randomness into here as well.

what are you suggesting for randomness?

We have:

the host

a date

SecureRandom.uuid?

s4nch3z

Great work! I've been waiting for something like that for a year at least!
I tried to run this branch on test dataset and there are couple things:

encoding => "gzip" is not working
time_file says it is in minutes, but implemented in seconds

s4nch3z · 2016-10-21T13:10:13Z

lib/logstash/outputs/s3/time_rotation_policy.rb

+        end
+
+        def rotate?(file)
+          file.size > 0 && Time.now - file.ctime >= time_file


Docs says time_file is in MINUTES, but it's not like that here

nevins-b · 2016-11-08T22:00:59Z

There hasn't seemed to be any activity on this pull request in a while so I made a fork and modified the code to address most of the issues raised with this review.

I haven't tested this code in production yet but all the tests pass.

ph · 2016-11-09T16:44:27Z

@nevins Thanks can you make a PR with your changes? I will help moving it forward :)

nevins-b · 2016-11-09T16:59:01Z

@ph thanks! opened #105

ph · 2016-11-14T16:37:47Z

I seem to run into a weird jruby thread/autoload issue when running the tests. To be clear the bug was there before but the new logger infra expose it.

jruby/jruby#3920 and elastic/logstash#6201

andrewvc · 2016-12-08T17:48:39Z

lib/logstash/outputs/s3/temporary_file.rb

+          # if the file is close we will use the File::size
+          begin
+            @fd.size
+          rescue


Is there a better way we can do this rescue? Is there a specific error we can catch instead?

andrewvc · 2016-12-08T17:49:12Z

Left a few more comments, it's almost good to go!

ph · 2016-12-08T19:50:14Z

@andrewvc I have addressed your latest comment.

ph · 2016-12-12T14:40:36Z

rekicking the travis job

andrewvc · 2016-12-12T19:04:05Z

@ph approved! Great work :)

**Motivation** One of the most requested features was adding a way to add dynamic prefixes using the fieldref syntax for the files on the bucket and also the changes in the pipeline to support shared delegator. The S3 output by nature was always a single threaded writes but had multiples workers to process the upload, the code was threadsafe when used in the concurrency `:single` mode. This PR addresses a few problems and provide shorter and more structured code: - This Plugin now uses the V2 version of the SDK, this make sure we receive the latest updates and changes. - We now uses S3's `upload_file` instead of reading chunks, this method is more efficient and will uses the multipart with threads if the files is too big. - You can now use the `fieldref` syntax in the prefix to dynamically changes the target with the events it receives. - The Upload queue is now a bounded list, this options is necessary to allow back pressure to be communicated back to the pipeline but its configurable by the user. - If the queue is full the plugin will start the upload in the current thread. - The plugin now threadsafe and support the concurrency model `shared` - The rotation strategy can be selected, the recommended is `size_and_time` that will check for both the configured limits (`size` and `time` are also available) - The `restore` option will now use a separate threadpool with an unbounded queue - The `restore` option will not block the launch of logstash and will uses less resources than the real time path - The plugin now uses `multi_receive_encode`, this will optimize the writes to the files - rotate operation are now batched to reduce the number of IO calls. - Empty file will not be uploaded by any rotation rotation strategy - We now use Concurrent-Ruby for the implementation of the java executor - If you have finer grain permission on prefixes or want faster boot, you can disable the credentials check with `validate_credentials_on_root_bucket` - The credentials check will no longer fails if we can't delete the file - We now have a full suite of integration test for all the defined rotation Fixes: logstash-plugins#4 logstash-plugins#81 logstash-plugins#44 logstash-plugins#59 logstash-plugins#50

… of unused files/directories

elasticsearch-bot · 2016-12-15T14:31:50Z

Pier-Hugues Pellerin merged this into the following branches!

Branch	Commits
master	`bcdb0df`, `586210b`, `dc7b3fd`, `6799d03`, `d8c1d98`, `9b16097`, `350ba7f`, `dbc474b`

Fixes #102

… of unused files/directories Fixes #102

Fixes #102

ph added enhancement needs reviewing labels Sep 16, 2016

ph assigned jordansissel and jsvd Sep 16, 2016

gaul mentioned this pull request Sep 19, 2016

Allow non-AWS endpoints #100

Closed

suyograo assigned suyograo and unassigned jsvd and jordansissel Sep 20, 2016

suyograo assigned andrewvc Sep 21, 2016

andrewvc suggested changes Sep 21, 2016

View reviewed changes

s4nch3z reviewed Oct 21, 2016

View reviewed changes

nevins-b pushed a commit to nevins-b/logstash-output-s3 that referenced this pull request Nov 8, 2016

making changes sugguested in logstash-plugins#102

b79c65f

nevins-b mentioned this pull request Nov 9, 2016

Refactoring, Dynamic prefix and AWS v2 (extended) #105

Closed

ph unassigned suyograo Nov 14, 2016

andrewvc reviewed Dec 8, 2016

View reviewed changes

ph closed this Dec 12, 2016

ph reopened this Dec 12, 2016

andrewvc approved these changes Dec 12, 2016

View reviewed changes

ph and others added 8 commits December 14, 2016 20:43

making changes sugguested in logstash-plugins#102

5276167

minor grammer fixes

171fd1e

simpler solutions for correcting seconds vs minutes

ccd9e5c

fixing bug where stale files where removed but never deleted

1c50b94

cleanup, also fixing a bug where was causing creation of large number…

4025feb

… of unused files/directories

make sure the test pass when using java classes

ff77cde

bump to 4.0.0

458de28

ph force-pushed the feature/cleaner-code branch from 176ad7c to 458de28 Compare December 15, 2016 01:46

elasticsearch-bot closed this in bcdb0df Dec 15, 2016

elasticsearch-bot pushed a commit that referenced this pull request Dec 15, 2016

making changes sugguested in #102

586210b

Fixes #102

elasticsearch-bot pushed a commit that referenced this pull request Dec 15, 2016

minor grammer fixes

dc7b3fd

Fixes #102

elasticsearch-bot pushed a commit that referenced this pull request Dec 15, 2016

simpler solutions for correcting seconds vs minutes

6799d03

Fixes #102

elasticsearch-bot pushed a commit that referenced this pull request Dec 15, 2016

fixing bug where stale files where removed but never deleted

d8c1d98

Fixes #102

elasticsearch-bot pushed a commit that referenced this pull request Dec 15, 2016

cleanup, also fixing a bug where was causing creation of large number…

9b16097

… of unused files/directories Fixes #102

elasticsearch-bot pushed a commit that referenced this pull request Dec 15, 2016

make sure the test pass when using java classes

350ba7f

Fixes #102

elasticsearch-bot pushed a commit that referenced this pull request Dec 15, 2016

bump to 4.0.0

dbc474b

Fixes #102

ph deleted the feature/cleaner-code branch December 15, 2016 14:32

ph mentioned this pull request Feb 24, 2017

Adding feature to support dynamic prefix #70

Closed

ph mentioned this pull request Mar 21, 2017

Logstash-output-s3 can't save logs on many bucket. #37

Closed

cflee mentioned this pull request Oct 29, 2018

Place UUID after timestamp in filename #197

Open

		@@ -102,7 +121,7 @@ class LogStash::Outputs::S3 < LogStash::Outputs::Base
		config :restore, :validate => :boolean, :default => false

Refactoring, Dynamic prefix and AWS v2 #102

Refactoring, Dynamic prefix and AWS v2 #102

Conversation

ph commented Sep 16, 2016 • edited Loading

tkampour commented Sep 20, 2016 • edited Loading

ph commented Sep 20, 2016

lremurphy commented Sep 20, 2016

vistorve commented Sep 20, 2016

suyograo commented Sep 20, 2016

vistorve commented Sep 20, 2016 • edited Loading

ph commented Sep 20, 2016

suyograo commented Sep 21, 2016

ph commented Sep 21, 2016

suyograo commented Sep 21, 2016

ph commented Sep 21, 2016

andrewvc commented Sep 21, 2016

andrewvc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewvc Sep 21, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ph Sep 22, 2016 • edited Loading

Choose a reason for hiding this comment

s4nch3z left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nevins-b commented Nov 8, 2016 • edited Loading

ph commented Nov 9, 2016

nevins-b commented Nov 9, 2016

ph commented Nov 14, 2016 • edited Loading

Choose a reason for hiding this comment

andrewvc commented Dec 8, 2016

ph commented Dec 8, 2016

ph commented Dec 12, 2016

andrewvc commented Dec 12, 2016

elasticsearch-bot commented Dec 15, 2016

ph commented Sep 16, 2016 •

edited

Loading

tkampour commented Sep 20, 2016 •

edited

Loading

vistorve commented Sep 20, 2016 •

edited

Loading

andrewvc Sep 21, 2016 •

edited

Loading

ph Sep 22, 2016 •

edited

Loading

nevins-b commented Nov 8, 2016 •

edited

Loading

ph commented Nov 14, 2016 •

edited

Loading