-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significantly update MMDB geoip enrichment data returns #19995
Comments
Thanks for the detailed request @johnhtodd ! I'm not terribly familiar with the mmdb format, but my quick investigation of the format seems to validate what you've written: records can have arbitrary fields. I think the difficulty would be that it appears that the names of the fields aren't embedded the database structure itself, from what I can see, so I think this would need to be provided to Vector. All that the database has to indicate the structure of each record is the the The mmdb reader we use has the geoip database record structures "hardcoded" for each GeoIP database type (city, ISP, etc.): https://github.com/oschwald/maxminddb-rust/blob/9045bfe1ac1fa3e7a29258692a03ea6ab2e01069/src/maxminddb/geoip2.rs#L5-L90. That's how we can pull out the fields with names like
The benchmark mentioned in that issue was just demonstrating that executing lookups in parallel with rayon is faster than serial lookups. This actually should already be happening by virtue of Vector executing |
I saw that comment from earlier in a prior MMDB thread about field names not being included in MMDB files - I don't think that is correct. The MMDB files I'm using here certainly do have the field names embedded in them, so I don't know where that impression came from. The parsers we wrote (as well as mmdbctl) create the structure of the file from the embedded field names for output. Example proofs (these are field names of an example GeoIP2-City database):
...and for a custom MMDB that we built with mrt2mmdb, using an unmodified version of mmdbctl to export the data - first line is the field names, and there is no magic location where mrt2mmdb is getting this field name data, since we arbitrarily created the file with field names "path" and "prefix" which don't appear in any MMDB libraries - those names are being pulled from the MMDB file (easier to show this than the code that we used to generate the file, which I linked above):
On the rayon topic: OK, great - so rayon is not relevant since Vector is already parallelizing these events. I didn't have context on that. Thanks! In short: I think this can be done entirely self-contained with just the MMDB file as the single input, which could then define a full object of results including all key names. |
This impression comes from the spec doc they have: https://maxmind.github.io/MaxMind-DB/#:~:text=a%20double%20instead.-,Data%20Field%20Format,-Each%20field%20starts Maybe the geoip databases are all storing a "map" as the record though which would allow for key/values. For the custom databases is that what you are doing? Using a "map" as the record? |
I'm not sure how that works (can't really dig into the code right now; I didn't write our parser) but I know that mmdbctl and our tools can both suss out the field names from the MMDB files without any difficulty or sidecar information, from both MMDB files that we create with custom names, as well as the files provided by MaxMind. I'm hopeful that this is enough to allow single-file consumption/mapping of names without a sidecar map file or schema definition built into the config in Vector, but we'll figure that out as we build the patch. |
This adds support for custom MMDB types. It will just return whatever is stored in the database, without further modifications. Test data was generated using the official go example: https://github.com/maxmind/mmdbwriter/blob/main/examples/asn-writer/main.go Fixes: vectordotdev#19995
A comment on configuration: I am very suspect of code that uses the hardcoded filename provided as the method to determine what kind of file is being accessed. Perhaps making those assumptions based on a dot-separated three letter suffix, but basing parsing logic on the starting part of the name seems very fragile and highly mis-interpretable. Currently, the patch looks at the filename and says "If the file doesn't contain the starting characters of "GeoLite2-ASN", "GeoIP2-ISP", "GeoIP2-Connection-Type", or "GeoIP2-City" then it must be a special custom file and we'll treat it differently. Even though this new special method should be able to ingest any of those other types equally well, or be more complete than the current method. Could we create a new "type" indicator that doesn't care about mmdb filenames? Perhaps saying "type: custom-mmdb" would allow us to use this new parsing method on any file, no matter what it has for a name, and would make documentation more easily explicable as well. You asked about the intent of the original request. I haven't (yet) tested the patch and I'm somewhere that I can write comments but not run code, so I'll spend some time doing that instead to explain the intent a bit more by way of examples, and you can see if what has been written matches the hoped-for outcome. Based on your example file, which I'll simplify here:
...I would expect this configuration:
...to result exactly in this data:
Here is a snippet of one of our custom MMDB files that has BGP data in it, along with some path data that we customized:
So using this MMDB file in the "CustomStuff" enrichment, I'd get this when Iooked up 1.0.0.1:
and another, more complex example from Maxmind with the same lookup, if I specified the GeoIP2-City.mmdb file as a custom-mmdb input (note that this is a slightly different result than what happens today with the filename-specific ingestion model):
I would expect a lookup on 1.0.0.1 in that database to produce results that look like this (non-exhaustive example - the actual object would contain ALL results from the lookup but I only show a few for brevity):
|
I don't think the patch looks at the file name. What it does is read the I do see your point about reading the record as a map being more general than specialized code for each database though. However, I don't think we could replace the current per-database-type handling with it as it is not simply reading a map, but it does some additional processing. For example, finding the last subdivision and renaming that to "region": We could, however, let the user configure if they just want to read the structure as-is though and not do any mapping, even for known database types. A couple of ways we could do this:
I'm partial to the new enrichment table type since it keeps things nicely separated, but I worry about users being confused about whether to use |
Correct. It used to check that before as well, I have just changed it to treat unknown types as custom, instead of city.
I think first option makes more sense. If we go for that, maybe it would make sense to make it an error if any unsupported type is used for |
I'm all for the first option. It would permit non-breaking use of the existing model, but would also allow new custom mmdb files to be read, and would also thirdly allow standard Maxmind files to be read and interpreted in the same way as custom files if that was the intent of the administrator. |
I'm 👍 on pursuing the first option to introduce a new enrichment table type:
I tend to agree. |
* feat(enrichment_tables): add support for custom MMDB types This adds support for custom MMDB types. It will just return whatever is stored in the database, without further modifications. Test data was generated using the official go example: https://github.com/maxmind/mmdbwriter/blob/main/examples/asn-writer/main.go Fixes: #19995 * Add changelog entry * Change `hostname` in tests to an actual word to avoid spellcheck * Update `enrichment_tables` docs * Update docs Co-authored-by: Ursula Chen <[email protected]> * Add separate `mmdb` enrichment table type * Update docs * Remove todos * Update comment on geoip `DatabaseKind` * Update changelog entry * Fix mmdb docs * Add benches for mmdb enrichment_tables --------- Co-authored-by: Ursula Chen <[email protected]>
…dev#20054) * feat(enrichment_tables): add support for custom MMDB types This adds support for custom MMDB types. It will just return whatever is stored in the database, without further modifications. Test data was generated using the official go example: https://github.com/maxmind/mmdbwriter/blob/main/examples/asn-writer/main.go Fixes: vectordotdev#19995 * Add changelog entry * Change `hostname` in tests to an actual word to avoid spellcheck * Update `enrichment_tables` docs * Update docs Co-authored-by: Ursula Chen <[email protected]> * Add separate `mmdb` enrichment table type * Update docs * Remove todos * Update comment on geoip `DatabaseKind` * Update changelog entry * Fix mmdb docs * Add benches for mmdb enrichment_tables --------- Co-authored-by: Ursula Chen <[email protected]>
A note for the community
Use Cases
The current methodology around using GeoIP lookups with MMDB files seems to be quite limited in capability, and does not allow for custom files which contain field names that are not hard-coded in Vector source code. We perform several MMDB lookups on our pipeline (three in some cases) and having this decreased to just one lookup event would be ideal. To do that, we would need to "merge" the two MMDB files with each other, and then Vector would need to be flexible enough to return an arbitrary set of fields with the single lookup. Also, we find that the existing MMDB lookups do not return a full set of data when that would be sometimes useful.
The existing MMDB lookup method is quite inflexible, and forces specific choices of what fields are returned, and even looks at file names to make assumptions about what kind of fields should be sent back in the response. This does not allow us to extend existing MMDB tables with more information without really ugly overloading of field names, and even that is not workable past a certain point.
Attempted Solutions
Tried overloading field names; this is awful and has a limited ramp as more complexity encroaches. Also, there is no other way to get other fields without code changes.
Proposal
We would propose a much more generic method of doing MMDB lookups and data returns, which uses the format of the MMDB file to define the data structure handed back. This seems like it would be less work than the existing method, so I'm not sure what was the intent of the original effort that resulted in the way it works today. There are field names implicit within the MMDB file, and this could simply be exported as a Vector "default" object set into memory. This would be much more flexible by allowing arbitrary database fields to be named, and then the management and parsing of those fields would be done within Vector as all fields are currently managed.
Here would be an example future configuration idea:
So a lookup for (as an example) 128.151.224.17 would result in an object from the MMDB lookup in the GeoIP2-City.mmdb file inserted into the .GeoData object that resembles this:
(Note: a developer of ours wrote some python code to produce this output directly from MMDB files - https://github.com/sbng/mrt2mmdb/ - use the "lookup.py" script like this: "python3 lookup.py --mmdb /etc/vector/maxmind/GeoIP2-City_20231117/GeoIP2-City.mmdb --ipaddress 128.151.224.17") - you may need to alter single quotes to double quotes. You an also use mmdbctl (https://github.com/ipinfo/mmdbctl) for queries.)
This would be a breaking change, and is different enough to the old method that it would seem to warrant a new enrichment method entirely. I proposed "geoip2" but perhaps "mmdb" needs to start appearing in the name to be less ambiguous.
References
Additionally: there was discussion of the use of "rayon" to increase performance in the original issue (#847) - was that used? It was more than 3x improvement in speed. Our reason for looking at custom MMDB files is to summarize multiple lookups into one, or at most two lookups for the sake of speed, and anything else to make MMDB lookups faster would be appreciated.
Historical:
Original PR: #1015
Further work: #1372
Version
vector 0.37.0 (x86_64-unknown-linux-gnu e2d8ad4 2024-03-02 04:01:47.120115067)
The text was updated successfully, but these errors were encountered: