-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #5: Support for DATE/TIME extraction. #9
base: master
Are you sure you want to change the base?
Conversation
This commit adds some support for parsing fields with the DATE, TIME, or DATETIME formats from .sas7bdat files. This is a big change, involving a couple of different elements: * new dependencies, including to Joda Time (for date/time formatting) and a log4j-slfj4 dependency for logging. * Adding two new types (DATE, TIME) to SasColumnType, * carrying through those changes to the parsing in the SasReader class, including use of date time conversion methods, * which are in the new DateTimeConverter class, that uses the Joda Time library. A couple of notes: first, there is some bounds-checking for date/time values in SasReader, which are meant to replicate some of the observations on date/time value parsing from the Python sas7bdat library. In particular, we replicate the bounds on values from the Python datetime library, so that the results _should_ be replicable across libraries. See the comments in SasReader and DateTimeConverter. Also, we note that some "excessively large" values for date9 are actually 'datetime' (i.e. seconds, not days, from Jan 1 1960), an observation taken from reading the Python code itself. I've actually aimed, throughout the code in places where the bounds or behavior were underdocumented, to match as much as possible the behavior of the Python sas7bdat module (https://pypi.python.org/pypi/sas7bdat). Probably we should recognize an additional, explicit DATETIME format type in the future. NULL values (represented as NaN in NUMERIC format) are now returned as straight Java nulls, from SasReader, for Date/Time values. Furthermore, the actual values returned are DateTime and Period values (from the Joda Time library), so downstream libraries will need to recognize those values (by RTTI?) and provide their own date/time formatting. Right now, testing only covers the DATE formatting, using the example DATE-containing data file provided in the discussion/gist to Issue datacleaner#5. However, at the moment, I don't have an example of TIME values. I've tested these against a privately-available file, but I don't have a public test that I can share so this feature should still be considered UNTESTED.
@kaspersorensen this is a big change -- and I'm still going to add a commit to it, with unit tests based on the new files that Gagravarr had passed along. But it is still ready for review, in particular whether my approach (extending the SysColumnType, and use of Joda Time objects) is appropriate and reasonable. So it's not ready for merging now, but hopefully soon with some review. Thanks! |
It's failing because of verbose logging. let me turn that off. |
|
||
<dependency> | ||
<groupId>org.slf4j</groupId> | ||
<artifactId>slf4j-log4j12</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not ship with any specific logging framework except for the facade provided by slf4j. It would be fine to include this as a test-scoped dependency if needed, but I would like to prevent that we make logging framework decisions for the end users of the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll scope it as 'test', thanks.
Could we avoid returning Joda time values? I don't mind using Joda time internally, but would prefer to return java.util.Date values. I know they suck on many behalfs and I cannot wait to start using java 8's new date API. I just feel that returning Joda time objects to the user is leaking our dependency onto the users of the library. |
Yeah, I'm up for it -- but how do you want to represent TIME values (which are, as I understand it, periods measured in seconds from the start of the day)? I'm genuinely asking, does the Java standard library have a class or representation of these already? |
Actually, the more I think about it, I think we should use java.sql.Time and java.sql.Date... They are both subclasses of java.util.Date so casting will work, but at the same time they indicate to be "just" a Date or "just" a Time field. |
@@ -0,0 +1,18 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file a leftover of testing compatibility with the original Python project? It doesn't seem to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was used to generate the .tsv for date_time testing, and it's been useful for generating other test output comparison files...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, better keep it for later, then :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, although maybe moving it to a different location wouldn't be a bad idea.
@kaspersorensen I will convert over to using the java.util classes -- as soon as I get some time from the work responsibilities that are consuming me at the moment. (I apologize, but if this is holding up a release, you shouldn't block on me for the next week or two.) |
formatOffset = IO.readShort(rawData, 36) + 4; | ||
formatLen = IO.readShort(rawData, 38); | ||
|
||
if(formatOffset > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably ought to be if(formatLen > 0) {
.
This isn't completely valid. I've found that that sometimes the format location is kept in the colName header (offset at byte 14, length at byte 16). My suspicion is that if there's no label for the column then it makes the column name subheader do extra work. I'll take this up with Matt to see if he's ran into this before. |
Hi guys, I was just stumbling over this PR again ... I'm worried if we're forgetting it because it went a bit silent. Anything I can do to help? What's the status? |
He mentioned that he was pretty busy, and didn't have time to check. I've looked quite a bit, but nothing ever came of it. Just seems some files are using something other than a string identifier in an unknown location, which seems odd to me. |
This commit adds some support for parsing fields with the DATE, TIME, or
DATETIME formats from .sas7bdat files.
This is a big change, involving a couple of different elements:
and a log4j-slfj4 dependency for logging.
class, including use of date time conversion methods,
Joda Time library.
A couple of notes: first, there is some bounds-checking for date/time
values in SasReader, which are meant to replicate some of the
observations on date/time value parsing from the Python sas7bdat
library. In particular, we replicate the bounds on values from the
Python datetime library, so that the results should be replicable
across libraries. See the comments in SasReader and DateTimeConverter.
Also, we note that some "excessively large" values for date9 are
actually 'datetime' (i.e. seconds, not days, from Jan 1 1960), an
observation taken from reading the Python code itself. I've actually
aimed, throughout the code in places where the bounds or behavior were
underdocumented, to match as much as possible the behavior of the Python
sas7bdat module (https://pypi.python.org/pypi/sas7bdat).
Probably we should recognize an additional, explicit DATETIME format
type in the future.
NULL values (represented as NaN in NUMERIC format) are now returned as
straight Java nulls, from SasReader, for Date/Time values.
Furthermore, the actual values returned are DateTime and Period values
(from the Joda Time library), so downstream libraries will need to
recognize those values (by RTTI?) and provide their own date/time
formatting.
Right now, testing only covers the DATE formatting, using the example
DATE-containing data file provided in the discussion/gist to Issue #5.
However, at the moment, I don't have an example of TIME values. I've
tested these against a privately-available file, but I don't have a
public test that I can share so this feature should still be considered
UNTESTED.