Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #5: Support for DATE/TIME extraction. #9

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

tdanford
Copy link
Contributor

This commit adds some support for parsing fields with the DATE, TIME, or
DATETIME formats from .sas7bdat files.

This is a big change, involving a couple of different elements:

  • new dependencies, including to Joda Time (for date/time formatting)
    and a log4j-slfj4 dependency for logging.
  • Adding two new types (DATE, TIME) to SasColumnType,
  • carrying through those changes to the parsing in the SasReader
    class, including use of date time conversion methods,
  • which are in the new DateTimeConverter class, that uses the
    Joda Time library.

A couple of notes: first, there is some bounds-checking for date/time
values in SasReader, which are meant to replicate some of the
observations on date/time value parsing from the Python sas7bdat
library. In particular, we replicate the bounds on values from the
Python datetime library, so that the results should be replicable
across libraries. See the comments in SasReader and DateTimeConverter.

Also, we note that some "excessively large" values for date9 are
actually 'datetime' (i.e. seconds, not days, from Jan 1 1960), an
observation taken from reading the Python code itself. I've actually
aimed, throughout the code in places where the bounds or behavior were
underdocumented, to match as much as possible the behavior of the Python
sas7bdat module (https://pypi.python.org/pypi/sas7bdat).

Probably we should recognize an additional, explicit DATETIME format
type in the future.

NULL values (represented as NaN in NUMERIC format) are now returned as
straight Java nulls, from SasReader, for Date/Time values.

Furthermore, the actual values returned are DateTime and Period values
(from the Joda Time library), so downstream libraries will need to
recognize those values (by RTTI?) and provide their own date/time
formatting.

Right now, testing only covers the DATE formatting, using the example
DATE-containing data file provided in the discussion/gist to Issue #5.
However, at the moment, I don't have an example of TIME values. I've
tested these against a privately-available file, but I don't have a
public test that I can share so this feature should still be considered
UNTESTED.

This commit adds some support for parsing fields with the DATE, TIME, or
DATETIME formats from .sas7bdat files.

This is a big change, involving a couple of different elements:
  * new dependencies, including to Joda Time (for date/time formatting)
    and a log4j-slfj4 dependency for logging.
  * Adding two new types (DATE, TIME) to SasColumnType,
  * carrying through those changes to the parsing in the SasReader
    class, including use of date time conversion methods,
  * which are in the new DateTimeConverter class, that uses the
    Joda Time library.

A couple of notes: first, there is some bounds-checking for date/time
values in SasReader, which are meant to replicate some of the
observations on date/time value parsing from the Python sas7bdat
library.  In particular, we replicate the bounds on values from the
Python datetime library, so that the results _should_ be replicable
across libraries. See the comments in SasReader and DateTimeConverter.

Also, we note that some "excessively large" values for date9 are
actually 'datetime' (i.e. seconds, not days, from Jan 1 1960), an
observation taken from reading the Python code itself.  I've actually
aimed, throughout the code in places where the bounds or behavior were
underdocumented, to match as much as possible the behavior of the Python
sas7bdat module (https://pypi.python.org/pypi/sas7bdat).

Probably we should recognize an additional, explicit DATETIME format
type in the future.

NULL values (represented as NaN in NUMERIC format) are now returned as
straight Java nulls, from SasReader, for Date/Time values.

Furthermore, the actual values returned are DateTime and Period values
(from the Joda Time library), so downstream libraries will need to
recognize those values (by RTTI?) and provide their own date/time
formatting.

Right now, testing only covers the DATE formatting, using the example
DATE-containing data file provided in the discussion/gist to Issue datacleaner#5.
However, at the moment, I don't have an example of TIME values.  I've
tested these against a privately-available file, but I don't have a
public test that I can share so this feature should still be considered
UNTESTED.
@tdanford
Copy link
Contributor Author

@kaspersorensen this is a big change -- and I'm still going to add a commit to it, with unit tests based on the new files that Gagravarr had passed along. But it is still ready for review, in particular whether my approach (extending the SysColumnType, and use of Joda Time objects) is appropriate and reasonable.

So it's not ready for merging now, but hopefully soon with some review.

Thanks!

@tdanford
Copy link
Contributor Author

It's failing because of verbose logging. let me turn that off.


<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not ship with any specific logging framework except for the facade provided by slf4j. It would be fine to include this as a test-scoped dependency if needed, but I would like to prevent that we make logging framework decisions for the end users of the library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll scope it as 'test', thanks.

@kaspersorensen
Copy link
Member

Could we avoid returning Joda time values? I don't mind using Joda time internally, but would prefer to return java.util.Date values. I know they suck on many behalfs and I cannot wait to start using java 8's new date API. I just feel that returning Joda time objects to the user is leaking our dependency onto the users of the library.

@tdanford
Copy link
Contributor Author

Yeah, I'm up for it -- but how do you want to represent TIME values (which are, as I understand it, periods measured in seconds from the start of the day)? I'm genuinely asking, does the Java standard library have a class or representation of these already?

@kaspersorensen
Copy link
Member

Actually, the more I think about it, I think we should use java.sql.Time and java.sql.Date... They are both subclasses of java.util.Date so casting will work, but at the same time they indicate to be "just" a Date or "just" a Time field.

@@ -0,0 +1,18 @@
#!/usr/bin/env python
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file a leftover of testing compatibility with the original Python project? It doesn't seem to be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was used to generate the .tsv for date_time testing, and it's been useful for generating other test output comparison files...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, better keep it for later, then :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, although maybe moving it to a different location wouldn't be a bad idea.

@tdanford
Copy link
Contributor Author

@kaspersorensen I will convert over to using the java.util classes -- as soon as I get some time from the work responsibilities that are consuming me at the moment.

(I apologize, but if this is holding up a release, you shouldn't block on me for the next week or two.)

formatOffset = IO.readShort(rawData, 36) + 4;
formatLen = IO.readShort(rawData, 38);

if(formatOffset > 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably ought to be if(formatLen > 0) {.

@ghost
Copy link

ghost commented Nov 9, 2015

This isn't completely valid. I've found that that sometimes the format location is kept in the colName header (offset at byte 14, length at byte 16). My suspicion is that if there's no label for the column then it makes the column name subheader do extra work.

I'll take this up with Matt to see if he's ran into this before.

@kaspersorensen
Copy link
Member

Hi guys,

I was just stumbling over this PR again ... I'm worried if we're forgetting it because it went a bit silent. Anything I can do to help? What's the status?

@ghost
Copy link

ghost commented Feb 12, 2016

He mentioned that he was pretty busy, and didn't have time to check. I've looked quite a bit, but nothing ever came of it.

Just seems some files are using something other than a string identifier in an unknown location, which seems odd to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants