Issue #5: Support for DATE/TIME extraction. #9

tdanford · 2015-10-14T17:05:39Z

This commit adds some support for parsing fields with the DATE, TIME, or
DATETIME formats from .sas7bdat files.

This is a big change, involving a couple of different elements:

new dependencies, including to Joda Time (for date/time formatting)
and a log4j-slfj4 dependency for logging.
Adding two new types (DATE, TIME) to SasColumnType,
carrying through those changes to the parsing in the SasReader
class, including use of date time conversion methods,
which are in the new DateTimeConverter class, that uses the
Joda Time library.

A couple of notes: first, there is some bounds-checking for date/time
values in SasReader, which are meant to replicate some of the
observations on date/time value parsing from the Python sas7bdat
library. In particular, we replicate the bounds on values from the
Python datetime library, so that the results should be replicable
across libraries. See the comments in SasReader and DateTimeConverter.

Also, we note that some "excessively large" values for date9 are
actually 'datetime' (i.e. seconds, not days, from Jan 1 1960), an
observation taken from reading the Python code itself. I've actually
aimed, throughout the code in places where the bounds or behavior were
underdocumented, to match as much as possible the behavior of the Python
sas7bdat module (https://pypi.python.org/pypi/sas7bdat).

Probably we should recognize an additional, explicit DATETIME format
type in the future.

NULL values (represented as NaN in NUMERIC format) are now returned as
straight Java nulls, from SasReader, for Date/Time values.

Furthermore, the actual values returned are DateTime and Period values
(from the Joda Time library), so downstream libraries will need to
recognize those values (by RTTI?) and provide their own date/time
formatting.

Right now, testing only covers the DATE formatting, using the example
DATE-containing data file provided in the discussion/gist to Issue #5.
However, at the moment, I don't have an example of TIME values. I've
tested these against a privately-available file, but I don't have a
public test that I can share so this feature should still be considered
UNTESTED.

This commit adds some support for parsing fields with the DATE, TIME, or DATETIME formats from .sas7bdat files. This is a big change, involving a couple of different elements: * new dependencies, including to Joda Time (for date/time formatting) and a log4j-slfj4 dependency for logging. * Adding two new types (DATE, TIME) to SasColumnType, * carrying through those changes to the parsing in the SasReader class, including use of date time conversion methods, * which are in the new DateTimeConverter class, that uses the Joda Time library. A couple of notes: first, there is some bounds-checking for date/time values in SasReader, which are meant to replicate some of the observations on date/time value parsing from the Python sas7bdat library. In particular, we replicate the bounds on values from the Python datetime library, so that the results _should_ be replicable across libraries. See the comments in SasReader and DateTimeConverter. Also, we note that some "excessively large" values for date9 are actually 'datetime' (i.e. seconds, not days, from Jan 1 1960), an observation taken from reading the Python code itself. I've actually aimed, throughout the code in places where the bounds or behavior were underdocumented, to match as much as possible the behavior of the Python sas7bdat module (https://pypi.python.org/pypi/sas7bdat). Probably we should recognize an additional, explicit DATETIME format type in the future. NULL values (represented as NaN in NUMERIC format) are now returned as straight Java nulls, from SasReader, for Date/Time values. Furthermore, the actual values returned are DateTime and Period values (from the Joda Time library), so downstream libraries will need to recognize those values (by RTTI?) and provide their own date/time formatting. Right now, testing only covers the DATE formatting, using the example DATE-containing data file provided in the discussion/gist to Issue datacleaner#5. However, at the moment, I don't have an example of TIME values. I've tested these against a privately-available file, but I don't have a public test that I can share so this feature should still be considered UNTESTED.

tdanford · 2015-10-14T17:06:36Z

@kaspersorensen this is a big change -- and I'm still going to add a commit to it, with unit tests based on the new files that Gagravarr had passed along. But it is still ready for review, in particular whether my approach (extending the SysColumnType, and use of Joda Time objects) is appropriate and reasonable.

So it's not ready for merging now, but hopefully soon with some review.

Thanks!

tdanford · 2015-10-14T17:08:54Z

It's failing because of verbose logging. let me turn that off.

kaspersorensen · 2015-10-14T17:54:14Z

pom.xml

+
+			<dependency>
+				<groupId>org.slf4j</groupId>
+				<artifactId>slf4j-log4j12</artifactId>


I think we should not ship with any specific logging framework except for the facade provided by slf4j. It would be fine to include this as a test-scoped dependency if needed, but I would like to prevent that we make logging framework decisions for the end users of the library.

I'll scope it as 'test', thanks.

kaspersorensen · 2015-10-14T17:58:50Z

Could we avoid returning Joda time values? I don't mind using Joda time internally, but would prefer to return java.util.Date values. I know they suck on many behalfs and I cannot wait to start using java 8's new date API. I just feel that returning Joda time objects to the user is leaking our dependency onto the users of the library.

tdanford · 2015-10-14T19:26:42Z

Yeah, I'm up for it -- but how do you want to represent TIME values (which are, as I understand it, periods measured in seconds from the start of the day)? I'm genuinely asking, does the Java standard library have a class or representation of these already?

kaspersorensen · 2015-10-15T09:38:22Z

Actually, the more I think about it, I think we should use java.sql.Time and java.sql.Date... They are both subclasses of java.util.Date so casting will work, but at the same time they indicate to be "just" a Date or "just" a Time field.

LosD · 2015-10-16T07:42:26Z

sas/src/test/resources/python_test.py

@@ -0,0 +1,18 @@
+#!/usr/bin/env python


Is this file a leftover of testing compatibility with the original Python project? It doesn't seem to be used.

It was used to generate the .tsv for date_time testing, and it's been useful for generating other test output comparison files...

Ah, better keep it for later, then :)

Yeah, although maybe moving it to a different location wouldn't be a bad idea.

tdanford · 2015-10-22T17:00:33Z

@kaspersorensen I will convert over to using the java.util classes -- as soon as I get some time from the work responsibilities that are consuming me at the moment.

(I apologize, but if this is holding up a release, you shouldn't block on me for the next week or two.)

ghost · 2015-11-09T01:52:33Z

sas/src/main/java/org/eobjects/metamodel/sas/SasReader.java

+							formatOffset = IO.readShort(rawData, 36) + 4;
+							formatLen = IO.readShort(rawData, 38);
+
+							if(formatOffset > 0) {


This probably ought to be if(formatLen > 0) {.

ghost · 2015-11-09T03:04:32Z

This isn't completely valid. I've found that that sometimes the format location is kept in the colName header (offset at byte 14, length at byte 16). My suspicion is that if there's no label for the column then it makes the column name subheader do extra work.

I'll take this up with Matt to see if he's ran into this before.

kaspersorensen · 2016-02-12T19:27:12Z

Hi guys,

I was just stumbling over this PR again ... I'm worried if we're forgetting it because it went a bit silent. Anything I can do to help? What's the status?

ghost · 2016-02-12T19:45:52Z

He mentioned that he was pretty busy, and didn't have time to check. I've looked quite a bit, but nothing ever came of it.

Just seems some files are using something other than a string identifier in an unknown location, which seems odd to me.

Setting the logging back to WARN (instead of DEBUG)

d77e9b7

kaspersorensen reviewed Oct 14, 2015
View reviewed changes

LosD reviewed Oct 16, 2015
View reviewed changes

ghost reviewed Nov 9, 2015
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue #5: Support for DATE/TIME extraction. #9

Issue #5: Support for DATE/TIME extraction. #9

tdanford commented Oct 14, 2015

tdanford commented Oct 14, 2015

tdanford commented Oct 14, 2015

kaspersorensen Oct 14, 2015

tdanford Oct 14, 2015

kaspersorensen commented Oct 14, 2015

tdanford commented Oct 14, 2015

kaspersorensen commented Oct 15, 2015

LosD Oct 16, 2015

tdanford Oct 20, 2015

LosD Oct 21, 2015

tdanford Oct 22, 2015

tdanford commented Oct 22, 2015

ghost Nov 9, 2015

ghost commented Nov 9, 2015

kaspersorensen commented Feb 12, 2016

ghost commented Feb 12, 2016

Issue #5: Support for DATE/TIME extraction. #9

Are you sure you want to change the base?

Issue #5: Support for DATE/TIME extraction. #9

Conversation

tdanford commented Oct 14, 2015

tdanford commented Oct 14, 2015

tdanford commented Oct 14, 2015

kaspersorensen Oct 14, 2015

Choose a reason for hiding this comment

tdanford Oct 14, 2015

Choose a reason for hiding this comment

kaspersorensen commented Oct 14, 2015

tdanford commented Oct 14, 2015

kaspersorensen commented Oct 15, 2015

LosD Oct 16, 2015

Choose a reason for hiding this comment

tdanford Oct 20, 2015

Choose a reason for hiding this comment

LosD Oct 21, 2015

Choose a reason for hiding this comment

tdanford Oct 22, 2015

Choose a reason for hiding this comment

tdanford commented Oct 22, 2015

ghost Nov 9, 2015

Choose a reason for hiding this comment

ghost commented Nov 9, 2015

kaspersorensen commented Feb 12, 2016

ghost commented Feb 12, 2016