Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sketch of the library #1

Open
lheagy opened this issue Apr 24, 2017 · 9 comments
Open

Sketch of the library #1

lheagy opened this issue Apr 24, 2017 · 9 comments

Comments

@lheagy
Copy link
Member

lheagy commented Apr 24, 2017

There are 3 main modules we are considering at the moment: Time Series, Spectral and Transfer functions. We will focus in on Natural Source EM data to start. Below I have added notes from the conversation and google doc with @kujaku11, @grosenkj, @rowanc1, @sgkang, @dougoldenburg. In addition, it would be helpful to outline a sketch of pseudo-code: What should a workflow look like?

Time Series

  • would like to get rid of spikes, periodic noise and powerline noise (well maybe, we can maybe estimate the response for periods outside the harmonics of powerlines) pre-processing
    • 60Hz + harmonics
    • Pipeline noise (cathodic protection) - electric square wave ~ length every 2-3 sec every 12 seconds
  • fill gaps in time series if something happened, like a cow started to chew on some wires
  • remove extremely noisy sections (though this can be taken care of with clever statistics later)

Spectral Analysis

  • What is the most efficient way to estimate the Fourier coefficients,i cascade decimation seems to be the traditional way, is there something better? Wavelet transform?
  • estimate source field parameters ala Egbert (principal component analysis or independent component analysis)

Transfer Functions

  • Coherency sorting to remove obvious outliers
  • Robust estimator (M-estimator, median estimator, some other fancy estimator)
  • Bounded influence
  • Remote reference
  • Interstation transfer functions
  • The big one: error estimation
@kujaku11
Copy link
Contributor

A general work flow would look something like this:

  • Time Series
    • Align all time series data at synchronous stations
    • Apply any time series filters (notch filters for power lines)
    • Can remove instrument responses here, or more easily in the frequency domain, especially the magnetometer response.
  • Spectral Estimation
    • pre-whiten data (detrend, remove spectral bias --> usually done by an auto-regression filter)
    • All current processing codes now do a cascade decimation:
      • Window data in to chunks where the length is proportional to the Nyquist frequency, the windows should overlap, usually by a half.
      • Apply a window function (Hanning, Slepian, Hamming) to remove aliasing and Gibbs Phenomenon
      • Calculate FFT pick out the first few harmonics, put those estimations of the Fourier coefficients in a bin for later analysis.
      • Decimate the time series such that the window is the same size but now half the frequency range (so the time series was sampled at 100 samples/sec now is sampled at 50 samples/sec but the window is still the same length say 1024 points), repeat the previous 2 steps, and do this until the decimated data is smaller than the window. This gives you Fourier coefficients that span the full spectrum. It can be slow.
    • According to Dave Myer this is archaic and there are better ways of doing it, I leave it to him to expound on that.
  • Transfer Function Estimation
    • With all the Fourier Coefficients nicely binned by time and frequency, statistical analysis can begin. At each frequency you should have a cloud of Fourier coefficients in the complex plane. If the data is perfect you will have a nice Gaussian distribution and you can easily pick out the center of mass and a variance, usually with a Least-Squares method. But MT data is never that easy and you usually end up with a cloud of points that looks more like a rorschach ink blob, so you need better ways of removing or down weighting the bad data and upweighting the good data.
    • To remove obviously bad data, usually there is some sort of coherency threshold, usually between the cross powers of induced channels (Ex and Hy or Ey and Hx), where it is assumed that if the coherence between the two is high it must be good signal and if it is below a given threshold it can be thrown out as bad data or down weighted.
    • Chave computes leverage points and down weights them to get a better statistical estimation, an iterative scheme.
    • Egbert uses an M-estimate, also an iterative scheme.
    • Both use remote referencing of the magnetic channels to remove noise in local magnetic channels, Egbert goes a step further to use multiple stations, including electric channels to compute the number of sources (there should only be 2 in MT) using principle component analysis, and computes an interstation transfer function, which gives you a better idea on data quality.

That's the general idea, there are areas to improve and try different things out.

@lheagy
Copy link
Member Author

lheagy commented Apr 30, 2017

Thanks @kujaku11. In terms of workflow - how iterative of a process would this typically be? Do you often iterate on a single step? or do a few steps in time series processing and then perhaps go back and remove a few more outliers?

@lheagy
Copy link
Member Author

lheagy commented Apr 30, 2017

Here are a couple thoughts based on the ppt and notes above. There are 3 main classes:
TimeSeries --> Spectral --> TransferFunctions. Each is instantiated with a dataset from the previous (eg. spectral = Spectral(time_series))

Here are a couple thoughts on what a TimeSeries class could look like - (very much a sketch at this point!)

A couple key points about the below:

  • I think we want to always have access to the source data and leave those un-touched. We can be smart about this if it is a large data set
  • I think we also want some way of logging the steps that are taken (including saving tuning parameters, etc). This would allow you to set up a processing workflow on a subset of data, write out the steps and then re-use / apply that same workflow on the whole data set. There is likely a better way of doing this that what I have sketched below, but hopefully that gives you a feel for the idea.
class TimeSeries(BaseTimeSeries):  # top level this might be processing.NSEM.TimeSeries
    
    def __init__(hx=None, hy=None, hz=None, ex=None, ey=None, **kwargs):

        self._source_data = utils.create_data_frame(hx, hy, hz, ex, ey)  # if large, we can be smart about writing this to disk

    def __getitem__(self, key):
        return self.current_data[key]  # if you call self['ex'] looks at the current data

    @property
    def source_data(self):
        """
        Original data set. remains un-touched. 
        """
        return self._source_data

    @property
    def current_data(self):
        if getattr(self, '_current_data', None) is None:
            self._current_data = self._source_data  # if we haven't taken any processing steps yet, this is just the source data
        return self._current_data

    @property
    def processing_log(self):
        """
        This returns an ordered dict (or similar - needs to be serializable)
        of the processing steps taken, including their args, kwargs
        """
        if getattr(self, '_processing_log', None) is None:
            self._processing_log = {}
        return self._processing_log

    def remove_pipeline_noise(self, period):
        self._current_data = utils.remove_pipeline_noise(self.current_data, period)
        self.register_processing_step('remove_pipeline_noise', {'period': period})
        return self.current_data

A couple other comments

  • We likely want the ability to work with large data sets (eg. can't simply load the whole thing into memory): Dask Dataframes could be a good option for this http://dask.pydata.org/en/latest/dataframe.html
  • the processing log needs to be serializable, and we should be able to load it and apply it to any appropriate data set (eg create it for a subset of the data, save it, load it and apply on a large data set, or use it as a starting point on a completely different data set)

@kujaku11
Copy link
Contributor

kujaku11 commented May 1, 2017

Good outline of the time series data @lheagy.

Probably should also add a save_data() function so you can reuse filtered data.

The logging is a good idea, what about something like
return_dict = {'step_01':processing step 1, 'step_02':processing step 2}

where processing step is the code snippet that produced that step, not sure how you do that but would be handy. You could then have a function that wrote out those steps in a script file. Could also have the return dictionary be its own object with methods of 'write_script_file, get_step, etc'

As for iterative processing on the time series, this is usually a one off, or a trial and error. Most of the time if you do a pretty good job of removing the obvious noise in the time domain, the frequency domain processing will take care of the rest. That's where most of the hard work is done. Time series processing is usually minimal, unless you have noise like pipeline noise where there are obvious periodicity structure.

@lheagy
Copy link
Member Author

lheagy commented May 4, 2017

Also, for keeping track of units: https://pint.readthedocs.io/en/0.7.2/

@ahartikainen
Copy link

pint and pandas do not mix well.

If you want to use pandas and pint, then maybe 'read-in' function should convert everything to SI units.

@lheagy
Copy link
Member Author

lheagy commented May 5, 2017

Ah, thanks @ahartikainen, I haven't played around with them together before. In this case, I don't think we will be making heavy use of pint, in a lot of ways, I see it being used more as a property on a class so we can do validation (not necessarily deeply integrating by attaching it to values or embedding it in a data frame. Agreed though that getting the data into SI on construction of a data class is a good approach.

@lheagy
Copy link
Member Author

lheagy commented May 5, 2017

Here are a couple points that came up in the meeting today (see: https://www.youtube.com/watch?v=-X4HbcedBUY): @sgkang and @kujaku11 please correct me where I am off-base! and feel free to add

  • data & processing steps should be independent classes with a well-defined way to interface between them. In this way reading data in, handling large data sets, plotting, etc is handled by the data class, none of that complexity needs to be introduced in the processing steps

  • processing steps are small, lightweight classes, much like mappings in SimPEG (http://docs.simpeg.xyz/content/api_core/api_Maps.html). They can be composed (eg, remove_pipeline_noise * notch_filter), and serialized. All of the tuning parameters are properties and can be serialized

@grosenkj
Copy link
Member

grosenkj commented May 5, 2017

Thanks, @lheagy, @kujaku11 and @ahartikainen for your comments.

I want to add a couple of things to add to be considered for the design of the library. Like the direction of having the things object oriented and leverage/use tools in python to deal with the large amount of data that the time series can be. I also agree with the discussion from today's meeting on separating the processing steps into a SimPEG.Maps kind of structure, I think that would be a good way of sharing processing steps between different stations/locations. It also makes building a custom filtering out of multiple predefined filters.

My suggestions on the classes in the module:

  • Building up the code having a single TimeSeries class as the base element that:

    • Contains metadata like:
      • E or H [in SI units] (Connection to a Storage class? Could have different backends)
      • Location (in real world coordinates)
      • Orientation (clearly defined to
      • Methods to easily access information about the time series (maybe as lazy properties):
        • Sampling rate, start time, duration ...
  • TimeSeriesMaps for the time series filters like powerline (DC/AC), pipe, de-spike, etc filters

  • Then the Spectral classes connect with a TimeSeries object to calculate the Fourier coefficients.

    • Can be easily extended to accommodate the different ways of doing the time to frequency transform
    • Keep track of when and which TimeSeries object the coefficients are obtained from
  • 'SpectralMaps` for spectral filterering

  • TransferFunctions objects calculate the MT data.

    • Should know what kind of transfer to calculate (impedance, tipper, ZTEM, ...)
    • Which Spectral and TimeSeries objects to use for the calculations
      • Cool using properties module to check the type for the given transfers before calculation of the data
    • Different algorithms to evaluate the transfers (Chave, Egbert, Larson, ...)
    • Plotting methods for the transfer over time and "final" data
    • Have some editing selection tools to be able to down weight data or unselect parts of the transfer series from the calculations of the data
  • Maybe build some of weighting and coefficient removal for the transfer calculation as TransferFucntionsMaps?

  • Having a Batch class/extension where a collection TimeSeries <--> Spectral <--> Transfer can be processed in the same way

    • Higher level API
    • This has the flexibility of dealing with time series from a single station (Ex, Ey, Hx, Hy) to a survey of multiple stations and things between.

A bit of my reasoning for having a single time series as the base makes the library much more versatile. As I understand the procedure, until the calculation of the transfer functions most of the calculations/operations are done individually on each component. So for performance, designing the library in that way should make parallelization easier if dealing with a list of base objects that can be sent out individually for the number crunching. Designing the filters to be SimPEG.Maps like would make it relatively simple and transparent to share filter steps between time series recorded at the same location. As well, with most data I have worked there are several (some non-continues) time series of different sampling rates for each of the component measured simultaneously (for example ~ 1000Hz/sec for 1min every hour; 100Hz/sec samples for 10min every hour; 10Hz/sec samples for the whole time). Then different frequency ranges are obtained from the different series. So being able to accommodate for these types of time series structures would be important.

As has been discussed, bookkeeping of the procedure (both of what is done but also where the transfers are calculated from in respect to the spectral and time series) is important. Being able to identify where in the time series an outlier is calculated from would be an effective tool to have.

Using pandas/dask data frames sounds like a great avenue for managing the time, spectral and transfer series making linking between the different objects effective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants