Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update all scripts and data #1

Open
jlehtoma opened this issue Jul 9, 2017 · 18 comments
Open

Update all scripts and data #1

jlehtoma opened this issue Jul 9, 2017 · 18 comments
Assignees

Comments

@jlehtoma
Copy link
Member

jlehtoma commented Jul 9, 2017

Things have changed at Kapsi and this repo should be updated accordingly.

@jlehtoma jlehtoma self-assigned this Jul 9, 2017
@antagomir
Copy link
Member

If only the location has changed then this may not be such a big deal. Otherwise it might be.

@jlehtoma
Copy link
Member Author

jlehtoma commented Jul 9, 2017

The URLs listed in https://github.com/avoindata/mml/blob/master/rscripts/Kapsi/kapsi2rdata.R are still kosher, so fortunately there is no need for bigger update. It would be good to have all data in dirs per year, now there seems to some redundancy in the repo, e.g. years 2012 and 2016 separately and then e.g. Yleiskartta-1000 which is also found in 2012.

@jlehtoma
Copy link
Member Author

jlehtoma commented Jul 9, 2017

I see, e.g. Yleiskartta-1000 is the current version and anything in 2012 is in archive?

@antagomir
Copy link
Member

antagomir commented Jul 9, 2017

Yes. the 2012 folder was added after someone requested the old versions after I had removed them. But this was years ago. I doubt that anyone really needs the 2012 folder any more, it could be removed for clarity as well. I am not aware of other apps than ropengov/louhos pkgs that use this data resource so I think even file/folder structure can be improved/changed if necessary.

@antagomir
Copy link
Member

Is the consensus now that the data will be fetched directly from github, or shall we find another host or even release a separate data package?

@jlehtoma
Copy link
Member Author

jlehtoma commented Jul 9, 2017

I would keep the old versions of there's space. In fact, it's a shame we haven't actively collected these. AFAIK, no instance keeps a (public) record of changing datasets, which may actually be very useful in studies.

@antagomir
Copy link
Member

Right. We can try collect these from now on. Not sure how often the data is updated. Collecting annually might be sufficient.

@jlehtoma
Copy link
Member Author

jlehtoma commented Jul 9, 2017

You may already know my take on the hosting issue 😉 I don't know about the consensus. I don't think a separate data package is really necessary, although hosting one using drat would simplify certain things.

@antagomir
Copy link
Member

Data package would potentially reduce network traffic and speed up execution in some cases but not sure how essential this would be. Github certainly fine with as long as proved otherwise.

@jlehtoma
Copy link
Member Author

Since the data doesn't update so often (max annually I guess), a data package would simplify things at least because:

  1. Less need for downloading/caching as user would just install the (data) package once until an update is issued.
  2. Versioning and updating becomes easier.
  3. Documenting the data becomes easier.

However, there would be a bit of conceptual shift for the packages using mml should it transform from a data store to a data package. Packages depending on it (such as gisfin) would be less "API packages" and more like conventional packages (this is not an issue as such). We would also be packaging somebody else's data. Don't know, need to think about it more.

@antagomir
Copy link
Member

The MML license should perfectly allow data packaging as far as I see. Even the current solution is not really an "API package" since it is based on our pre-processed and independently distributed RData files rather than the MML service. To achieve conceptual clarity, the package functions should download the data straight from Kapsi/MML and perform preprocessing on the fly. But that is not practical. If we rely on our own preprocessed data anyway then I am not sure if it makes a big difference whether the data is hosted in Github or in a data package. Any pragmatic solution is fine.

@jlehtoma
Copy link
Member Author

jlehtoma commented Jul 10, 2017

The MML license should perfectly allow data packaging as far as I see.

Yes, it does. Ideally the data provider deals with the packaging the data, but in this case we could be just as good (or better!).

Even the current solution is not really an "API package" since it is based on our pre-processed and independently distributed RData files rather than the MML service.

Yep, but the data is still loaded on-need basis. It's worth noting that Kapsi is not a MML service either, which makes this even less API-like.

To achieve conceptual clarity, the package functions should download the data straight from Kapsi/MML and perform preprocessing on the fly. But that is not practical. If we rely on our own preprocessed data anyway then I am not sure if it makes a big difference whether the data is hosted in Github or in a data package. Any pragmatic solution is fine.

I agree that downloading the data every time is not practical. However, as a user I would like the package to do as little filtering as possible (i.e. subsetting data). Value-adding pre-processing (fixing strings, setting types etc) is great, as long as it's clear what was done. In this sense a data package might be a very good solution as it enables good documentation, versioning and provenance (i.e. distributing the code). Currently, it's a bit unclear where the data is coming from (mml repo, the original references are well handled) and what has been done it.

@jlehtoma
Copy link
Member Author

If we package the data using drat, it will still be hosted on Github.

@jlehtoma
Copy link
Member Author

+1 for data package, in other words.

@antagomir
Copy link
Member

Yes an R data package starts to seem a good solution. I do not know whether CRAN data package or Github drat package is better. If there is no added value from hosting at CRAN then drat might be optimal as updates will be easier.

Or we could ask Kapsi to add our scripts in their pipeline and host the data files/packages (for MML it is not realistic I think). But that would add some overhead and any changes/updates would be heavier to make.

As a side note, Feather would work great for sharing data frames but not in this case (shapefiles) I guess.

@jlehtoma
Copy link
Member Author

Yes an R data package starts to seem a good solution. I do not know whether CRAN data package or Github drat package is better. If there is no added value from hosting at CRAN then drat might be optimal as updates will be easier.

Obviously having the package in CRAN wouldn't hurt as so long as 1) the size-limit (~5 MB) is not an issue, and 2) one is willing to get an angry email from BDR 😄 .

Or we could ask Kapsi to add our scripts in their pipeline and host the data files/packages (for MML it is not realistic I think). But that would add some overhead and any changes/updates would be heavier to make.

Might be a bit overkill, yes.

As a side note, Feather would work great for sharing data frames but not in this case (shapefiles) I guess.

Well, if we switch completely to sf objects (which I think we should), then they are in essence data.frames. If Feather files compress well, this might be a good option. AFAIK, Feather is mostly meant for interoperability between e.g. R and Python and not for long term storage, so there's that.

@antagomir
Copy link
Member

R data packages hosted in CRAN can exceed the typical 5MB size limit if we can motivate the need for BDR. Perhaps good to start with Github+drat and move to CRAN later if it seems useful.

Not sure if RData or Rds are any better for long-term storage than feather. Except for the fact that Feather is still under development and may hence be less stable. Ok, perhaps Rds files would be the best here now (saveRDS / readRDS)

@jlehtoma
Copy link
Member Author

jlehtoma commented Jul 10, 2017

+1 to everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants