-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding output_format options for InferenceObjects #60
Comments
Hi Seth (@sethaxen), I'll add that. Initially maybe leave the current :dimarray version of DimensionalDate in StanSample and add a new option Have been looking at PosteriorDB the last couple of days, in particular the current read functions, but also what it would take to have say a "Statistical Rethinking" and "Regression and Other Stories" database. How stable is PosteriorDB's API? Are you planning to register it? Mainly asking because I work mostly in Pluto where it is easier to use registered packages. Admittedly, that also means frequent updates. |
Yes, I think it makes sense to leave
PosteriorDB's registration went through this morning, so you should be able to use it now in Pluto. What do you mean about frequent updates?
I wouldn't call it stable right now for a few reasons:
If I had to guess, most changes in the future will be 1-to-1 renaming of functions and of expanding the API around loading model implementations (currently
The great thing is that if you can construct a database of models that follows the posteriordb schema, then not only can PosteriorDB.jl work with it just fine, but so can the corresponding R and Python packages, so you potentially have a larger base of users and contributors. Then you could have a Julia package that just reexports PosteriorDB.jl but handles downloading of that database as an artifact (similar to what PosteriorDB does) and a convenience function for loading it. I'd imagine a number of those models are either in posteriordb already or might be welcome additions to it. In terms of conforming to the posteriordb spec, currently only the R posteriordb package has functionality for adding to the database. I may add API functions for doing this to PosteriorDB.jl as well, but this is currently low priority. |
Absolutely, before merging the inferencedata branch into StanSample master I’ll list you as a reviewer. On the frequent updates, I’ve just found that when using Pluto I merge into master asap. Which is typically more often then I used to. I agree a `read_samples(m::SampleModel, :inferencedata) for a single group is trivial. I wonder if InferenceObjects have a method to add multiple groups? E.g. from your link to the PyStan examples they use
In the eight schools example I would have:
and then could use NamedTupleTools to split this object, e.g.:
but then expect I would need an object update method like:
Definitely would prefer to pass in a single NT and separately specify which form groups and dimensions in groups. Update: Guess |
Another thought I had is about the name of the method. Currently |
In the branch
Using The code is in `./test/test_inferencedata/test_inferencedata.jl. The next step is including warmups and then turn it into a proper function/method. |
Yes, julia> idata = from_namedtuple(stan_nts; posterior_predictive=:y_hat, log_likelihood=:log_lik)
InferenceData with groups:
> posterior
> posterior_predictive
> log_likelihood
julia> keys(idata.posterior)
(:mu, :theta_tilde, :theta, :tau)
julia> keys(idata.posterior_predictive)
(:y_hat,)
julia> keys(idata.log_likelihood)
(:log_lik,) In this case, what you'd really want is to rename idata = from_namedtuple(stan_nts; posterior_predictive=:y_hat=>:y, log_likelihood=:log_lik=>:y)
idata = from_namedtuple(stan_nts; posterior_predictive=(y=:y_hat,), log_likelihood=(y=:log_lik,)) The former is inspired by DataFrames's selection syntax.
Yes, actually, this is what we generally do for converting different types. So each type that is convertible to an inference data usually has a In terms of design, I would like to make
Currently |
Great! I'll take a look in the next few days. If you open a draft PR, I can comment directly on the code.
Currently EDIT: issue opened at arviz-devs/InferenceObjects.jl#28 |
Thanks for your comments. The main parts shown below work. I added a few other options as a comment.
|
Hi Seth, I need to study DimensionalData further. Above ends up with the wrong dimensions in InferenceData and also drops most of log_lik. I'll dig a bit deeper the next couple of days. Simply calling |
@goedman looks great so far! Yeah I see that the dimensions are wrong. This seems to be due to how julia> stan_nts.theta_tilde |> size
(8, 2000, 4)
julia> post.theta_tilde |> size
(1000,) So the function is discarding the chains dimension and the leading dimension(s). In InferenceData, for sample-based groups, a vector is interpreted as a single draw with many chains. This should probably change (we recently changed the default dimension ordering in InferenceObjects to be more Julian), hence why you're getting a single This can absolutely be worked around without DimensionalData; ideally no DimensionalData knowledge is needed to work with InferenceObjects; knowing it just unlocks more functionality. Since you know the draws are indexed starting at 1, you can do this to separate the posterior draws from the warmup: julia> idata2 = let
idata_warmup = idata[draw=1:1000]
idata_postwarmup = idata[draw=1001:2000]
idata_warmup_rename = InferenceData(NamedTuple(Symbol("warmup_$k") => idata_warmup[k] for k in keys(idata_warmup)))
merge(idata_postwarmup, idata_warmup_rename)
end
InferenceData with groups:
> posterior
> posterior_predictive
> log_likelihood
> sample_stats
> warmup_posterior
> warmup_posterior_predictive
> warmup_sample_stats
> warmup_log_likelihood
julia> idata2.posterior
Dataset with dimensions:
Dim{:theta_tilde_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points,
Dim{:draw} Sampled{Int64} 1001:2000 ForwardOrdered Regular Points,
Dim{:chain} Sampled{Int64} Base.OneTo(4) ForwardOrdered Regular Points,
Dim{:theta_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points
and 4 layers:
:theta_tilde Float64 dims: Dim{:theta_tilde_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
:mu Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
:tau Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
:theta Float64 dims: Dim{:theta_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
with metadata Dict{String, Any} with 1 entry:
"created_at" => "2022-11-03T14:46:48.145"
julia> idata2.warmup_posterior
Dataset with dimensions:
Dim{:theta_tilde_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points,
Dim{:draw} Sampled{Int64} 1:1000 ForwardOrdered Regular Points,
Dim{:chain} Sampled{Int64} Base.OneTo(4) ForwardOrdered Regular Points,
Dim{:theta_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points
and 4 layers:
:theta_tilde Float64 dims: Dim{:theta_tilde_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
:mu Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
:tau Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
:theta Float64 dims: Dim{:theta_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
with metadata Dict{String, Any} with 1 entry:
"created_at" => "2022-11-03T14:46:48.145" I would suggest resetting the One more note is that the InferenceData schema does have some rules about what certain sampling statistics should be named. This list includes all of the usual Stan sample statistics, so to completely comply with the spec, these parameters should be renamed. ArviZ.jl already defines this map: https://github.com/arviz-devs/ArviZ.jl/blob/1f642377ec01b9f5ef4d6ebd164604f65edf79de/src/mcmcchains.jl#L9-L17, so you could do: julia> stan_key_map = (
n_leapfrog__=:n_steps,
treedepth__=:tree_depth,
energy__=:energy,
lp__=:lp,
stepsize__=:step_size,
divergent__=:diverging,
accept_stat__=:acceptance_rate,
);
julia> sample_stats_rekey = InferenceObjects.Dataset((; (stan_key_map[k] => idata.sample_stats[k] for k in keys(idata.sample_stats))...));
julia> idata2 = merge(idata, InferenceData(; sample_stats=sample_stats_rekey))
InferenceData with groups:
> posterior
> posterior_predictive
> log_likelihood
> sample_stats
julia> idata2.sample_stats
Dataset with dimensions:
Dim{:draw} Sampled{Int64} Base.OneTo(2000) ForwardOrdered Regular Points,
Dim{:chain} Sampled{Int64} Base.OneTo(4) ForwardOrdered Regular Points
and 7 layers:
:lp Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
:tree_depth Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
:step_size Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
:n_steps Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
:energy Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
:diverging Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
:acceptance_rate Float64 dims: Dim{:draw}, Dim{:chain} (2000×4) A few remaining things you could do would be:
I'd prefer foregoing the specialized Alternatively, the syntax in #60 (comment) would capture this as well as allow for renaming of variables:
Do you have a preference for either of these syntaxes? |
Here's how to do this: julia> using DimensionalData
julia> idata3 = InferenceData(map(NamedTuple(idata2)) do ds
DimensionalData.set(ds; draw=axes(ds, :draw))
end)
InferenceData with groups:
> posterior
> posterior_predictive
> log_likelihood
> sample_stats
> warmup_posterior
> warmup_posterior_predictive
> warmup_sample_stats
> warmup_log_likelihood
julia> idata3.posterior
Dataset with dimensions:
Dim{:theta_tilde_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points,
Dim{:draw} Sampled{Int64} Base.OneTo(1000) ForwardOrdered Regular Points,
Dim{:chain} Sampled{Int64} Base.OneTo(4) ForwardOrdered Regular Points,
Dim{:theta_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points
and 4 layers:
:theta_tilde Float64 dims: Dim{:theta_tilde_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
:mu Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
:tau Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
:theta Float64 dims: Dim{:theta_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
with metadata Dict{String, Any} with 1 entry:
"created_at" => "2022-11-03T19:29:57.427"
julia> idata3.warmup_posterior
Dataset with dimensions:
Dim{:theta_tilde_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points,
Dim{:draw} Sampled{Int64} Base.OneTo(1000) ForwardOrdered Regular Points,
Dim{:chain} Sampled{Int64} Base.OneTo(4) ForwardOrdered Regular Points,
Dim{:theta_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points
and 4 layers:
:theta_tilde Float64 dims: Dim{:theta_tilde_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
:mu Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
:tau Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
:theta Float64 dims: Dim{:theta_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
with metadata Dict{String, Any} with 1 entry:
"created_at" => "2022-11-03T19:29:57.427" |
Oh, also, it might be good to allow users to specify |
@goedman I took some time tonight to rethink the conversion pipeline to With this pipeline, you would implement |
Hi @sethaxen Implemented your first 2 suggestions and that works great! By default, if a model contains the warmup samples I'll use the renaming option to rename y_hat to y and log_lik to y as you suggested earlier and also look into the posterior indices (to start from 1). I can easily add kwargs to |
Hi @sethaxen Just pushed a very first attempt at Are there any rules/recommendations on what should be stored in inferencedata? In addition to warmup values, many models might not have y_hat and/or log_lik generated in generated_quantities. Currently For y_hat and log_lik I think I can use merge as you showed above (and used in the current version of On your question above, I think I prefer:
but for now I've used your setup for remapping keys. |
Great! Is that at #61? If so I'll take a closer look.
That makes sense!. The best argument I can think of for not loading warmup is that for models with many parameters, this might double the memory requirements of loading the draws, since even though the work is done, until loading, the draws are stored on disk. Also, only in rare cases will users need to inspect the warm-up draws for diagnostic purposes.
The InferenceData spec gives some instructions. The non-MCMC groups are There's also |
On my flight back from Amsterdam to Colorado I made several more changes to Inferencedata.jl in StanSample/utils. All in #61 on the InferenceData branch. I'll go though the InferenceData spec today. For now, I have removed PosteriorDB from the StanSample.jl inference data branch. I think it belongs on the level of the Stan.jl package and the StatisticalRethinking and RegressionAndOtherStories projects. And I would love to use it there. |
Thanks for the feedback @sethaxen ! I pushed a couple of updates to the inferencedata branch that use your suggestions. It furthermore simplified handling of the names used for the posterior_predictive and log_likelihood. This way I don't think we need to use remap on the inferencedata object. The new inferencedata function looks like:
|
Hi Seth, Think I'm doing something wrong here. After creating above InferenceData object, I'm trying to add the observed _data:
From the InferenceObjects code I see
or
All of these fail, often with stack overflow error. But haven't figured out how to do it correctly. |
Hi @goedman, it seems you found an InferenceObjects bug. When arviz-devs/InferenceObjects.jl#36 is finished, it should fix this. |
Great. I think I did try replacing Is my line of thought correct? I.e. Edit: I think when I updated it to [4] it complained about different length of draw or chain dimension. I read somewhere that check is not done for observed_data and constant_data. |
It's better to pass the observed data directly, e.g. using
That's right, there are 3 groups that are special-cased (also IIRC |
Hi Seth
_It's better to pass the observed data directly, e.g. using from_namedtuple(; observed_data=nt). The first argument of from_namedtuple corresponds to the posterior, so by passing nt in first, you're informing the function that it should expect chain and draw dimensions on all passed parameters, which is why it raises a dimension error. Passing keys is only safe for parameter derived from the prior or posterior._
Aah, I’d missed that. With a workaround for issue #36:
```
nt = namedtuple(data)
# Until InferenceObjects issue #36 is merged
ntu=(sigma=nt.sigma, J=[4], y=nt.y)
idata = merge(idata, from_namedtuple(; observed_data = ntu))
```
works fine.
I’ll merge the branch inferencedata into master. I’m sure we’ll make more changes over the coming weeks but I would like to have the current version available in Pluto to see how that works out. Earlier a quick test showed DimensionalData displays well.
Rob
Rob J Goedman
***@***.***
|
Hi Seth, I created a slightly modified version on inferencedata (
Currently I'm testing this version in Pluto notebooks. |
c7a425a Allow JSON data as a string (#60) 50fe2ec Update README examples section e380e82 Reword docstrings fb513e3 Reorganize and update documentation (#57) 96805fd Compile methods: split stanc args from make args, add default include path (#58) f62bf46 Fix broken link 676db6b Bump actions/setup-python from 2 to 4 (#55) 34f10dd Remove CmdStan Dependency (#51) 821883f Prefix any C-exposed symbols with `bs_` (#54) 81129b0 Add the option for autodiff Hessian calculations (#52) git-subtree-dir: deps/data/bridgestan git-subtree-split: c7a425aac54120bafa643b722ed24b2a32111782
Hi @goedman, sorry, I missed this comment. I'll take a look at the latest implementation tonight and open a PR with any potential improvements I see. |
Hi Seth ( @sethaxen ) the current Dict based version ( |
Hi Seth (@sethaxen) Sorry for the late reply but your updates and v0.2.6 seem to work great including scalar values in the data. Got distracted by roualdes/bridgestan#62 (comment). I don't think arviz-devs/InferenceObjects.jl#40 (comment) has been merged yet. |
No problem, and great!
It's now been merged. RE whether to use |
Hi Seth (@sethaxen) Works fine, just merged in StanSample.jl v6.13.8, example notebook will be in Stan v9.10.5. |
Great! I'll open a PR in the next few days to support extracting |
InferenceObjects.jl implements the
Dataset
andInferenceData
contains for outputs of Bayesian inference. These are the storage types used by ArviZ.jl (Python ArviZ also implements anInferenceData
using the same spec, which is used by PyMC), and there's discussion about ultimately usingInferenceData
as an official container for sampling results in Turing (see TuringLang/MCMCChains.jl#381). It would be convenient if StanSample supported this as an output format:inferencedata
. From the current output formats, this would be fairly straightforward. e.g.Here's what these outputs look like:
Dataset
is aDimensionalData.AbstractDimStack
, and its variables areDimArray
s.Unlike the other output formats,
InferenceData
can store sampling statistics (like divergences and tree depth), data, predictions, and warmup draws, so this information could also be unpacked fromstan_model
into theInferenceData
object to be more easily used in downstream analyses (here's an example with PyStan and the Python implementation ofInferenceData
: https://python.arviz.org/en/latest/getting_started/CreatingInferenceData.html#from-pystan)There are more options that might be convenient for users (e.g. specifying dimension names and coordinates) that wouldn't fit into the
read_samples
interface, so it would probably still be a good idea to have a method with more options live somewhere, e.g. here, in ArviZ itself, or in a glue package (e.g. StanInferenceObjects.jl).The text was updated successfully, but these errors were encountered: