Skip to content

Commit

Permalink
Merge 'develop' gtools-1.5.1; matasave (gtop, glevelsof), greshape @
Browse files Browse the repository at this point in the history
Features

- `greshape` supports `@` syntax for wide and long. Change the string
  to be matched via `match()`

- `greshape` supports stata varlist syntax for long to wide (may not be
  combined with `@` within a stub).

- `greshape` does not support varlist syntax for wide to long, but can
  use `match(regex)` for complex wide to long matches (see examples).

- Closes #57

- `glevelsof, mata[(name)]` saves the levels to mata. The levels are _not_
  stored in `r(levels)` and option `local()` is not allowed. With `silent`,
  the levels are additionally not formatted.

- `glevelsof, mata numfmt()` requires `numfmt` to be a mata print format
  instead of a C print format.

- `gtop, ntop(.)` and `gtop, ntop(-.)` now allow printing all the levels
  from largest to smallest or the converse.

- `gtop, alpha` sorts the top levels in variable order. if `gtop -var, alpha`
  is passed then they are sorted in reverse order.

- `gtop, mata` uses temporary files on disk to read the levels from C
  via mata. Matrices and locals are not used, meaning `r(levels)`,
  `r(toplevels)`, and the resuls stored via the option -matrix()-,
  ``r(`matrix')``, are no longer available. The user can access each
  of these via the mata object `GtoolsByLevels` (the user can change
  the name of this object via `mata(name)`). The levels are stored raw
  in `GtoolsByLevels.charx` and `GtoolsByLevels.numx`; the levels are
  stored formatted in `GtoolsByLevels.printed`; the frequencies are
  stored in `GtoolsByLevels.toplevels`.

- `r(matalevels)` stores the name of the mata object with the levels and frequencies.

- `gtop` also stores `r(ntop)`, `r(nrows)`, and `r(alpha)` as return scalars,
  for the numbere of top levels (if `.`, this will be `r(J)`), the number of
  rows in the `toplevels` matrix (it may or not include a row for "other" and
  a row for "missing"), and whether the top levels are sorted by their values.

- `gtop, mata numfmt()` requires `numfmt` to be a mata print format instead of
  a C print format.
  • Loading branch information
mcaceresb committed Mar 24, 2019
2 parents 4cfd6dc + 944ca4c commit 458c95e
Show file tree
Hide file tree
Showing 92 changed files with 4,562 additions and 1,984 deletions.
2 changes: 1 addition & 1 deletion .appveyor.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
version: "generic-1.4.1-{build}"
version: "generic-1.5.1-{build}"

environment:
matrix:
Expand Down
44 changes: 23 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,10 @@

Faster Stata for big data. This packages uses C plugins and hashes
to provide a massive speed improvements to common Stata commands,
including: collapse, reshape, winsor, pctile, xtile, contract, egen,
isid, levelsof, duplicates, and unique/distinct.
including: collapse, reshape, xtile, tabstat, isid, egen, pctile,
winsor, contract, levelsof, duplicates, and unique/distinct.

![Dev Version](https://img.shields.io/badge/stable-v1.4.1-blue.svg?longCache=true&style=flat-square)
![Stable Version](https://img.shields.io/badge/stable-v1.5.1-blue.svg?longCache=true&style=flat-square)
![Supported Platforms](https://img.shields.io/badge/platforms-linux--64%20%7C%20osx--64%20%7C%20win--64-blue.svg?longCache=true&style=flat-square)
[![Travis Build Status](https://img.shields.io/travis/mcaceresb/stata-gtools/master.svg?longCache=true&style=flat-square&label=linux)](https://travis-ci.org/mcaceresb/stata-gtools)
[![Travis Build Status](https://img.shields.io/travis/mcaceresb/stata-gtools/master.svg?longCache=true&style=flat-square&label=osx)](https://travis-ci.org/mcaceresb/stata-gtools)
Expand Down Expand Up @@ -59,8 +59,8 @@ __*Gtools commands with a Stata equivalent*__
| gquantiles | xtile | 10 to 30 / 13 to 25 (-) | | `by()`, various (see [usage](https://gtools.readthedocs.io/en/latest/usage/gquantiles)) |
| | pctile | 13 to 38 / 3 to 5 (-) | | Ibid. |
| | \_pctile | 25 to 40 / 3 to 5 | | Ibid. |
| gstats tab | tabstat | 10 to 60 / 5 to 40 (-) | See [remarks](#remarks) | various (see [usage](https://gtools.readthedocs.io/en/latest/usage/gstats_summarize)) |
| gstats sum | sum, detail | 10 to 40 / 5 to 10 | See [remarks](#remarks) | various (see [usage](https://gtools.readthedocs.io/en/latest/usage/gstats_summarize)) |
| gstats tab | tabstat | 10 to 50 / 5 to 30 | See [remarks](#remarks) | various (see [usage](https://gtools.readthedocs.io/en/latest/usage/gstats_summarize)) |
| gstats sum | sum, detail | 10 to 20 / 5 to 10 | See [remarks](#remarks) | various (see [usage](https://gtools.readthedocs.io/en/latest/usage/gstats_summarize)) |

<small>(+) The upper end of the speed improvements are for quantiles
(e.g. median, iqr, p90) and few groups. Weights have not been
Expand Down Expand Up @@ -296,8 +296,9 @@ allow weights).

Hence both should be able to replicate all of the functionality of their
Stata counterparts. Last, `gstats tab` allows every statistic allowed
by `tabstat` as well as any statistic allowed by `gcollapse`, and the
syntax for the statistics specified via `statistics()` is also the same.
by `tabstat` as well as any statistic allowed by `gcollapse`; the
syntax for the statistics specified via `statistics()` is the same
as in `tabstat`.

The following are implemented internally in C:

Expand All @@ -324,7 +325,7 @@ The following are implemented internally in C:
| min | X | X | X |
| range | X | X | X |
| select | X | X | X |
| rawselect | X | X | X |
| rawselect | X | | X |
| percent | X | X | X |
| first | X | X (+) | X |
| last | X | X (+) | X |
Expand All @@ -349,7 +350,7 @@ gegen target = pctile(var), by(varlist) p(#)
```

where # is a "percentile" with arbitrary decimal places (e.g. 2.5 or 97.5).
`gtools` also supports selecting the `#`th smallest or largest non-missing value:
`gtools` also supports selecting the `#`th smallest or largest value:
```stata
gcollapse (select#) target = var [(select-#) target = var ...] , by(varlist)
gegen target = select(var), by(varlist) n(#)
Expand Down Expand Up @@ -385,13 +386,13 @@ Differences from `collapse`
- `rawstat` allows selectively applying weights.
- `rawselect` ignores weights for `select` (analogously to `rawsum`).
- Option `wild` allows bulk-rename. E.g. `gcollapse mean_x* = x*, wild`
- `gcollapse (nansum)` and `gcollapse (rawnansum)` outputs a missing
value for sums if all inputs are missing (instead of 0).
- `gcollapse, merge` merges the collapsed data set back into memory. This is
much faster than collapsing a dataset, saving, and merging after. However,
Stata's `merge ..., update` functionality is not implemented, only replace.
(If the targets exist the function will throw an error without `replace`).
- `gcollapse, labelformat` allows specifying the output label using placeholders.
- `gcollapse (nansum)` and `gcollapse (rawnansum)` outputs a missing
value for sums if all inputs are missing (instead of 0).
- `gcollapse, sumcheck` keeps integer types with `sum` if the sum will not overflow.

Differences from `greshape`
Expand All @@ -413,7 +414,7 @@ Differences from `greshape`
with this functionality.
- For that same reason, "advanced" syntax is not supported, including
the subcommands: clear, error, query, i, j, xij, and xi.
- `@` syntax is not (yet) supported but is planned for a future release.
- `@` syntax can be modified via `match()`

Differences from `xtile`, `pctile`, and `_pctile`

Expand Down Expand Up @@ -453,28 +454,30 @@ Differences from `tabstat`

- Saving the output is done via `mata` instead of `r()`. No matrices
are saved in `r()` and option `save` is not allowed. However, option
`matasave` saves the output and `by()` info in `GstatsOutput`. See
`mata GstatsOutput.desc()` after `gstats tab, matasave` for details.
`matasave` saves the output and `by()` info in `GstatsOutput` (the object
can be named via `matasave(name)`). See `mata GstatsOutput.desc()` after
`gstats tab, matasave` for details.
- `GstatsOutput` provides helpers for extracting rows, columns, and levels.
- Multiple groups are allowed.
- Options `casewise`, `longstub` are not supported.
- Option `nototal` is on by default; `total` is planned for a future release.
- Option `pooled` pools the source variables into one.

Differences from `summarize, detail`

- The behavior of `summarize` and `summarize, meanonly` can be
recovered via options `nodetail` and `meanonly`. These two
options are mainly for use with `by()`
- Option `matasave` saves output and `by()` info in `GstatsOutput`,
a mata class object. See `mata GstatsOutput.desc()` after
`gstats sum, matasave` for details.
a mata class object (the object can be named via `matasave(name)`).
See `mata GstatsOutput.desc()` after `gstats sum, matasave` for details.
- Option `noprint` saves the results but omits printing output.
- Option `tab` prints statistics in the style of `tabstat`
- Option `pooled` pools the source variables and computes summary
- Option `pooled` pools the source variables and computes summary
stats as if it was a single variable.
- `pweights` are allowed.
- Largest and smallest observations are weighted.
- `rolling:`, `statsby`, and `by:` are not allowed. To use `by` pass
- `rolling:`, `statsby:`, and `by:` are not allowed. To use `by` pass
the option `by()`
- `display options` are not supported.
- Factor and time series variables are not allowed.
Expand Down Expand Up @@ -560,9 +563,9 @@ TODO
----

- [ ] Update benchmarks for all commands. Still on 0.8 benchmarks.
- [ ] Allow keeping both variable names and labels in `greshape spread/gather`
- [ ] Implement `collapse()` option for `greshape`.
- [ ] Implement variable group syntax for `greshape`.
- [ ] `geomean` for geometric mean (`exp(mean(log(x)))` for gcollapse, gstats tab, gegen).
- [ ] Allow keeping both variable names and labels in `greshape spread/gather`
- [ ] Implement `selectoverflow(missing|closest)`
- [ ] Add totals row for `J > 1` in gstats

Expand All @@ -577,7 +580,6 @@ have an ETA for them:
- [ ] Create a Stata C hashing API with thin wrappers around core functions.
- [ ] This will be a C library that other users can import.
- [ ] Some functionality will be available from Stata via gtooos, api()
- [ ] Add option to `gtop` to display top X results in alpha order
- [ ] Improve debugging info.
- [ ] Improve code comments when you write the API!
- [ ] Have some type of coding standard for the base (coding style)
Expand Down
Loading

0 comments on commit 458c95e

Please sign in to comment.