Skip to content

Latest commit

 

History

History
127 lines (102 loc) · 4.29 KB

LIBGH.1.md

File metadata and controls

127 lines (102 loc) · 4.29 KB

libGH(1)

NAME

libgh - GitHub scraping tool

SYNOPSIS

libgh [--days|-d DAYS] [--force|-f] [--from] [--json|-j] [--prune|-p] [--topics] [--xml|-x] [--debug] [--help|-?] [--version] [--] account_or_repository [...]

The alias lgh is also available to shorten the command name.

DESCRIPTION

The libgh command-line utility scraps data from a list of GitHub accounts (either personal or organizational) or repositories (in account/repository form).

By default this data is returned as pretty-printed text, or JSON data if the --json|-j option is used, or XML data if the --xml|-x option is used.

As data is retrieved in unauthenticated mode, some of it may be missing. For instance, organizational accounts will not mention the origin of forked repositories and may come with partial repositories topics. The --from and --topics options will enable additional repository scraping in order to provide these information.

The GitHub Web site is applying rate limiting rules. To comply with its policies no more than 60 requests per hour, with at least a 1 second interval between requests, will be performed. The tool will also maintain a caching directory of requests results, which it will reuse for 7 days or the --days|-d option parameter. A value of 0 will instruct the tool not to use caching, while the --force|-f option will force reloading the resources requested. The --debug option will show if a resource comes from the cache or the Web, as well as the number of Web requests made to GitHub per day, hour and minute.

The cache can be trimmed to the 7 days or --days|-d parameter value with the --prune|-p option. The pages are stored as XZ compressed files in order to reduce disk usage.

OPTIONS

Options Use
--days|-d DAYS Set number of caching days (0=don't use cache)
--force|-f Force fetching URL instead of using cache
--from Load repositories when forked_from is blank
--json|-j Switch to JSON output instead of plain text
--prune|-p Prune cache items olday than DAYS and cache index
--topics Load repositories when there are missing topics
--xml|-x Switch to XML output instead of plain text
--debug Enable debug mode
--help|-? Print usage and a short help message and exit
--version Print version and exit
-- Options processing terminator

ENVIRONMENT

The LOCALAPPDATA and TMP environment variables under Windows, and HOME, TMPDIR and TMP environment variables under other operating systems can influence the caching directory used.

FILES

The libgh utility will attempt to maintain a caching directory for the web requests it makes.

This directory will be located in one of the following places:

Unix:
    ${HOME}/.cache/libgh
    ${TMPDIR}/.cache/libgh
    ${TMP}/.cache/libgh
Windows:
    %LOCALAPPDATA%\cache\libgh
    %TMP%\cache\libgh

An index.txt file will make the correspondence between URL and files.

EXIT STATUS

The libgh utility exits 0 on success, and >0 if an error occurs.

EXAMPLES

To extract data from a personal GitHub account named HubTou in all possible output formats, do:

$ lgh --debug HubTou > libgh.txt
$ lgh --debug --json HubTou > libgh.json
$ lgh --debug --xml HubTou > libgh.xml

Results for this example are available there:

SEE ALSO

fetch(1), curl(1)

STANDARDS

The libgh utility is not a standard UNIX command.

This implementation tries to follow the PEP 8 style guide for Python code.

PORTABILITY

To be tested under Windows.

HISTORY

This implementation was made for the PNU project.

It's intended as the scraping engine for my topgh tool.

LICENSE

It is available under the 3-clause BSD license.

AUTHORS

Hubert Tournier

CAVEATS

Some information are not available in unauthenticated mode and the rate limits per hour are quite low, but it should be fine anyway for most usages.