Skip to content

Commit

Permalink
Merge pull request #6 from scholarsportal/release/v0.1.1
Browse files Browse the repository at this point in the history
Release/v0.1.1
  • Loading branch information
kenlhlui authored Jan 28, 2025
2 parents 3ef6dab + 92543a3 commit 7c60b04
Show file tree
Hide file tree
Showing 12 changed files with 461 additions and 127 deletions.
6 changes: 5 additions & 1 deletion .github/workflows/poetry-export_dependencies.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,15 @@ name: Poetry export requirements.txt
on:
push:
branches:
- '*' # Trigger on any push to any branch
- 'main' # Trigger only on pull requests made to the main branch
paths:
- 'requirements.txt'
- 'pyproject.toml'
- 'poetry.lock'

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

jobs:
poetry-export_dependencies:
strategy:
Expand Down
180 changes: 180 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# UV
# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
#uv.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

# Ruff stuff:
.ruff_cache/

# PyPI configuration file
.pypirc

# exported_files folder
exported_files/

# test.ipynb
test.ipynb
6 changes: 3 additions & 3 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
cff-version: 0.1.0
cff-version: 0.1.1
message: "If you use this software, please cite it as below."
authors:
- family-names: "Lui"
given-names: "Lok Hei"
orcid: "https://orcid.org/0000-0001-5077-1530"
title: "Dataverse Metadata Crawler"
version: 0.1.0
date-released: 2025-01-16
version: 0.1.1
date-released: 2025-01-28
url: "https://github.com/scholarsportal/dataverse-metadata-crawler"
57 changes: 33 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![Licnese: MIT](https://img.shields.io/badge/Licnese-MIT-blue)](https://opensource.org/license/mit)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue)](https://opensource.org/license/mit)
[![Dataverse](https://img.shields.io/badge/Dataverse-FFA500?)](https://dataverse.org/)
[![Code Style: Black](https://img.shields.io/badge/code_style-black-black?)](https://github.com/psf/black)

# Dataverse Metadata Crawler
![Screencapture of the CLI tool](res/screenshot.png)

## 📜Description
A Python CLI tool for extracting and exporting metadata from [Dataverse](https://dataverse.org/) repositories. It supports bulk extraction of dataverses, datasets, and data file metadata from any chosen level of dataverse collection (whole Dataverse repository/sub-Dataverse), with flexible export options to JSON and CSV formats.
A Python CLI tool for extracting and exporting metadata from [Dataverse](https://dataverse.org/) repositories. It supports bulk extraction of dataverses, datasets, and data file metadata from any chosen level of dataverse collection (an entire Dataverse repository/sub-Dataverse), with flexible export options to JSON and CSV formats.

## ✨Features
1. Bulk metadata extraction from Dataverse repositories from any chosen level of collection (top level or selected collection)
1. Bulk metadata extraction from Dataverse repositories at any chosen level of collection (top level or selected collection)
2. JSON & CSV file export options

## 📦Prerequisites
Expand All @@ -29,7 +29,7 @@ A Python CLI tool for extracting and exporting metadata from [Dataverse](https:/
cd ./dataverse-metadata-crawler
```

3. Create an environment file (.env)
3. Create an environment file (`.env`)
```sh
touch .env # For Unix/MacOS
nano .env # or vim .env, or your preferred editor
Expand All @@ -38,11 +38,16 @@ A Python CLI tool for extracting and exporting metadata from [Dataverse](https:/
notepad .env
```

4. Configure environment file using your text editor at your choice
4. Configure the environment (`.env`) file using the text editor of your choice.
```sh
# .env file
BASE_URL = "TARGET_REPO_URL" # e.g., "https://demo.borealisdata.ca/"
API_KEY = "YOUR_API_KEY" # Find in your Dataverse account settings. You may also specify it in the CLI interface (with -a flag)
BASE_URL = "TARGET_REPO_URL" # Base URL of the repository; e.g., "https://demo.borealisdata.ca/"
API_KEY = "YOUR_API_KEY" # Found in your Dataverse account settings. Can also be specified in the CLI interface using the -a flag.
```
Your `.env` file should look like this:
```sh
BASE_URL = "https://demo.borealisdata.ca/"
API_KEY = "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXX"
```

5. Set up virtual environment (recommended)
Expand All @@ -68,15 +73,15 @@ python3 dvmeta/main.py [-a AUTH] [-l] [-d] [-p] [-f] [-e] [-s] -c COLLECTION_ALI

| **Option** | **Short** | **Type** | **Description** | **Default** |
|--------------------|-----------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|
| --collection_alias | -c | TEXT | Name of the collection to crawl. <br/> **[required]** | None |
| --collection_alias | -c | TEXT | The alias of the collection to crawl. <br/> See the guide [here](https://github.com/scholarsportal/dataverse-metadata-crawler/wiki/Guide:-How-to-find-the-COLLECTION_ALIAS-of-a-Dataverse-collection) to learn how to look for a the collection alias. <br/> **[required]** | None |
| --version | -v | TEXT | The Dataset version to crawl. Options include: <br/> • `draft` - The draft version, if any <br/> • `latest` - Either a draft (if exists) or the latest published version <br/> • `latest-published` - The latest published version <br/> • `x.y` - A specific version <br/> **[required]** | None (required) |


**Optional arguments:**

| **Option** | **Short** | **Type** | **Description** | **Default** |
|----------------------|-----------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------|
| --auth | -a | TEXT | Authentication token to access the Dataverse repository. <br/> If | None |
| --auth | -a | TEXT | Authentication token to access the Dataverse repository. <br/> | None |
| --log <br/> --no-log | -l | | Output a log file. <br/> Use `--no-log` to disable logging. | `log` (unless `--no-log`) |
| --dvdfds_metadata | -d | | Output a JSON file containing metadata of Dataverses, Datasets, and Data Files. | |
| --permission | -p | | Output a JSON file that stores permission metadata for all Datasets in the repository. | |
Expand All @@ -101,36 +106,40 @@ python3 dvmeta/main.py -c demo -v 1.0 -d -s -p -a xxxxxxxx-xxxx-xxxx-xxxx-xxxxxx

| File | Description |
|-------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| ds_metadata_yyyymmdd-HHMMSS.json | Datasets' their data files' metadata in JSON format. |
| ds_metadata_yyyymmdd-HHMMSS.json | Datasets representation & data files metadata in JSON format. |
| empty_dv_yyyymmdd-HHMMSS.json | The id of empty dataverse(s) in list format. |
| failed_metadata_uris_yyyymmdd-HHMMSS.json | The URIs (URL) of datasets failed to be downloaded. |
| permission_dict_yyyymmdd-HHMMSS.json | The perission metadata of datasets with their dataset id. |
| pid_dict_yyyymmdd-HHMMSS.json | Datasets' basic info with hierarchical information dictionary.Only exported if -p (permission) flag is used without -d (metadata) flag. |
| pid_dict_dd_yyyymmdd-HHMMSS.json | The Hierarchical information of deaccessioned/draft datasets. |
| ds_metadata_yyyymmdd-HHMMSS.csv | Datasets' their data files' metadata in CSV format. |
| ds_metadata_yyyymmdd-HHMMSS.csv | Datasets and their data files' metadata in CSV format. |
| log_yyyymmdd-HHMMSS.txt | Summary of the crawling work. |

```sh
exported_files/
├── json_files/
│ └── ds_metadata_yyyymmdd-HHMMSS.json # With -d flag enabled
│ └── empty_dv_yyyymmdd-HHMMSS.json # With -e flag enabled
│ └── failed_metadata_uris_yyyymmdd-HHMMSS.json
│ └── permission_dict_yyyymmdd-HHMMSS.json # With -p flag enabled
│ └── pid_dict_yyyymmdd-HHMMSS.json # Only exported if -p flag is used without -d flag
│ └── pid_dict_dd_yyyymmdd-HHMMSS.json # Hierarchical information of deaccessioned/draft datasets
│ └── failed_metadata_uris_yyyymmdd-HHMMSS.json # With -f flag enabled
│ └── permission_dict_yyyymmdd-HHMMSS.json # With only -p flag enabled
│ └── pid_dict_yyyymmdd-HHMMSS.json # With only -p flag enabled
│ └── pid_dict_dd_yyyymmdd-HHMMSS.json # Hierarchical information of deaccessioned/draft datasets.
├── csv_files/
│ └── ds_metadata_yyyymmdd-HHMMSS.csv # with -s flag enabled
└── logs_files/
└── log_yyyymmdd-HHMMSS.txt # Exported by default, without specifying --no-log
```

## ⚠️Disclaimer
> [!WARNING]
> To retrieve data about unpublished datasets or information that is not available publicly (e.g. collaborators/permissions), you will need to have necessary access rights. **Please note that any publication or use of non-publicly available data may require review by a Research Ethics Board**.
## ✅Tests
No tests have been written yet. Contributions welcome!

## 💻Development
1. Dependencies managment: [poetry](https://python-poetry.org/) - Update the pyproject.toml dependencies changes
2. Linter: [ruff](https://docs.astral.sh/ruff/) - Linting rules are outlined in the pyproject.toml
1. Dependencies managment: [poetry](https://python-poetry.org/) - Use `poetry` to manage dependencies and reflect changes in the `pyproject.toml` file.
2. Linter: [ruff](https://docs.astral.sh/ruff/) - Follow the linting rules outlined in the `pyproject.toml` file.

## 🙌Contributing
1. Fork the repository
Expand All @@ -148,18 +157,18 @@ If you use this software in your work, please cite it using the following metada

APA:
```
Lui, L. H. (2025). Dataverse Metadata Crawler (Version 0.1.0) [Computer software]. https://github.com/scholarsportal/dataverse-metadata-crawler
Lui, L. H. (2025). Dataverse Metadata Crawler (Version 0.1.1) [Computer software]. https://github.com/scholarsportal/dataverse-metadata-crawler
```

BibTeX:
```
@software{Lui_Dataverse_Metadata_Crawler_2025,
author = {Lui, Lok Hei},
month = jan,
title = {{Dataverse Metadata Crawler}},
url = {https://github.com/scholarsportal/dataverse-metadata-crawler},
version = {0.1.0},
year = {2025}
author = {Lui, Lok Hei},
month = {jan},
title = {Dataverse Metadata Crawler},
url = {https://github.com/scholarsportal/dataverse-metadata-crawler},
version = {0.1.1},
year = {2025}
}
```

Expand Down
Loading

0 comments on commit 7c60b04

Please sign in to comment.