Skip to content

Commit

Permalink
Adding data movement notebook.
Browse files Browse the repository at this point in the history
  • Loading branch information
Jitendra (Jitu) Kumar committed Jun 11, 2019
1 parent b3ef3d6 commit 69dd167
Show file tree
Hide file tree
Showing 3 changed files with 245 additions and 23 deletions.
4 changes: 3 additions & 1 deletion 0_Introduction/TutorialIntroduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,9 @@
"\n",
"All materials used in today's tutorial are available at https://github.com/ARM-DOE/armhpc-tutorial\n",
"\n",
"Get a copy: `git clone https://github.com/ARM-DOE/armhpc-tutorial`"
"Get a copy: `git clone https://github.com/ARM-DOE/armhpc-tutorial`\n",
"\n",
"Login: https://arm-jupyter.ornl.gov"
]
},
{
Expand Down
220 changes: 220 additions & 0 deletions 2_DataMovement/DataMovement.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# **Data Movement**\n",
"- Big data science using ARM data requires access to large volumes of observations.\n",
"- Data search and download can be an daunting and time taking task.\n",
"- ARM provides custom developed services to allow seamless access to ARM data archives.\n",
"- These services are designed for easy use in a programmatic environment.\n",
"- Our custom tools alleviates some of these by:\n",
" - integrating search and download\n",
" - handling authentication, security etc.\n",
" - command line and python API's to enable fully automated workflows"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# **Tools for data movement**\n",
"\n",
"|ARM Live Data Web Service | \"stage_arm_data\"|\n",
"|--------------------|-----------------|\n",
"| Web service |Globus based transfer|\n",
"| Works from any where in the world | Custom tool available only on ARM HPC clusters |\n",
"| Access to all user accessible data | Access to **ALL** ARM data, including raw |\n",
"| Download only | Also allow data movement within ARM clusters and ADC |\n",
"| HTTP download driven by your internet speed|Fast transfers over dedicated Infiniband network|\n",
"| Allow search and download based on simple queries | Allow search and download based on simple queries |\n",
"| Command line, Python API | Command line, Python API |\n",
"| Enable fully portable application | Applications would be limited to ARM cluster |"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# **ARM Live Data Web Service**\n",
"\n",
"**URL: https://adc.arm.gov/armlive**\n",
"\n",
"Developed and maintained by: Ranjeet Devarakonda and ADC Web Tools Team\n",
"\n",
"Provides: REST API, Command line (Wget, Curl), Bash scripting, Python API"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## **Using REST API**\n",
"`https://adc.arm.gov/armlive/data/query?user=USER_ID:ACCESS_TOKEN&ds=DATASTREAM&start=START_DATE&end=END_DATE`\n",
"\n",
"## **Command line download using Curl**\n",
"`$ wget 'https://adc.arm.gov/armlive/data/saveData?user=example:abcd1234&file=sgpmetE11.b1.20180101.000000.cdf'`\n",
"\n",
"## **Command line download using Wget**\n",
"`$ curl 'https://adc.arm.gov/armlive/data/saveData?user=example:abcd1234&file=sgpmetE11.b1.20180101.000000.cdf'`\n",
"\n",
"## **Using Python API**\n",
"\n",
"Installation: `$ pip install git+https://code.ornl.gov/ofg/armlive_getfiles.git`\n",
"\n",
"Execution: `$ getARMFiles -u example:abcd1234 -ds sgpmetE11.b1 -s 2018-01-01 -e 2018-02-01`\n",
"\n",
"OR \n",
"\n",
"Installation: `$ git clone https://code.ornl.gov/ofg/armlive_getfiles.git`\n",
"\n",
"Execution: `$ python armlive_getfiles/src/getFiles.py -u example:abcd1234 -ds sgpmetE11.b1 -s 2018-01-01 -e 2018-02-01`\n",
"\n",
"## **Using bash scripting**\n",
"Bash script: `https://adc.arm.gov/armlive/scripts/getFiles.sh`\n",
"\n",
"Execution: `$ bash getFiles.sh example:abcd1234 sgpmetE11.b1 2018-01-01 2018-02-01`"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## **This Python function parses the JSON blob and downloads the responsive files into the output directory.**\n",
"The ARMLive web service returns a JSON blob with download links for archive files based on the datastream, start, and end dates provided. \n",
"\n",
"\n",
"```python\n",
"def download_arm_files(user, token, datastream, start, end, output_directory):\n",
" params = {\n",
" 'user': f'{user}:{token}',\n",
" 'ds': datastream,\n",
" 'start': start,\n",
" 'end': end,\n",
" 'wt': 'json',\n",
" }\n",
"\n",
"# print(params)\n",
" response = requests.get('https://adc.arm.gov/armlive/livedata/query', params=params)\n",
"# print(response.url)\n",
" response = response.json()\n",
"# print(response)\n",
" downloaded_files = []\n",
" for filename in response['files']:\n",
" download_url = f'https://adc.arm.gov/armlive/livedata/saveData?user={user}:{token}&file={filename}'\n",
" file_path = Path(output_directory, filename)\n",
" file_path.parent.mkdir(parents=True, exist_ok=True)\n",
" with requests.get(download_url, stream=True) as r:\n",
" with open(file_path, 'wb') as f:\n",
" shutil.copyfileobj(r.raw, f)\n",
" downloaded_files.append(file_path)\n",
" return downloaded_files\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# stage_arm_data\n",
"\n",
"Developed and maintained by: Zach Price, Jitu Kumar and ARM HPC Team\n",
"\n",
"- stage_arm_data uses Globus protocols to\n",
" - query ARM database\n",
" - identify the files as per search criteria\n",
" - download from ARM archive\n",
" - handles all security and authentications\n",
"- data is always stages in project shared area on Lustre filesystem\n",
"- Currently stage_arm_data is only avaliable from the login and compute nodes of Stratus. \n",
"- They are not available from Jupyter notebook. The recommended workaround in the meantime is to open a web terminal in Jupyter, ssh to stratus.ornl.gov, and stage data.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# stage_arm_data\n",
"\n",
"Add to your environment: `module load stage_arm_data`\n",
"\n",
"Data transfer command: `stage_arm_data --to Stratus --datastream corkasacrcfrhsrhiM1.a1 --start 2019-01-01 00:00:00 --end 2019-02-01 00:00:00`\n",
"\n",
"Files will be staged at: `/lustre/or-hydra/cades-arm/proj-shared/data_transfer/cor/corkasacrcfrhsrhiM1.a1/`\n",
"\n",
"From within a Python script:\n",
"```python\n",
"from stage_arm_data.core import transfer_files\n",
"from stage_arm_data.endpoints import Stratus\n",
"\n",
"constraints = {\n",
" 'start_time': 1552608000,\n",
" 'end_time': 1552694400,\n",
" 'datastream': 'corkasacrcfrhsrhiM1.a1'\n",
"}\n",
"\n",
"transfer_files(constraints, Stratus)\n",
"```\n",
"\n",
"Use `--dry-run=True` option in constraints to list available files without transferring."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading

0 comments on commit 69dd167

Please sign in to comment.