Skip to content

Commit

Permalink
Add simple Spark notebook and environment
Browse files Browse the repository at this point in the history
  • Loading branch information
mrocklin committed Apr 20, 2024
1 parent 9f0e7ef commit 2fc20ee
Show file tree
Hide file tree
Showing 2 changed files with 177 additions and 0 deletions.
161 changes: 161 additions & 0 deletions spark.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "f8eafed2-77a1-4691-8e8b-aeb1187ce8f5",
"metadata": {
"tags": []
},
"source": [
"\n",
"Spark on Coiled\n",
"===============\n",
"\n",
"<img src=\"https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLsyvblPuU1h0NRGoZiODTKqIYbTpCu3hrHoM1rXzt1A&s\"\n",
" align=\"right\"\n",
" width=\"40%\"/>\n",
"\n",
"Coiled can run Spark Jobs.\n",
"\n",
"You get all the same Coiled ease of use features:\n",
"\n",
"1. Quick startup\n",
"2. Copies all of your local packages and code\n",
"3. Runs in any region on any hardware\n",
"4. Runs from your local notebook\n",
"\n",
"But now rather than just Dask you can run Spark too."
]
},
{
"cell_type": "markdown",
"id": "0d130128-ac72-4ce6-87b1-b7a20337fd2a",
"metadata": {},
"source": [
"### Read a little bit of data with pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "09728a96-0c84-4198-ab52-4dcdfd704606",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_parquet(\n",
" \"s3://coiled-data/uber/part.0.parquet\",\n",
")\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "3148ad8d-3de7-47b6-91a3-1d1f5a393f64",
"metadata": {},
"source": [
"## Start Spark cluster to read lots of data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a9e6076-c8b3-4282-90a4-0fe3ab49440d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import coiled\n",
"\n",
"cluster = coiled.Cluster(\n",
" n_workers=10,\n",
" worker_memory=\"16 GiB\",\n",
" region=\"us-east-2\",\n",
")\n",
"\n",
"spark = cluster.get_spark()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "33b598a4-fe0a-43c5-8007-0e955ac193f9",
"metadata": {},
"outputs": [],
"source": [
"df = spark.read.parquet(\"s3a://coiled-data/uber\")\n",
"df.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "56e2b982-af1b-4140-8cc1-414343ba1f0a",
"metadata": {},
"outputs": [],
"source": [
"df.count()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "829de2bc-ed09-4e10-b06f-268aa79ead59",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import dask\n",
"import dask.dataframe as dd\n",
"dask.config.set({\"dataframe.convert-string\": True}) # use PyArrow strings by default\n",
"\n",
"while True:\n",
" client.restart()\n",
"\n",
" df = dd.read_parquet(\n",
" \"s3://coiled-datasets/uber-lyft-tlc/\",\n",
" storage_options={\"anon\": True},\n",
" ).persist()\n",
"\n",
" for _ in range(10):\n",
" df[\"tipped\"] = df.tips != 0\n",
"\n",
" df.groupby(\"hvfhs_license_num\").tipped.mean().compute()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4b6b9f9-7ef3-4ca0-b769-0dd7e4ce6b0b",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:spark]",
"language": "python",
"name": "conda-env-spark-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
16 changes: 16 additions & 0 deletions spark.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
name: spark
channels:
- conda-forge
dependencies:
- python=3.11
- dask
- coiled
- ipykernel
- pyspark==3.4.1
- pyarrow
- grpcio
- grpcio-status
- openjdk~=11.0
- protobuf
- jupyterlab
- s3fs

0 comments on commit 2fc20ee

Please sign in to comment.