Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
docs-action committed Nov 26, 2024
1 parent 1be9a14 commit fe6e61b
Show file tree
Hide file tree
Showing 4 changed files with 56 additions and 15 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17 changes: 12 additions & 5 deletions assets/js/search-data.json
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@
},"41": {
"doc": "Try it Out",
"title": "Table of contents",
"content": ". | Prerequisites | Trying lakeFS for Databricks with Sample Data | Importing real data into lakeFS for Databricks | Summary | . ",
"content": ". | Prerequisites | Trying lakeFS for Databricks with Sample Data | Importing your data into lakeFS for Databricks | Importing data from an existing lakeFS repository | Summary | . ",
"url": "/getstarted/try-it-out.html#table-of-contents",

"relUrl": "/getstarted/try-it-out.html#table-of-contents"
Expand All @@ -308,12 +308,19 @@
"relUrl": "/getstarted/try-it-out.html#trying-lakefs-for-databricks-with-sample-data"
},"44": {
"doc": "Try it Out",
"title": "Importing real data into lakeFS for Databricks",
"content": "Note Before using lakeFS for Databricks with your data, please review the Product Limitations page and ensure you follow the provided guidelines. If you’ve made it this far - great! You should now understand how to operate lakeFS and use it alongside Unity Catalog and your Databricks workspace. Now that you’re more comfortable with the system, let’s try it out by importing our existing data into lakeFS. This is safe to do: lakeFS would never modify or change imported data in any way. Importing works similarly to how we created a branch before: lakeFS would simply create Delta tables whose data resides in its original location - no data is copied or moved, and is never modified. Once imported, any changes made to the resulting table are isolated to it: the imported source is never modified. Let’s see this in action: . Step 1: Importing existing data . Let’s import data from our production catalog into our sample catalog we created earlier: . lakefs-databricks import \\ --from \"production_catalog.default.tpch_customer\" \\ --to \"lakefs_databricks_tutorial.dev.tpch_customer\" . We should now see that table appear in our dev schema! . Let’s commit this change: . lakefs-databricks commit lakefs_databricks_tutorial dev . Step 2: Interacting with data . Let’s query the data from its import destination branch: . USE dev; SELECT * FROM tpch_customer LIMIT 10; . Should return the data that we just imported: . Step 3: Branching out . Now, just like with our sample data, we can branch out and modify this table in isolation. Our source branch should not be affected by any modification made to our new branch, with the same zero-copy branching approach: . lakefs-databricks branch create lakefs_databricks_tutorial ozk_test dev . Which should look something like this: . Step 4: Modify table . To wrap things up, let’s modify both tables (sample data and the one we just imported) to show how our source branch remains unmodified: . USE ozk_test; -- delete some arbitrary data DELETE FROM tpch_customer WHERE c_nationkey = 9; . Again, we can use SQL to look a the difference between our branched table and our source, to ensure isolation: . Step 5: Creating new table . In some cases, we want to not only modify existing data and tables, but to add new tables as well. To do so, we’ll need to create a table which is located within the storage namespace of the repository we’re using to version our data. This allows lakeFS to properly access and version this table as part of the repository. To do so, first copy the storage namespace location from your lakeFS repository’s settings: . CREATE EXTERNAL TABLE my_additional_table ( id INT, value TEXT, ) LOCATION 's3://<REPOSITORY_STORAGE_NAMESPACE>/my_additional_table'; . Replacing <REPOSITORY_STORAGE_NAMESPACE> with the value we took from the repository settings (s3://my-bucket-name/repositories/lakefs-databricks-repo/ in my case). ",
"url": "/getstarted/try-it-out.html#importing-real-data-into-lakefs-for-databricks",
"title": "Importing your data into lakeFS for Databricks",
"content": "Note Before using lakeFS for Databricks with your data, please review the Product Limitations page and ensure you follow the provided guidelines. If you’ve made it this far - great! You should now understand how to operate lakeFS and use it alongside Unity Catalog and your Databricks workspace. Now that you’re more comfortable with the system, let’s try it out by importing our existing data into lakeFS for Databricks. This is safe to do: lakeFS would never modify or change imported data in any way. Importing works similarly to how we created a branch before: lakeFS would simply create Delta tables whose data resides in its original location - no data is copied or moved, and is never modified. Once imported, any changes made to the resulting table are isolated to it: the imported source is never modified. With lakeFS for Databricks you can import existing data from other catalogs within the workspace Metastore. Alternatively, if you are already using lakeFS, you can import data directly from your existing lakeFS repositories. Our example imports data from another catalog within the workspace. Let’s see it in action: . Step 1: Importing existing data . Let’s import data from our production catalog into our sample catalog we created earlier: . lakefs-databricks import \\ --from \"production_catalog.default.tpch_customer\" \\ --to \"lakefs_databricks_tutorial.dev.tpch_customer\" . We should now see that table appear in our dev schema! . In the lakeFS UI, you’ll see a new commit that reflects the imported table. This commit is automatically created at the end of the import operation. Step 2: Interacting with data . Let’s query the data from its import destination branch: . USE dev; SELECT * FROM tpch_customer LIMIT 10; . Should return the data that we just imported: . Step 3: Branching out . Now, just like with our sample data, we can branch out and modify this table in isolation. Our source branch should not be affected by any modification made to our new branch, with the same zero-copy branching approach: . lakefs-databricks branch create lakefs_databricks_tutorial ozk_test dev . Which should look something like this: . Step 4: Modify table . To wrap things up, let’s modify both tables (sample data and the one we just imported) to show how our source branch remains unmodified: . USE ozk_test; -- delete some arbitrary data DELETE FROM tpch_customer WHERE c_nationkey = 9; . Again, we can use SQL to look a the difference between our branched table and our source, to ensure isolation: . Step 5: Creating new table . In some cases, we want to not only modify existing data and tables, but to add new tables as well. To do so, we’ll need to create a table which is located within the storage namespace of the repository we’re using to version our data. This allows lakeFS to properly access and version this table as part of the repository. To do so, first copy the storage namespace location from your lakeFS repository’s settings: . CREATE EXTERNAL TABLE my_additional_table ( id INT, value TEXT, ) LOCATION 's3://<REPOSITORY_STORAGE_NAMESPACE>/my_additional_table'; . Replacing <REPOSITORY_STORAGE_NAMESPACE> with the value we took from the repository settings (s3://my-bucket-name/repositories/lakefs-databricks-repo/ in my case). ",
"url": "/getstarted/try-it-out.html#importing-your-data-into-lakefs-for-databricks",

"relUrl": "/getstarted/try-it-out.html#importing-real-data-into-lakefs-for-databricks"
"relUrl": "/getstarted/try-it-out.html#importing-your-data-into-lakefs-for-databricks"
},"45": {
"doc": "Try it Out",
"title": "Importing data from an existing lakeFS repository",
"content": "If you’re already using lakeFS Cloud to manage Delta Lake tables, you can seamlessly import those tables into lakeFS for Databricks. By doing so, the tables become part of the versioned catalog, enabling enhanced dataset management. Run the following command to import a table from lakeFS into lakeFS for Databricks: . lakefs-databricks import \\ --from \"lakefs://my-lakefs-repo/main/famous-people\" \\ --to \"lakefs_databricks_tutorial.dev.famous-people\" . After running this command, the table will appear in the dev schema! . In the lakeFS UI, you’ll see a new commit that reflects the imported table. This commit is automatically created at the end of the import operation: . Once imported, you can interact with the table directly through Unity Catalog: . ",
"url": "/getstarted/try-it-out.html#importing-data-from-an-existing-lakefs-repository",

"relUrl": "/getstarted/try-it-out.html#importing-data-from-an-existing-lakefs-repository"
},"46": {
"doc": "Try it Out",
"title": "Summary",
"content": "Congratulations! If you’ve made it this far, you should now be familiar with all the building blocks to achieve: . | Zero-copy environments for your data lakehouse, allowing you to test and modify your tables with full isolation without having to create and maintain copies of data | Be able to implement the write-audit-publish pattern in production, allowing you to safely test and validate data before releasing it to consumers. | Be able to see how your data changes over time by looking at a detailed commit log of transformations, covering the who, what and how of changes to datasets | . ",
Expand Down
54 changes: 44 additions & 10 deletions getstarted/try-it-out.html
Original file line number Diff line number Diff line change
Expand Up @@ -373,7 +373,8 @@ <h2 class="no_toc text-delta" id="table-of-contents">
<ol id="markdown-toc">
<li><a href="#prerequisites" id="markdown-toc-prerequisites">Prerequisites</a></li>
<li><a href="#trying-lakefs-for-databricks-with-sample-data" id="markdown-toc-trying-lakefs-for-databricks-with-sample-data">Trying lakeFS for Databricks with Sample Data</a></li>
<li><a href="#importing-real-data-into-lakefs-for-databricks" id="markdown-toc-importing-real-data-into-lakefs-for-databricks">Importing real data into lakeFS for Databricks</a></li>
<li><a href="#importing-your-data-into-lakefs-for-databricks" id="markdown-toc-importing-your-data-into-lakefs-for-databricks">Importing your data into lakeFS for Databricks</a></li>
<li><a href="#importing-data-from-an-existing-lakefs-repository" id="markdown-toc-importing-data-from-an-existing-lakefs-repository">Importing data from an existing lakeFS repository</a></li>
<li><a href="#summary" id="markdown-toc-summary">Summary</a></li>
</ol>

Expand Down Expand Up @@ -614,10 +615,10 @@ <h3 id="step-6-comparing-branches">
</p>

<p>Looking at the amount of dropoff zipcodes, we also see there are less of those, because this was our predicate to delete rows by.</p>
<h2 id="importing-real-data-into-lakefs-for-databricks">
<h2 id="importing-your-data-into-lakefs-for-databricks">


<a href="#importing-real-data-into-lakefs-for-databricks" class="anchor-heading"><svg viewBox="0 0 16 16" aria-hidden="true"><use xlink:href="#svg-link"></use></svg></a> Importing real data into lakeFS for Databricks
<a href="#importing-your-data-into-lakefs-for-databricks" class="anchor-heading"><svg viewBox="0 0 16 16" aria-hidden="true"><use xlink:href="#svg-link"></use></svg></a> Importing your data into lakeFS for Databricks


</h2>
Expand All @@ -629,12 +630,16 @@ <h2 id="importing-real-data-into-lakefs-for-databricks">

<p>If you’ve made it this far - great! You should now understand how to operate lakeFS and use it alongside Unity Catalog and your Databricks workspace.</p>

<p>Now that you’re more comfortable with the system, let’s try it out by importing our existing data into lakeFS.</p>
<p>Now that you’re more comfortable with the system, let’s try it out by importing our existing data into lakeFS for Databricks.</p>

<p><em>This is safe to do</em>: lakeFS would never modify or change imported data in any way. Importing works similarly to how we created a branch before: lakeFS would simply create Delta tables whose data resides in its original location - no data is copied or moved, and is never modified.</p>

<p>Once imported, any changes made to the resulting table are isolated to it: the imported source is never modified.
Let’s see this in action:</p>
<p>Once imported, any changes made to the resulting table are isolated to it: the imported source is never modified.</p>

<p>With lakeFS for Databricks you can import existing data from other catalogs within the workspace <a href="https://docs.databricks.com/en/data-governance/unity-catalog/index.html#metastores">Metastore</a>.
Alternatively, if you are already using lakeFS, you can <a href="#importing-data-from-existing-lakefs-repositories">import data directly from your existing lakeFS repositories</a>.</p>

<p>Our example imports data from another catalog within the workspace. Let’s see it in action:</p>
<h3 id="step-1-importing-existing-data">


Expand All @@ -653,10 +658,8 @@ <h3 id="step-1-importing-existing-data">

<p>We should now see that table appear in our <code class="language-plaintext highlighter-rouge">dev</code> schema!</p>

<p>Let’s commit this change:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lakefs-databricks commit lakefs_databricks_tutorial dev
</code></pre></div></div>
<p>In the lakeFS UI, you’ll see a new commit that reflects the imported table. This commit is automatically created at the
end of the import operation.</p>
<h3 id="step-2-interacting-with-data-1">


Expand Down Expand Up @@ -750,6 +753,37 @@ <h3 id="step-5-creating-new-table">
</code></pre></div></div>

<p>Replacing <code class="language-plaintext highlighter-rouge">&lt;REPOSITORY_STORAGE_NAMESPACE&gt;</code> with the value we took from the repository settings (<code class="language-plaintext highlighter-rouge">s3://my-bucket-name/repositories/lakefs-databricks-repo/</code> in my case).</p>
<h2 id="importing-data-from-an-existing-lakefs-repository">


<a href="#importing-data-from-an-existing-lakefs-repository" class="anchor-heading"><svg viewBox="0 0 16 16" aria-hidden="true"><use xlink:href="#svg-link"></use></svg></a> Importing data from an existing lakeFS repository


</h2>


<p>If you’re already using lakeFS Cloud to manage Delta Lake tables, you can seamlessly import those tables into lakeFS for
Databricks. By doing so, the tables become part of the versioned catalog, enabling enhanced dataset management.</p>

<p>Run the following command to import a table from lakeFS into lakeFS for Databricks:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lakefs-databricks import <span class="se">\</span>
<span class="nt">--from</span> <span class="s2">"lakefs://my-lakefs-repo/main/famous-people"</span> <span class="se">\</span>
<span class="nt">--to</span> <span class="s2">"lakefs_databricks_tutorial.dev.famous-people"</span>
</code></pre></div></div>

<p>After running this command, the table will appear in the <code class="language-plaintext highlighter-rouge">dev</code> schema!</p>

<p>In the lakeFS UI, you’ll see a new commit that reflects the imported table. This commit is automatically created at the
end of the import operation:</p>
<p>
<img src="/assets/img/getstarted/try-20-import-commit-imported-from-lakefs.png" />
</p>

<p>Once imported, you can interact with the table directly through Unity Catalog:</p>
<p>
<img src="/assets/img/getstarted/try-21-query-imported-from-lakefs.png" />
</p>
<h2 id="summary">


Expand Down

0 comments on commit fe6e61b

Please sign in to comment.