From 3da4286c8d725ae5c736627f6bbe4e316603dc5e Mon Sep 17 00:00:00 2001 From: "C. E. Brasseur" Date: Fri, 29 Mar 2024 16:05:14 +0000 Subject: [PATCH 1/2] first draft --- tutorials/FITS-large/FITS-large.ipynb | 934 +++++++++++++++++++++++++- 1 file changed, 928 insertions(+), 6 deletions(-) diff --git a/tutorials/FITS-large/FITS-large.ipynb b/tutorials/FITS-large/FITS-large.ipynb index fe55e342..2451558e 100644 --- a/tutorials/FITS-large/FITS-large.ipynb +++ b/tutorials/FITS-large/FITS-large.ipynb @@ -7,6 +7,10 @@ "source": [ "# Working with large FITS files\n", "\n", + "This tutorial builds on this guide to [Create a very large FITS file from scratch](https://docs.astropy.org/en/stable/generated/examples/io/skip_create-large-fits.html) and shows how to buid a large fits file with multiple HDUs and with the big one not being the last. It is aimed at users already quite familiar with the FITS format.\n", + "\n", + "\n", + "\n", "## Authors\n", "C. E. Brasseur\n", "\n", @@ -22,22 +26,123 @@ "LINK TO FITS DOCUMENTATION\n", "\n", "## Summary\n", - "In this tutorial, we will download a data file, do something to it, and then\n", - "visualize it." + "\n", + "This is an advanced tutorial. If you don't want to know about the inner workings of the FITS format, just stop here. If you don't want to know but nevertheless neeed to, proceed with caution, that's how I started and now here I am writing this tutorial.\n", + "\n" ] }, { "cell_type": "markdown", - "id": "b1e73414-7111-4f27-9478-0cbc015276ca", + "id": "7dfd952d-56e3-4963-8540-f43020a7ea9e", "metadata": {}, - "source": [] + "source": [ + "## Building a large FITS file\n", + "\n", + "1. [Imports](#Imports)\n", + "2. [Primary HDU](#Primary-HDU)\n", + "3. [Large Image HDU](#Large-Image-HDU)\n", + "4. [Large Table HDU](#Large-Table-HDU)\n", + "5. [Adding an Extra Small HDU](#Adding-an-Extra-Small-HDU)\n", + "6. [Cleanup](#Cleanup)\n" + ] }, { "cell_type": "markdown", - "id": "7dfd952d-56e3-4963-8540-f43020a7ea9e", + "id": "5e0b583d", + "metadata": {}, + "source": [ + "https://docs.astropy.org/en/stable/generated/examples/io/skip_create-large-fits.html\n", + "\n", + "https://fits.gsfc.nasa.gov/fits_standard.html\n", + "\n", + "https://docs.python.org/3/library/mmap.html#mmap.mmap.madvise\n", + "\n", + "https://docs.python.org/3/library/mmap.html#madvise-constants\n", + "\n", + "https://man7.org/linux/man-pages/man2/madvise.2.html\n", + "\n", + "https://github.com/astropy/astropy/issues/1380\n", + "\n", + "https://github.com/astropy/astropy/pull/7597\n", + "\n", + "https://github.com/astropy/astropy/pull/7926" + ] + }, + { + "cell_type": "markdown", + "id": "e51c21c1", + "metadata": {}, + "source": [ + "## Imports" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "670f28e1", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "from time import time\n", + "\n", + "import numpy as np\n", + "\n", + "from astropy.io import fits\n", + "from astropy.table import Table\n", + "\n", + "from mmap import MADV_SEQUENTIAL" + ] + }, + { + "cell_type": "markdown", + "id": "97374113", + "metadata": {}, + "source": [ + "And since we're building a huge file, we'll write a little function to give us the file size in a whatever units we want." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab165db6", + "metadata": {}, + "outputs": [], + "source": [ + "def print_file_size(path, unit=\"B\"):\n", + " \n", + " size = os.path.getsize(path)\n", + " \n", + " if unit==\"KB\":\n", + " size /= 1e3\n", + " fmt = '.1f'\n", + " elif unit==\"MB\":\n", + " size /= 1e6\n", + " fmt = '.1f'\n", + " elif unit==\"GB\":\n", + " size /= 1e9\n", + " fmt = '.1f'\n", + " elif unit==\"FITS\":\n", + " size //= 2880\n", + " unit = \"FITS block(s)\"\n", + " fmt = 'd'\n", + " \n", + " else:\n", + " unit = \"Bytes\"\n", + " fmt = 'd'\n", + " \n", + " print(f\"{size:{fmt}} {unit}\")" + ] + }, + { + "cell_type": "markdown", + "id": "fc642cac", "metadata": {}, "source": [ - "## Building a large FITS file" + "## Primary HDU\n", + "\n", + "We're going to build this up as a properly formated multi-extension FITS file, so before we get into the matter of greating a masive FITS file we will build a basic primary header and write that to the file that will become our monster FITS file." ] }, { @@ -46,6 +151,823 @@ "id": "5f301e48-ab85-422d-817e-7926e4c1d6c8", "metadata": {}, "outputs": [], + "source": [ + "# Make some header entries for important information\n", + "primary_header_cards = [(\"ORIGIN\", 'Fancy Archive', \"Where the data came from\"),\n", + " (\"DATE\", '2024-03-05', \"Creation date\"),\n", + " (\"MJD\", 60374, \"Creation date in MJD\"),\n", + " (\"CREATOR\", 'Me', \"Who created this file\")] \n", + "\n", + "# Build the Primary HDU object and put it in an HDU list\n", + "primary_hdu = fits.PrimaryHDU(header=fits.Header(primary_header_cards))\n", + "hdu_list = fits.HDUList([primary_hdu])\n", + "\n", + "# Write the HDU list to file\n", + "big_fits_fle = \"./patagotitan.fits\"\n", + "hdu_list.writeto(big_fits_fle, overwrite=True)" + ] + }, + { + "cell_type": "markdown", + "id": "049580b1", + "metadata": {}, + "source": [ + "Before we continue let's verify our (currently tiny) FITS file is valid." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6c38858f", + "metadata": {}, + "outputs": [], + "source": [ + "with fits.open(big_fits_fle) as hdu_list:\n", + " hdu_list.info()" + ] + }, + { + "cell_type": "markdown", + "id": "8e14e379", + "metadata": {}, + "source": [ + "Checking out the current size, we see it's one FITS block." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "714965ad", + "metadata": {}, + "outputs": [], + "source": [ + "print_file_size(big_fits_fle)\n", + "print_file_size(big_fits_fle, \"FITS\")" + ] + }, + { + "cell_type": "markdown", + "id": "71cf02d6", + "metadata": {}, + "source": [ + "## Large Image HDU\n", + "\n", + "Here we are going to expand out FITS file to fit a large (40,000 x 40,000 pixel) image. This will cause the file to grow to ~13 GB in size. If that is too large for your system, adjust `array_dims` below. All of the steps still work as expected with smaller data, just there are simpler ways to do this if the whole FITS file fits in memory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "042b5a63", + "metadata": {}, + "outputs": [], + "source": [ + "array_dims = [40_000, 40_000]" + ] + }, + { + "cell_type": "markdown", + "id": "41ef407f", + "metadata": {}, + "source": [ + "First we build an ImageHDU object with a small data array. The data in the array does not matter because we won't be using it, but the data type needs to be correct, and you need to know how many bytes per entry goes with that data type. In this case we are maing a `float64` array, so each entry uses 8 bytes of memory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f9c8df42", + "metadata": {}, + "outputs": [], + "source": [ + "data = np.zeros((100, 100), dtype=np.float64)\n", + "hdu = fits.ImageHDU(data)" + ] + }, + { + "cell_type": "markdown", + "id": "43bac0e2", + "metadata": {}, + "source": [ + "Now we pull out just the header, and adjust the NAXIS keywords to matach our desired large aray dimensions. We also set an EXTNAME which is optional, but helpful because it allows us to refer to that extention by name as well as index. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c8d874e3", + "metadata": {}, + "outputs": [], + "source": [ + "header = hdu.header\n", + "\n", + "header[\"NAXIS2\"] = array_dims[0]\n", + "header[\"NAXIS1\"] = array_dims[1]\n", + "\n", + "header[\"EXTNAME\"] = 'BIG_IMG' " + ] + }, + { + "cell_type": "markdown", + "id": "943335ed", + "metadata": {}, + "source": [ + "Now we write just the header to the end of our soon to balloon FITS file (at the end of this step it is temporarily NOT a valid FITS file)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "93568d65", + "metadata": {}, + "outputs": [], + "source": [ + "with open(big_fits_fle, 'ab') as FITSFLE: # 'ab' means open to append bytes\n", + " FITSFLE.write(bytearray(header.tostring(), encoding=\"utf-8\"))" + ] + }, + { + "cell_type": "markdown", + "id": "1704c93b", + "metadata": {}, + "source": [ + "Now we calculate the number of bytes we need for our large array, remembering the result needs to be a multiple of 2880 bytes to conform to the FITS standard. Note the multiplication by 8 because our array is of type `float64` this would be adjusted for different data types." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0a494a5a", + "metadata": {}, + "outputs": [], + "source": [ + "arraysize_in_bytes = ((np.prod(array_dims) * 8 + 2880 - 1) // 2880) * 2880" + ] + }, + { + "cell_type": "markdown", + "id": "4a970eb1", + "metadata": {}, + "source": [ + "Now we need to expand the file by that many bytes. To do this we seek to the desired new end of the file and write a null byte." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f99cd09b", + "metadata": {}, + "outputs": [], + "source": [ + "filelen = os.path.getsize(big_fits_fle) \n", + " \n", + "with open(big_fits_fle, 'r+b') as FITSFLE:\n", + " FITSFLE.seek(filelen + arraysize_in_bytes - 1)\n", + " FITSFLE.write(b'\\0')" + ] + }, + { + "cell_type": "markdown", + "id": "5c168f65", + "metadata": {}, + "source": [ + "Now lets see how big our FITS file has become." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea9e1dd7", + "metadata": {}, + "outputs": [], + "source": [ + "print_file_size(big_fits_fle, \"GB\")\n", + "print_file_size(big_fits_fle, \"FITS\")" + ] + }, + { + "cell_type": "markdown", + "id": "c0c1e191", + "metadata": {}, + "source": [ + "So just about 13 GB as expected, and a lot more FITS blocks." + ] + }, + { + "cell_type": "markdown", + "id": "0910723e", + "metadata": {}, + "source": [ + "### Filling the big array\n", + "\n", + "Now we have a big ol' empty array, so lets put some stuff in it.\n", + "\n", + "Add some stuff about he different ways of openeing fits files and how memmap is now critical etc. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bd09b429", + "metadata": {}, + "outputs": [], + "source": [ + "hdu_list = fits.open(big_fits_fle, mode='update', memmap=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6426bc2d", + "metadata": {}, + "outputs": [], + "source": [ + "hdu_list.info()" + ] + }, + { + "cell_type": "markdown", + "id": "a98664fb", + "metadata": {}, + "source": [ + "That's what we expect, and also this is a point where you find out if you've messed up this operations. FITS files don't have indices up front so the computer just has to scan through it (in chunks of 2880) looking for more extensions. By default the Astropy fits module does not do this until necessary (LINK TO DOCS), so it's at the point where we call the info function that we find out if out FITS file is still valid. If this operations hangs, most likely the array size calculation is wrong." + ] + }, + { + "cell_type": "markdown", + "id": "8ac94786", + "metadata": {}, + "source": [ + "We'll pull out the large data array, and then fill it in a loop. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3be4e66f", + "metadata": {}, + "outputs": [], + "source": [ + "data_array = hdu_list[1].data" + ] + }, + { + "cell_type": "markdown", + "id": "42698c9a", + "metadata": {}, + "source": [ + "If you are on a system with the `madvise` call (you're on your own figuring that out), you can set madvise to MADV_SEQUENTIAL for the data_array. This tells the memory mapping that you are going to be accessing the array in a sequential manner and allows it to be more efficient in how it handles memory allocation based on that. (Obviously don't set this if you aren't going to be accessing the array sequentially)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "85eb00fe", + "metadata": {}, + "outputs": [], + "source": [ + "mm = fits.util._get_array_mmap(data_array)\n", + "mm.madvise(MADV_SEQUENTIAL)" + ] + }, + { + "cell_type": "markdown", + "id": "306ca830", + "metadata": {}, + "source": [ + "Now we fill the large array in blocks. We want the block size to comfortably fit in memory. The block_size I am using yields an ~1.3 GB array, adjust as your system requires." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "266125bb", + "metadata": {}, + "outputs": [], + "source": [ + "block_size = 4000\n", + "\n", + "it = time()\n", + "for i,j in enumerate(range(0, array_dims[0], block_size)):\n", + " sub_arr = np.ones((block_size,array_dims[1]))*i\n", + " data_array[j:j+block_size,:] = sub_arr\n", + " print(f\"{i}: {time()-it:.0f} sec\")\n", + " it = time()" + ] + }, + { + "cell_type": "markdown", + "id": "900aa92e", + "metadata": {}, + "source": [ + "Note the differing times for the loops. DO I KNOW WHY????" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "274cee30", + "metadata": {}, + "outputs": [], + "source": [ + "hdu_list.close()" + ] + }, + { + "cell_type": "markdown", + "id": "54875a56", + "metadata": {}, + "source": [ + "### Checking the file contents\n", + "\n", + "So now we've theoretically filled the elephantine array, but we want to make sure it actually got filled and save. So we'll open the file in a non-editable mode and check." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c452f3c6", + "metadata": {}, + "outputs": [], + "source": [ + "hdu_list = fits.open(big_fits_fle, mode='denywrite', memmap=True)\n", + "\n", + "data_array = hdu_list[1].data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ba09b5a", + "metadata": {}, + "outputs": [], + "source": [ + "it = time()\n", + "for i,j in enumerate(range(0, array_dims[0], block_size)):\n", + " print(f\"{i}: Data match is {(data_array[j:j+block_size,:] == i).all()}: {time()-it:.0f} sec\")\n", + " it = time()\n", + " \n", + "hdu_list.close()" + ] + }, + { + "cell_type": "markdown", + "id": "4a7298d9", + "metadata": {}, + "source": [ + "## Large Table HDU\n", + "\n", + "In the last section we expanded our FITS file to add a colossal image extension, in this section we will do the same for a table extension. The method is similar, but with a few key differences.\n", + "\n", + "As with the image the data is not important but the data types are. In particular, the maximum string length for columns cannot be changed one the fly (since the memory has been allocated and is fixed)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f1b4c31e", + "metadata": {}, + "outputs": [], + "source": [ + "small_tbl = Table(names=[\"Name\", \"Population\", \"Prince\", \"Years since fall\", \"Imports\", \"Exports\"],\n", + " dtype=['U128', int, 'U128', np.float64, 'U2048', 'U2048'],\n", + " rows=[[\"Vangaveyave\", 1297382, \"Oriana\", 34.6, \"wine, cheese\", \"ahalo cloth, pearls, foamwork\"],\n", + " [\"Azilint\", 50000, \"n/a\", 92.3, \"none\", \"none\"],\n", + " [\"Amboloyo\", 50937253, \"Rufus\", 1504.2, \"pears, textiles, spices\", \"wine, timber\"]])\n", + "\n", + "table_hdu = fits.BinTableHDU(data=small_tbl)\n", + "table_hdu.header[\"EXTNAME\"] = \"BIG_TABLE\"" + ] + }, + { + "cell_type": "markdown", + "id": "6be82b91", + "metadata": {}, + "source": [ + "The header for this table HDU gives us the information to determine how many bytes we need for our mammoth table." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "528aa250", + "metadata": {}, + "outputs": [], + "source": [ + "table_hdu.header" + ] + }, + { + "cell_type": "markdown", + "id": "535ee86c", + "metadata": {}, + "source": [ + "The `NAXIS#` keywords hold the dimensions of the table where `NAXIS1` is the length of a single table row in bytes and `NAXIS2` is the number of rows in the table. So to get the total size of the jumbo table in bytes we simply multiply `NAXIS1` by the number of rows desired (adjusting for FITS blocksize). I'm choosing a million rows which is about 4GB, adjust as necessary for your system." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b42bae4b", + "metadata": {}, + "outputs": [], + "source": [ + "num_rows = 1_000_000\n", + "tablesize_in_bytes = ((table_hdu.header[\"NAXIS1\"]*num_rows + 2880 - 1) // 2880) * 2880" + ] + }, + { + "cell_type": "markdown", + "id": "95a95ad0", + "metadata": {}, + "source": [ + "Now we adjust the `NAXIS2` keyword to match our new table length and write just the header to the end of our towering FITS file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "13c3c36e", + "metadata": {}, + "outputs": [], + "source": [ + "table_hdu.header[\"NAXIS2\"] = num_rows\n", + "\n", + "with open(big_fits_fle, 'ab') as FITSFLE:\n", + " FITSFLE.write(bytearray(table_hdu.header.tostring(), encoding=\"utf-8\"))" + ] + }, + { + "cell_type": "markdown", + "id": "a8ea704c", + "metadata": {}, + "source": [ + "Before we expand the file, lets remind ourself of the current filesize." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d86193af", + "metadata": {}, + "outputs": [], + "source": [ + "print_file_size(big_fits_fle, \"GB\")" + ] + }, + { + "cell_type": "markdown", + "id": "e9579e19", + "metadata": {}, + "source": [ + "Now, just as for the vast data array, we seek `tablesize_in_bytes` beyond the current end of the file and write a null byte." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b51a1aec", + "metadata": {}, + "outputs": [], + "source": [ + "filelen = os.path.getsize(big_fits_fle)\n", + "\n", + "with open(big_fits_fle, 'r+b') as FITSFLE:\n", + " FITSFLE.seek(filelen + tablesize_in_bytes - 1)\n", + " FITSFLE.write(b'\\0')" + ] + }, + { + "cell_type": "markdown", + "id": "1180aa5d", + "metadata": {}, + "source": [ + "And we can see that the filesize has indeed increased by about 4GB." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8f9a7371", + "metadata": {}, + "outputs": [], + "source": [ + "print_file_size(big_fits_fle, \"GB\")" + ] + }, + { + "cell_type": "markdown", + "id": "b44ff84f", + "metadata": {}, + "source": [ + "### Adding data to the titanic table\n", + "\n", + "We can now open the prodigeous FITS file in update mode and fill in our table. Note that this time we don't advise the memory mapper we will be accessing the memory in sequential order, because we are not doing that.\n", + "\n", + "CHANGE THIS NOW I KNOW IT'S STORED ROW BY ROW\n", + "\n", + "\n", + "ALSO CAN FILL ROW BY ROW\n", + "\n", + "In [6]: hdu.data\n", + "Out[6]: \n", + "FITS_rec([(1, 1., 'c'), (2, 2., 'd'), (3, 3., 'e')],\n", + " dtype=(numpy.record, [('a', ' Date: Fri, 5 Apr 2024 16:10:20 +0100 Subject: [PATCH 2/2] big_fits tutorial --- tutorials/FITS-large/FITS-large.ipynb | 412 +++++++++++++------------- tutorials/FITS-large/requirements.txt | 2 + 2 files changed, 209 insertions(+), 205 deletions(-) create mode 100644 tutorials/FITS-large/requirements.txt diff --git a/tutorials/FITS-large/FITS-large.ipynb b/tutorials/FITS-large/FITS-large.ipynb index 2451558e..e5deb248 100644 --- a/tutorials/FITS-large/FITS-large.ipynb +++ b/tutorials/FITS-large/FITS-large.ipynb @@ -7,8 +7,7 @@ "source": [ "# Working with large FITS files\n", "\n", - "This tutorial builds on this guide to [Create a very large FITS file from scratch](https://docs.astropy.org/en/stable/generated/examples/io/skip_create-large-fits.html) and shows how to buid a large fits file with multiple HDUs and with the big one not being the last. It is aimed at users already quite familiar with the FITS format.\n", - "\n", + "This tutorial, built on the [Create a very large FITS file from scratch](https://docs.astropy.org/en/stable/generated/examples/io/skip_create-large-fits.html) guide, works through building a very large (too large to fit in memory) FITS file with multiple HDUs. It covers creating both large Image and Table extensions. It is aimed at users already quite familiar with the FITS format.\n", "\n", "\n", "## Authors\n", @@ -16,19 +15,24 @@ "\n", "## Learning Goals\n", "* Build a *large* FITS file (*large* means is too large to fit in memory all at once)\n", - "* Access data from a *large* FITS file\n", - "* Modify a *large* FITS file\n", + "* Make a *large* FITS Image extension\n", + "* Make a *large* FITS Table extension\n", "\n", "## Keywords\n", - "Example, example, example\n", + "FITS, file input/output, memory mapping\n", "\n", "## Companion Content\n", - "LINK TO FITS DOCUMENTATION\n", + "- [FITS Standard](https://fits.gsfc.nasa.gov/fits_standard.html)\n", + "- [Astropy FITS Documentation](https://docs.astropy.org/en/stable/io/fits/)\n", + "- [Python madvise documentation](https://docs.python.org/3/library/mmap.html#mmap.mmap.madvise)\n", + "- [Madvise system call documentation](https://man7.org/linux/man-pages/man2/madvise.2.html)\n", + "- [Create a very large FITS file from scratch](https://docs.astropy.org/en/stable/generated/examples/io/skip_create-large-fits.html)\n", "\n", "## Summary\n", "\n", - "This is an advanced tutorial. If you don't want to know about the inner workings of the FITS format, just stop here. If you don't want to know but nevertheless neeed to, proceed with caution, that's how I started and now here I am writing this tutorial.\n", - "\n" + "This is an advanced tutorial. We will be building a very large multi-extension FITS file from scratch, going through both how to create Images and Arrays too large to fit in memory, and how to fill those structures once created.\n", + "\n", + "If you don't want to know about the inner workings of the FITS format, just stop here. If you don't want to know but nevertheless neeed to, proceed with caution, that's how I started and now here I am writing this tutorial. " ] }, { @@ -43,29 +47,18 @@ "3. [Large Image HDU](#Large-Image-HDU)\n", "4. [Large Table HDU](#Large-Table-HDU)\n", "5. [Adding an Extra Small HDU](#Adding-an-Extra-Small-HDU)\n", - "6. [Cleanup](#Cleanup)\n" - ] - }, - { - "cell_type": "markdown", - "id": "5e0b583d", - "metadata": {}, - "source": [ - "https://docs.astropy.org/en/stable/generated/examples/io/skip_create-large-fits.html\n", - "\n", - "https://fits.gsfc.nasa.gov/fits_standard.html\n", - "\n", - "https://docs.python.org/3/library/mmap.html#mmap.mmap.madvise\n", - "\n", - "https://docs.python.org/3/library/mmap.html#madvise-constants\n", + "6. [Cleanup](#Cleanup)\n", "\n", - "https://man7.org/linux/man-pages/man2/madvise.2.html\n", + "### But before we begin...\n", "\n", - "https://github.com/astropy/astropy/issues/1380\n", + "There are a few things we need to know about the FITS format so I will collect them there. And if you are either now, or in the future, wondering \"why is it like that\" the answer is that the FITS format was originally designed and optimised for magnetic tape. This means that the FITS format was originally designed for sequentual reading rather than random access, so the FITS format has no index (listing at what bytes various parts of the file start) but instead is formatted in 2880 byte chunks so that a tape reader head can simply skip forward by 2880 bytes repeatedly and check if a new section has begun. This is mostly trivia for us as modern users, but there are a few implications. Firstly, when reading FITS files the Astropy FITS module by default reads them \"lazily\" meaning that it does not tabulate all the extensions until it needs to (i.e. when the user requests a specific extension or calls the `info()` function). Secondly, and most crucially for this tutorial, when creating FITS extension manually, the most critical part of creating a new *valid* FITS extension is making sure the number of bytes is a multiple of 2880. These are of course but a few of the quirks of the FITS format, to read about all of them in their full and eccentric glory, see the [FITS standard](https://fits.gsfc.nasa.gov/fits_standard.html) document.\n", "\n", - "https://github.com/astropy/astropy/pull/7597\n", - "\n", - "https://github.com/astropy/astropy/pull/7926" + "**A short review of terminology:**\n", + "- The basic block of a FITS file is called a Header Data Unit (HDU).\n", + "- Each HDU contains two elements, the header and the data.\n", + "- A FITS file consistes of one or more HDUs.\n", + "- Astropy represents a FITS file as an HDUList, where each extension is an HDU of a specific type (i.e PrimaryHDU, ImageHDU, etc).\n", + "- For more details see the [Astropy FITS Documentation](https://docs.astropy.org/en/stable/io/fits/)." ] }, { @@ -73,7 +66,9 @@ "id": "e51c21c1", "metadata": {}, "source": [ - "## Imports" + "## Imports\n", + "\n", + "We don't need many modules for this. The central one is of course `astropy.io.fits`; the `mmap` import helps with efficiency, but is not available on all systems, and is ultimately not essential. So if yours is a system without it never fear, you can just comment out those lines." ] }, { @@ -100,7 +95,9 @@ "id": "97374113", "metadata": {}, "source": [ - "And since we're building a huge file, we'll write a little function to give us the file size in a whatever units we want." + "### A little helper function\n", + "\n", + "Because this tutorial is all about building a huge file, we'll write a little function to print the file size in a a variety of units." ] }, { @@ -124,9 +121,9 @@ " size /= 1e9\n", " fmt = '.1f'\n", " elif unit==\"FITS\":\n", - " size //= 2880\n", + " size /= 2880\n", " unit = \"FITS block(s)\"\n", - " fmt = 'd'\n", + " fmt = '.1f'\n", " \n", " else:\n", " unit = \"Bytes\"\n", @@ -142,7 +139,7 @@ "source": [ "## Primary HDU\n", "\n", - "We're going to build this up as a properly formated multi-extension FITS file, so before we get into the matter of greating a masive FITS file we will build a basic primary header and write that to the file that will become our monster FITS file." + "In this tutorial we are going to build a properly formated multi-extension FITS file, so before we get into the matter of creating a massive FITS file we will build a basic Primary HDU, and write it to file where it will be the basis for our monstrous FITS file." ] }, { @@ -172,7 +169,7 @@ "id": "049580b1", "metadata": {}, "source": [ - "Before we continue let's verify our (currently tiny) FITS file is valid." + "Before we continue let's verify our (currently tiny) FITS file is valid. We will do this by calling the `info()` function which will hang if the file is not a valid FITS format." ] }, { @@ -191,7 +188,7 @@ "id": "8e14e379", "metadata": {}, "source": [ - "Checking out the current size, we see it's one FITS block." + "Let's also look at the file size in bytes and FITS blocks. Note that it is exactly one FITS block." ] }, { @@ -212,7 +209,9 @@ "source": [ "## Large Image HDU\n", "\n", - "Here we are going to expand out FITS file to fit a large (40,000 x 40,000 pixel) image. This will cause the file to grow to ~13 GB in size. If that is too large for your system, adjust `array_dims` below. All of the steps still work as expected with smaller data, just there are simpler ways to do this if the whole FITS file fits in memory." + "Now we get to the meat of the tutorial. We are going to expand out the FITS file to fit a 40,000 x 40,000 pixel image (~13 GB).\n", + "\n", + "*Note*: If this is problamatically big for your system, adjust `array_dims` below. All of the steps will still work as expected with smaller data, it's simply an unnessesarily complex method when dealing with data sizes that fit in memory." ] }, { @@ -230,7 +229,7 @@ "id": "41ef407f", "metadata": {}, "source": [ - "First we build an ImageHDU object with a small data array. The data in the array does not matter because we won't be using it, but the data type needs to be correct, and you need to know how many bytes per entry goes with that data type. In this case we are maing a `float64` array, so each entry uses 8 bytes of memory." + "First we build an ImageHDU object with a small data array. The data in the array does not matter because we won't be using it, but the data type needs to be correct, and you need to take note of how many bytes per element goes with that data type. In this example we are building a `float64` array, so each element uses 8 bytes of memory." ] }, { @@ -249,7 +248,9 @@ "id": "43bac0e2", "metadata": {}, "source": [ - "Now we pull out just the header, and adjust the NAXIS keywords to matach our desired large aray dimensions. We also set an EXTNAME which is optional, but helpful because it allows us to refer to that extention by name as well as index. " + "Now we pull out just the header, and adjust the NAXIS keywords to match our desired giant-array dimensions. This is a critical step, it is telling the FITS file how large the data array is and must match the data array size we add.\n", + "\n", + "We also set an EXTNAME which is optional, but helpful because it allows us to refer to that extension by name as well as index. " ] }, { @@ -272,7 +273,9 @@ "id": "943335ed", "metadata": {}, "source": [ - "Now we write just the header to the end of our soon to balloon FITS file (at the end of this step it is temporarily NOT a valid FITS file)." + "The next step is to write *just the header* to the end of our soon to balloon FITS file.\n", + "\n", + "*Note:* At the end of this step our file is temporarily NOT a valid FITS file." ] }, { @@ -291,7 +294,9 @@ "id": "1704c93b", "metadata": {}, "source": [ - "Now we calculate the number of bytes we need for our large array, remembering the result needs to be a multiple of 2880 bytes to conform to the FITS standard. Note the multiplication by 8 because our array is of type `float64` this would be adjusted for different data types." + "Now we calculate the number of bytes we need for our gargantuan array, remembering that the result *must* to be a multiple of 2880 bytes to conform to the FITS standard. \n", + "\n", + "*Note:* The `astype(np.int64)` is not necessary on all systems, but some still default to int32 and therefor throw an overflow error. " ] }, { @@ -301,7 +306,8 @@ "metadata": {}, "outputs": [], "source": [ - "arraysize_in_bytes = ((np.prod(array_dims) * 8 + 2880 - 1) // 2880) * 2880" + "elt_size = 8 # Bytes needed for an array element\n", + "arraysize_in_bytes = ((np.prod(array_dims).astype(np.int64) * elt_size + 2880 - 1) // 2880) * 2880" ] }, { @@ -350,7 +356,7 @@ "id": "c0c1e191", "metadata": {}, "source": [ - "So just about 13 GB as expected, and a lot more FITS blocks." + "So just about 13 GB as expected. And a lot more FITS blocks, but still an exact multiple of the FITS block size which is what we want to see." ] }, { @@ -360,9 +366,15 @@ "source": [ "### Filling the big array\n", "\n", - "Now we have a big ol' empty array, so lets put some stuff in it.\n", + "Now we have a big ol' empty array that we want to put some stuff in. At this point we are working with a gigantic file, so we need to start being careful we don't ask for it to be loaded wholesale into memory (if you *do* try to do that, it won't break anything you will just get a `Cannot allocate memory` error). \n", + "\n", + "What this means is that the memmap argument must be set to True. This is usually the default (unless you have changed it in your astropy confuguration settings). There are also a number of modes the file can be opened with:\n", + "- `readonly`: Default behavior. Opens the file in readonly mode, meaning that to save any changes you need to write to a whole new file (or overwrite the existing one). This means that while with `memmap=True` the entire FITS file is not loaded into memory, the system is prepared to load it all in memory if the user changes something, and so will still throw an error if it is not *possible* to allocate memory for the whole file even as it does not allocate it at the minute. For big files where this is not possible it will fall back to `denywrite` mode and produce a warning.\n", + "- `denywrite`: This is similar to readonly except that it does not allow the FITS object to be altered and then written to a new file. For our tremendous FITS file we will use this mode when we want to access but not change the file.\n", + "- `update`: This mode allows a file to be updated in place.\n", + "- `append`: This allows more extensions to be added to an existing FITS file, but doe *not* allow changing data already in the file when it is opened. \n", "\n", - "Add some stuff about he different ways of openeing fits files and how memmap is now critical etc. " + "We will use the `update` mode to fill our immense array in place." ] }, { @@ -376,21 +388,21 @@ ] }, { - "cell_type": "code", - "execution_count": null, - "id": "6426bc2d", + "cell_type": "markdown", + "id": "a98664fb", "metadata": {}, - "outputs": [], "source": [ - "hdu_list.info()" + "Before we comence we will call the `info()` function to verify that our FITS file is valid and the enormous array is the size we expect." ] }, { - "cell_type": "markdown", - "id": "a98664fb", + "cell_type": "code", + "execution_count": null, + "id": "6426bc2d", "metadata": {}, + "outputs": [], "source": [ - "That's what we expect, and also this is a point where you find out if you've messed up this operations. FITS files don't have indices up front so the computer just has to scan through it (in chunks of 2880) looking for more extensions. By default the Astropy fits module does not do this until necessary (LINK TO DOCS), so it's at the point where we call the info function that we find out if out FITS file is still valid. If this operations hangs, most likely the array size calculation is wrong." + "hdu_list.info()" ] }, { @@ -398,7 +410,7 @@ "id": "8ac94786", "metadata": {}, "source": [ - "We'll pull out the large data array, and then fill it in a loop. " + "Pulling out the majestic array for convenience." ] }, { @@ -416,7 +428,9 @@ "id": "42698c9a", "metadata": {}, "source": [ - "If you are on a system with the `madvise` call (you're on your own figuring that out), you can set madvise to MADV_SEQUENTIAL for the data_array. This tells the memory mapping that you are going to be accessing the array in a sequential manner and allows it to be more efficient in how it handles memory allocation based on that. (Obviously don't set this if you aren't going to be accessing the array sequentially)." + "If you are on a system with the `madvise` call (you're on your own figuring that out), you can set madvise to `MADV_SEQUENTIAL` for the `data_array`. This tells the memory mapping that you are going to be accessing the array in a sequential manner and allows it to be more efficient in how it handles memory allocation based on that. How much this actually affects the time it takes to perform the filling operation will depend on your specific system and the array sizes you are working with.\n", + "\n", + "*Note:* If you change how you fill your array to something not sequential, don't set this." ] }, { @@ -435,42 +449,38 @@ "id": "306ca830", "metadata": {}, "source": [ - "Now we fill the large array in blocks. We want the block size to comfortably fit in memory. The block_size I am using yields an ~1.3 GB array, adjust as your system requires." + "Now we fill the jumbo array block by block. We want the block size to comfortably fit in memory. The `block_size` I am using yields an ~1.3 GB array, adjust as your system requires.\n", + "\n", + "We also print the time it take for every block. If you have the `MADV_SEQUENTIAL` flag set, the individual block fill operation will generally take longer, and the close operation quite fast, while if the `MADV_SEQUENTIAL` flag is not set the reverse is generally true. This is because in the first case, the data is being flushed to disk at once, while in the second it builds up untill the system needs more memory or the file is closed and it writes it all at once. Which is more efficent on your setup will vary with block and file size." ] }, { "cell_type": "code", "execution_count": null, - "id": "266125bb", + "id": "bb2a0903", "metadata": {}, "outputs": [], "source": [ "block_size = 4000\n", "\n", - "it = time()\n", + "tt = 0\n", "for i,j in enumerate(range(0, array_dims[0], block_size)):\n", + " it = time()\n", + " \n", " sub_arr = np.ones((block_size,array_dims[1]))*i\n", " data_array[j:j+block_size,:] = sub_arr\n", - " print(f\"{i}: {time()-it:.0f} sec\")\n", - " it = time()" - ] - }, - { - "cell_type": "markdown", - "id": "900aa92e", - "metadata": {}, - "source": [ - "Note the differing times for the loops. DO I KNOW WHY????" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "274cee30", - "metadata": {}, - "outputs": [], - "source": [ - "hdu_list.close()" + " \n", + " tm = time()-it\n", + " print(f\"{i}: {tm:.0f} sec\")\n", + " tt += tm\n", + " \n", + "it = time()\n", + "hdu_list.close()\n", + "tm = time()-it\n", + "print(f\"Closing: {tm:.0f} sec\")\n", + "tt += tm\n", + " \n", + "print(f\"Total fill time: {tt/60:.1f} min\")" ] }, { @@ -480,7 +490,7 @@ "source": [ "### Checking the file contents\n", "\n", - "So now we've theoretically filled the elephantine array, but we want to make sure it actually got filled and save. So we'll open the file in a non-editable mode and check." + "We've filled our outsize array, but we want to make sure that it is correct. So we'll open the elephantine file in `denywrite` mode and check." ] }, { @@ -491,7 +501,6 @@ "outputs": [], "source": [ "hdu_list = fits.open(big_fits_fle, mode='denywrite', memmap=True)\n", - "\n", "data_array = hdu_list[1].data" ] }, @@ -519,7 +528,7 @@ "\n", "In the last section we expanded our FITS file to add a colossal image extension, in this section we will do the same for a table extension. The method is similar, but with a few key differences.\n", "\n", - "As with the image the data is not important but the data types are. In particular, the maximum string length for columns cannot be changed one the fly (since the memory has been allocated and is fixed)." + "As with the mighty array, we start by making a small table where the specific data is not important but the data types are. In particular, the maximum string length for columns cannot be changed on the fly, so any string columns must be given the maximum number of characters needed." ] }, { @@ -532,8 +541,11 @@ "small_tbl = Table(names=[\"Name\", \"Population\", \"Prince\", \"Years since fall\", \"Imports\", \"Exports\"],\n", " dtype=['U128', int, 'U128', np.float64, 'U2048', 'U2048'],\n", " rows=[[\"Vangaveyave\", 1297382, \"Oriana\", 34.6, \"wine, cheese\", \"ahalo cloth, pearls, foamwork\"],\n", - " [\"Azilint\", 50000, \"n/a\", 92.3, \"none\", \"none\"],\n", - " [\"Amboloyo\", 50937253, \"Rufus\", 1504.2, \"pears, textiles, spices\", \"wine, timber\"]])\n", + " [\"Tkinele\", 50000, \"n/a\", 92.3, \"none\", \"none\"],\n", + " [\"Amboloyo\", 50937253, \"Rufus\", 1504.2, \"pears, textiles, spices\", \"wine, timber\"],\n", + " [\"Xiputl\", 3627373, \"Anastasiya\", 346.8, \"silk, perfumes, pigments\", \"stone, cotton\"],\n", + " [\"Old Damara\", 437226732, \"Melissa Damara\", 25.3, \"wool, timber\", \"spices, silk\"],\n", + " [\"Western Dair\", 8045728302, \"Belu\", 876.3, \"shellfish, salt\", \"cured meat, wool\"]])\n", "\n", "table_hdu = fits.BinTableHDU(data=small_tbl)\n", "table_hdu.header[\"EXTNAME\"] = \"BIG_TABLE\"" @@ -562,7 +574,7 @@ "id": "535ee86c", "metadata": {}, "source": [ - "The `NAXIS#` keywords hold the dimensions of the table where `NAXIS1` is the length of a single table row in bytes and `NAXIS2` is the number of rows in the table. So to get the total size of the jumbo table in bytes we simply multiply `NAXIS1` by the number of rows desired (adjusting for FITS blocksize). I'm choosing a million rows which is about 4GB, adjust as necessary for your system." + "The `NAXIS1` keyword gives the length of a single table row in bytes, and the `NAXIS2` keyword holds the number of rows in the table. So to get the total size of the humongous table in bytes we simply multiply `NAXIS1` by the number of rows desired (adjusting for FITS block size). Here I choose a million rows which is about 4GB, adjust as necessary for your system." ] }, { @@ -581,7 +593,7 @@ "id": "95a95ad0", "metadata": {}, "source": [ - "Now we adjust the `NAXIS2` keyword to match our new table length and write just the header to the end of our towering FITS file." + "Now we adjust the `NAXIS2` keyword to match our new table length and write just the header to the end of our towering FITS file, as we did for the oversize array extension." ] }, { @@ -602,7 +614,7 @@ "id": "a8ea704c", "metadata": {}, "source": [ - "Before we expand the file, lets remind ourself of the current filesize." + "Before we expand the file, we'll look at the current file size." ] }, { @@ -620,7 +632,7 @@ "id": "e9579e19", "metadata": {}, "source": [ - "Now, just as for the vast data array, we seek `tablesize_in_bytes` beyond the current end of the file and write a null byte." + "Now, just as for the vast array, we seek `tablesize_in_bytes` beyond the current end of the file and write a null byte." ] }, { @@ -662,67 +674,28 @@ "source": [ "### Adding data to the titanic table\n", "\n", - "We can now open the prodigeous FITS file in update mode and fill in our table. Note that this time we don't advise the memory mapper we will be accessing the memory in sequential order, because we are not doing that.\n", - "\n", - "CHANGE THIS NOW I KNOW IT'S STORED ROW BY ROW\n", - "\n", - "\n", - "ALSO CAN FILL ROW BY ROW\n", - "\n", - "In [6]: hdu.data\n", - "Out[6]: \n", - "FITS_rec([(1, 1., 'c'), (2, 2., 'd'), (3, 3., 'e')],\n", - " dtype=(numpy.record, [('a', '