Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spacewatch cropped images #54

Closed
araichoor opened this issue Aug 11, 2023 · 15 comments · Fixed by #58
Closed

Spacewatch cropped images #54

araichoor opened this issue Aug 11, 2023 · 15 comments · Fixed by #58
Assignees
Labels
enhancement New feature or request

Comments

@araichoor
Copy link

If possible, it would be nice if the spacewatch images could be transferred to NERSC.
Note that there is no rush for that.

I use those images to create per-night spacewatch movies of what desi observes.
see the "Per-night spacewatch" tab here: https://data.desi.lbl.gov/desi/users/raichoor/main-status/main-status.html.

For a given night YYYYMMDD, I currently download those images on my laptop from:
https://varuna.kpno.noirlab.edu/allsky-all/images/cropped/$YYYY/$MM/$DD/
and then scp them to:
https://data.desi.lbl.gov/desi/users/raichoor/images/cropped/$YYYY/$MM/$DD/

Technical informations/precisions:

  • there is a new image every two minutes;
  • a given image typically is ~100K-200K;
  • a full year of images should take ~50 G (see e.g. $DESI_ROOT/users/raichoor/images/cropped/2022, but I didn t downloaded images for Jul.-Aug.);
  • frequency: I typically use that when I run my status page, so it would be on daily basis; I ve to wait for the pipeline afterburner to be run, so any time before that would be great; for instance one could make a daily download just before/after launching the afterburner, of the images like in the last 24h or 48;
  • of course one needs to properly handle the timezone differences (pacific, mst, utc).

Thanks!

@weaverba137 weaverba137 self-assigned this Aug 14, 2023
@weaverba137 weaverba137 added the enhancement New feature or request label Aug 14, 2023
@weaverba137
Copy link
Member

@araichoor, the good news is that we don't have to open a new firewall port for this, since varnuna.kpno.noirlab.edu appears to be completely public anyway. However, since this download would not be based on rsync, it's not as simple as adding additional configuration to existing code. I have several questions:

  1. $DESI_ROOT/users/raichoor/images/cropped/2022 doesn't appear to exist. Could you please check that path?
  2. For the sake of data integrity, I have to ask, when you downloaded these images, did you preserve modification time?
  3. Is there an expectation that we would not download images during certain times of the year like in 2022? If so, how do we express that programmatically?
  4. My experience is that the time at which the pipeline afterburner runs is not necessarily predictable, and it is especially not predictable right now. Would it be acceptable to simply set a certain fixed time of day, 12:00 MST, for example?

@sbailey, please also comment, especially on timing with respect to the pipeline.

@weaverba137
Copy link
Member

Notes for parsing the Last-Modified http header.

Python 3.10.12 (main, Jun 10 2023, 10:51:02) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> r = requests.head('https://varuna.kpno.noirlab.edu/allsky-all/images/cropped/2023/08/13/20230813_002605.jpg')
>>> r.headers
{'Date': 'Mon, 14 Aug 2023 16:11:37 GMT', 'Server': 'Apache/2.4.57 (Fedora Linux) OpenSSL/1.1.1q', 'Last-Modified': 'Sun, 13 Aug 2023 00:26:24 GMT', 'ETag': '"274ff-602c2fef3949c"', 'Accept-Ranges': 'bytes', 'Content-Length': '161023', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'image/jpeg'}
>>> import datetime
>>> r.headers['Last-Modified']
'Sun, 13 Aug 2023 00:26:24 GMT'
>>> try:
...     utc = datetime.UTC
... except AttributeError:
...     # datetime.UTC is in Python 3.11
...     import pytz
...     utc = pytz.UTC
>>>
>>> datetime.datetime.strptime(r.headers['Last-Modified'], '%a, %d %b %Y %H:%M:%S %Z').replace(tzinfo=utc)
datetime.datetime(2023, 8, 13, 0, 26, 24, tzinfo=<UTC>)
>>> 

@weaverba137
Copy link
Member

Further code snippets.

timestamp = int(datetime.strptime(url_time, last_updated_pattern).timestamp())
os.utime(file_name, (timestamp, timestamp))

@araichoor
Copy link
Author

Thanks for looking at this @weaverba137 !

First: oops, sorry, my correct path is $DESI_ROOT/users/raichoor/spacewatch/images/cropped/2022 (a "spacewatch" got cropped in the way...)

Timestamps:
My commands sequence is, from my laptop a `wget -nc -r -nd https://varuna.kpno.noirlab.edu/allsky-all/images/cropped/$YYYY/$MM/$YY/, then I tar+scp that to nersc, and untar at nersc
(I did attach the script in the email I sent you on 8/10/23, 10:06 PM pacific)
From spot-checking few files, this apparently preserves the modification time of the images, but not of the index files.
But I cannot guarantee that I ve proceeded this way since the beginning.. I can run more checks if you d like, just let me know.

What to download:
As I m launching the downloading manually, I didn t launch any when we don t have data (e.g. during summer shutdown).
I guess we could cook up some recipe based e.g. on exposures-daily.csv to only download images for nights where we ve been on-sky.
Though I suspect it could be a bit tedious.
Provided that the overall size is not that large (~50 G for one year), maybe it s simpler to just download everything?
If we worry about disk space, maybe we even could tar+gzip those per night? (provided that one can then still read individual images from those).

When to download:
You re right, we can surely set here a time independent of the afterburner.
Ideally that would be somewhere between the end of observations and before the afterburner is usually run (10am pacific?); so something like 9am MST?

@weaverba137
Copy link
Member

Is there documentation on when spacewatch does their own rollover to a new day? After that rollover would be the best time to download.

@weaverba137
Copy link
Member

Please ignore the previous question, it's clear they rollover at 00:00 UTC.

@akremin
Copy link
Member

akremin commented Aug 14, 2023

Pre-sneakernet the afterburners are run as a scronjob at 9am each day (this time has varied throughout the years, but this is the current). That is sometimes paused if there were issues on a night, but in smooth operations, it was consistently done by roughly 9:20am or 9:30am pacific time.

In sneakernet it is much more variable.

@weaverba137
Copy link
Member

@sbailey, @araichoor, @akremin, I'd like to revisit this in the near future. Some remaining questions:

  1. If these data are important to operations, where is the best place to store them for access by the full data team? They currently live within $DESI_ROOT/users/araichoor.
  2. When and how often do these data need to be updated? We've talked about "between the end of observations and before the afterburner is usually run", but can we be more precise than that, and how can automation determine that it is running during this period of time? In other words, if I start a script, and it detects that observations have not ended, it can sleep until they are done. If I start a script and it detects that the afterburner is already running, what happens then?

@weaverba137
Copy link
Member

  1. How far back into the past do we need these images?

@araichoor
Copy link
Author

thanks for pushing on this.

  1. I leave it to @sbailey.

2a. how often? => I feel that a daily update would make sense, no? and provided that the occupied disk space is ok, for the sake of simplicity, we could just run that everyday, whether or not we go on sky (i.e. we would download images for few weeks per year where there is no observations but that should be few Gb, so no big deal);
2b. if I correctly read the exposures-daily.csv and the various GFA offline_matched_coadd_ccds files, it looks like the latest in the night an exposure was taken was MJD % 1 ~ 0.58, which is 0.58* 24 - 7 = 6:55am MST; so it would sound safe to say that at 8am MST we expect science observations to be done.
2c. I think that, in normal mode, the afterburner is launched at 9am pacific (16:00 UTC);
so, if the download is launched at 8am MST, it should work; also note that the afterburner (obviously) currently does not use those; if in the future we will use those images in the afterburner, we can surely make some handling in the code for missing images; I mean: I think the download process can be fully de-correlated from the afterburner.

  1. same, here, @sbailey would have a more relevant answer, I guess; my 2 cents: since the beginning of sv1? (20201214).

@sbailey
Copy link
Contributor

sbailey commented Sep 25, 2023

Apologies for joining late on this.

  1. Location: let's put these in $DESI_ROOT/external/spacewatch, following the same directory structure that Anand has in $DESI_ROOT/users/raichoor/spacewatch (though it looks like the subdirs have some leftover cruft like 'index.html?C=D;O=A' (the filename includes the quotes). I've seen that before when others have used wget for recursive bulk downloads.
  2. How often and timing: daily at 8am sounds good
  3. How far back: let's migrate the files that Anand has already downloaded, and then setup automated transfers for the future, but not worry about filling in any back log except if needed for particular nights.

AFAIK, there is nothing in the standard automated pipeline that requires these files, but they are a useful debugging tool when the daily QA reveals something odd and Anand checks if a satellite track went through our focal plane.

@weaverba137
Copy link
Member

@araichoor, do you have any further comments on this? If not, I'll move forward based on @sbailey's last set of comments.

@araichoor
Copy link
Author

no, nothing to add, thanks!

@araichoor
Copy link
Author

hi @weaverba137:

for info, I get today this message when I run my wget command:

--2023-11-02 09:33:34--  https://varuna.kpno.noirlab.edu/allsky-all/images/cropped//2023/11/02/
Resolving varuna.kpno.noirlab.edu (varuna.kpno.noirlab.edu)... 140.252.86.23
Connecting to varuna.kpno.noirlab.edu (varuna.kpno.noirlab.edu)|140.252.86.23|:443... connected.
ERROR: cannot verify varuna.kpno.noirlab.edu's certificate, issued by ‘CN=InCommon RSA Server CA,OU=InCommon,O=Internet2,L=Ann Arbor,ST=MI,C=US’:
  Issued certificate has expired.
To connect to varuna.kpno.noirlab.edu insecurely, use `--no-check-certificate'.

I re-ran with this --no-check-certificate argument, and it worked fine.
but I m just reporting, in case that s useful information for #58.

@weaverba137
Copy link
Member

@araichoor, thank you I noticed that too, but it's simply an expired certificate that we'll have to wait to be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants