Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with latest satsearch #6

Open
TomAugspurger opened this issue Oct 16, 2020 · 19 comments
Open

Issues with latest satsearch #6

TomAugspurger opened this issue Oct 16, 2020 · 19 comments

Comments

@TomAugspurger
Copy link
Member

A collection of issues with latest versions of satsearch / intake-stac.

  1. The search URL needs to be specified. I think https://earth-search.aws.element84.com/v0 is the recommended URL, and you pass it like results = Search.search(url=api_url, collection='landsat-8-l1', ...).
  2. If that's the right URL, the search returns many more items (34,675 limited to 10,000). Compared to 25 before.
There are more items found (34675) than the limit (100) provided.

34675 items
  1. eo:bands isn't present in the geoDataFrame: band_info = pd.DataFrame(ast.literal_eval(gf.iloc[0]['eo:bands'])). I think that it's available in the Item assets?
eo_bands = [items[0].assets[f'B{i}']['eo:bands'] for i in range(1, 12)]

so repeat that for each one?

  1. The scenid hardcoded isn't present in the catalog
sceneid = 'LC80470272019096'
  1. The ValueError from intake-stac setup and error on aws search example intake/intake-stac#64.
@scottyhq
Copy link
Member

thanks for the detailed issue @TomAugspurger , there have been a lot of changes to sat-search(>0.3) and intake-stac(>0.3) in the last couple months, with a new version of intake-stac just released yesterday. Long story short, the notebook needs some updating once those new versions are in the environment

@ZihengSun
Copy link

ZihengSun commented Apr 2, 2021

Hi @scottyhq and @TomAugspurger , I was just running the landsat8 notebook on a newly installed Dask cluster on Azure K8S. I used the exact same version of satsearch (==0.2.3), but still cannot go through. Here are the details of the error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-10-47751b4b061a> in <module>
     10                'landsat:tier=T1'] 
     11 
---> 12 results = Search.search(collection='landsat-8-l1', 
     13                         bbox=bbox,
     14                         datetime=timeRange,

/srv/conda/envs/notebook/lib/python3.8/site-packages/satsearch/search.py in search(cls, **kwargs)
     61             del kwargs['sort']
     62             kwargs['sort'] = sorts
---> 63         return Search(**kwargs)
     64 
     65     def found(self):

/srv/conda/envs/notebook/lib/python3.8/site-packages/satsearch/search.py in __init__(self, **kwargs)
     26         """ Initialize a Search object with parameters """
     27         self.kwargs = kwargs
---> 28         for k in self.kwargs:
     29             if k == 'datetime':
     30                 self.kwargs['time'] = self.kwargs['datetime']

RuntimeError: dictionary keys changed during iteration

Any idea?

@TomAugspurger
Copy link
Member Author

Mmm I'm not sure. That looks like a bug in satsearch. I believe that development focus is shifting from satsearch https://github.com/stac-utils/pystac-api-client, but I'm not sure how mature pystac-api-client is yet.

@ZihengSun
Copy link

I think this would be a simple fix by changing line 28 to:

for k in list(self.kwargs):

But I cannot access and edit this file /srv/conda/envs/notebook/lib/python3.8/site-packages/satsearch/search.py as I am using dummy user on dask jupyterlab.

@scottyhq
Copy link
Member

@TomAugspurger @ZihengSun , yes taking a step back this example really needs to be updated to use a different L8 dataset see #8 for some alternatives including accessing harmonized landsat sentinel2 (HLS) via NASA's CMR STAC endpoint.

It would be awesome if all public datasets in AWS, Azure, Google had up-to-date STAC metadata and search endpoints, but that is still very much a work in progress...

@RichardScottOZ
Copy link

Yes, all a work in progress the magic Stack of STACs

@ricardobarroslourenco
Copy link

Hi everyone! I was trying to reproduce this notebook and on the third cell the search throws an error, that seems to be related to the search call:


gaierror Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connection.py in _new_conn(self)
159 conn = connection.create_connection(
--> 160 (self._dns_host, self.port), self.timeout, **extra_kw
161 )

/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options)
60
---> 61 for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
62 af, socktype, proto, canonname, sa = res

/srv/conda/envs/notebook/lib/python3.7/socket.py in getaddrinfo(host, port, family, type, proto, flags)
751 addrlist = []
--> 752 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
753 af, socktype, proto, canonname, sa = res

gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

NewConnectionError Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
676 headers=headers,
--> 677 chunked=chunked,
678 )

/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
380 try:
--> 381 self._validate_conn(conn)
382 except (SocketTimeout, BaseSSLError) as e:

/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connectionpool.py in _validate_conn(self, conn)
977 if not getattr(conn, "sock", None): # AppEngine might not have .sock
--> 978 conn.connect()
979

/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connection.py in connect(self)
308 # Add certificate verification
--> 309 conn = self._new_conn()
310 hostname = self.host

/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connection.py in _new_conn(self)
171 raise NewConnectionError(
--> 172 self, "Failed to establish a new connection: %s" % e
173 )

NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7ffae1f56f10>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

MaxRetryError Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
448 retries=self.max_retries,
--> 449 timeout=timeout
450 )

/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
726 retries = retries.increment(
--> 727 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
728 )

/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
438 if new_retry.is_exhausted():
--> 439 raise MaxRetryError(_pool, url, error or ResponseError(cause))
440

MaxRetryError: HTTPSConnectionPool(host='earth-search-legacy.aws.element84.com', port=443): Max retries exceeded with url: /stac/search (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ffae1f56f10>: Failed to establish a new connection: [Errno -2] Name or service not known'))

During handling of the above exception, another exception occurred:

ConnectionError Traceback (most recent call last)
in
17 )
18
---> 19 print('%s items' % results.found())
20 items = results.items()
21 items.save('subset.geojson')

/srv/conda/envs/notebook/lib/python3.7/site-packages/satsearch/search.py in found(self)
73 }
74 kwargs.update(self.kwargs)
---> 75 results = self.query(**kwargs)
76 return results['meta']['found']
77

/srv/conda/envs/notebook/lib/python3.7/site-packages/satsearch/search.py in query(cls, url, **kwargs)
80 """ Get request """
81 logger.debug('Query URL: %s, Body: %s' % (url, json.dumps(kwargs)))
---> 82 response = requests.post(url, data=json.dumps(kwargs))
83 # API error
84 if response.status_code != 200:

/srv/conda/envs/notebook/lib/python3.7/site-packages/requests/api.py in post(url, data, json, **kwargs)
117 """
118
--> 119 return request('post', url, data=data, json=json, **kwargs)
120
121

/srv/conda/envs/notebook/lib/python3.7/site-packages/requests/api.py in request(method, url, **kwargs)
59 # cases, and look like a memory leak in others.
60 with sessions.Session() as session:
---> 61 return session.request(method=method, url=url, **kwargs)
62
63

/srv/conda/envs/notebook/lib/python3.7/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
528 }
529 send_kwargs.update(settings)
--> 530 resp = self.send(prep, **send_kwargs)
531
532 return resp

/srv/conda/envs/notebook/lib/python3.7/site-packages/requests/sessions.py in send(self, request, **kwargs)
641
642 # Send the request
--> 643 r = adapter.send(request, **kwargs)
644
645 # Total elapsed time of the request (approximately)

/srv/conda/envs/notebook/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
514 raise SSLError(e, request=request)
515
--> 516 raise ConnectionError(e, request=request)
517
518 except ClosedPoolError as e:

ConnectionError: HTTPSConnectionPool(host='earth-search-legacy.aws.element84.com', port=443): Max retries exceeded with url: /stac/search (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ffae1f56f10>: Failed to establish a new connection: [Errno -2] Name or service not known'))

For me, it seems that is an issue with the endpoint being called. Anyone can give a hint on how to solve this? I have already installed satsearch from its GitHub repo but seems to persist.

@TomAugspurger
Copy link
Member Author

I'd recommend trying pystac-client (and updating the notebook if that works). Something like

catalog = pystac_client.Client.open("https://earth-search.aws.element84.com/v0")

Docs are at https://pystac-client.readthedocs.io/.

@ricardobarroslourenco
Copy link

Thanks @TomAugspurger ! I will try, and if it works, I will fix and submit a PR, ok?

@rabernat
Copy link
Member

rabernat commented Mar 2, 2022

I was hoping to use this in a demo and found the same issue described in #6 (comment). I was able to run

catalog = pystac_client.Client.open("https://earth-search.aws.element84.com/v0")

but could not figure out how to refactor the rest of the search in cell 3 to use the new API.

This gallery is a very important demonstration of Pangeo's capabilities in geospatial analysis. Let's get it working again!

@ricardobarroslourenco
Copy link

I got dragged in lab affairs these last days but will be going over this later this week.

@rabernat the refactoring would be due to the method change basically, or are there new "errors"?

@TomAugspurger
Copy link
Member Author

I'll take a look quick.

@TomAugspurger
Copy link
Member Author

Gotta move on, but I have a start at https://gist.github.com/6051aa1705dc6797beccc9ac6e321ef3.

  1. The STAC items in AWS changed their IDs and structure
  2. The pystac-client API differs a bit from sat-serach
  3. The returned STAC items don't seem to work with whatever intake-stac is expected to do it's stuff. I halfway switched over to using stackstac instead, but the output structure is at odds with what the hvplot code was expecting (which is where I have to leave it for now).

I'll pick things up later if I have a chance.

@rsignell-usgs
Copy link
Member

rsignell-usgs commented Mar 2, 2022

The Landsat example here may also be of interest, it uses pystac-client and odc-stac: https://github.com/Element84/geo-notebooks/blob/main/notebooks/odc-landsat.ipynb

Here's the rendered version

@TomAugspurger
Copy link
Member Author

Hmm, now I'm having issues with accessing the data (e.g. the link at https://landsat-pds.s3.us-west-2.amazonaws.com/c1/L8/047/027/LC08_L1TP_047027_20210630_20210708_01_T1/LC08_L1TP_047027_20210630_20210708_01_T1_thumb_large.jpg is giving a 40x error). Did that bucket recently change to requester-pays? For some reason I thought it had been requester pays for a while, but now I'm not sure.

I could change the source to the Planetary Computer's landsat collection in Azure, but would want an OK from @scottyhq before doing that. Or I could put that in a separate notebook.

@scottyhq
Copy link
Member

Thanks for taking time to try and fix the now-dated example @TomAugspurger

Hmm, now I'm having issues with accessing the data

Yes, I think that s3://landsat-pds has finally been retired! See pydata/xarray#6363 (comment)

There are now at least 4 options for cloud-hosted Landsat data (for better or worse)!

version cloud cloud region authentication
NASA HLS v2 AWS us-west-2 NASA URS
USGS collection 2 AWS us-west-2 requester-pays
Landsat collection 2 Azure West Europe SAS token
Landsat collection 1 Google US multi-region public

@rabernat @TomAugspurger Happy to merge a PR for an updated notebook... But in my mind Pangeo Gallery is primarily to illustrate 1. Large-scale examples of moving compute to the data and 2. Actually execute large examples as big integration tests and see when data or software problems come up (as they have here!). So my reluctance to continue maintaining this example with AWS datasets is due to the following:

  • Does binderbot still work since the AWS binderhub requires github login?
  • How to automatically execute code requiring credentials (required for NASA URS or requester pays)?

@martibosch
Copy link

martibosch commented Jul 14, 2022

Hello,

I have also been browsing a bunch of online materials to get started with intake for landsat data and have not yet been able to reproduce a single notebook.

I have signed up for Microsoft's planetary computer (since, correct me if I am mistaken, seems to be more aligned with the open source principles than Google's Earth Engine), and in the meantime, I have been trying to use the AWS datasets using an AWS account. However, when I try to convert to dask/xarray:

import pystac_client
import intake
from rasterio import session

aws_session = session.AWSSession(boto3.Session(profile_name="aws"), requester_pays=True)
stac_uri =  "https://landsatlook.usgs.gov/stac-server"
collections = ["landsat-c2l1"]

client = pystac_client.Client.open(stac_uri)

results = client.search(
        collections=collections,
        bbox =...,
        datetime=...)
items = results.get_all_items()
catalog = intake.open_stac_item_collection(items)
with rio.Env(aws_session):
    ds = catalog[list(catalog)[0]]["blue"].to_dask()

I obtain:

  KeyError                                  Traceback (most recent call last)
  File ~/path/to/conda/env/lib/python3.10/site-packages/xarray/backends/file_manager.py:199, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
      198 try:
  --> 199     file = self._cache[self._key]
      200 except KeyError:

  File ~/path/to/conda/env/lib/python3.10/site-packages/xarray/backends/lru_cache.py:53, in LRUCache.__getitem__(self, key)
       52 with self._lock:
  ---> 53     value = self._cache[key]
       54     self._cache.move_to_end(key)

  KeyError: [<function open at 0x7fe6b9d5ab00>, ('https://landsatlook.usgs.gov/data/collection02/level-1/standard/oli-tirs/2020/165/062/LC08_L1TP_165062_20201231_20210308_02_T1/LC08_L1TP_165062_20201231_20210308_02_T1_B2.TIF',), 'r', ()]

  During handling of the above exception, another exception occurred:

  CPLE_AppDefinedError                      Traceback (most recent call last)
  File rasterio/_base.pyx:302, in rasterio._base.DatasetBase.__init__()

  File rasterio/_base.pyx:213, in rasterio._base.open_dataset()

  File rasterio/_err.pyx:217, in rasterio._err.exc_wrap_pointer()

  CPLE_AppDefinedError: Line 49: </head> doesn't have matching <head>.

  During handling of the above exception, another exception occurred:

  RasterioIOError                           Traceback (most recent call last)
  Input In [33], in <cell line: 1>()
        1 with rio.Env(aws_session):
  ----> 2     ds = catalog[list(catalog)[0]]["blue"].to_dask()

  File ~/path/to/conda/env/lib/python3.10/site-packages/intake_xarray/base.py:69, in DataSourceMixin.to_dask(self)
       67 def to_dask(self):
       68     """Return xarray object where variables are dask arrays"""
  ---> 69     return self.read_chunked()

  File ~/path/to/conda/env/lib/python3.10/site-packages/intake_xarray/base.py:44, in DataSourceMixin.read_chunked(self)
       42 def read_chunked(self):
       43     """Return xarray object (which will have chunks)"""
  ---> 44     self._load_metadata()
       45     return self._ds

  File ~/path/to/conda/env/lib/python3.10/site-packages/intake/source/base.py:236, in DataSourceBase._load_metadata(self)
      234 """load metadata only if needed"""
      235 if self._schema is None:
  --> 236     self._schema = self._get_schema()
      237     self.dtype = self._schema.dtype
      238     self.shape = self._schema.shape

  File ~/path/to/conda/env/lib/python3.10/site-packages/intake_xarray/raster.py:102, in RasterIOSource._get_schema(self)
       99 self.urlpath, *_ = self._get_cache(self.urlpath)
      101 if self._ds is None:
  --> 102     self._open_dataset()
      104     ds2 = xr.Dataset({'raster': self._ds})
      105     metadata = {
      106         'dims': dict(ds2.dims),
      107         'data_vars': {k: list(ds2[k].coords)
     (...)
      110         'array': 'raster'
      111     }

  File ~/path/to/conda/env/lib/python3.10/site-packages/intake_xarray/raster.py:90, in RasterIOSource._open_dataset(self)
       88     self._ds = self._open_files(files)
       89 else:
  ---> 90     self._ds = xr.open_rasterio(files, chunks=self.chunks,
       91                                 **self._kwargs)

  File ~/path/to/conda/env/lib/python3.10/site-packages/xarray/backends/rasterio_.py:302, in open_rasterio(filename, parse_coordinates, chunks, cache, lock, **kwargs)
      293     lock = RASTERIO_LOCK
      295 manager = CachingFileManager(
      296     rasterio.open,
      297     filename,
     (...)
      300     kwargs=kwargs,
      301 )
  --> 302 riods = manager.acquire()
      303 if vrt_params is not None:
      304     riods = WarpedVRT(riods, **vrt_params)

  File ~/path/to/conda/env/lib/python3.10/site-packages/xarray/backends/file_manager.py:181, in CachingFileManager.acquire(self, needs_lock)
      166 def acquire(self, needs_lock=True):
      167     """Acquire a file object from the manager.
      168 
      169     A new file is only opened if it has expired from the
     (...)
      179         An open file object, as returned by ``opener(*args, **kwargs)``.
      180     """
  --> 181     file, _ = self._acquire_with_cache_info(needs_lock)
      182     return file

  File ~/path/to/conda/env/lib/python3.10/site-packages/xarray/backends/file_manager.py:205, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
      203     kwargs = kwargs.copy()
      204     kwargs["mode"] = self._mode
  --> 205 file = self._opener(*self._args, **kwargs)
      206 if self._mode == "w":
      207     # ensure file doesn't get overridden when opened again
      208     self._mode = "a"

  File ~/path/to/conda/env/lib/python3.10/site-packages/rasterio/env.py:442, in ensure_env_with_credentials.<locals>.wrapper(*args, **kwds)
      439     session = DummySession()
      441 with env_ctor(session=session):
  --> 442     return f(*args, **kwds)

  File ~/path/to/conda/env/lib/python3.10/site-packages/rasterio/__init__.py:277, in open(fp, mode, driver, width, height, count, crs, transform, dtype, nodata, sharing, **kwargs)
      274 path = _parse_path(raw_dataset_path)
      276 if mode == "r":
  --> 277     dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
      278 elif mode == "r+":
      279     dataset = get_writer_for_path(path, driver=driver)(
      280         path, mode, driver=driver, sharing=sharing, **kwargs
      281     )

  File rasterio/_base.pyx:304, in rasterio._base.DatasetBase.__init__()

  RasterioIOError: Line 49: </head> doesn't have matching <head>.

Since the error seems to come from rasterio, I tried:

import boto3
import rasterio as rio
from rasterio import session

aws_session = session.AWSSession(boto3.Session(profile_name="aws"), requester_pays=True)

uri = "https://landsatlook.usgs.gov/data/collection02/level-1/standard/oli-tirs/2020/165/062/LC08_L1TP_165062_20201231_20210308_02_T1/LC08_L1TP_165062_20201231_20210308_02_T1_B2.TIF"

with rio.Env(aws_session):
    with rio.open(uri) as src:
        print(src.profile)

but obtain the same error. However, if I change the https uri for its s3 version, i.e., "s3://usgs-landsat/collection02/level-1/standard/oli-tirs/2020/165/062/LC08_L1TP_165062_20201231_20210308_02_T1/LC08_L1TP_165062_20201231_20210308_02_T1_B2.TIF", everything works.

Also note that I tried setting the CURL_CA_BUNDLE environment variable to "/etc/ssl/certs/ca-certificates.crt" as suggested by @mdmaas (https://www.matecdev.com/posts/landsat-sentinel-aws-s3-python.html) but it did not work either.

Is this normal? How can I make the to_dask method work for Earth on AWS?

For info: rasterio version is 1.3.0, intake version is 0.6.5, and pystac_client version is 0.4.0.

Thank you in advance. Best,
Martí

@scottyhq
Copy link
Member

scottyhq commented Jul 14, 2022

RasterioIOError: Line 49: </head> doesn't have matching <head>.

This is suggestive of reading an HTML form rather than the actual file. I ran into this issue recently and found a solution here https://gis.stackexchange.com/questions/430026/gdalinfo-authenticate-for-remote-file. As a consequence of the authentication required for HTTP links it seems best to stick with the S3:// urls as you've discovered.

Here is a more up-to-date example I'd suggest following for landsatlook.usgs.gov on AWS pangeo-data/cog-best-practices#12 (comment)

@martibosch
Copy link

Thanks @scottyhq for your response. I have managed to stream landsat data from AWS into xarray thanks to the example that you sent. If I am understanding the situation correctly, this means that intake should be updated to read s3 urls rather than https, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants