Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor level-2 branch to use remote db #13

Open
wants to merge 16 commits into
base: level2
Choose a base branch
from
Open

Conversation

zhik
Copy link
Collaborator

@zhik zhik commented Mar 20, 2024

#12

Still needs to be tested on kubernetes. My local server write speeds to the RDS takes 4h59min8s (35691 xml lines).

@austensen
Copy link
Member

I'm getting this error when running it on kubernetes.

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/requests_toolbelt/_compat.py", line 48, in <module>
    from requests.packages.urllib3.contrib import appengine as gaecontrib
ImportError: cannot import name 'appengine' from 'requests.packages.urllib3.contrib' (/usr/local/lib/python3.11/site-packages/urllib3/contrib/__init__.py)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/app/oca_update.py", line 7, in <module>
    from lib.etl import oca_etl
  File "/app/lib/etl.py", line 24, in <module>
    from .geocode_record import geocode_record, geocode_using_census_batch
  File "/app/lib/geocode_record.py", line 7, in <module>
    import censusgeocode as cg
  File "/usr/local/lib/python3.11/site-packages/censusgeocode/__init__.py", line 10, in <module>
    from .censusgeocode import CensusGeocode
  File "/usr/local/lib/python3.11/site-packages/censusgeocode/censusgeocode.py", line 26, in <module>
    from requests_toolbelt.multipart.encoder import MultipartEncoder
  File "/usr/local/lib/python3.11/site-packages/requests_toolbelt/__init__.py", line 12, in <module>
    from .adapters import SSLAdapter, SourceAddressAdapter
  File "/usr/local/lib/python3.11/site-packages/requests_toolbelt/adapters/__init__.py", line 12, in <module>
    from .ssl import SSLAdapter
  File "/usr/local/lib/python3.11/site-packages/requests_toolbelt/adapters/ssl.py", line 16, in <module>
    from .._compat import poolmanager
  File "/usr/local/lib/python3.11/site-packages/requests_toolbelt/_compat.py", line 50, in <module>
    from urllib3.contrib import appengine as gaecontrib
ImportError: cannot import name 'appengine' from 'urllib3.contrib' (/usr/local/lib/python3.11/site-packages/urllib3/contrib/__init__.py)

Sounds like this can be fixed by specifying these package versions:

urllib3==1.26.15 
requests-toolbelt==0.10.1

https://stackoverflow.com/a/76177575/7051239

@austensen
Copy link
Member

That worked, now getting a new error it can't find the geosupport files:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/geosupport/geosupport.py", line 67, in __init__
    self.geolib = cdll.LoadLibrary("libgeo.so")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/ctypes/__init__.py", line 454, in LoadLibrary
    return self._dlltype(name)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/ctypes/__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libgeo.so: cannot open shared object file: No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/app/oca_update.py", line 7, in <module>
    from lib.etl import oca_etl
  File "/app/lib/etl.py", line 24, in <module>
    from .geocode_record import geocode_record, geocode_using_census_batch
  File "/app/lib/geocode_record.py", line 12, in <module>
    g = Geosupport()
        ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/geosupport/geosupport.py", line 75, in __init__
    raise GeosupportError(
geosupport.error.GeosupportError: libgeo.so: cannot open shared object file: No such file or directory
You are currently using a 64-bit Python interpreter. Is the installed version of Geosupport 64-bit?

@austensen
Copy link
Member

austensen commented Mar 24, 2024

Apparently they removed the files for minor version of geosupport we were using, and also changed the filename structure. Updated that and it's now installing correctly.

New error though with the SFTP connection:

getaddrinfo {host}: Name or service not known
Traceback (most recent call last):
  File "/app/oca_update.py", line 39, in <module>
    main()
  File "/app/oca_update.py", line 36, in main
    oca_etl(db_args, sftp_args, s3_args, mode, remote_db_args)
  File "/app/lib/etl.py", line 199, in oca_etl
    sftp = Sftp(**sftp_args)
           ^^^^^^^^^^^^^^^^^
  File "/app/lib/sftp.py", line 13, in __init__
    self.sftp = pysftp.Connection(host=host, username=user, password=pswd)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pysftp/__init__.py", line 116, in __init__
    self._cnopts = cnopts or CnOpts()
                             ^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pysftp/__init__.py", line 64, in __init__
    raise HostKeysException('No Host Keys Found')
pysftp.exceptions.HostKeysException: No Host Keys Found
Exception ignored in: <function Connection.__del__ at 0x7f4e9b9b5d00>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/pysftp/__init__.py", line 1013, in __del__
    self.close()
  File "/usr/local/lib/python3.11/site-packages/pysftp/__init__.py", line 784, in close
    if self._sftp_live:
       ^^^^^^^^^^^^^^^
AttributeError: 'Connection' object has no attribute '_sftp_live'

Not sure why this is happening when it has always worked fine locally.
I'm seeing that apparently this pysftp package is old and not maintained, and has some known issues related to host keys. People seem to suggest not using it and just directly using paramiko, which is what pysftp wraps. https://stackoverflow.com/questions/48434941/pysftp-vs-paramiko

@zhik have you ever run into an issue like this when running it locally?

@austensen
Copy link
Member

It seems like for some reason this line in sftp.py creating the known_hosts file was not working correctly.

os.system("ssh-keyscan -t dsa {host} >> ~/.ssh/known_hosts")

I edited the kubernetes cron job definition to first run that command, and then python oca_update.py and that is working.

It seems like everything is going fine now, so I'll update again when the job is done.

@austensen
Copy link
Member

It took about 2-3 hours for the initial XML parsing work, then it got to Inserting from staging to main.. where it stalled out, and after 7 hours it was still stuck there and eventually the k8s pod closed. I'm not sure if it was the DELETE FROM or the INSERT INTO. I'll have to look more closely at the code, but maybe we could refactor things and avoid this entirely by using an "upsert" for the parsed records rather than inserting into the staging table then deleting and moving over after.
One complication would be the extra sql work to create the appearance_outcomes records.

@zhik zhik deleted the branch level2 November 17, 2024 15:01
@zhik zhik closed this Nov 17, 2024
@zhik zhik reopened this Nov 17, 2024
@zhik
Copy link
Collaborator Author

zhik commented Nov 17, 2024

I think we still want to keep the original process of writing to a remote postgres instance in-case we want to switch out of AWS. So I would not want to merge use-remote-all-steps into level2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants