-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when trying to reproduce #8
Comments
It sounds like the pickle file got corrupted. As a first step could you try clearing out the data folder of any of the intermediate files? This should be in the same directory where your notebook is located. Reply here if that doesn't work. |
Hi Otto,
I went ahead and re-cloned the git. It turned up a different error on the same line of code. Then I deleted all of the checkpoints and pycache and ran through the code again, it turned up the error from before “EOF: Ran out of input”. Should I try the hair-dryer csv instead? Let me know next steps
Alex
…---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-29-8d1ee1ab3c7f> in <module>()
----> 1 subset_l1_pub_nums, l1_texts, padded_abstract_embeddings, refs_one_hot, cpc_one_hot = expander.sample_for_inference(td, 0.2)
~\Documents\GitHub\patents-public-data\models\landscaping\expansion.py in sample_for_inference(self, train_data_util, sample_frac)
535 pickle.dump(
536 (subset_l1_pub_nums, l1_texts, padded_abstract_embeddings, refs_one_hot, cpc_one_hot),
--> 537 outfile)
538 else:
539 print('Loading inference data from filesystem at {}'.format(inference_data_path))
OverflowError: cannot serialize a bytes object larger than 4 GiB
From: Otto Stegmaier [mailto:[email protected]]
Sent: Thursday, July 19, 2018 1:35 PM
To: google/patents-public-data <[email protected]>
Cc: Albracht, Alex [USA] <[email protected]>; Author <[email protected]>
Subject: [External] Re: [google/patents-public-data] Error when trying to reproduce (#8)
It sounds like the pickle file got corrupted. As a first step could you try clearing out the data folder of any of the intermediate files? This should be in the same directory where your notebook is located. Reply here if that doesn't work.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#8 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/Al9wxI8ZDCZXaDXN7dm72-c-On5v-ffWks5uIMNhgaJpZM4VU6gt>.
|
I ran it and got the same issue, which is related to trying to pickle an intermediate dataset over the pickle limit of 4gb. In the short term, you can comment out lines 532-544 of expansion.py and then try to rerun the notebook. Also looping in @seinberg for comments on how to modify the code to allow the intermediate dataset to be stored locally. |
The issue you're seeing is something I've seen from time to time and I think the issue is that one (or more) of the patents that are in the L1 result set have a really really large blob of text in them. This is perhaps actually an issue with the BigQuery dataset, though I've not investigated closely. One simple workaround is to update the sample size on this line:
to be something smaller than 0.2 (20% of the L1 patents). Try 5% (0.05) to start with, and dial it up until you see the problem again. This isn't ideal if you want the full set of L1 patents for inference (e.g., to build the full patent landscape), but it can get you a good portion of the way there. |
Hi @aalbracht , do you think you can help on this one #47? |
I have been trying to recreate the patent landscape code in a jupyter notebook. Everything runs perfectly until I get to loading the inference data. I get "EOF error: Ran out of input" at line 541 of expansion.py.
`Loading inference data from filesystem at data\video_codec\landscape_inference_data.pkl
EOFError Traceback (most recent call last)
in ()
----> 1 subset_l1_pub_nums, l1_texts, padded_abstract_embeddings, refs_one_hot, cpc_one_hot = expander.sample_for_inference(td, 0.2)
~\Documents\Python Scripts\expansion.py in sample_for_inference(self, train_data_util, sample_frac)
539 print('Loading inference data from filesystem at {}'.format(inference_data_path))
540 with open(inference_data_path, 'rb') as infile:
--> 541 inference_data_deserialized = pickle.load(infile)
542
543 subset_l1_pub_nums, l1_texts, padded_abstract_embeddings, refs_one_hot, cpc_one_hot = \
EOFError: Ran out of input`
Appreciate the help. I can put this all on my git if that would help
The text was updated successfully, but these errors were encountered: