Error when trying to reproduce #8

aalbracht · 2018-07-18T16:24:28Z

I have been trying to recreate the patent landscape code in a jupyter notebook. Everything runs perfectly until I get to loading the inference data. I get "EOF error: Ran out of input" at line 541 of expansion.py.
`Loading inference data from filesystem at data\video_codec\landscape_inference_data.pkl

EOFError Traceback (most recent call last)
in ()
----> 1 subset_l1_pub_nums, l1_texts, padded_abstract_embeddings, refs_one_hot, cpc_one_hot = expander.sample_for_inference(td, 0.2)

~\Documents\Python Scripts\expansion.py in sample_for_inference(self, train_data_util, sample_frac)
539 print('Loading inference data from filesystem at {}'.format(inference_data_path))
540 with open(inference_data_path, 'rb') as infile:
--> 541 inference_data_deserialized = pickle.load(infile)
542
543 subset_l1_pub_nums, l1_texts, padded_abstract_embeddings, refs_one_hot, cpc_one_hot = \

EOFError: Ran out of input`

Appreciate the help. I can put this all on my git if that would help

ostegm · 2018-07-19T17:35:28Z

It sounds like the pickle file got corrupted. As a first step could you try clearing out the data folder of any of the intermediate files? This should be in the same directory where your notebook is located. Reply here if that doesn't work.

aalbracht · 2018-07-20T15:11:00Z

Hi Otto, I went ahead and re-cloned the git. It turned up a different error on the same line of code. Then I deleted all of the checkpoints and pycache and ran through the code again, it turned up the error from before “EOF: Ran out of input”. Should I try the hair-dryer csv instead? Let me know next steps Alex

…

--------------------------------------------------------------------------- OverflowError Traceback (most recent call last) <ipython-input-29-8d1ee1ab3c7f> in <module>()

----> 1 subset_l1_pub_nums, l1_texts, padded_abstract_embeddings, refs_one_hot, cpc_one_hot = expander.sample_for_inference(td, 0.2) ~\Documents\GitHub\patents-public-data\models\landscaping\expansion.py in sample_for_inference(self, train_data_util, sample_frac) 535 pickle.dump( 536 (subset_l1_pub_nums, l1_texts, padded_abstract_embeddings, refs_one_hot, cpc_one_hot),

--> 537 outfile) 538 else: 539 print('Loading inference data from filesystem at {}'.format(inference_data_path)) OverflowError: cannot serialize a bytes object larger than 4 GiB From: Otto Stegmaier [mailto:[email protected]] Sent: Thursday, July 19, 2018 1:35 PM To: google/patents-public-data <[email protected]> Cc: Albracht, Alex [USA] <[email protected]>; Author <[email protected]> Subject: [External] Re: [google/patents-public-data] Error when trying to reproduce (#8) It sounds like the pickle file got corrupted. As a first step could you try clearing out the data folder of any of the intermediate files? This should be in the same directory where your notebook is located. Reply here if that doesn't work. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#8 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/Al9wxI8ZDCZXaDXN7dm72-c-On5v-ffWks5uIMNhgaJpZM4VU6gt>.

ostegm · 2018-07-23T19:14:55Z

I ran it and got the same issue, which is related to trying to pickle an intermediate dataset over the pickle limit of 4gb. In the short term, you can comment out lines 532-544 of expansion.py and then try to rerun the notebook.

Also looping in @seinberg for comments on how to modify the code to allow the intermediate dataset to be stored locally.

feltenberger · 2018-07-23T19:45:37Z

The issue you're seeing is something I've seen from time to time and I think the issue is that one (or more) of the patents that are in the L1 result set have a really really large blob of text in them. This is perhaps actually an issue with the BigQuery dataset, though I've not investigated closely. One simple workaround is to update the sample size on this line:

subset_l1_pub_nums, l1_texts, padded_abstract_embeddings, refs_one_hot, cpc_one_hot = expander.sample_for_inference(td, 0.2)

to be something smaller than 0.2 (20% of the L1 patents). Try 5% (0.05) to start with, and dial it up until you see the problem again. This isn't ideal if you want the full set of L1 patents for inference (e.g., to build the full patent landscape), but it can get you a good portion of the way there.

peiyu-wang · 2022-12-14T15:33:32Z

Hi @aalbracht , do you think you can help on this one #47?
The model gets deleted on cloud storage, I really want to try the model. I will be super appreciated if you can share the local copy of the models to me if you still have it. Thanks a lot!

feltenberger closed this as completed Aug 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when trying to reproduce #8

Error when trying to reproduce #8

aalbracht commented Jul 18, 2018

ostegm commented Jul 19, 2018

aalbracht commented Jul 20, 2018 via email

ostegm commented Jul 23, 2018

feltenberger commented Jul 23, 2018

peiyu-wang commented Dec 14, 2022

Error when trying to reproduce #8

Error when trying to reproduce #8

Comments

aalbracht commented Jul 18, 2018

ostegm commented Jul 19, 2018

aalbracht commented Jul 20, 2018 via email

ostegm commented Jul 23, 2018

feltenberger commented Jul 23, 2018

peiyu-wang commented Dec 14, 2022