Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not import the dataset into python #20

Open
tiechengsu opened this issue Feb 23, 2016 · 8 comments
Open

Can not import the dataset into python #20

tiechengsu opened this issue Feb 23, 2016 · 8 comments

Comments

@tiechengsu
Copy link

with open('yelp_dataset_challenge_academic_dataset',encoding='utf-8') as f:
jsondata=json.load(f)
I try to import the dataset into python with the code above, but failed. The error is that 'utf-8' codec can't decode byte 0xb5. I also try encoding='charmap', but it didn't work either. Can anyone tell me how to import the data.

@bngksgl
Copy link

bngksgl commented Mar 2, 2016

@tiechengsu i am having the same problem, were you able to solve the issue?

@tiechengsu
Copy link
Author

@bngksgl No, I used the previous dataset instead, which you can find here
https://app.dominodatalab.com/mtldata/yackathon/browse/yelp_dataset_challenge_academic_dataset
It's easier to import. The latest data combine several categories together, no idea have to import it.

@Hank-JSJ
Copy link

It's a .tar file, just decompress it again

@HongxuChenUQ
Copy link

The latest data combine several categories together, no idea have to import it.

Does that mean reviews.josn and business.json,etc. are mixed stored int he file?

@CAVIND46016
Copy link

Not really sure where you ppl r facing errors. I have edited the code to accept .json files explicitly and convert them to .csv. I have mentioned the filepath in main method explicitly instead of using arg.parse as in original code. Let me know if this helps.

Reference:

https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py

"""Convert the Yelp Dataset Challenge dataset from json format to csv.
import argparse
import collections
import csv
import json
def read_and_write_file(json_file_path, csv_file_path, column_names):
"""Read in the json dataset file and write it out to a csv file, given the column names."""
with open(csv_file_path, 'w') as fout:
csv_file = csv.writer(fout)
csv_file.writerow(list(column_names))
with open(json_file_path, encoding = 'utf8') as fin:
for line in fin:
line_contents = json.loads(line)
csv_file.writerow(get_row(line_contents, column_names))
def get_superset_of_column_names_from_file(json_file_path):
"""Read in the json dataset file and return the superset of column names."""
column_names = set()
with open(json_file_path, encoding = 'utf8') as fin:
for line in fin:
line_contents = json.loads(line)
column_names.update(
set(get_column_names(line_contents).keys())
)
return column_names
def get_column_names(line_contents, parent_key=''):
"""Return a list of flattened key names given a dict.
Example:
line_contents = {
'a': {
'b': 2,
'c': 3,
},
}
will return: ['a.b', 'a.c']
These will be the column names for the eventual csv file.
"""
column_names = []
for k, v in line_contents.items():
column_name = "{0}.{1}".format(parent_key, k) if parent_key else k
if isinstance(v, collections.MutableMapping):
column_names.extend(
get_column_names(v, column_name).items()
)
else:
column_names.append((column_name, v))
return dict(column_names)
def get_nested_value(d, key):
"""Return a dictionary item given a dictionary d and a flattened key from get_column_names.

Example:
    d = {
        'a': {
            'b': 2,
            'c': 3,
            },
    }
    key = 'a.b'
    will return: 2

"""
if '.' not in key:
    if key not in d:
        return None
    return d[key]
base_key, sub_key = key.split('.', 1)
if base_key not in d:
    return None
sub_dict = d[base_key]
return get_nested_value(sub_dict, sub_key)

def get_row(line_contents, column_names):
"""Return a csv compatible row given column names and a dict."""
row = []
for column_name in column_names:
line_value = get_nested_value(
line_contents,
column_name,
)
if isinstance(line_value, str):
row.append('{0}'.format(line_value.encode('utf-8')))
elif line_value is not None:
row.append('{0}'.format(line_value))
else:
row.append('')
return row
if(name == 'main'):
"""Convert a yelp dataset file from json to csv."""
json_file = []
json_file.append('D:\YELP Dataset\yelp_academic_dataset_business.json'); #args.json_file
json_file.append('D:\YELP Dataset\yelp_academic_dataset_checkin.json');
json_file.append('D:\YELP Dataset\yelp_academic_dataset_review.json');
json_file.append('D:\YELP Dataset\yelp_academic_dataset_tip.json');
json_file.append('D:\YELP Dataset\yelp_academic_dataset_user.json');
csv_file = []
for i in range(5):
csv_file.append('{}.csv'.format((json_file[i])[0:len(json_file[i])-5]))
column_names = get_superset_of_column_names_from_file(json_file[i])
read_and_write_file(json_file[i], csv_file[i], column_names)
print('{} converted to {} successfully.'.format(json_file[i], csv_file[i]))

@HongxuChenUQ
Copy link

YES! SOLVED! Once you have decomposed it from *.tar, do it again on the generated file, then you will see different josn files.

@tootrackminded
Copy link

@CAVIND46016 are you able to post your code in a formatted snippet? Using it in my compiler is producing indentation errors. Thank you!

@CAVIND46016
Copy link

@dotdose : Have a look at the code here, this should work better.
https://github.com/CAVIND46016/Yelp-Reviews-Dataset-Analysis/blob/master/json_to_csv_converter.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants