Can not import the dataset into python #20

tiechengsu · 2016-02-23T02:36:52Z

with open('yelp_dataset_challenge_academic_dataset',encoding='utf-8') as f:
jsondata=json.load(f)
I try to import the dataset into python with the code above, but failed. The error is that 'utf-8' codec can't decode byte 0xb5. I also try encoding='charmap', but it didn't work either. Can anyone tell me how to import the data.

bngksgl · 2016-03-02T20:19:01Z

@tiechengsu i am having the same problem, were you able to solve the issue?

tiechengsu · 2016-03-02T21:11:29Z

@bngksgl No, I used the previous dataset instead, which you can find here
https://app.dominodatalab.com/mtldata/yackathon/browse/yelp_dataset_challenge_academic_dataset
It's easier to import. The latest data combine several categories together, no idea have to import it.

Hank-JSJ · 2016-11-17T01:37:34Z

It's a .tar file, just decompress it again

HongxuChenUQ · 2017-03-21T03:03:14Z

The latest data combine several categories together, no idea have to import it.

Does that mean reviews.josn and business.json,etc. are mixed stored int he file?

CAVIND46016 · 2017-03-21T03:40:59Z

Not really sure where you ppl r facing errors. I have edited the code to accept .json files explicitly and convert them to .csv. I have mentioned the filepath in main method explicitly instead of using arg.parse as in original code. Let me know if this helps.

Reference:

https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py

"""Convert the Yelp Dataset Challenge dataset from json format to csv.
import argparse
import collections
import csv
import json
def read_and_write_file(json_file_path, csv_file_path, column_names):
"""Read in the json dataset file and write it out to a csv file, given the column names."""
with open(csv_file_path, 'w') as fout:
csv_file = csv.writer(fout)
csv_file.writerow(list(column_names))
with open(json_file_path, encoding = 'utf8') as fin:
for line in fin:
line_contents = json.loads(line)
csv_file.writerow(get_row(line_contents, column_names))
def get_superset_of_column_names_from_file(json_file_path):
"""Read in the json dataset file and return the superset of column names."""
column_names = set()
with open(json_file_path, encoding = 'utf8') as fin:
for line in fin:
line_contents = json.loads(line)
column_names.update(
set(get_column_names(line_contents).keys())
)
return column_names
def get_column_names(line_contents, parent_key=''):
"""Return a list of flattened key names given a dict.
Example:
line_contents = {
'a': {
'b': 2,
'c': 3,
},
}
will return: ['a.b', 'a.c']
These will be the column names for the eventual csv file.
"""
column_names = []
for k, v in line_contents.items():
column_name = "{0}.{1}".format(parent_key, k) if parent_key else k
if isinstance(v, collections.MutableMapping):
column_names.extend(
get_column_names(v, column_name).items()
)
else:
column_names.append((column_name, v))
return dict(column_names)
def get_nested_value(d, key):
"""Return a dictionary item given a dictionary d and a flattened key from get_column_names.

Example:
    d = {
        'a': {
            'b': 2,
            'c': 3,
            },
    }
    key = 'a.b'
    will return: 2

"""
if '.' not in key:
    if key not in d:
        return None
    return d[key]
base_key, sub_key = key.split('.', 1)
if base_key not in d:
    return None
sub_dict = d[base_key]
return get_nested_value(sub_dict, sub_key)

def get_row(line_contents, column_names):
"""Return a csv compatible row given column names and a dict."""
row = []
for column_name in column_names:
line_value = get_nested_value(
line_contents,
column_name,
)
if isinstance(line_value, str):
row.append('{0}'.format(line_value.encode('utf-8')))
elif line_value is not None:
row.append('{0}'.format(line_value))
else:
row.append('')
return row
if(name == 'main'):
"""Convert a yelp dataset file from json to csv."""
json_file = []
json_file.append('D:\YELP Dataset\yelp_academic_dataset_business.json'); #args.json_file
json_file.append('D:\YELP Dataset\yelp_academic_dataset_checkin.json');
json_file.append('D:\YELP Dataset\yelp_academic_dataset_review.json');
json_file.append('D:\YELP Dataset\yelp_academic_dataset_tip.json');
json_file.append('D:\YELP Dataset\yelp_academic_dataset_user.json');
csv_file = []
for i in range(5):
csv_file.append('{}.csv'.format((json_file[i])[0:len(json_file[i])-5]))
column_names = get_superset_of_column_names_from_file(json_file[i])
read_and_write_file(json_file[i], csv_file[i], column_names)
print('{} converted to {} successfully.'.format(json_file[i], csv_file[i]))

HongxuChenUQ · 2017-03-21T06:58:21Z

YES! SOLVED! Once you have decomposed it from *.tar, do it again on the generated file, then you will see different josn files.

tootrackminded · 2017-05-03T13:55:54Z

@CAVIND46016 are you able to post your code in a formatted snippet? Using it in my compiler is producing indentation errors. Thank you!

CAVIND46016 · 2017-05-03T21:05:12Z

@dotdose : Have a look at the code here, this should work better.
https://github.com/CAVIND46016/Yelp-Reviews-Dataset-Analysis/blob/master/json_to_csv_converter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not import the dataset into python #20

Can not import the dataset into python #20

tiechengsu commented Feb 23, 2016

bngksgl commented Mar 2, 2016

tiechengsu commented Mar 2, 2016

Hank-JSJ commented Nov 17, 2016

HongxuChenUQ commented Mar 21, 2017

CAVIND46016 commented Mar 21, 2017

HongxuChenUQ commented Mar 21, 2017

tootrackminded commented May 3, 2017

CAVIND46016 commented May 3, 2017

Can not import the dataset into python #20

Can not import the dataset into python #20

Comments

tiechengsu commented Feb 23, 2016

bngksgl commented Mar 2, 2016

tiechengsu commented Mar 2, 2016

Hank-JSJ commented Nov 17, 2016

HongxuChenUQ commented Mar 21, 2017

CAVIND46016 commented Mar 21, 2017

Reference:

https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py

HongxuChenUQ commented Mar 21, 2017

tootrackminded commented May 3, 2017

CAVIND46016 commented May 3, 2017