Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert to human readable format #6

Open
caiqi opened this issue Jun 24, 2024 · 1 comment
Open

Convert to human readable format #6

caiqi opened this issue Jun 24, 2024 · 1 comment

Comments

@caiqi
Copy link

caiqi commented Jun 24, 2024

For anyone interested, here is a simple snippet to convert the TFRecord file to JSON format:

import base64
import json

import tensorflow as tf

file_path = "dev.tfrecord"


def parse_tfrecord(record):
    example = tf.train.Example()
    example.ParseFromString(record.numpy())
    return example


def read_tfrecord_file(file_path):
    raw_dataset = tf.data.TFRecordDataset(file_path)
    parsed_records = []

    for raw_record in raw_dataset:
        example = parse_tfrecord(raw_record)
        record = {}
        for key, value in example.features.feature.items():
            if value.bytes_list.value:
                try:
                    # Try to decode as UTF-8 string
                    record[key] = value.bytes_list.value[0].decode('utf-8')
                except UnicodeDecodeError:
                    # If decoding fails, store as raw bytes
                    record[key] = base64.b64encode(value.bytes_list.value[0]).decode('utf-8')
            elif value.float_list.value:
                record[key] = value.float_list.value[0]
            elif value.int64_list.value:
                record[key] = value.int64_list.value[0]
        parsed_records.append(record)

    return parsed_records


records = read_tfrecord_file(file_path)
json_records = json.dumps(records, indent=4)

with open('output.json', 'w') as json_file:
    json_file.write(json_records)

print("TFRecord has been converted to JSON and saved as output.json")
@leebird
Copy link
Collaborator

leebird commented Jun 24, 2024

Thanks for providing the codes! We have added a simple script to show how to retrieve the labels from the dataset at https://github.com/google-research/google-research/blob/master/richhf_18k/parse_tfrecord_file.py, which can be used together with this script to convert the dataset to JSON or other formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants