diff --git a/challenge/README.md b/challenge/README.md
index 92d5444..5b69d00 100644
--- a/challenge/README.md
+++ b/challenge/README.md
@@ -190,11 +190,34 @@ python evaluation.py --root_path1 ./output.json --root_path2 ./test_eval.json
 ### Results
 The zero-shot results of baseline on the sampled data are as follows:
 ```
-accuracy:  0.0
-chatgpt:  65.11111111111111
-match:  28.25
-language score:  {'val/Bleu_1': 0.0495223110147729, 'val/Bleu_2': 0.00021977465683011536, 'val/Bleu_3': 3.6312541763196866e-05, 'val/Bleu_4': 1.4776149283286042e-05, 'val/ROUGE_L': 0.08383567940883102, 'val/CIDEr': 0.09901486412073952}
-final score:  0.3240234750718823
+"accuracy":  0.0
+"chatgpt":  65.11111111111111
+"match":  28.25
+"language score":  {
+  'val/Bleu_1': 0.0495223110147729, 
+  'val/Bleu_2': 0.00021977465683011536, 
+  'val/Bleu_3': 3.6312541763196866e-05, 
+  'val/Bleu_4': 1.4776149283286042e-05, 
+  'val/ROUGE_L': 0.08383567940883102, 
+  'val/CIDEr': 0.09901486412073952
+}
+"final_score":  0.3240234750718823
+```
+
+The zero-shot results of baseline on the test data are as follows:
+```
+"accuracy": 0.0
+"chatgpt": 67.7535896248263, 
+"match": 18.83
+"language score": {
+  "test/Bleu_1": 0.2382764794460423,
+  "test/Bleu_2": 0.09954243471154352,
+  "test/Bleu_3": 0.03670697545241351,
+  "test/Bleu_4": 0.011298629095627342,
+  "test/ROUGE_L": 0.1992858115225957,
+  "test/CIDEr": 0.0074352082312374385
+}
+"final_score": 0.32843094354141145
 ```
 
 ## Submit to Test Server