Skip to content

Using StringFormat for one‐off parsing

Ben Yu edited this page Apr 4, 2024 · 21 revisions

I had a program, it reads a bunch of timeline data from a database table, and applies some business rules to generate output billing data.

For simplicity, let's assume the method looks like:

BillingResult runBilling(List<BillingTimelineRow> input);

One day in production we observed some unexpected result. And we narrowed it down to a certain time range (let's say it's 10 minutes of data).

In the principle of TDD, it's best if I can reproduce the problem in a unit test; do some local debugging to find where the bug is; then fix the bug. The new unit test will be submitted to prevent future regression.

The first thing I did was running a SQL query against the prod db: SELECT * FROM BillingTimeline WHERE timestamp BETWEEN ... AND ... (this table has no PII data)

The commandline SQL tool printed some dozens of rows that looked like this:

+---------+--------------------+------------+---------------------------+--------+---------------------+--------------------+-------------+----------------------------+
| ShardId |  PayloadTaskId     | Tier       | StartTimestamp            | PlanId | RedactedUserGroupId | ProfileName        | StartSecond |      Timeline              |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 27      | tttsk1             | PREMIUM    | 2023-12-05 17:51:00-08:00 | 234    | 123                 | testonlyaccount-2  | 21          | [1334, 3923]               |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 30      | tttsk2             | PREMIUM    | 2023-12-05 17:43:00-08:00 | 125    | 123                 | testonlyaccount-2  | 39          | [161, 54888, 77364, 14468] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 34      | tttsk1             | PREMIUM    | 2023-12-05 17:42:00-08:00 | 789    | 999                 | testonlyaccount-1  | 40          | [129, 8877]                |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 47      | tttsk1             | PREMIUM    | 2023-12-05 17:42:00-08:00 | 110    | 123                 | testonlyaccount-3  | 28          | [1739, 6026]               |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 22      | tttsk7             | PREMIUM    | 2023-12-05 17:35:00-08:00 | 226    | 456                 | testonlyaccount-24 | 1           | [26031, 42307, 13820]      |
+---------+--------------------+------------+---------------------------+--------+------------------------------+--------------------+-------------+-------------------+
...
51 row(s)

I need to feed this input data to the unit test.

Do I do some vim data masssaging to make it a csv file? Or maybe that commandline tool has an option to tweak the output format? Or maybe find another SQL GUI tool with richer support?

On a second thought, why yak shaving? why can't I just use this nice ASCII-arted data directly?

With StringFormat, parsing is a pretty easy thing to do.

First I need to define the format of each data row. The ShardId and Tier aren't needed so I just ignore them in the format string:

StringFormat inputFormat = new StringFormat(
  "|{...}|{payload_task_id}|{...}|{start_timestamp}|{plan_id}|{user_group_id}|{profile_name}|{start_second}|{timeline}|");

I'll also need to parse the start_timestamp into Instant objects as expected by the code. This requires the right DateTimeFormatter that matches the format in db. Luckily DateTimeFormats gets me that without me having to check some cheatsheet.

Then I just need to read the text file, and scan through it for all valid rows:

import com.google.mu.time.DateTimeFormats;

String input = readTheTextFile(...);
List<BillingTimelineRow> timelines = inputFormat.scan(
    input,
    (taskId, startTime, planId, userGroup, profile, startSecond, timeline) -> {
      try {
        return BillingTimelineRow.builder()
            .setUserGroupId(Long.parseLong(userGroup.trim()))
            .setStartTimestamp(DateTimeFormats.parseToInstant(startTime.trim()))
            ...
            .build();
      } catch (NumberFormatException notValidRow) {
        return null; // skip
      }
    }).collect(toList());

The parser is one-off. But in the future if I need to do similar debugging and tests, I can always just copy-paste from the commandline tool's output without worrying about data conversion. And it's nice to submit the pretty-printed input file to the repo so anyone can easily read it.