Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parquet-exporter): export blocks, txs and data items to parquet #213

Merged
merged 9 commits into from
Sep 27, 2024

Conversation

karlprieb
Copy link
Collaborator

  • Add schema for blocks and transactions tables
  • Update schema for tags table
  • Refactor ParquetExporter to import blocks, transactions and tags from sqlite to duckdb
  • Refactor ParquetExporter to export blocks, transactions and tags to parquet

* Add schema for blocks and transactions tables
* Update schema for tags table
* Refactor ParquetExporter to import blocks, transactions and tags from sqlite to duckdb
* Refactor ParquetExporter to export blocks, transactions and tags to parquet
Copy link
Contributor

coderabbitai bot commented Sep 26, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The changes encompass modifications to the database schema and enhancements to the data export functionality. The schema.sql file has been updated to alter the tags table and introduce new transactions and blocks tables. Additionally, the ParquetExporter class has been enhanced with improved input validation, new methods for importing data, and updates to the export process, including error handling and logging.

Changes

File Change Summary
src/database/duckdb/schema.sql - Modified tags table: changed created_at to indexed_at (nullable).
- Added new tables: transactions and blocks.
src/routes/ar-io.ts - Enhanced input validation for parameters.
- Updated database paths for ParquetExporter.
- Renamed method from exportDataItemTagsParquet to export.
src/workers/parquet-exporter.ts - Updated constructor to accept new parameters.
- Added methods for importing data.
- Enhanced error handling and logging.
- Modified export process and data type handling.
package.json - Updated duckdb-async package version from ^1.0.0 to ^1.1.0.

Possibly related PRs

Suggested reviewers

  • djwhitt

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

codecov bot commented Sep 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.69%. Comparing base (6242301) to head (1afc1e9).
Report is 13 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #213      +/-   ##
===========================================
- Coverage    70.19%   68.69%   -1.51%     
===========================================
  Files           32       32              
  Lines         7905     7905              
  Branches       430      430              
===========================================
- Hits          5549     5430     -119     
- Misses        2356     2474     +118     
- Partials         0        1       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (4)
src/database/duckdb/schema.sql (1)

11-36: Approve new transactions table with a minor suggestion

The new transactions table structure looks comprehensive and well-designed, covering various aspects of a transaction with appropriate data types. This addition aligns well with the PR objectives.

Consider renaming the "offset" column to avoid using a SQL keyword:

- "offset" UBIGINT,
+ transaction_offset UBIGINT,

This change would eliminate the need for quotes and make the schema more consistent with other column names.

src/routes/ar-io.ts (1)

232-242: Improved input validation, consider enhancing error messages.

The added input validation is a good improvement, ensuring that all required parameters are present and of the correct type. This helps prevent potential issues caused by invalid input.

Consider enhancing the error message to specify which parameter(s) failed validation. This would make it easier for API consumers to identify and correct issues. For example:

const invalidParams = [];
if (typeof outputDir !== 'string' || outputDir.trim() === '') invalidParams.push('outputDir');
if (!Number.isInteger(startHeight) || startHeight < 0) invalidParams.push('startHeight');
if (!Number.isInteger(endHeight) || endHeight < 0) invalidParams.push('endHeight');
if (!Number.isInteger(maxFileRows) || maxFileRows < 0) invalidParams.push('maxFileRows');

if (invalidParams.length > 0) {
  res.status(400).send(`Invalid or missing parameters: ${invalidParams.join(', ')}`);
  return;
}

This approach provides more specific feedback about which parameters need attention.

src/workers/parquet-exporter.ts (2)

69-69: Log error objects to include stack traces

When logging errors, ensure that you log the error objects instead of just the messages to include stack traces for better debugging.

Update your error logging statements as follows:

- logger.error('Error creating DuckDB database', error);
+ logger.error('Error creating DuckDB database', { error });

This change ensures that the full error details are captured in the logs.

Also applies to: 84-84, 126-126, 181-181, 236-236, 273-273, 312-312, 366-366, 452-452


41-44: Initialize 'isExporting' inside the constructor for consistency

Initializing class properties inside the constructor ensures consistency and clarity.

Apply this diff to move the initialization:

- private isExporting = false;

constructor({
  log,
  db,
  duckDbPath,
}: {
  log: winston.Logger;
  db: Database;
  duckDbPath: string;
}) {
  this.log = log;
  this.db = db;
  this.duckDbPath = duckDbPath;
+ this.isExporting = false;
}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between d12f1f9 and 117dfda.

⛔ Files ignored due to path filters (1)
  • flake.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • src/database/duckdb/schema.sql (1 hunks)
  • src/routes/ar-io.ts (1 hunks)
  • src/workers/parquet-exporter.ts (3 hunks)
🔇 Additional comments (5)
src/database/duckdb/schema.sql (1)

38-46: Approve new blocks table with a question

The new blocks table structure looks good, covering essential information about blocks with appropriate data types. This addition aligns well with the PR objectives.

Could you clarify why indep_hash is nullable? As it appears to be a unique identifier for blocks, it might be better to make it NOT NULL. If there's a specific reason for it to be nullable, please explain.

- indep_hash BLOB,
+ indep_hash BLOB NOT NULL,

If this change is appropriate, please update the schema accordingly. If there's a reason for it to be nullable, consider adding a comment in the schema to explain the rationale.

src/routes/ar-io.ts (1)

248-250: Updated ParquetExporter configuration and usage.

The changes to the ParquetExporter configuration and usage align with the expanded functionality described in the PR objectives. The exporter now seems capable of handling blocks and transactions in addition to tags.

To ensure the new functionality works as intended, please verify the following:

  1. The new sqliteCoreDbPath is correctly set up and accessible.
  2. The export method in ParquetExporter correctly handles the expanded dataset (blocks, transactions, and tags).

Run the following script to check the ParquetExporter implementation:

This script will help confirm that the ParquetExporter class has been properly updated to handle the new export functionality and that it's using the new sqliteCoreDbPath.

Also applies to: 253-258

✅ Verification successful

src/database/duckdb/schema.sql Show resolved Hide resolved
await this.db.exec(`INSERT INTO blocks ${query}`);
log.info('Blocks inserted into DuckDB');
} catch (error) {
throw `Error importing blocks: ${error}`;
Copy link
Contributor

@coderabbitai coderabbitai bot Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Throw Error objects instead of strings for better error handling

In the catch blocks, you're currently throwing strings. Throwing Error objects preserves the stack trace and is a best practice for consistent error handling.

Apply this diff to fix the error throwing:

- throw `Error importing blocks: ${error}`;
+ throw new Error(`Error importing blocks: ${error}`);

Repeat this change in all similar instances throughout the methods.

Also applies to: 180-180, 235-235, 272-272, 311-311, 365-365, 474-474, 493-493

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karlprieb, I think this is a good comment by @coderabbitai. I asked ChatGPT about the best way to catch and rethrow errors and this is what it suggested: https://chatgpt.com/share/e/66f6d1d2-8f7c-8004-9d74-c86d298a6225

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djwhitt, I'm glad you found the suggestion helpful!

(_/)
(^‿^)

src/workers/parquet-exporter.ts Show resolved Hide resolved
Copy link
Collaborator

@djwhitt djwhitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. This looks good. Only thing I think need adjustment is how we're rethrowing errors.

await this.db.exec(`INSERT INTO blocks ${query}`);
log.info('Blocks inserted into DuckDB');
} catch (error) {
throw `Error importing blocks: ${error}`;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karlprieb, I think this is a good comment by @coderabbitai. I asked ChatGPT about the best way to catch and rethrow errors and this is what it suggested: https://chatgpt.com/share/e/66f6d1d2-8f7c-8004-9d74-c86d298a6225

await this.db.exec(`INSERT INTO transactions ${query}`);
log.info('Transactions inserted into DuckDB');
} catch (error) {
throw `Error importing transactions: ${error}`;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as above re error handling.

await this.db.exec(`INSERT INTO transactions ${query}`);
log.info('Data items inserted into DuckDB');
} catch (error) {
throw `Error importing data items: ${error}`;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as above re error handling.

await this.db.exec(`INSERT INTO tags ${query}`);
log.info('Transaction tags inserted into DuckDB');
} catch (error) {
throw `Error importing transaction tags: ${error}`;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as above re error handling.

minHeight = height + 1n;
rowCount = 0n;
} catch (error) {
throw `Error exporting Parquet file ${fileName}: ${error}`;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as above re error handling.

await this.db.exec(`CHECKPOINT ${tableName};`);
return result[0].count;
} catch (error) {
throw `Error getting row count for height ${height} in ${tableName}: ${error}`;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as above re error handling.

maxHeight: result[0].max_height,
};
} catch (error) {
throw `Error getting height range for ${tableName}: ${error}`;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as above re error handling.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)
src/workers/parquet-exporter.ts (1)

345-347: Simplify height range calculations for clarity

The current calculations for minHeight and maxHeight can be simplified to enhance readability.

Consider refactoring as follows:

- minHeight =
-   minHeight > BigInt(startHeight) ? minHeight : BigInt(startHeight);
- maxHeight = maxHeight < BigInt(endHeight) ? maxHeight : BigInt(endHeight);
+ minHeight = BigInt(Math.max(Number(minHeight), startHeight));
+ maxHeight = BigInt(Math.min(Number(maxHeight), endHeight));

This refactoring uses Math.max and Math.min to make the intention clearer.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 117dfda and 8b0fcdf.

⛔ Files ignored due to path filters (1)
  • yarn.lock is excluded by !**/yarn.lock, !**/*.lock
📒 Files selected for processing (2)
  • package.json (1 hunks)
  • src/workers/parquet-exporter.ts (3 hunks)
🔇 Additional comments (2)
package.json (1)

24-24: Approve the update of duckdb-async, but verify compatibility.

The update of duckdb-async from ^1.0.0 to ^1.1.0 is a minor version bump, which typically indicates new features or non-breaking improvements. This update is likely beneficial for the project, potentially bringing performance improvements or bug fixes.

However, it's important to ensure that this update doesn't introduce any breaking changes or compatibility issues with other dependencies in the project.

To verify the compatibility and impact of this update, please run the following script:

After running these checks, please review the results to ensure there are no unexpected issues or breaking changes introduced by this update.

src/workers/parquet-exporter.ts (1)

Line range hint 350-374: Verify correctness of minHeight adjustment within loop

Updating minHeight inside the loop might lead to unexpected behavior in subsequent iterations.

Please ensure that modifying minHeight within the loop does not cause any off-by-one errors or skip any heights. Verify that all intended heights are processed correctly.

Comment on lines +125 to +128
const newError = new Error('Error importing blocks');
newError.stack = error.stack;
throw newError;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Preserve original error messages when re-throwing

In the catch blocks, you're creating new Error objects with a generic message and assigning the original error's stack to the new error. This approach loses the original error's message and any additional context, which can be valuable for debugging.

Consider using the cause property of the Error object to wrap the original error. This way, you preserve the original error's message and stack trace.

Here's how you can modify the catch blocks:

- const newError = new Error('Error importing blocks');
- newError.stack = error.stack;
- throw newError;
+ throw new Error('Error importing blocks', { cause: error });

Apply this change to all similar instances throughout the methods:

  • importBlocks
  • importTransactions
  • importDataItems
  • importTransactionTags
  • importDataItemTags
  • exportToParquet
  • getHeightRange
  • getRowCountForHeight

Also applies to: 182-185, 239-242, 278-281, 319-322, 376-379, 489-490, 508-509

Comment on lines +398 to +414
const log = this.log.child({ method: 'export' });

if (this.isExporting) {
log.error('An export is already in progress');
return;
}
this.isExporting = true;

if (startHeight > endHeight) {
log.error('startHeight must be less than or equal to endHeight');
return;
}

if (maxFileRows <= 0) {
log.error('maxFileRows must be a positive number');
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Throw errors for invalid input parameters to alert calling code

In the export method, when input validations fail, the function logs an error and returns. This might not sufficiently inform the calling code about the failure.

Consider throwing an error after logging to ensure the issue is propagated:

if (startHeight > endHeight) {
  log.error('startHeight must be less than or equal to endHeight');
- return;
+ throw new Error('Invalid height range: startHeight must be less than or equal to endHeight');
}

if (maxFileRows <= 0) {
  log.error('maxFileRows must be a positive number');
- return;
+ throw new Error('Invalid parameter: maxFileRows must be positive');
}

This change ensures that the calling code is aware of the invalid parameters and can handle the exceptions appropriately.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const log = this.log.child({ method: 'export' });
if (this.isExporting) {
log.error('An export is already in progress');
return;
}
this.isExporting = true;
if (startHeight > endHeight) {
log.error('startHeight must be less than or equal to endHeight');
return;
}
if (maxFileRows <= 0) {
log.error('maxFileRows must be a positive number');
return;
}
const log = this.log.child({ method: 'export' });
if (this.isExporting) {
log.error('An export is already in progress');
return;
}
this.isExporting = true;
if (startHeight > endHeight) {
log.error('startHeight must be less than or equal to endHeight');
throw new Error('Invalid height range: startHeight must be less than or equal to endHeight');
}
if (maxFileRows <= 0) {
log.error('maxFileRows must be a positive number');
throw new Error('Invalid parameter: maxFileRows must be positive');
}

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Outside diff range and nitpick comments (2)
src/workers/parquet-exporter.ts (2)

Line range hint 338-385: Improve security and error handling in exportToParquet method

The exportToParquet method has good use of BigInt for handling large numbers, but requires improvements:

  1. SQL Injection vulnerability:

Use parameterized queries to prevent SQL injection. Replace the query execution with:

await this.db.exec(`
  COPY (
    SELECT * FROM ${tableName}
    WHERE height >= ? AND height <= ?
  ) TO ? (FORMAT PARQUET, COMPRESSION 'zstd')
`, [minHeight, height, filePath]);
  1. Error handling:

Preserve the original error message when re-throwing. Replace the catch block with:

} catch (error) {
  throw new Error(`Error exporting Parquet file ${fileName}`, { cause: error });
}

The use of BigInt for height comparisons is a good practice for handling potentially large block heights.


Line range hint 1-515: Overall assessment: Good functionality with areas for improvement

The ParquetExporter class provides comprehensive functionality for exporting data to Parquet format. The use of BigInt for height values is commendable, ensuring compatibility with large block heights. However, there are consistent areas for improvement across the methods:

  1. SQL Injection prevention: Most methods need to implement parameterized queries or input validation to prevent SQL injection vulnerabilities.

  2. Error handling: The error handling can be improved by preserving original error messages when re-throwing errors.

  3. Input validation: Some methods, particularly the export method, could benefit from stricter input validation with error throwing instead of just logging.

Addressing these issues will significantly enhance the security and robustness of the code. Consider applying the suggested changes consistently across all methods in the class.

As a general architectural suggestion, consider implementing a separate SQL query builder or data access layer. This could help centralize SQL injection prevention measures and make the code more maintainable.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 8b0fcdf and 1afc1e9.

📒 Files selected for processing (1)
  • src/workers/parquet-exporter.ts (3 hunks)
🔇 Additional comments (2)
src/workers/parquet-exporter.ts (2)

18-26: LGTM: Import statements and class properties look good

The new imports and class properties are appropriate for the added functionality. The isExporting flag is a good addition to prevent concurrent exports.


31-39: LGTM: Constructor updates are consistent

The addition of the duckDbPath parameter and its initialization in the constructor are consistent with the new class property.

Comment on lines +45 to +92
sqliteBundlesDbPath,
sqliteCoreDbPath,
}: {
log: winston.Logger;
duckDbPath: string;
sqliteDbPath: string;
sqliteBundlesDbPath: string;
sqliteCoreDbPath: string;
}) {
const logger = log.child({ class: this.constructor.name });
const db = await Database.create(duckDbPath);
const logger = log.child({ class: 'ParquetExporter' });
let db: Database;

const duckDbSchema = readFileSync(
'./src/database/duckdb/schema.sql',
'utf8',
);
await db.exec(duckDbSchema);
logger.debug('DuckDB schema created');
try {
db = await Database.create(duckDbPath);

const duckDbSchema = readFileSync(
'./src/database/duckdb/schema.sql',
'utf8',
);
await db.exec(duckDbSchema);

logger.debug('DuckDB schema created');
} catch (error) {
logger.error('Error creating DuckDB database', error);
throw error;
}

try {
await db.exec(`INSTALL sqlite; LOAD sqlite;`);
await db.exec(
`ATTACH '${sqliteBundlesDbPath}' AS sqlite_bundles_db (TYPE sqlite);`,
);
await db.exec(
`ATTACH '${sqliteCoreDbPath}' AS sqlite_core_db (TYPE sqlite);`,
);

await db.exec(`INSTALL sqlite; LOAD sqlite;`);
await db.exec(`ATTACH '${sqliteDbPath}' AS sqlite_db (TYPE sqlite);`);
logger.debug('SQLite loaded');
return new ParquetExporter({ log: logger, db, duckDbPath, sqliteDbPath });
logger.debug('SQLite databases loaded');
} catch (error) {
logger.error('Error loading SQLite databases', error);
throw error;
}

logger.info('DuckDB database created!');

return new ParquetExporter({
log: logger,
db,
duckDbPath,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

LGTM with a security suggestion: Use parameterized queries for ATTACH statements

The improvements in error handling and logging are good. However, there's a potential SQL injection vulnerability in the ATTACH statements.

Consider using parameterized queries for the ATTACH statements to prevent SQL injection. For example:

await db.exec(`ATTACH ? AS sqlite_bundles_db (TYPE sqlite);`, [sqliteBundlesDbPath]);
await db.exec(`ATTACH ? AS sqlite_core_db (TYPE sqlite);`, [sqliteCoreDbPath]);

This approach treats the file paths as parameters rather than part of the SQL string, enhancing security.

Comment on lines +95 to +129
private async importBlocks({
startHeight,
endHeight,
}: {
startHeight: number;
endHeight: number;
}) {
const log = this.log.child({ method: 'importBlocks' });
const query = `
SELECT
indep_hash,
height,
previous_block,
nonce,
hash,
block_timestamp,
tx_count,
block_size
FROM
sqlite_core_db.stable_blocks
WHERE
height BETWEEN ${startHeight} AND ${endHeight}
ORDER BY
height ASC;
`;

try {
await this.db.exec(`INSERT INTO blocks ${query}`);
log.info('Blocks inserted into DuckDB');
} catch (error: any) {
const newError = new Error('Error importing blocks');
newError.stack = error.stack;
throw newError;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Improve security and error handling in importBlocks method

The importBlocks method has two areas for improvement:

  1. SQL Injection vulnerability:

Use parameterized queries to prevent SQL injection. Replace the query string with:

const query = `
  SELECT
    indep_hash,
    height,
    previous_block,
    nonce,
    hash,
    block_timestamp,
    tx_count,
    block_size
  FROM
    sqlite_core_db.stable_blocks
  WHERE
    height BETWEEN ? AND ?
  ORDER BY
    height ASC;
`;
await this.db.exec(`INSERT INTO blocks ${query}`, [startHeight, endHeight]);
  1. Error handling:

Preserve the original error message when re-throwing. Replace the catch block with:

} catch (error) {
  throw new Error('Error importing blocks', { cause: error });
}

This approach maintains the original error information while adding context.

Comment on lines +131 to +186
private async importTransactions({
startHeight,
endHeight,
maxFileRows,
outputDir,
}: {
startHeight: number;
endHeight: number;
maxFileRows: number;
outputDir: string;
}) {
const sqliteQuery = `
SELECT
sdit.height,
sdit.data_item_id AS id,
sdit.data_item_tag_index AS tag_index,
sdi.indexed_at AS created_at,
tn.name AS tag_name,
tv.value AS tag_value,
1 AS is_data_item
FROM
sqlite_db.stable_data_item_tags sdit
JOIN
sqlite_db.tag_names tn ON sdit.tag_name_hash = tn.hash
JOIN
sqlite_db.tag_values tv ON sdit.tag_value_hash = tv.hash
JOIN
sqlite_db.stable_data_items sdi ON sdit.data_item_id = sdi.id
WHERE
sdit.height BETWEEN ${startHeight} AND ${endHeight};
const log = this.log.child({ method: 'importTransactions' });
const query = `
SELECT
st.id,
NULL AS indexed_at,
st.block_transaction_index,
0 AS is_data_item,
st.target,
st.quantity,
st.reward,
st.last_tx as anchor,
st.data_size,
st.content_type,
st.format,
st.height,
st.owner_address,
st.data_root,
NULL AS parent,
st."offset",
NULL AS size,
NULL AS data_offset,
NULL AS owner_offset,
NULL AS owner_size,
CASE
WHEN octet_length(w.public_modulus) <= 64 THEN w.public_modulus
ELSE NULL
END AS owner,
NULL AS signature_offset,
NULL AS signature_size,
NULL AS signature_type
FROM
sqlite_core_db.stable_transactions st
LEFT JOIN
sqlite_core_db.wallets w ON st.owner_address = w.address
WHERE
st.height BETWEEN ${startHeight} AND ${endHeight}
ORDER BY
st.height ASC;
`;

await this.db.exec(`INSERT INTO tags ${sqliteQuery}`);
try {
await this.db.exec(`INSERT INTO transactions ${query}`);
log.info('Transactions inserted into DuckDB');
} catch (error: any) {
const newError = new Error('Error importing transactions');
newError.stack = error.stack;
throw newError;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Apply security and error handling improvements to importTransactions method

The importTransactions method has the same issues as importBlocks:

  1. SQL Injection vulnerability:

Use parameterized queries to prevent SQL injection. Replace the query execution with:

await this.db.exec(`INSERT INTO transactions ${query}`, [startHeight, endHeight]);

Ensure that ${startHeight} and ${endHeight} in the query string are replaced with ?.

  1. Error handling:

Preserve the original error message when re-throwing. Replace the catch block with:

} catch (error) {
  throw new Error('Error importing transactions', { cause: error });
}

These changes will improve security and maintain better error context.

Comment on lines +188 to +243
private async importDataItems({
startHeight,
endHeight,
}: {
startHeight: number;
endHeight: number;
}) {
const log = this.log.child({ method: 'importDataItems' });
const query = `
SELECT
sdi.id,
sdi.indexed_at,
NULL AS block_transaction_index,
1 AS is_data_item,
sdi.target,
NULL AS quantity,
NULL AS reward,
sdi.anchor,
sdi.data_size,
sdi.content_type,
NULL AS format,
sdi.height,
sdi.owner_address,
NULL AS data_root,
sdi.parent_id AS parent,
sdi."offset",
sdi.size,
sdi.data_offset,
sdi.owner_offset,
sdi.owner_size,
CASE
WHEN octet_length(w.public_modulus) <= 64 THEN w.public_modulus
ELSE NULL
END AS owner,
sdi.signature_offset,
sdi.signature_size,
sdi.signature_type
FROM
sqlite_bundles_db.stable_data_items sdi
LEFT JOIN
sqlite_bundles_db.wallets w ON sdi.owner_address = w.address
WHERE
sdi.height BETWEEN ${startHeight} AND ${endHeight}
ORDER BY
sdi.height ASC;
`;

this.log.debug('Data inserted into DuckDB');
try {
await this.db.exec(`INSERT INTO transactions ${query}`);
log.info('Data items inserted into DuckDB');
} catch (error: any) {
const newError = new Error('Error importing data items');
newError.stack = error.stack;
throw newError;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Implement security and error handling improvements in importDataItems method

The importDataItems method requires the same improvements as the previous import methods:

  1. SQL Injection vulnerability:

Use parameterized queries to prevent SQL injection. Replace the query execution with:

await this.db.exec(`INSERT INTO transactions ${query}`, [startHeight, endHeight]);

Ensure that ${startHeight} and ${endHeight} in the query string are replaced with ?.

  1. Error handling:

Preserve the original error message when re-throwing. Replace the catch block with:

} catch (error) {
  throw new Error('Error importing data items', { cause: error });
}

These changes will enhance security and provide better error context.

Comment on lines +245 to +282
private async importTransactionTags({
startHeight,
endHeight,
}: {
startHeight: number;
endHeight: number;
}) {
const log = this.log.child({ method: 'importTransactionTags' });
const query = `
SELECT
stt.height,
stt.transaction_id AS id,
stt.transaction_tag_index AS tag_index,
NULL AS indexed_at,
tn.name AS tag_name,
tv.value AS tag_value,
0 AS is_data_item
FROM
sqlite_core_db.stable_transaction_tags stt
JOIN
sqlite_core_db.tag_names tn ON stt.tag_name_hash = tn.hash
JOIN
sqlite_core_db.tag_values tv ON stt.tag_value_hash = tv.hash
WHERE
stt.height BETWEEN ${startHeight} AND ${endHeight}
ORDER BY
stt.height ASC;
`;

try {
await this.db.exec(`INSERT INTO tags ${query}`);
log.info('Transaction tags inserted into DuckDB');
} catch (error: any) {
const newError = new Error('Error importing transaction tags');
newError.stack = error.stack;
throw newError;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Apply security and error handling improvements to importTransactionTags method

The importTransactionTags method needs the same improvements as the previous import methods:

  1. SQL Injection vulnerability:

Use parameterized queries to prevent SQL injection. Replace the query execution with:

await this.db.exec(`INSERT INTO tags ${query}`, [startHeight, endHeight]);

Ensure that ${startHeight} and ${endHeight} in the query string are replaced with ?.

  1. Error handling:

Preserve the original error message when re-throwing. Replace the catch block with:

} catch (error) {
  throw new Error('Error importing transaction tags', { cause: error });
}

These changes will improve security and maintain better error context.

Comment on lines +284 to +322
private async importDataItemTags({
startHeight,
endHeight,
}: {
startHeight: number;
endHeight: number;
}) {
const log = this.log.child({ method: 'importDataItemTags' });
const query = `
SELECT
sdit.height,
sdit.data_item_id AS id,
sdit.data_item_tag_index AS tag_index,
sdi.indexed_at,
tn.name AS tag_name,
tv.value AS tag_value,
1 AS is_data_item
FROM
sqlite_bundles_db.stable_data_item_tags sdit
JOIN
sqlite_bundles_db.tag_names tn ON sdit.tag_name_hash = tn.hash
JOIN
sqlite_bundles_db.tag_values tv ON sdit.tag_value_hash = tv.hash
JOIN
sqlite_bundles_db.stable_data_items sdi ON sdit.data_item_id = sdi.id
WHERE
sdit.height BETWEEN ${startHeight} AND ${endHeight}
ORDER BY
sdit.height ASC;
`;

await this.db.close();
try {
await this.db.exec(`INSERT INTO tags ${query}`);
log.info('Data item tags inserted into DuckDB');
} catch (error: any) {
const newError = new Error('Error importing data item tags');
newError.stack = error.stack;
throw newError;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Implement security and error handling improvements in importDataItemTags method

The importDataItemTags method requires the same improvements as the previous import methods:

  1. SQL Injection vulnerability:

Use parameterized queries to prevent SQL injection. Replace the query execution with:

await this.db.exec(`INSERT INTO tags ${query}`, [startHeight, endHeight]);

Ensure that ${startHeight} and ${endHeight} in the query string are replaced with ?.

  1. Error handling:

Preserve the original error message when re-throwing. Replace the catch block with:

} catch (error) {
  throw new Error('Error importing data item tags', { cause: error });
}

These changes will enhance security and provide better error context.

Comment on lines +387 to +470
async export({
outputDir,
startHeight,
endHeight,
maxFileRows,
}: {
outputDir: string;
startHeight: number;
endHeight: number;
maxFileRows: number;
}): Promise<void> {
const log = this.log.child({ method: 'export' });

if (this.isExporting) {
log.error('An export is already in progress');
return;
}
this.isExporting = true;

if (startHeight > endHeight) {
log.error('startHeight must be less than or equal to endHeight');
return;
}

if (maxFileRows <= 0) {
log.error('maxFileRows must be a positive number');
return;
}

if (!existsSync(outputDir)) {
mkdirSync(outputDir, { recursive: true });
}

try {
// Import data into DuckDB
await this.importBlocks({ startHeight, endHeight });
await this.importTransactions({ startHeight, endHeight });
await this.importDataItems({ startHeight, endHeight });
await this.importTransactionTags({ startHeight, endHeight });
await this.importDataItemTags({ startHeight, endHeight });

// Export data to Parquet files
await this.exportToParquet({
outputDir,
tableName: 'blocks',
startHeight,
endHeight,
maxFileRows,
});

await this.exportToParquet({
outputDir,
tableName: 'transactions',
startHeight,
endHeight,
maxFileRows,
});

await this.exportToParquet({
outputDir,
tableName: 'tags',
startHeight,
endHeight,
maxFileRows,
});

log.info('Parquet export complete');
} catch (error) {
log.error('Error exporting Parquet files', {
error: error instanceof Error ? error.message : String(error),
stack: error instanceof Error ? error.stack : undefined,
});
} finally {
await this.db.close();

// Delete the duckdb file
try {
rmSync(this.duckDbPath, { recursive: true, force: true });
rmSync(`${this.duckDbPath}.wal`, { force: true });
} catch (error) {
log.error(`Error deleting duckdb file ${this.duckDbPath}:`, error);
}
this.isExporting = false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Good overall structure, with suggestions for input validation

The export method has a good structure for orchestrating the export process. The use of a flag to prevent concurrent exports is a good practice. However, there's room for improvement in input validation:

Consider throwing errors for invalid input parameters instead of just logging. This will allow the calling code to handle these issues more effectively. For example:

if (startHeight > endHeight) {
  throw new Error('startHeight must be less than or equal to endHeight');
}

if (maxFileRows <= 0) {
  throw new Error('maxFileRows must be a positive number');
}

The error handling in the catch block and the cleanup process in the finally block are well implemented. The method effectively manages the entire export process, including proper resource cleanup.

Comment on lines +475 to +492
): Promise<{ minHeight: bigint; maxHeight: bigint }> {
const query = `
SELECT MIN(height) as min_height, MAX(height) as max_height
FROM ${tableName}
`;
const result = await this.db.all(query);

return {
minHeight: result[0].min_height,
maxHeight: result[0].max_height,
};
try {
const result = await this.db.all(query);

return {
minHeight: result[0].min_height,
maxHeight: result[0].max_height,
};
} catch (error: any) {
const newError = new Error(`Error getting height range for ${tableName}`);
newError.stack = error.stack;
throw newError;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Improve security and error handling in getHeightRange method

The getHeightRange method makes good use of BigInt for height values, but requires improvements:

  1. SQL Injection vulnerability:

Use a whitelist of allowed table names to prevent SQL injection. For example:

const allowedTables = ['blocks', 'transactions', 'tags'];
if (!allowedTables.includes(tableName)) {
  throw new Error(`Invalid table name: ${tableName}`);
}
  1. Error handling:

Preserve the original error message when re-throwing. Replace the catch block with:

} catch (error) {
  throw new Error(`Error getting height range for ${tableName}`, { cause: error });
}

The use of BigInt for height values is appropriate for handling potentially large block heights.

Comment on lines +497 to +515
height: bigint,
): Promise<bigint> {
const query = `
SELECT COUNT(*) as count
FROM ${tableName}
WHERE height = ${height}
`;
const result = await this.db.all(query);

return Number(result[0].count);
}
try {
const result = await this.db.all(query);

private async truncateTable(tableName: string): Promise<void> {
await this.db.exec(`TRUNCATE TABLE ${tableName};`);
await this.db.exec(`CHECKPOINT ${tableName};`);
return result[0].count;
} catch (error: any) {
const newError = new Error(
`Error getting row count for height ${height} in ${tableName}`,
);
newError.stack = error.stack;
throw newError;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Enhance security and error handling in getRowCountForHeight method

The getRowCountForHeight method appropriately uses BigInt for the height value, but needs improvements:

  1. SQL Injection vulnerability:

Use a whitelist for table names and parameterized query for the height value:

const allowedTables = ['blocks', 'transactions', 'tags'];
if (!allowedTables.includes(tableName)) {
  throw new Error(`Invalid table name: ${tableName}`);
}

const query = `
  SELECT COUNT(*) as count
  FROM ${tableName}
  WHERE height = ?
`;

const result = await this.db.all(query, [height]);
  1. Error handling:

Preserve the original error message when re-throwing. Replace the catch block with:

} catch (error) {
  throw new Error(`Error getting row count for height ${height} in ${tableName}`, { cause: error });
}

The use of BigInt for the height value is appropriate for handling potentially large block heights.

@djwhitt djwhitt merged commit de28367 into develop Sep 27, 2024
4 checks passed
@djwhitt djwhitt deleted the PE-6733-blocks-txs-di-tags branch September 27, 2024 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants