Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-28555 Add blocked reader for unfiltered serial index reading #17551

Conversation

jakesmith
Copy link
Member

@jakesmith jakesmith commented Jul 4, 2023

Allow indexes to be read with a large granularity.
This is particularly important if reading from storage that
charges per read access.

Signed-off-by: Jake Smith [email protected]

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

@@ -360,6 +360,7 @@ class CRegistryServer : public CSimpleInterface
msg.append(THOR_VERSION_MAJOR).append(THOR_VERSION_MINOR);
processGroup->serialize(msg);
globals->serialize(msg);
getGlobalConfigSP()->serialize(msg);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and related changes are so that workers have access to global config, and in particular the planes.

@github-actions
Copy link

github-actions bot commented Jul 4, 2023

Copy link
Member

@richardkchapman richardkchapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit title suggests this is for serial index reads, but my reading of the code is that it will be used for random index reads too.

current.setown(f->open(IFOread));
if (blockedIndexIOSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't appear to be any code to use the plane settings in Roxie?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's true, mainly because I thought use case tends to be different in roxie, where unlikely to have some indexes on a fast/non-blob plane and others on a slow/blob plane.. - and therefore a per plane configuration setting would not be required as much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now added

}

static constexpr size32_t defaultBlockedIndexIOKB = 1024;
size32_t getBlockedIndexIOSize(const char *planeName, int configUseBlockedIndexIO, size32_t configBlockedIndeIOSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling: configBlockedIndeIOSize should be configBlockedIndexIOSize

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will correct

blockedSize = configBlockedIndeIOSize;
if (0 == blockedSize)
{
// could cache, but not sure it's worth it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like it might be worth it to cache, if Roxie ever called this - it's opening a lot of index files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caching added.

@@ -2046,9 +2047,15 @@ class CLazyFileIO : public CInterfaceOf<IFileIO>
return iFileIO.getClear();
}
public:
CLazyFileIO(CFileCache &_cache, const char *_filename, const char *_id, IActivityReplicatedFile *_repFile, bool _compressed, IExpander *_expander, const StatisticsMapping & _statMapping=diskLocalStatistics)
: cache(_cache), filename(_filename), id(_id), repFile(_repFile), compressed(_compressed), expander(_expander), fileStats(_statMapping)
CLazyFileIO(CFileCache &_cache, const char *_filename, const char *_id, IActivityReplicatedFile *_repFile, bool _compressed, IExpander *_expander, const StatisticsMapping & _statMapping, size32_t _blockedFileIOSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it will potentially be used for non-index files too. If so, some names may be inappropriate. But not sure you really want to use for non-index files except possibly in FETCH activities, as you are probably already using large buffers anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hadn't considered fetch case, but only use case (in hthor and thor) is where it is for index read, where it is expected to potentially be read serially.
name here (in CLazyFileIO) ctor and factory are "blockedFileIOSize", so not specifically for indexes, so I think naming is ok(?), albeit it may never be used for anything but indexes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only used/passed through for [unfiltered] index reads at the moment.

@@ -7643,3 +7643,110 @@ IAPICopyClient * createApiCopyClient(IStorageApiInfo * source, IStorageApiInfo *
}
return nullptr;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not review this bit in detail as it looks like it's copied from my prior PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, your original commit is the 1st of the 2 commits in this PR, I made minor changes only to CBlockedFileIO

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this is quite right. There are two issues
i) The size that is read when reading something from an index
ii) The size of reads when reading unfiltered indexes

It is not clear what effect (i) will have on random reading. For branches it would be good, - but they tend to get loaded quickly and stay in memory. We would need some significant tests before were were convinced it was worthwhile. It may also have a significant memory impact if multiple parts are open at once.

Really (ii) is a special case of "what is the most efficient size to read from this storage". Possibly this should be generalised so there was a blockedReadSize on a plane which was used for unfiltered indexes and files.

I'll finish reviewing code and add any specific comments.

return io->read(pos, len, data);
if (!blockedFileIOBuffer)
{
blockedFileIOBuffer = malloc(blockSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has problems if there are multiple CBlockedFileIO objects with different blockSizes. Currently it would potentially corrupt memory. If a thread local size was used instead you could prevent the corruption, but would the block size would be determined by whichever file was first read on that thread.
I think the idea may be fundamentally flawed for multithreaded access.

@jakesmith jakesmith changed the title HPCC-28555 Add blocked IFileIO reader for serial index reading HPCC-28555 Add blocked IFileIO reader for index reading Jul 6, 2023
@jakesmith jakesmith force-pushed the HPCC-28555-indexread-lookahead branch from 9d59f2b to 2f26e7d Compare July 6, 2023 12:06
@jakesmith jakesmith changed the title HPCC-28555 Add blocked IFileIO reader for index reading HPCC-28555 Add blocked reader for unfiltered serial index reading Jul 6, 2023
@jakesmith jakesmith requested a review from ghalliday July 6, 2023 13:03
@jakesmith jakesmith changed the base branch from candidate-9.2.x to candidate-9.0.x July 6, 2023 13:53
@jakesmith jakesmith force-pushed the HPCC-28555-indexread-lookahead branch 2 times, most recently from 9a00180 to 11d18c7 Compare July 6, 2023 14:53
Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith generally looks good. A few minor comments.

@@ -8878,7 +8878,7 @@ void executeConfigUpdaterCallbacks()
{
if (!configFileUpdater) // NB: executeConfigUpdaterCallbacks should always be called after configFileUpdater is initialized
return;
configFileUpdater->executeCallbacks(componentConfiguration.getLink(), globalConfiguration.getLink());
configFileUpdater->executeCallbacks(componentConfiguration, globalConfiguration);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a separate fix for a leak. Should it be in a separate PR? (It depends if it also affects 8.12.x)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, it is a separate leak fix, and it does effect 8.12.x.
I'll release a separate JIRA/PR: https://track.hpccsystems.com/browse/HPCC-29915

if (blockedFileIOSize) // enabled
{
if (compressed || expander)
throw makeStringExceptionV(0, "CLazyFileIO(%s): blockedFileIO cannot be used with compressed files", filename.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be ignored rather than throwing an error - in case we start using that default for all sequential file access.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith hasn't been changed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I made the change somewhere, I'll redo.

@@ -7579,3 +7589,103 @@ IAPICopyClient * createApiCopyClient(IStorageApiInfo * source, IStorageApiInfo *
}
return nullptr;
}


class CBlockedFileIO : public CSimpleInterfaceOf<IFileIO>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have a comment that this is not thread safe, so can only be used from a single thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add comment

}
virtual size32_t read(offset_t pos, size32_t len, void *data) override
{
if (len > blockSize || pos+len > fileSize || len==0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need the fileSize checking? It means that you may perform an extra stat call which will add latency if it is on blob storage. It should be possible to change the code so it copes with reading at the end of the file.
Also len==0 test is not needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, refactored to avoid need for fileSize

@jakesmith jakesmith force-pushed the HPCC-28555-indexread-lookahead branch from 0e866b8 to c652d32 Compare July 10, 2023 10:46
@jakesmith jakesmith requested a review from ghalliday July 10, 2023 10:47
@jakesmith
Copy link
Member Author

@ghalliday - please review new commit changes.

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith one remaining change. Please squash after fixing so it is ready for merge.

if (blockedFileIOSize) // enabled
{
if (compressed || expander)
throw makeStringExceptionV(0, "CLazyFileIO(%s): blockedFileIO cannot be used with compressed files", filename.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith hasn't been changed.

Allow indexes to be read with a large granularity.
This is particularly important if reading from storage that
charges per read access.

Signed-off-by: Jake Smith <[email protected]>
@jakesmith jakesmith force-pushed the HPCC-28555-indexread-lookahead branch from c652d32 to 87d12a6 Compare July 11, 2023 10:43
@jakesmith
Copy link
Member Author

@jakesmith one remaining change. Please squash after fixing so it is ready for merge.

addressed and squashed.
Please review/merge.

@ghalliday
Copy link
Member

@jakesmith better not to also rebase at the same time as squashing - it makes it hard to spot the changes.

@ghalliday ghalliday merged commit 4a0099b into hpcc-systems:candidate-9.0.x Jul 11, 2023
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants