Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-30481][FLIP-277] GlueCatalog Implementation #47

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Samrat002
Copy link
Contributor

@Samrat002 Samrat002 commented Jan 12, 2023

Purpose of the change

Add Glue catalog feature

Verifying this change

Tested in EMR cluster

  • Unit test cases

Significant changes

(Please check any boxes [x] if the answer is "yes" )

  • Dependencies have been added or upgraded
  • Public API has been changed (Public API is any class annotated with @Public(Evolving))
  • Serializers have been changed
  • New feature has been introduced
    • If yes, how is this documented? (not applicable / docs / JavaDocs / not documented)

@Samrat002 Samrat002 marked this pull request as draft January 12, 2023 10:54
@Samrat002 Samrat002 force-pushed the FLINK-30481 branch 8 times, most recently from 665271b to 4ad9f96 Compare January 21, 2023 06:50
@Samrat002 Samrat002 force-pushed the FLINK-30481 branch 2 times, most recently from 43308aa to ccf35dc Compare January 24, 2023 12:57
@Samrat002 Samrat002 marked this pull request as ready for review January 31, 2023 10:58
@Samrat002 Samrat002 changed the title [FLINK-30481][DRAFT][FLIP-277] GlueCatalog Implementation [FLINK-30481][FLIP-277] GlueCatalog Implementation Jan 31, 2023
@Samrat002 Samrat002 force-pushed the FLINK-30481 branch 3 times, most recently from 2a82242 to 9117402 Compare February 13, 2023 05:54
@Samrat002
Copy link
Contributor Author

@dannycranmer please review changes in free time

Copy link

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Samrat002 Thanks for bring this feature. I'm not familiar with GlueCatalog, but I left some comments from Flink and Flink Catalog sides.

@Samrat002 Samrat002 force-pushed the FLINK-30481 branch 2 times, most recently from 0e2dc19 to 10dfc99 Compare March 16, 2023 12:21
@Samrat002 Samrat002 requested review from luoyuxia and removed request for guptashailesh92 March 16, 2023 13:22
@Samrat002
Copy link
Contributor Author

@dannycranmer @vahmed-hamdy please review wheenver time

@Samrat002
Copy link
Contributor Author

@dannycranmer @vahmed-hamdy please review whenever time 🙏🏻

@Samrat002
Copy link
Contributor Author

[gentle ping]
@dannycranmer, i have addressed to all the review comments . Please review the PR whenever time

@Samrat002
Copy link
Contributor Author

@PrabhuJoseph please review the PR whenever time

@wckdman
Copy link

wckdman commented Mar 12, 2024

Any news here? :)
I'm really looking forward to see this feature 🙏

@vikramsinghchandel
Copy link

any updates here? This is a feature that is already available on Flink EMR would be great to have it here too.

@Samrat002
Copy link
Contributor Author

vikramsinghchandel@ wckdman@ . thank you for your interest
i will find a reviewer and close the feature

@Samrat002
Copy link
Contributor Author

@dannycranmer @fapaul Please review whenever time

Copy link

@foxus foxus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for your efforts with this one @Samrat002 - I've made a few suggestions.

@Samrat002
Copy link
Contributor Author

Thank you @foxus for the review .
this PR was opened long ago. Let me try to address all the comments and test it again e2e .
this PR will be ready for review by next 2 days

@foxus
Copy link

foxus commented Jul 29, 2024

Thank you @foxus for the review .

this PR was opened long ago. Let me try to address all the comments and test it again e2e .

this PR will be ready for review by next 2 days

  • Many thanks, let me know when it's ready and I'll have another look. I noticed there were a couple of previous unaddressed comments from other reviewers - it will be great to get this one merged in.

@foxus
Copy link

foxus commented Aug 7, 2024

Thank you @foxus for the review .

this PR was opened long ago. Let me try to address all the comments and test it again e2e .

this PR will be ready for review by next 2 days

Hey, let me know if I can help progress this? I'm happy to assist if you're willing and able to give me write access to the branch.

@Samrat002
Copy link
Contributor Author

@foxus
Apologies for the delay. i have mostly addressed to all the comments . Lastly wrapping up validating the working .
i will ping you once the pr is ready for review

@Samrat002
Copy link
Contributor Author

let me know if I can help progress this? I'm happy to assist if you're willing and able to give me write access to the branch.

yes . send invite for colaboration

@Samrat002
Copy link
Contributor Author

@foxus Please review whenever time

@foxus
Copy link

foxus commented Aug 14, 2024

@foxus Please review whenever time

Thanks for the effort responding to the comments, there's only one which I think has been marked as resolved but still appears not to have been addressed.

@Samrat002
Copy link
Contributor Author

Samrat002 commented Aug 15, 2024

Fixed the missing comment.
it might have slipped unintentionally

@foxus please review

Copy link

@foxus foxus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies @Samrat002, I've spotted two more blocking bugs that I should have spotted before.

return DataTypes.TIMESTAMP(5);
case "array":
// Example: array<string> -> DataTypes.ARRAY(DataTypes.STRING())
String elementType = glueType.substring(6, glueType.length() - 1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive me for not spotting this earlier - this will always cause an out of bounds exception. The only way this block of code runs is if glueType is equal to the string literal array but in this block you're attempting to parse the part of the string which cannot exist (array<string>). It looks like you're expecting case here to behave as a startsWith. Please address this and consider adding unit tests for this method.

// Example: map<string, string> -> DataTypes.MAP(DataTypes.STRING(),
// DataTypes.STRING())
int commaIndex = glueType.indexOf(",");
String keyType = glueType.substring(4, commaIndex);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive me for not spotting this earlier - this will always cause an out of bounds exception. The only way this block of code runs is if glueType is equal to the string literal map but in this block you're attempting to parse the part of the string which cannot exist (map<string, string>). It looks like you're expecting case here to behave as a startsWith. Please address this and consider adding unit tests for this method. I would consider using regex here to parse the key and value types - it will allow you to be more permissive about whitespace (e.g. map<string, string> and map<string,string>) and also allow you to play with named groups (a personal favourite).

@foxus
Copy link

foxus commented Aug 18, 2024

@Samrat002, thanks for agreeing to collaborate, I've resolved some of the comments but left it as a separate commit to allow you to review and squash if you're happy whilst you're addressing the TypeMapper issue.

@Samrat002
Copy link
Contributor Author

@foxus thank you for adding the change. 👍🏻
please review again i have added tests and tried to address to your suggested change.

Copy link
Contributor

@vahmed-hamdy vahmed-hamdy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have partially reviewed the PR, I will try to resume during this week, the PR has been hanging for a very long time and is of huge size, Do you believe it might be easier to break it down to a couple of PRs, so it is easier to review and more importantly to provide thorough testing for each part.
I suggest starting with laying the code structure and minimum possible set of features and exposing minimum configuration while defaulting the rest.
We can add features like functions and partitions and http-client configurations in a following PR.

</parent>

<artifactId>flink-catalog-aws-glue</artifactId>
<name>Flink : Catalog : AWS : Glue</name>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<name>Flink : Catalog : AWS : Glue</name>
<name>Flink : Catalog : AWS : Glue Data Catalog</name>

* <p>For more details, see <a
* href="https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-databases.html">...</a>
*/
public static final String GLUE_CATALOG_ID = "aws.glue.id";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use aws.glue.catalog-id since we also have account-id

}

public static Set<ConfigOption<?>> getRequiredConfigOptions() {
return new HashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have GLUE_CATALOG_ID and REGION as required options?

public static final ConfigOption<String> REGION =
ConfigOptions.key(AWSConfigConstants.AWS_REGION)
.stringType()
.defaultValue(Region.US_WEST_1.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why us-west-1 ? I prefer to have no default and make it required like the case in the connectors

import static org.apache.flink.table.catalog.glue.GlueCatalog.DEFAULT_DB;

/** Collection of {@link ConfigOption} used in GlueCatalog. */
@Internal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not Internal suggest to go with PublicEvolving

class GlueCatalogOptionsUtilsTest {

@Test
void testGetValidatedConfigurations() {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove If not needed

* href="https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/http/apache/ApacheHttpClient.Builder.html">...</a>
*/
public static final String HTTP_CLIENT_CONNECTION_TIMEOUT_MS =
"http-client.connection-timeout-ms";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

* <p>For more details, see <a
* href="https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/http/apache/ApacheHttpClient.Builder.html">...</a>
*/
public static final String HTTP_CLIENT_SOCKET_TIMEOUT_MS = "http-client.socket-timeout-ms";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Comment on lines +40 to +43
private static final String CLIENT_HTTP_PROTOCOL_VERSION_OPTION = "protocol.version";
private static final String CLIENT_HTTP_MAX_CONNECTION_TIMEOUT_MS = "connection-timeout-ms";
private static final String CLIENT_HTTP_MAX_SOCKET_TIMEOUT_MS = "socket-timeout-ms";
private static final String APACHE_MAX_CONNECTIONS = "apache.max-connections";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

@@ -0,0 +1,7 @@
flink-catalog-aws-glue
Copyright 2014-2023 The Apache Software Foundation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2024

* @see org.apache.flink.table.catalog.CatalogTable
* @see org.apache.flink.table.catalog.ResolvedCatalogTable
*/
public class TypeMapper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotation missing

public class TypeMapperTest {

@Test
public void testMapFlinkTypeToGlueType_Primitives() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Junit5 conventions don't suggest a prefix test in test names
https://junit.org/junit5/docs/current/user-guide/

}

@Test
public void testMapFlinkTypeToGlueType_Array() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We can use parameterized tests for all of these tests

@Test
public void testGlueTypeToFlinkType_Array() {
LogicalType arrayType = new ArrayType(new VarCharType(255));
assertEquals("array<string>", TypeMapper.mapFlinkTypeToGlueType(arrayType));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not glue to Flink as test suggests

@Test
public void testGlueTypeToFlinkType_Map() {
LogicalType mapType = new MapType(new VarCharType(255), new IntType());
assertEquals("map<string,int>", TypeMapper.mapFlinkTypeToGlueType(mapType));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not glue to Flink as test suggests

</dependency>

</dependencies>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this should be distributed as an uberjar, right?
Do we need to relocate the sdk dependencies?

This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

This project bundles the following dependencies under the Apache Software License 2.0. (http://www.apache.org/licenses/LICENSE-2.0.txt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file needed? Unless we bundle and relocate any dependencies I don't think we need it.

* @param properties Map of properties.
* @return fully qualified owner name.
*/
public static String extractTableOwner(Map<String, String> properties) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Ideally we return Optional and add it to builder if present

* @param properties Map of properties.
* @return fully qualified owner name.
*/
public static String extractTableOwner(Map<String, String> properties) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Why do we remove?
  • removing is not tested

.tableType(table.getTableKind().name())
.lastAccessTime(Instant.now())
.owner(tableOwner)
.viewExpandedText(GlueUtils.getExpandedQuery(table))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is not a mandatory parameter in builder and we don't really support those yet, why do we set those?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants