-
Notifications
You must be signed in to change notification settings - Fork 13
DataLakeCatalog namespace filter #1337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: antalya-25.8
Are you sure you want to change the base?
Conversation
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8b92a48e5b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| DB::Names RestCatalog::getTables(const std::string & base_namespace, size_t limit) const | ||
| { | ||
| if (!allowed_namespaces.isNamespaceAllowed(base_namespace, /*nested*/ false)) | ||
| throw DB::Exception(DB::ErrorCodes::OUT_OF_SCOPE, | ||
| "Namespace {} is filtered by `namespaces` database parameter", base_namespace); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip filtered namespaces when probing emptiness
This new OUT_OF_SCOPE throw means RestCatalog::getTables() now fails for any namespace excluded by the namespaces filter. RestCatalog::empty() still iterates all namespaces returned by the catalog and calls getTables() in its stop condition without checking the filter, so any filtered namespace will now throw and bubble up. That makes DatabaseDataLake::empty() (used by DatabaseCatalog::detachDatabase when dropping/detaching DBs) fail whenever the catalog contains namespaces outside the filter (e.g., namespaces='dev' but prod exists). Consider skipping filtered namespaces in empty() or swallowing OUT_OF_SCOPE there.
Useful? React with 👍 / 👎.
|
@codex review |
|
Codex Review: Didn't find any major issues. 🎉 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
| DataLakeStorageSettingsPtr settings_, | ||
| std::optional<std::string> catalog_namespaces_ = std::nullopt) | ||
| : settings(settings_) | ||
| , catalog_namespaces(catalog_namespaces_ ? *catalog_namespaces_ : std::string("*")) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider std::optional::value_or()
| { | ||
| return allowed_namespaces.contains("*") || allowed_namespaces.contains(namespace_); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but may be make this method virtual and move this implementation to ICatalog to avoid duplication?
Feel free to ignore if you've considered this option and decided that it is better to keep it simple.
| glue_client = std::make_unique<Aws::Glue::GlueClient>(chain, endpoint_provider, client_configuration); | ||
| } | ||
|
|
||
| boost::split(allowed_namespaces, settings.namespaces, [](char c){ return c == ','; }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider boost::is_any_of instead of lambda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, am I right that 'aaa, bbb' would not work because of space after comma?
Is it ok?
To fix it one can
...split ... is_any_of(", "), boost::token_compress_on)
| M(754, UDF_EXECUTION_FAILED) \ | ||
| M(755, TOO_LARGE_LIGHTWEIGHT_UPDATES) \ | ||
| M(756, CANNOT_PARSE_PROMQL_QUERY) \ | ||
| M(757, OUT_OF_SCOPE) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, though may be it is better to either use existing error code (e.g. DATALAKE_DATABASE_ERROR) or make up a more specific name, e.g. CATALOG_NAMESPACE_DISABLED ?
| | `aws_access_key_id` | AWS access key ID for S3/Glue access (if not using vended credentials) | | ||
| | `aws_secret_access_key` | AWS secret access key for S3/Glue access (if not using vended credentials) | | ||
| | `region` | AWS region for the service (e.g., `us-east-1`) | | ||
| | `namespaces` | Comma-separated list of namespaces, supported types: `rest`, `glue` and `unity` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I thought that rest, glue and unity are supported namespaces. Probably target audience of this feature would not have this problem, but may be '... implemented for catalog types: ..'
| boost::split(list_of_nested_namespaces, ns, [](char c){ return c == '.'; }); | ||
|
|
||
| size_t len = list_of_nested_namespaces.size(); | ||
| if (!len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that I understand what it actually means.
That 'ns' is an empty string? Is it possible?
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
DataLakeCatalog namespace filter
Documentation entry for user-facing changes
New setting
namespacesfor DataLakeCatalog with comma-separated list of namespaces.Supports
rest,glueandunitytypes.namepsaces='foo,bar'resttype can have nested namespaces, supports next rules:foo- tables from namespacefoo, but not from nested namespacesfoo.bar- tables from nested namespace, but not from base namespacefoo.*- tables from al; nested namespaces, but not from base namespaceWhen tables from both namespaces (base and nested) are required, need to use both:
foo,foo.*CI/CD Options
Exclude tests:
Regression jobs to run: