Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Unqualified Field Name #14799

Open
LiaCastaneda opened this issue Feb 20, 2025 · 1 comment
Open

Duplicate Unqualified Field Name #14799

LiaCastaneda opened this issue Feb 20, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@LiaCastaneda
Copy link

LiaCastaneda commented Feb 20, 2025

Describe the bug

I have encountered an error while doing a query with a join:

failed to optimize plan: Internal error: Input field name count(Int64(1)) does not match with the projection expression count(Int64(1)):1

This occurs when I do a join and the resulting schema has more than 2 fields with a count(*) (or any aggregator), I understand that datafusion internally tries to rename the column names to be able to create a DFSchema, since arrow does not support duplicate names. While debugging I saw that what it does internally is:

  1. The name tracker sees count(Int64(1))
  2. The name traccker sees count(Int64(1))again and renames it to count(Int64(1)):1
  3. The name traccker sees count(Int64(1))again and renames it to count(Int64(1)):2 or count(Int64(1)):1 again?

I have another related error but this one arises when trying to create the logical plan that is:
failed to translate modified plan to DataFusion: Schema error: Schema contains duplicate unqualified field name "count(Int64(1)):1"

On my local clone, for the second error, I managed to make datafusion generate a logical plan by tweaking this function so in case it encounters count(Int64(1)):1 for the second time, its renamed to count(Int64(1)):2 and there are no duplicate names, but then when trying to build the physical plan I get the error I mentioned first.

To Reproduce

CREATE TABLE left_table (
    id INT PRIMARY KEY,
    category TEXT,
    timestamp TIMESTAMP
);

CREATE TABLE right_table (
    id INT PRIMARY KEY,
    category TEXT,
    timestamp TIMESTAMP
);

INSERT INTO left_table (id, category, timestamp) VALUES
(1, 'business_logic', '2024-02-15 10:00:00'),
(2, 'attack_attempt', '2024-02-15 10:05:00');

INSERT INTO right_table (id, category, timestamp) VALUES
(1, 'info', '2024-02-15 10:10:00'),
(2, 'low', '2024-02-15 10:15:00');

WITH first_agg AS (
    SELECT id, COUNT(*) AS count_first FROM left_table GROUP BY id
),
second_agg AS (
    SELECT id, COUNT(*) AS count_second FROM right_table GROUP BY id
),

third_agg AS (
    SELECT id, COUNT(*) AS count_third FROM right_table GROUP BY id
),
fourth_random_table AS (
    SELECT id, category FROM right_table GROUP BY id, category
)

select count_first, category, count_second, count_third from first_agg 
  LEFT JOIN fourth_random_table using (id) 
  LEFT JOIN  second_agg using (id)
  LEFT JOIN third_agg using  (id) `

Expected behavior

No response

Additional context

I'm using version 43.0

@LiaCastaneda LiaCastaneda added the bug Something isn't working label Feb 20, 2025
@kosiew
Copy link
Contributor

kosiew commented Feb 21, 2025

The error "Schema error: Ambiguous reference to unqualified field id" occurs because multiple tables in your query contain a column named id, and you're using USING (id), which requires id to be unambiguous in all participating tables.

In your sql, second_agg, third_agg, and fourth_random_table all originate from right_table, id becomes ambiguous when SQL tries to determine which one to use in further joins.

Here's how you can resolve the ambiquity:

WITH first_agg AS (
    SELECT id, COUNT(*) AS count_first FROM left_table GROUP BY id
),
second_agg AS (
    SELECT id, COUNT(*) AS count_second FROM right_table GROUP BY id
),
third_agg AS (
    SELECT id, COUNT(*) AS count_third FROM right_table GROUP BY id
),
fourth_random_table AS (
    SELECT id, category FROM right_table GROUP BY id, category
)
SELECT 
    fa.count_first, 
    frt.category, 
    sa.count_second, 
    ta.count_third 
FROM first_agg fa
LEFT JOIN fourth_random_table frt ON fa.id = frt.id
LEFT JOIN second_agg sa ON fa.id = sa.id
LEFT JOIN third_agg ta ON fa.id = ta.id;

Results

+-------------+----------+--------------+-------------+
| count_first | category | count_second | count_third |
+-------------+----------+--------------+-------------+
| 1           | info     | 1            | 1           |
| 1           | low      | 1            | 1           |
+-------------+----------+--------------+-------------+
2 row(s) fetched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants