⚡️ Speed up function `validate_gantt` by 58x #5386

misrasaurabh1 · 2025-10-30T06:23:32Z

📄 5,759% (57.59x) speedup for `validate_gantt` in `plotly/figure_factory/_gantt.py`

⏱️ Runtime : 154 milliseconds → 2.63 milliseconds (best of 246 runs)

📝 Explanation and details

The optimization achieves a 58x speedup by eliminating the major performance bottleneck in pandas DataFrame processing.

Key optimizations:

Pre-fetch column data as numpy arrays: The original code used df.iloc[index][key] for each cell access, which triggers pandas' slow row-based indexing mechanism. The optimized version extracts all column data upfront using df[key].values and stores it in a dictionary, then uses direct numpy array indexing columns[key][index] inside the loop.
Use actual DataFrame columns: Instead of iterating over the DataFrame object itself (which includes metadata), the code now uses list(df.columns) to get only the actual column names.

Why this is dramatically faster:

df.iloc[index][key] creates temporary pandas Series objects and involves complex indexing logic for each cell
Direct numpy array indexing columns[key][index] is orders of magnitude faster
The line profiler shows the original df.iloc line consumed 96.8% of execution time (523ms), while the optimized dictionary comprehension takes only 44.9% (4.2ms)

Performance characteristics:

Large DataFrames see massive gains: 8000%+ speedup on 1000-row DataFrames
Small DataFrames: 40-50% faster
List inputs: Slight slowdown (3-13%) due to additional validation overhead, but still microsecond-level performance
Empty DataFrames: Some slowdown due to upfront column extraction, but still fast overall

This optimization is most beneficial for DataFrame inputs with many rows, where the repeated iloc calls created a severe performance bottleneck.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 39 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import pytest
# function to test
from plotly import exceptions, optional_imports
from plotly.figure_factory._gantt import validate_gantt

pd = optional_imports.get_module("pandas")

REQUIRED_GANTT_KEYS = ["Task", "Start", "Finish"]
from plotly.figure_factory._gantt import validate_gantt

# --- BASIC TEST CASES ---

def test_valid_list_of_dicts():
    # Test a valid list of dictionaries with required keys
    input_data = [
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02"},
        {"Task": "B", "Start": "2020-01-03", "Finish": "2020-01-04"}
    ]
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 1.88μs -> 1.95μs (3.54% slower)

def test_valid_dataframe():
    # Test a valid pandas DataFrame with required keys
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02"},
        {"Task": "B", "Start": "2020-01-03", "Finish": "2020-01-04"}
    ])
    codeflash_output = validate_gantt(df); result = codeflash_output # 142μs -> 99.9μs (42.6% faster)

def test_valid_list_with_extra_keys():
    # Test list of dicts with extra keys
    input_data = [
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02", "Resource": "X"},
        {"Task": "B", "Start": "2020-01-03", "Finish": "2020-01-04", "Resource": "Y"}
    ]
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 1.57μs -> 1.70μs (7.77% slower)

def test_valid_dataframe_with_extra_keys():
    # Test DataFrame with extra columns
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02", "Resource": "X"},
        {"Task": "B", "Start": "2020-01-03", "Finish": "2020-01-04", "Resource": "Y"}
    ])
    codeflash_output = validate_gantt(df); result = codeflash_output # 160μs -> 109μs (46.6% faster)

# --- EDGE TEST CASES ---

def test_missing_required_key_in_list():
    # Test list of dicts missing a required key
    input_data = [
        {"Task": "A", "Start": "2020-01-01"},  # Missing "Finish"
    ]
    # Should NOT raise: list input is not validated for keys
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 1.54μs -> 1.67μs (7.83% slower)

def test_missing_required_key_in_dataframe():
    # Test DataFrame missing a required key
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01"}  # Missing "Finish"
    ])
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(df) # 27.2μs -> 27.1μs (0.402% faster)

def test_empty_list():
    # Test empty list input
    input_data = []
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 2.39μs -> 2.40μs (0.292% slower)


def test_input_is_not_list_or_dataframe():
    # Test input that is neither a list nor a pandas DataFrame
    input_data = "Not a list or DataFrame"
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 2.58μs -> 2.64μs (2.27% slower)

def test_dataframe_with_no_rows():
    # Test DataFrame with correct columns but no rows
    import pandas as pd
    df = pd.DataFrame(columns=["Task", "Start", "Finish"])
    codeflash_output = validate_gantt(df); result = codeflash_output # 27.0μs -> 99.0μs (72.8% slower)

def test_dataframe_with_extra_rows_and_missing_keys():
    # Test DataFrame with extra columns, but missing one required key
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01", "Resource": "X"},
        {"Task": "B", "Start": "2020-01-03", "Resource": "Y"}
    ])
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(df) # 26.3μs -> 26.8μs (2.13% slower)

def test_list_with_dict_missing_all_keys():
    # Test list of dicts missing all required keys
    input_data = [
        {"Resource": "X"}
    ]
    # Should NOT raise: list input is not validated for keys
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 1.61μs -> 1.87μs (13.9% slower)

def test_dataframe_with_only_required_keys():
    # Test DataFrame with only required keys
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02"}
    ])
    codeflash_output = validate_gantt(df); result = codeflash_output # 108μs -> 98.6μs (9.92% faster)

# --- LARGE SCALE TEST CASES ---

def test_large_list_of_dicts():
    # Test a large list of dicts (1000 elements)
    input_data = [
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}", "Finish": f"2020-02-{i%28+1:02d}"}
        for i in range(1000)
    ]
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 2.30μs -> 2.47μs (6.69% slower)

def test_large_dataframe():
    # Test a large DataFrame (1000 rows)
    import pandas as pd
    df = pd.DataFrame([
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}", "Finish": f"2020-02-{i%28+1:02d}"}
        for i in range(1000)
    ])
    codeflash_output = validate_gantt(df); result = codeflash_output # 35.9ms -> 429μs (8268% faster)
    for i in range(1000):
        pass

def test_large_dataframe_missing_key():
    # Test a large DataFrame missing one required key
    import pandas as pd
    df = pd.DataFrame([
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}"}  # Missing "Finish"
        for i in range(1000)
    ])
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(df) # 31.1μs -> 30.0μs (3.66% faster)

def test_large_list_with_non_dict_first_element():
    # Test large list with first element not a dict
    input_data = ["Not a dict"] + [
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}", "Finish": f"2020-02-{i%28+1:02d}"}
        for i in range(999)
    ]
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 2.91μs -> 2.96μs (1.69% slower)

def test_large_list_with_non_dict_later_element():
    # Test large list where a later element is not a dict (should NOT raise)
    input_data = [
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}", "Finish": f"2020-02-{i%28+1:02d}"}
        for i in range(999)
    ] + ["Not a dict"]
    # Should NOT raise: only first element is checked
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 2.18μs -> 2.34μs (7.01% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import sys
import types

import pandas as pd
# imports
import pytest  # used for our unit tests
# function to test
# (copied verbatim from prompt, for test completeness)
from plotly import exceptions, optional_imports
from plotly.figure_factory._gantt import validate_gantt

pd = optional_imports.get_module("pandas")

REQUIRED_GANTT_KEYS = ["Task", "Start", "Finish"]
from plotly.figure_factory._gantt import validate_gantt

# unit tests

if pd is None:
    import pytest


# --- Basic Test Cases ---

def test_valid_list_of_dicts():
    # Valid input: list of dictionaries
    input_data = [
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"},
        {"Task": "B", "Start": "2023-01-02", "Finish": "2023-01-03"}
    ]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.87μs -> 1.94μs (3.86% slower)

def test_valid_dataframe():
    # Valid input: DataFrame with required columns
    df = pd.DataFrame([
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"},
        {"Task": "B", "Start": "2023-01-02", "Finish": "2023-01-03"}
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 150μs -> 106μs (42.1% faster)

# --- Edge Test Cases ---

def test_missing_required_keys_in_dataframe():
    # DataFrame missing "Finish" column
    df = pd.DataFrame([
        {"Task": "A", "Start": "2023-01-01"},
        {"Task": "B", "Start": "2023-01-02"}
    ])
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(df) # 26.0μs -> 25.9μs (0.424% faster)

def test_missing_required_keys_in_list_of_dicts():
    # List of dicts missing "Finish" key
    input_data = [
        {"Task": "A", "Start": "2023-01-01"},
        {"Task": "B", "Start": "2023-01-02"}
    ]
    # This should not raise, as the function does not check keys for lists
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.53μs -> 1.75μs (12.5% slower)

def test_empty_list():
    # Empty list should raise
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt([]) # 1.76μs -> 1.81μs (2.70% slower)


def test_non_list_non_dataframe_input():
    # Input is neither a list nor a DataFrame
    input_data = "not a list or dataframe"
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 1.68μs -> 1.64μs (2.56% faster)

def test_dataframe_with_extra_columns():
    # DataFrame with extra columns should still work
    df = pd.DataFrame([
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02", "Extra": 123}
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 173μs -> 117μs (48.3% faster)

def test_list_of_dicts_with_extra_keys():
    # List of dicts with extra keys should pass
    input_data = [
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02", "Extra": 123}
    ]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.55μs -> 1.76μs (12.2% slower)

def test_dataframe_with_wrong_column_types():
    # DataFrame with columns named correctly but with wrong types in values
    df = pd.DataFrame([
        {"Task": None, "Start": 123, "Finish": []}
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 137μs -> 91.8μs (49.7% faster)

def test_list_with_first_dict_rest_non_dicts():
    # Only the first element is checked for being a dict
    input_data = [{"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"}, 123, "string"]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.58μs -> 1.75μs (9.48% slower)

def test_dataframe_with_no_rows():
    # DataFrame with correct columns but no rows
    df = pd.DataFrame(columns=REQUIRED_GANTT_KEYS)
    codeflash_output = validate_gantt(df); output = codeflash_output # 23.9μs -> 93.8μs (74.5% slower)

def test_list_of_dicts_with_empty_dict():
    # List with an empty dictionary as first element
    input_data = [{}]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.43μs -> 1.85μs (22.8% slower)

# --- Large Scale Test Cases ---

def test_large_list_of_dicts():
    # Large list of dicts (1000 elements)
    input_data = [
        {"Task": f"Task{i}", "Start": f"2023-01-{i:02d}", "Finish": f"2023-01-{i+1:02d}"}
        for i in range(1, 1001)
    ]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 2.02μs -> 2.25μs (10.2% slower)

def test_large_dataframe():
    # Large DataFrame (1000 rows)
    df = pd.DataFrame([
        {"Task": f"Task{i}", "Start": f"2023-01-{i:02d}", "Finish": f"2023-01-{i+1:02d}"}
        for i in range(1, 1001)
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 35.7ms -> 433μs (8135% faster)
    for i in range(1000):
        pass

def test_large_dataframe_with_extra_columns():
    # Large DataFrame with extra columns
    df = pd.DataFrame([
        {"Task": f"Task{i}", "Start": f"2023-01-{i:02d}", "Finish": f"2023-01-{i+1:02d}", "Extra": i}
        for i in range(1, 1001)
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 81.0ms -> 511μs (15734% faster)
    for i in range(1000):
        pass

def test_large_list_with_non_dict_first_element():
    # Large list, first element not a dict
    input_data = [0] + [{"Task": f"Task{i}", "Start": f"2023-01-{i:02d}", "Finish": f"2023-01-{i+1:02d}"} for i in range(1, 999)]
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 3.19μs -> 3.32μs (3.82% slower)

def test_large_empty_dataframe():
    # Large DataFrame with correct columns but zero rows
    df = pd.DataFrame(columns=REQUIRED_GANTT_KEYS)
    codeflash_output = validate_gantt(df); output = codeflash_output # 25.0μs -> 96.7μs (74.2% slower)

# --- Determinism and Robustness ---

def test_determinism_multiple_calls():
    # Multiple calls with same input should give same output
    input_data = [
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"},
        {"Task": "B", "Start": "2023-01-02", "Finish": "2023-01-03"}
    ]
    codeflash_output = validate_gantt(input_data); output1 = codeflash_output # 1.61μs -> 1.88μs (14.0% slower)
    codeflash_output = validate_gantt(input_data); output2 = codeflash_output # 523ns -> 586ns (10.8% slower)

def test_dataframe_column_order():
    # DataFrame with columns in different order
    df = pd.DataFrame([
        {"Finish": "2023-01-02", "Start": "2023-01-01", "Task": "A"}
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 107μs -> 96.7μs (10.7% faster)

def test_dataframe_with_index():
    # DataFrame with custom index
    df = pd.DataFrame([
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"},
        {"Task": "B", "Start": "2023-01-02", "Finish": "2023-01-03"}
    ], index=["x", "y"])
    codeflash_output = validate_gantt(df); output = codeflash_output # 137μs -> 91.4μs (50.1% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-validate_gantt-mhcxyu68 and push.

The optimization achieves a **58x speedup** by eliminating the major performance bottleneck in pandas DataFrame processing. **Key optimizations:** 1. **Pre-fetch column data as numpy arrays**: The original code used `df.iloc[index][key]` for each cell access, which triggers pandas' slow row-based indexing mechanism. The optimized version extracts all column data upfront using `df[key].values` and stores it in a dictionary, then uses direct numpy array indexing `columns[key][index]` inside the loop. 2. **More efficient key validation**: Replaced the nested loop checking for missing keys with a single list comprehension `missing_keys = [key for key in REQUIRED_GANTT_KEYS if key not in df]`. 3. **Use actual DataFrame columns**: Instead of iterating over the DataFrame object itself (which includes metadata), the code now uses `list(df.columns)` to get only the actual column names. **Why this is dramatically faster:** - `df.iloc[index][key]` creates temporary pandas Series objects and involves complex indexing logic for each cell - Direct numpy array indexing `columns[key][index]` is orders of magnitude faster - The line profiler shows the original `df.iloc` line consumed 96.8% of execution time (523ms), while the optimized dictionary comprehension takes only 44.9% (4.2ms) **Performance characteristics:** - **Large DataFrames see massive gains**: 8000%+ speedup on 1000-row DataFrames - **Small DataFrames**: 40-50% faster - **List inputs**: Slight slowdown (3-13%) due to additional validation overhead, but still microsecond-level performance - **Empty DataFrames**: Some slowdown due to upfront column extraction, but still fast overall This optimization is most beneficial for DataFrame inputs with many rows, where the repeated `iloc` calls created a severe performance bottleneck.

camdecoster · 2025-10-30T15:27:59Z

Thanks for the PR! Could you please add test coverage or demonstrate that test coverage is already provided? Some tests failed CI, but I think that's unrelated to your changes.

misrasaurabh1 · 2025-10-30T19:37:17Z

@camdecoster just added a test for it. fixing the formatting issue now

codeflash-ai bot and others added 3 commits October 30, 2025 04:46

Apply suggestion from @misrasaurabh1

6be6284

Apply suggestion from @misrasaurabh1

9e2a2f0

misrasaurabh1 changed the title ~~⚡️ Speed up function validate_gantt by 58x~~ ⚡️ Speed up function validate_gantt by 58x Oct 30, 2025

misrasaurabh1 changed the title ~~⚡️ Speed up function validate_gantt by 58x~~ ⚡️ Speed up function validate_gantt by 58x Oct 30, 2025

adding validate_gantt tests file

7ddb02b

mashraf-222 added 2 commits October 30, 2025 22:40

fix formatting

666dcc2

fixing formatting

ef98a70

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

⚡️ Speed up function `validate_gantt` by 58x #5386

⚡️ Speed up function `validate_gantt` by 58x #5386

misrasaurabh1 commented Oct 30, 2025

Uh oh!

camdecoster commented Oct 30, 2025

Uh oh!

misrasaurabh1 commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

⚡️ Speed up function validate_gantt by 58x #5386

Are you sure you want to change the base?

⚡️ Speed up function validate_gantt by 58x #5386

Conversation

misrasaurabh1 commented Oct 30, 2025

📄 5,759% (57.59x) speedup for validate_gantt in plotly/figure_factory/_gantt.py

📝 Explanation and details

Uh oh!

camdecoster commented Oct 30, 2025

Uh oh!

misrasaurabh1 commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

⚡️ Speed up function `validate_gantt` by 58x #5386

⚡️ Speed up function `validate_gantt` by 58x #5386

📄 5,759% (57.59x) speedup for `validate_gantt` in `plotly/figure_factory/_gantt.py`