Skip to content

Conversation

@misrasaurabh1
Copy link

📄 5,759% (57.59x) speedup for validate_gantt in plotly/figure_factory/_gantt.py

⏱️ Runtime : 154 milliseconds 2.63 milliseconds (best of 246 runs)

📝 Explanation and details

The optimization achieves a 58x speedup by eliminating the major performance bottleneck in pandas DataFrame processing.

Key optimizations:

  1. Pre-fetch column data as numpy arrays: The original code used df.iloc[index][key] for each cell access, which triggers pandas' slow row-based indexing mechanism. The optimized version extracts all column data upfront using df[key].values and stores it in a dictionary, then uses direct numpy array indexing columns[key][index] inside the loop.

  2. Use actual DataFrame columns: Instead of iterating over the DataFrame object itself (which includes metadata), the code now uses list(df.columns) to get only the actual column names.

Why this is dramatically faster:

  • df.iloc[index][key] creates temporary pandas Series objects and involves complex indexing logic for each cell
  • Direct numpy array indexing columns[key][index] is orders of magnitude faster
  • The line profiler shows the original df.iloc line consumed 96.8% of execution time (523ms), while the optimized dictionary comprehension takes only 44.9% (4.2ms)

Performance characteristics:

  • Large DataFrames see massive gains: 8000%+ speedup on 1000-row DataFrames
  • Small DataFrames: 40-50% faster
  • List inputs: Slight slowdown (3-13%) due to additional validation overhead, but still microsecond-level performance
  • Empty DataFrames: Some slowdown due to upfront column extraction, but still fast overall

This optimization is most beneficial for DataFrame inputs with many rows, where the repeated iloc calls created a severe performance bottleneck.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 39 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest
# function to test
from plotly import exceptions, optional_imports
from plotly.figure_factory._gantt import validate_gantt

pd = optional_imports.get_module("pandas")

REQUIRED_GANTT_KEYS = ["Task", "Start", "Finish"]
from plotly.figure_factory._gantt import validate_gantt

# --- BASIC TEST CASES ---

def test_valid_list_of_dicts():
    # Test a valid list of dictionaries with required keys
    input_data = [
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02"},
        {"Task": "B", "Start": "2020-01-03", "Finish": "2020-01-04"}
    ]
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 1.88μs -> 1.95μs (3.54% slower)

def test_valid_dataframe():
    # Test a valid pandas DataFrame with required keys
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02"},
        {"Task": "B", "Start": "2020-01-03", "Finish": "2020-01-04"}
    ])
    codeflash_output = validate_gantt(df); result = codeflash_output # 142μs -> 99.9μs (42.6% faster)

def test_valid_list_with_extra_keys():
    # Test list of dicts with extra keys
    input_data = [
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02", "Resource": "X"},
        {"Task": "B", "Start": "2020-01-03", "Finish": "2020-01-04", "Resource": "Y"}
    ]
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 1.57μs -> 1.70μs (7.77% slower)

def test_valid_dataframe_with_extra_keys():
    # Test DataFrame with extra columns
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02", "Resource": "X"},
        {"Task": "B", "Start": "2020-01-03", "Finish": "2020-01-04", "Resource": "Y"}
    ])
    codeflash_output = validate_gantt(df); result = codeflash_output # 160μs -> 109μs (46.6% faster)

# --- EDGE TEST CASES ---

def test_missing_required_key_in_list():
    # Test list of dicts missing a required key
    input_data = [
        {"Task": "A", "Start": "2020-01-01"},  # Missing "Finish"
    ]
    # Should NOT raise: list input is not validated for keys
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 1.54μs -> 1.67μs (7.83% slower)

def test_missing_required_key_in_dataframe():
    # Test DataFrame missing a required key
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01"}  # Missing "Finish"
    ])
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(df) # 27.2μs -> 27.1μs (0.402% faster)

def test_empty_list():
    # Test empty list input
    input_data = []
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 2.39μs -> 2.40μs (0.292% slower)


def test_input_is_not_list_or_dataframe():
    # Test input that is neither a list nor a pandas DataFrame
    input_data = "Not a list or DataFrame"
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 2.58μs -> 2.64μs (2.27% slower)

def test_dataframe_with_no_rows():
    # Test DataFrame with correct columns but no rows
    import pandas as pd
    df = pd.DataFrame(columns=["Task", "Start", "Finish"])
    codeflash_output = validate_gantt(df); result = codeflash_output # 27.0μs -> 99.0μs (72.8% slower)

def test_dataframe_with_extra_rows_and_missing_keys():
    # Test DataFrame with extra columns, but missing one required key
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01", "Resource": "X"},
        {"Task": "B", "Start": "2020-01-03", "Resource": "Y"}
    ])
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(df) # 26.3μs -> 26.8μs (2.13% slower)

def test_list_with_dict_missing_all_keys():
    # Test list of dicts missing all required keys
    input_data = [
        {"Resource": "X"}
    ]
    # Should NOT raise: list input is not validated for keys
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 1.61μs -> 1.87μs (13.9% slower)

def test_dataframe_with_only_required_keys():
    # Test DataFrame with only required keys
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02"}
    ])
    codeflash_output = validate_gantt(df); result = codeflash_output # 108μs -> 98.6μs (9.92% faster)

# --- LARGE SCALE TEST CASES ---

def test_large_list_of_dicts():
    # Test a large list of dicts (1000 elements)
    input_data = [
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}", "Finish": f"2020-02-{i%28+1:02d}"}
        for i in range(1000)
    ]
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 2.30μs -> 2.47μs (6.69% slower)

def test_large_dataframe():
    # Test a large DataFrame (1000 rows)
    import pandas as pd
    df = pd.DataFrame([
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}", "Finish": f"2020-02-{i%28+1:02d}"}
        for i in range(1000)
    ])
    codeflash_output = validate_gantt(df); result = codeflash_output # 35.9ms -> 429μs (8268% faster)
    for i in range(1000):
        pass

def test_large_dataframe_missing_key():
    # Test a large DataFrame missing one required key
    import pandas as pd
    df = pd.DataFrame([
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}"}  # Missing "Finish"
        for i in range(1000)
    ])
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(df) # 31.1μs -> 30.0μs (3.66% faster)

def test_large_list_with_non_dict_first_element():
    # Test large list with first element not a dict
    input_data = ["Not a dict"] + [
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}", "Finish": f"2020-02-{i%28+1:02d}"}
        for i in range(999)
    ]
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 2.91μs -> 2.96μs (1.69% slower)

def test_large_list_with_non_dict_later_element():
    # Test large list where a later element is not a dict (should NOT raise)
    input_data = [
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}", "Finish": f"2020-02-{i%28+1:02d}"}
        for i in range(999)
    ] + ["Not a dict"]
    # Should NOT raise: only first element is checked
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 2.18μs -> 2.34μs (7.01% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import sys
import types

import pandas as pd
# imports
import pytest  # used for our unit tests
# function to test
# (copied verbatim from prompt, for test completeness)
from plotly import exceptions, optional_imports
from plotly.figure_factory._gantt import validate_gantt

pd = optional_imports.get_module("pandas")

REQUIRED_GANTT_KEYS = ["Task", "Start", "Finish"]
from plotly.figure_factory._gantt import validate_gantt

# unit tests

if pd is None:
    import pytest


# --- Basic Test Cases ---

def test_valid_list_of_dicts():
    # Valid input: list of dictionaries
    input_data = [
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"},
        {"Task": "B", "Start": "2023-01-02", "Finish": "2023-01-03"}
    ]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.87μs -> 1.94μs (3.86% slower)

def test_valid_dataframe():
    # Valid input: DataFrame with required columns
    df = pd.DataFrame([
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"},
        {"Task": "B", "Start": "2023-01-02", "Finish": "2023-01-03"}
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 150μs -> 106μs (42.1% faster)

# --- Edge Test Cases ---

def test_missing_required_keys_in_dataframe():
    # DataFrame missing "Finish" column
    df = pd.DataFrame([
        {"Task": "A", "Start": "2023-01-01"},
        {"Task": "B", "Start": "2023-01-02"}
    ])
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(df) # 26.0μs -> 25.9μs (0.424% faster)

def test_missing_required_keys_in_list_of_dicts():
    # List of dicts missing "Finish" key
    input_data = [
        {"Task": "A", "Start": "2023-01-01"},
        {"Task": "B", "Start": "2023-01-02"}
    ]
    # This should not raise, as the function does not check keys for lists
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.53μs -> 1.75μs (12.5% slower)

def test_empty_list():
    # Empty list should raise
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt([]) # 1.76μs -> 1.81μs (2.70% slower)


def test_non_list_non_dataframe_input():
    # Input is neither a list nor a DataFrame
    input_data = "not a list or dataframe"
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 1.68μs -> 1.64μs (2.56% faster)

def test_dataframe_with_extra_columns():
    # DataFrame with extra columns should still work
    df = pd.DataFrame([
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02", "Extra": 123}
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 173μs -> 117μs (48.3% faster)

def test_list_of_dicts_with_extra_keys():
    # List of dicts with extra keys should pass
    input_data = [
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02", "Extra": 123}
    ]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.55μs -> 1.76μs (12.2% slower)

def test_dataframe_with_wrong_column_types():
    # DataFrame with columns named correctly but with wrong types in values
    df = pd.DataFrame([
        {"Task": None, "Start": 123, "Finish": []}
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 137μs -> 91.8μs (49.7% faster)

def test_list_with_first_dict_rest_non_dicts():
    # Only the first element is checked for being a dict
    input_data = [{"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"}, 123, "string"]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.58μs -> 1.75μs (9.48% slower)

def test_dataframe_with_no_rows():
    # DataFrame with correct columns but no rows
    df = pd.DataFrame(columns=REQUIRED_GANTT_KEYS)
    codeflash_output = validate_gantt(df); output = codeflash_output # 23.9μs -> 93.8μs (74.5% slower)

def test_list_of_dicts_with_empty_dict():
    # List with an empty dictionary as first element
    input_data = [{}]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.43μs -> 1.85μs (22.8% slower)

# --- Large Scale Test Cases ---

def test_large_list_of_dicts():
    # Large list of dicts (1000 elements)
    input_data = [
        {"Task": f"Task{i}", "Start": f"2023-01-{i:02d}", "Finish": f"2023-01-{i+1:02d}"}
        for i in range(1, 1001)
    ]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 2.02μs -> 2.25μs (10.2% slower)

def test_large_dataframe():
    # Large DataFrame (1000 rows)
    df = pd.DataFrame([
        {"Task": f"Task{i}", "Start": f"2023-01-{i:02d}", "Finish": f"2023-01-{i+1:02d}"}
        for i in range(1, 1001)
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 35.7ms -> 433μs (8135% faster)
    for i in range(1000):
        pass

def test_large_dataframe_with_extra_columns():
    # Large DataFrame with extra columns
    df = pd.DataFrame([
        {"Task": f"Task{i}", "Start": f"2023-01-{i:02d}", "Finish": f"2023-01-{i+1:02d}", "Extra": i}
        for i in range(1, 1001)
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 81.0ms -> 511μs (15734% faster)
    for i in range(1000):
        pass

def test_large_list_with_non_dict_first_element():
    # Large list, first element not a dict
    input_data = [0] + [{"Task": f"Task{i}", "Start": f"2023-01-{i:02d}", "Finish": f"2023-01-{i+1:02d}"} for i in range(1, 999)]
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 3.19μs -> 3.32μs (3.82% slower)

def test_large_empty_dataframe():
    # Large DataFrame with correct columns but zero rows
    df = pd.DataFrame(columns=REQUIRED_GANTT_KEYS)
    codeflash_output = validate_gantt(df); output = codeflash_output # 25.0μs -> 96.7μs (74.2% slower)

# --- Determinism and Robustness ---

def test_determinism_multiple_calls():
    # Multiple calls with same input should give same output
    input_data = [
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"},
        {"Task": "B", "Start": "2023-01-02", "Finish": "2023-01-03"}
    ]
    codeflash_output = validate_gantt(input_data); output1 = codeflash_output # 1.61μs -> 1.88μs (14.0% slower)
    codeflash_output = validate_gantt(input_data); output2 = codeflash_output # 523ns -> 586ns (10.8% slower)

def test_dataframe_column_order():
    # DataFrame with columns in different order
    df = pd.DataFrame([
        {"Finish": "2023-01-02", "Start": "2023-01-01", "Task": "A"}
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 107μs -> 96.7μs (10.7% faster)

def test_dataframe_with_index():
    # DataFrame with custom index
    df = pd.DataFrame([
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"},
        {"Task": "B", "Start": "2023-01-02", "Finish": "2023-01-03"}
    ], index=["x", "y"])
    codeflash_output = validate_gantt(df); output = codeflash_output # 137μs -> 91.4μs (50.1% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-validate_gantt-mhcxyu68 and push.

Codeflash Static Badge

codeflash-ai bot and others added 3 commits October 30, 2025 04:46
The optimization achieves a **58x speedup** by eliminating the major performance bottleneck in pandas DataFrame processing. 

**Key optimizations:**

1. **Pre-fetch column data as numpy arrays**: The original code used `df.iloc[index][key]` for each cell access, which triggers pandas' slow row-based indexing mechanism. The optimized version extracts all column data upfront using `df[key].values` and stores it in a dictionary, then uses direct numpy array indexing `columns[key][index]` inside the loop.

2. **More efficient key validation**: Replaced the nested loop checking for missing keys with a single list comprehension `missing_keys = [key for key in REQUIRED_GANTT_KEYS if key not in df]`.

3. **Use actual DataFrame columns**: Instead of iterating over the DataFrame object itself (which includes metadata), the code now uses `list(df.columns)` to get only the actual column names.

**Why this is dramatically faster:**
- `df.iloc[index][key]` creates temporary pandas Series objects and involves complex indexing logic for each cell
- Direct numpy array indexing `columns[key][index]` is orders of magnitude faster
- The line profiler shows the original `df.iloc` line consumed 96.8% of execution time (523ms), while the optimized dictionary comprehension takes only 44.9% (4.2ms)

**Performance characteristics:**
- **Large DataFrames see massive gains**: 8000%+ speedup on 1000-row DataFrames
- **Small DataFrames**: 40-50% faster 
- **List inputs**: Slight slowdown (3-13%) due to additional validation overhead, but still microsecond-level performance
- **Empty DataFrames**: Some slowdown due to upfront column extraction, but still fast overall

This optimization is most beneficial for DataFrame inputs with many rows, where the repeated `iloc` calls created a severe performance bottleneck.
@misrasaurabh1 misrasaurabh1 changed the title ⚡️ Speed up function validate_gantt by 58x ⚡️ Speed up function validate_gantt by 58x Oct 30, 2025
@misrasaurabh1 misrasaurabh1 changed the title ⚡️ Speed up function validate_gantt by 58x ⚡️ Speed up function validate_gantt by 58x Oct 30, 2025
@camdecoster
Copy link
Contributor

Thanks for the PR! Could you please add test coverage or demonstrate that test coverage is already provided? Some tests failed CI, but I think that's unrelated to your changes.

@misrasaurabh1
Copy link
Author

@camdecoster just added a test for it. fixing the formatting issue now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants