Skip to content

[source-file] Multiple sheets in XLSX #47445

Open
@ex0ns

Description

@ex0ns

Connector Name

source-file

Connector Version

0.5.13

What step the error happened?

None

Relevant information

I was trying to load an Excel (XLSX) file containing multiple sheets and I noticed that in the output all my headers were actually mixed up and no information about the sheet themselves were kept.

I was expecting an outcome similar to the one we can have when loading data from a Google Sheet, where it would create a source and within this source we would have table (i.e streams) for each of the sheet of the document.

This seems related to this part of the code:

def openpyxl_chunk_reader(self, file, **kwargs):
"""Use openpyxl lazy loading feature to read excel files (xlsx only) in chunks of 500 lines at a time"""
work_book = load_workbook(filename=file)
user_provided_column_names = kwargs.get("names")
for sheetname in work_book.sheetnames:
work_sheet = work_book[sheetname]
data = work_sheet.values
end = work_sheet.max_row
if end == 1 and not user_provided_column_names:
message = "Please provide column names for table in reader options field"
logger.error(message)
raise AirbyteTracedException(
message="Config validation error: " + message,
internal_message=message,
failure_type=FailureType.config_error,
)
cols, start = (next(data), 1) if not user_provided_column_names else (user_provided_column_names, 0)
step = 500
while start <= end:
df = pd.DataFrame(data=(next(data) for _ in range(start, min(start + step, end))), columns=cols)
yield df
start += step

Is there a reason it was done that way ? Would it be possible to keep information about each of the existing sheet of the document ?
I don't have any experience with Airbyte source code so I wanted to make sure I was looking at the right place, and maybe get a few pointers on where to start in order to contribute and maybe improve the Excel reader, but I first wanted to understand why it was done this way in the first place.

Thanks !

Relevant log output

No response

Contribute

  • Yes, I want to contribute

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions