How to open XLSB files in Pentaho Kettle
Pentaho PDI/Kettle is not able to open Excel XLSB (binary format) files by default. However because Pentaho is a fantastic bridge between different technologies, you can use Python script step to accomplish this task. This short tutorial demonstrates how to open xlsb file and how to generate rows for further steps in Pentaho transformation.
CPython Script Executor plugin
To execute Python code in Pentaho you need CPython Script Executor plugin. This plugin can be easily downloaded from Pentaho Marketplace [menu: Tools => Marketplace]. Start again Pentaho after installing this plugin.
Yes, you need Python to be installed on you PC but also some Python packages like pandas, scikit-learn, matplotlib and pyxlsb. Some of these libraries are not required to handle XLSB files yet because CPython plugin was written to work with "Machine Learning" projects it is required to download them.
Pentaho Data Integration (PDI) version 9.4 or later does not have a Marketplace for several key reasons, mainly stemming from the development direction of this tool and Hitachi Vantara's (the owner of Pentaho) approach to managing the ecosystem of plugins and extensions. You have to manually download and install this plugin. Download latest version (quite old yet still working) from link. Unzip to 'plugins' directory. Restart Pentaho PDI.
Pyxlsb Python package
Pyxlsb parser [homepage] is needed to open XLSB file. Just install it with pip install pyxlsb command. Example of simple code to read XLSB file:
import pandas as pd
from pyxlsb import open_workbook as open_xlsb
df = []
with open_xlsb('some.xlsb') as wb:
with wb.get_sheet(1) as sheet:
for row in sheet.rows():
df.append([item.v for item in row])
df = pd.DataFrame(df[1:], columns=df[0])
Pentaho transformation
In real scenario our file names will be dynamic; paths comes from previous steps or they are read with 'Get file names' step. Here, in this example, we use 'Generate rows' to create paths - where source xlsb file is and where to save result. CPython Script Executor opens Excel file and save content into temp file [can be text file or xlsx file]. Next step, 'Text file input' is used to generate rows (from content of xlsb file)
Generate rows
We define two variables here pointing to source and result files. This is test only so static paths.
CPython Script Executor
This step reads variables delivered by previous step [paths], reads content of XLSB file and save it to text file. 'Configure' tab is set that each row triggers this step. 'Input Frames' is used to define pandas DataFrame - all variables delivered by previous step will be stored in DataFrame.
Python Script pane is the window where you write your code. 'Python Variables to Get' is to define what variables will be transferred to the next step.
The code:
import pandas as pd
from pyxlsb import open_workbook as open_xlsb
#Files paths - variable 'pframe' is pandas DataFrame
plik_czytaj = f'''{pframe["nazwapliku_czytaj"][0].strip()}'''
plik_zapisz = f'''{pframe["nazwapliku_zapisz"][0].strip()}'''
df = []
with open_xlsb(plik_czytaj) as wb:
with wb.get_sheet('Arkusz1') as sheet:
for row in sheet.rows():
df.append([item.v for item in row])
#If you want to read collumn 1 and 3
#df.append([row[0].v, row[2].v])
df = pd.DataFrame(df[1:], columns=df[0])
df.to_csv(plik_zapisz, sep=';', index=False)
Why we save result to file [df.to_csv(plik_zapisz, sep=';', index=False)]? We do this to generate rows for next Pentaho steps.
Last tab, 'Output Fields' of CPython Script Executor is mainly used to format data types leaving this step. It is not required to defined them.
Text file input
This step will simply read result file and create rows for next transformation steps.
CPython Script Executor bugs
Known issue is that sometimes this plugin loses connection with Python. This happens when you execute script with errors in code. After restarting Pentaho it works again. Keep this in mind if you are sure that you code is right.