PDF import from J&T Banka (Slovak)

Hi,
I saw that in the last version of PP was added PDF import from J&T Direktbank which is a German bank of J&T FINANCE GROUP.
I have account in J&T Banka Slovak branch.
I’d like to make log from PDF import debugger but it make some strange things.
First of all, the PDF report is in Slovak language.
Some characters are shifted before or after each other.
Here is text from text extractor from PDF:

PDFBox Version: 1.8.17
Portfolio Performance Version: 0.67.2
-----------------------------------------
VÝPIS Z ÚČTU
VKLADOVÝ ÚČET S VÝOP VDENOU LHE OTOU
IBAN: SK53 8 032 000 0 309 0 2049  2781
Názov účtu: VFsd7, KMRhQ              ToA. 
ZkU vVMfs
Mena účtu: EUR  
 H hB Fx u   4/5 1 1  
Za obdobie: 31. 12. 2023 - 02. 01. 2024 1 2190 Piešťany
Dátum výpisu: 02. 01. 2024
Číslo výpisu: 1/2024
Adresa trvalého
pobytu/sídla KlientaA.:  STZFra 82/41
92101, Piešťany
Súhrn
Aktuálna úroková sadzba (p.a.):    ,3 50 %
Výpovedná lehota: 41  dní
Počiatočný zostatok  3 000,00 EUR
Kreditné obraty (+)    3,74UE R
z toho pripísané hrubé úroky (+)    3,7E4UR
Debetné obraty (-) -   0,71UE R
z toho poplatky (-)    0,0E0UR
Konečný zostatok  3 ,300 03 EUR
Rozpis transakcií
Dátum Typ transakcie Číslo protiúčtu Suma (EUR)
Valuta Správa pre príjemcu Názov protiúčtu Poplatok
Detail transakcie
02. 01. 4022 Vysporiadanie úroku vkladu 3,74
31. 12. 2023
31. 12. 202aD 3ň z úrokov -0,71
Váš vklad je chránený v rámci zákonného systému poistenia pohľadávok z vkladov. Informácie o poistenia môžete
nájst na https://www.jtbanka.sk/poistenia-pohladavok.
Strana č.:  1/1 IBAN: SK5383200000003990242781
J&T BANKA, a.s., Sokolovská 700/113a, 186 00 Praha 8, Česká republika, IČ: 471 15 378, zapísaná v obchodnom registri vedenom Mestským súdom v Prahe, spis. zn.: B 1731,
podnikajúca na území Slovenskej republiky prostredníctvom organizačnej zložky J&T AB NKA, a.s., pobočka zahraničnej banky, vD ořákovo nábrežie 8, 811 02 rB atislava,
Slovenská republika, IČO: 35 964 693, zapísaná v obchodnom registri vedenom Mestským súdom rB atislava III, oddiel OP , vložka 1320/B.
Komfort linka : 0800 90 0 500, zo zahraničia : +421 223 607 187, info@jtbanka.sk, www.jtbanka.sk.

And here is original PDF (I highlight some string differences):

1 Like

Looks like it is a problem with the older version of PDFBox used in PP.
I have extracted text from PDF with PDFBox 1.8.17 and with newer versions (2.0.30 and 3.0.1) and there is no problem.

PDFBox 1.8.17

VÝPIS Z ÚČTU
VKLADOVÝ ÚČET S VÝOP VDENOU LHE OTOU
IBAN: SK53 8 032 000 0 309 0 2049  2781

PDFBox 2.0.30

VÝPIS Z ÚČTU
VKLADOVÝ ÚČET S VÝPOVEDNOU LEHOTOU
IBAN: SK53 8320 0000 0039 9024 2781 Dfčýv Dnežs

Original PDF

Okay, that is a reason to consider updating PDFBox.

So far I tried to avoid that, because we only have the text output of test documents. If we use a new PDFBox version that generates the text slightly different, then a) we have no way to easily notice, b) no way to update the test documents and c) we easily break the existing importer.

I have seen other changes - Deutsche Bank PDF documents have a slightly different output with new PDFBox version.

I think a conservative approach could be to include both PDFBox versions with PP and have the importer decide which format to use. This way we can also run experiments before we kick out the old PDFBox version.

Looking at the current output from the Slovak dokument such as 0,0E0UR or even worse 3 ,300 03 EUR (3.003,03 EUR) it doesn’t make sense to try to parse it in the importer. @segi Are the numbers okay with the later versions?

1 Like

Hi @AndreasB ,
now I understand how complicated it is to update PDFBox.

Maybe it will be a good solution to have both versions of PDFBox and the new one will be implemented for text extractor for debugging and in time more and more articles will work with new version of PDFBox. Because I expect that some old reports will be replaced by new ones and need to be redebugged. Unless I am wrong :slight_smile:

And yes. I have checked it and with the latest version the numbers are correct.


[Left side - old version, Right side - new version]
Old version output in ANSI, latest version in UTF-8, so some words are highlighted.

If I can help in any way, please write to me.

1 Like