PDF import from SelfWealth

Nirus · August 3, 2021, 6:17am

Video tutorial:
Extract PDF documents for debugging

flywire · August 3, 2021, 11:51pm

Why post this topic when the wiki clearly says create new Github Issue or post in the forum (not both)? A consolidated post of PDFExtractors available or in development would be more useful.

That link is not much use when the files need to be desensitised and they are full of &nbsp which can’t normally be seen in a text editor. I passed on the recommendation from the PDFBox developer to post-process extracted text to replace all &nbsp with " ".

AndreasB · August 4, 2021, 5:36am

And this is the thread to post (update/new) text extracted from SelfWealth PDFs to.

Nirus · August 9, 2021, 2:48pm

Hello @flywire

I have fixed a few minor bugs.
Greetings
Alex

flywire · August 10, 2021, 1:55pm

From Create SelfWealthPDFExtractor.java by flywire · Pull Request #2340 · portfolio-performance/portfolio · GitHub

PDF author: ''
PDFBox Version: 1.8.16
-----------------------------------------
SelfWealth Limited ABN: 52 154 324 428 AFSL 421789 W: www.selfwealth.com.au E: support@selfwealth.com.au
This trade was executed and cleared by OpenMarkets Australia Ltd ABN 38 090 472 012,
AFSL 246 705, Market Particpant of ASX, CHIX and NSX.
Buy Confirmation
MR JOHN DOE  Account Number: 1234567
JOHN DOE A/C Reference No: T202107011234561
1 LONG ROAD Trade Date: 1 Jul 2021
SYDNEY NSW Settlement Date: 5 Jul 2021
2000, AUS Market: ASX
WE HAVE BOUGHT ON YOUR ACCOUNT
Quantity Security Code Security Description Price +1 Consideration Currency
25 UMAX BETA S&P500 YIELDMAX 12.40 $312.50 AUD
Brokerage* $9.50 AUD
Adviser Fee* $3.12 AUD
Net Value $325.12 AUD
GST included in this invoice is $1.14
The confirmation is a tax invoice  please retain for tax purposes. If this confirmation does not correspond with your records please contact us immediately at
support@selfwealth.com.au
Settlement Instructions All consideration and any information or documents required by OpenMarkets must be provided to OpenMarkets by 9am AEST on the Settlement
Date. This transaction will be settled from your linked cash account or in accordance with your instructions on the Settlement Date.
Contract Comments
Ex Dividend
* Inclusive of GST
+1 Standard Financial Rounding Applied (if applicable)
This confirmation is provided to you by each of SelfWealth and OpenMarkets. The Brokerage and Adviser fees set out in this confirmation are charged by SelfWealth.
OpenMarkets has not charged you any fees for the above transaction(s). The above transaction(s) and this confirmation are issued subject to the directions, decisions
and requirements of the operator of the relevant Market, ASIC Market Integrity Rules, the operating rules of the relevant Market, and, where relevant, the Clearing
Rules of the relevant Clearing Facility and the Settlement Rules of the relevant Settlement Facility, the customs and usages of the relevant Market and the correction of
errors and omissions. If this confirmation relates to multiple transactions, those transactions may have been completed on ASX or CHIX.”
Generated At: 5 Jul 2021 16:30:01 PM Page: 1 of 1

Nirus · August 10, 2021, 2:17pm

Ahh… now i see the problem.
As a work around… before you import the pdf’s, deactivate the historical courses… import and then re-enable.
I’ll see what can be done to solve the problem.
I’ll talk to Andreas, but it will take some time.

Nirus · August 10, 2021, 7:20pm

Hello @flywire
I think I have worked out a solution to the problem.
If this is ok for @Andreas, both importers should work without problems.

github.com/buchen/portfolio

Modify PDF-Importer duplicate detection to support ticker symbol

buchen:master ← Nirus2000:Modify-PDF-Importer-duplicate-detection-to-support-ticker-symbol

opened 07:13PM - 10 Aug 21 UTC

Nirus2000

+24 -1

Modify PDF-Importer duplicate detection to support ticker symbol Hallo @buche…n für die beiden Importer [Commonwealth Securities](https://github.com/buchen/portfolio/pull/2383) und [SelfWealth](https://github.com/buchen/portfolio/pull/2384) wird eine Modifikation benötigt. **Folgendes Problem:** Testkauf 1 wird importiert und das Wertpapier stellt. Holt man sich dann die Historischen Kurse, dann wird das Tickersymbol für das Wertpapier im Testkauf 1 von UMAX zu UMAX.AX (z.B. für australische Börse). Wenn dann Testkauf 2 importiert wird, wird das Wertpapier aufgrund der Änderung des Tickersymbols nicht erkannt. Es wird neu angelegt. Bei diesen Importern gibt es im PDF-Debug keine ISIN oder WKN, sondern diese werden über die Tickersymbole bestimmt. Durch die Änderung wird dies behoben. Ich hoffe dass dies so passt. Testkauf 1 und 2 zum testen [first buy.pdf](https://github.com/buchen/portfolio/files/6963945/first.buy.pdf) [second buy.pdf](https://github.com/buchen/portfolio/files/6963946/second.buy.pdf) Grüße Alex

Greetings
Alex

flywire · August 10, 2021, 10:57pm

Note the pdf debug in this post is different to the original file linked from GitHub. The process has replaced the non-breaking spaces with spaces.

flywire · August 15, 2021, 10:31pm

github.com/portfolio-performance/portfolio

Comment by buchen to Create SelfWealthPDFExtractor.java

portfolio-performance:master ← flywire:SelfWealthPDFExtractor

Alright, thanks @flywire for the contribution and thanks @Nirus2000 for comment…ing on the change and helping. I have now picked up the code and fixed it as good as I could understand it. A couple for remarks: * I remove the GST line - it apparently was already included in the other line items * The "Sell" file was somehow corrupted - it had strange space characters which led to the patterns not working. I did remove them, but if the PDF actually creates this space characters, then we need updated regex pattern. <img width="578" alt="Bildschirmfoto 2021-07-25 um 21 56 44" src="https://user-images.githubusercontent.com/587976/126912389-eee6b056-4d2b-459f-98cf-e9e105e43cf8.png"> * The reason I fixed the sample file and not the regex expressions was that I had the impression the "sell" file was manually created. At least in Germany - if you sell something - you get the proceeds **without** brokerage fees. :smile: Therefore I have updated the file accordingly. If this different in Australia, then PP cannot treat "brokerage fees" as fees we would have to add them to the original proceeds. ``` Sell Confirmation [...] 397 WPL WOODSIDE PETROLEUM 21.88 $8,686.36 AUD Brokerage* $9.50 AUD Adviser Fee* $0.00 AUD Net Value $8,695.86 AUD ``` * I registered the importer in the ```PDFImportAssistent``` because otherwise it is not applied to PDF imports I am happy to merge more contributions. Please make sure the Github Actions workflow compiles.

The “Sell” file was somehow corrupted - it had strange space characters which led to the patterns not working. I did remove them, but if the PDF actually creates this space characters, then…

Feel free to read on through thread yourself.

github.com/portfolio-performance/portfolio

Comment by flywire to Create SelfWealthPDFExtractor.java

portfolio-performance:master ← flywire:SelfWealthPDFExtractor

The SelfWealth PDF files have been decoded and issue investigated: https://issue…s.apache.org/jira/browse/PDFBOX-5247 Indeed every space is a [Non-breaking_space](https://en.wikipedia.org/wiki/Non-breaking_space) > if you don't like it, do your own postprocessing. It's unusual but valid. @buchen I think this is a good case for post-processing rather than have hidden codes in the test files. What do you think? Add to SelfWealthPDFExtractor.java something like: ```java String nbsp = " " pdfstream.replaceAll(" ", " "); ``` --- Files are very fiddly to create and error-prone with Non-breaking space. Updated: [SelfWealthBuy01.txt](https://github.com/buchen/portfolio/files/6897113/SelfWealthBuy01.txt) [SelfWealthSell01.txt](https://github.com/buchen/portfolio/files/6897114/SelfWealthSell01.txt)

The SelfWealth PDF files have been decoded and issue investigated: [PDFBOX-5247] Space in pdf returns c2 a0 characters instead of 20 - ASF JIRA

Indeed every space is a Non-breaking_space

if you don’t like it, do your own postprocessing. It’s unusual but valid.

@buchen I think this is a good case for post-processing rather than have hidden codes in the test files. What do you think?

Add to SelfWealthPDFExtractor.java something like:

String nbsp = "&nbsp;"
pdfstream.replaceAll("&nbsp;", " ");

Files are very fiddly to create and error-prone with Non-breaking space.

Updated: SelfWealthBuy01.txt SelfWealthSell01.txt

@AndreasB

From my point of view, there are two options:

[deleted]

do a post-processing and replace the non-breaking spaces with regular spaces - would we do this on the text output before passing it to any converter, or would this be a pre-processing specific to SelfWealth?

[and update supplied test files.]

Nirus · August 16, 2021, 4:35pm

Hello @flywire
now the PDF-Debugs show the problems… fix is done.
The previous PDF debugs did not include the characters. What is not there cannot be recognized by PP. This also means that a TestCase will pass correctly if the PDF debug was extracted incorrectly. If it is extracted correctly as in the video tutorial, then we see these characters and we can deposit them in the PDF importert.

grafik

Bye
Alex

flywire · November 5, 2021, 5:30am

The SelfWealth PDF-Importer sets the Note to the Reference No yet it is being overwritten by the filename. Why does that happen?

Test file: SelfWealthBuy01.pdf (16.6 KB) (from tests converted to pdf with LibreOffice Writer).

Nirus · November 5, 2021, 12:11pm

Done.
Alex

flywire · November 6, 2021, 5:54am

Thanks Alex. The reason I asked the question is I was trying to understand the code and I couldn’t see why the filename was stored instead of the selected text. I thought it might be a standard note being introduced.

Rafa · November 6, 2021, 6:48am

Mabe this is the root cause:

github.com

portfolio-performance/portfolio/blob/fecc54b771b688e078bc5058889ea3ad575e70ea/name.abuchen.portfolio/src/name/abuchen/portfolio/datatransfer/pdf/AbstractPDFExtractor.java#L113-L118


      
          if (subject instanceof Transaction)
              ((Transaction) subject).setSource(filename);
          else if (subject.getNote() == null || subject.getNote().trim().length() == 0)
              item.getSubject().setNote(filename);
          else
              item.getSubject().setNote(item.getSubject().getNote().trim().concat(" | ").concat(filename)); //$NON-NLS-1$

Nirus · November 6, 2021, 7:06am

Hallo @Rafa
No, the reg expression for the note was wrong.
See in my changes: Pull request.

Match note before:
.match(" Reference No: (?<note>.*)$")

after:
.match("^.* Reference No: (?<note>.*)$")

Greetings
Alex

flywire · November 6, 2021, 7:22am

lol that would do it. I assume the default is:

if Note = "" then Note = filename
   else Note = Note + " | " + filename

That’s probably would be good enough because the note = TransNo and filename is just userid+TransNo+.pdf.

Rafa · November 6, 2021, 9:06am

Well, the RegEx is one thing but for my feeling, if the note is not parsable, to utilise the file name instead of feels a little bit weak

Nirus · November 6, 2021, 11:20am

Hello @Rafa
We have made many new changes in PP. In the future, the note and the source (file name) will be separated from each other. There will be a separate column in PP for this purpose. So wait in suspense and wait for the new release.

Greetings
Alex

flywire · October 5, 2022, 12:31am

New contract note variant SelfWealth → Selfwealth - Selfwealth by flywire · Pull Request #2998 · buchen/portfolio · GitHub

@Nirus Can you adjust the RegEx?

Nirus · October 5, 2022, 2:31am

Hello @flywire
I need a small PDF-Debug with one oder two transactions.

Greetings
Alex