2019-10-08
- Place Book Scanning PDF in [Scrapbox
- https://www.facebook.com/toshiyukimasui/posts/10157675595687498
- Upload to Gyazo Pro via script after disassembling into images
- Gyazo Pro uses Google Cloud Platform’s CLOUD VISION API for OCR.
- It takes time, so we get OCR data after a while.
Readings from https://github.com/masui/Book2Scrapbox
- Scanning results from ScanSnap are retrieved in pdfimages.
- Related PDF to PNG conversion.
- If it’s a cut-and-scan PDF, that’s OK.
- PDFs of slides, etc. are not acceptable.
- Locally, folders are cut and stored with MD5 hash.
- Sync it to AWS.
- AWS Command Line Interface (CLI: an integrated tool to manage AWS services)| AWS must be installed
- Installing the AWS CLI - AWS Command Line Interface
- That’s very kindly written.
- AWS CLI Configuration - AWS Command Line Interface
- aws s3 sync
- Deletion on hand does not delete anything on S3.
- Sync to AWS is not really required.
- Because I’m sending the contents of the FILE to gyazo.
- https://github.com/nishio/Book2Scrapbox
- Use pdftocairo since slides cannot be converted to images with pdfimeges
$ pdftocairo -r 200 -f 0 -jpeg <pdf> pages
- Multiple PDFs are now combined into a single JSON
- pdfstojson.rb calls makejson.rb
- I looked into how to do it in Python, but I was able to achieve it by using makejson.rb as a child process.
- Download and add the OCR results from Gyazo a while after the JSON is ready.
- Use pdftocairo since slides cannot be converted to images with pdfimeges
This page is auto-translated from /nishio/書籍スキャンPDFをScrapboxに置く2019 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.