2023-09-12 There are many situations where you want to use PDF as a starting point for intellectual production support
- Iâve tried many things in my life.
- As of 2023, the development of LLM has led to a feeling of âI want to unify all the things Iâve been doing by patching things together, and I want to make it easier to update.
- Cutting and scanning from books (hereafter: scanned PDF)
- This includes the images on each page in high resolution
- Lecture materials such as PowerPoint (hereafter: slide PDF)
- This contains images of the material, not of each page
PDF to page image
.
- Some say that using pdfimages.
- But pdfimages is a command to retrieve images in PDF in lossless mode, so if you use it for a slide PDF, you will not get an image of the page
- In that case, you can use [pdftocairo
- Reference: PDF to PNG conversion.
PDF to text
.
- Text can be obtained by OCR from an image.
- Slide PDFs can be extracted with PDFMiner, etc. with cleaner text than with OCR.
- However, the same text is obtained from âbook PDFs with embedded OCR resultsâ as well
- This is of lower quality than todayâs OCR, depending on the level of technology at the time of embedding.
- Even if you have an old OCR-embedded PDF, it would be better to ignore it and OCR it with the new Google Cloud Vision.
- Google Cloud Vision is $1.5 for 1,000 cases in the high zone.
- Gyazo Pro costs 490 yen/month
- Internally, Google Cloud Vision
text and images are paired and managed
.
- Even if there is a hit in a search for a text, there are times when you want to see the paper because you canât make sense of the text alone or you want to see a diagram, etc.
- In the past, a text file with one line of text per page was prepared for each PDF and searched in a grep-like manner, and the number of hit lines was displayed, which was used to open the PDF.
- As of 2023, I do not think it is good that intellectual production can only be done in front of a PC. I want to use it while taking a walk or taking a bath. When I think about it, I want to put it in a cloud environment, not in a local environment on my PC.
- 2023-09-13 Gyazo Pro, I thought, why not?
- 2023-09-14 I actually tried it and got
Too many requests
after about 10,000. - Since the acquisition of OCR results is also included in the API call, only about 6,000 pages can actually be processed per day.
- 2023-09-14 I actually tried it and got
1:N correspondence between books and images
.
- Machine-readable data is spit out as JSON and sent to Dropbox
- Make it possible to get it via API if you feel the need.
- Data on Scrapbox is edited by humans as they wish.
Static content to dynamic process
.
- PDF to [Scrapbox
Scrapbox and content correspondence
.
- Put all PDF pages in one Scrapbox page
- A page should be a unit of meaning, especially a page in a book is not a unit of meaning
- Put all PDFs in one project
- Unexpected connections are the source of value.
BELOW_IS_AI_GENERATED
PDFăăScrapboxă¸
2023-09-12 23:26
Summary of notes
. We are exploring ways to utilize PDF for intellectual production support. there are two types of PDF, scanned PDF and slide PDF, and it is important to understand the characteristics of each. pdftocairo] is useful for retrieving page images from PDF, and OCR and PDFMiner are available for text extraction. PDFMiner] can be used. However, the accuracy of OCR has been improving due to technological advances, and OCR results with the newer Google Cloud Vision are of better quality than older OCR results. By managing text and images as a pair, it is possible to refer to the paper from the search results. Gyazo] is effective for this. We are also looking to move from static content to a dynamic process.
Relation to Fragment
. The fragment âOCRâ is related to the âPDF to Textâ section of the note and confirms the excellent performance of Google Cloud Visionâs OCR. Also, âInterference with bylinesâ is an important factor affecting the accuracy of OCR. The âFigure Extractionâ section is relevant to the Note in that it explores ways to extract figures from scanned PDFs and use them for intellectual production. The â20180408Note on task abstraction sticky noteâ is related to the âPairing and managing text and imagesâ section of the notebook and highlights the importance of page-by-page text extraction and retrieval.
deep thinking
Through Notes and Fragments, we were able to think deeply about how PDFs can be used to support intellectual production. In particular, we felt that it is important to pursue the improvement of efficiency and quality of intellectual production in terms of text and image pair management, improved text extraction through advances in OCR, and the transition from static content to a dynamic process.
summary of thoughts and title
. âThe Evolution and Potential of PDF-Based Intellectual Production Support.â The use of PDF for intellectual production support continues to evolve, and it is important to explore its potential from a variety of perspectives, including text and image pair management, advances in OCR, and the transition from static to dynamic.
extra info
TITLES: ["OCR", "Figure Extraction", "20180408 task abstraction sticky note", "What to do when knowledge output is required that cannot be copied and pasted", "Hatena2009-10-02"]
generated: 2023-09-12 23:26
This page is auto-translated from /nishio/PDFăăScrapbox㸠using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iâm very happy to spread my thought to non-Japanese readers.