国产 无码 综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

Nougat:結(jié)合光學神經(jīng)網(wǎng)絡,引領學術(shù)PDF文檔的智能解析、挖掘?qū)W術(shù)論文PDF的價值

這篇具有很好參考價值的文章主要介紹了Nougat:結(jié)合光學神經(jīng)網(wǎng)絡,引領學術(shù)PDF文檔的智能解析、挖掘?qū)W術(shù)論文PDF的價值。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方,請大家不吝賜教,您也可以點擊"舉報違法"按鈕提交疑問。

Nougat:結(jié)合光學神經(jīng)網(wǎng)絡,引領學術(shù)PDF文檔的智能解析、挖掘?qū)W術(shù)論文PDF的價值

這是Nougat的官方存儲庫,Nougat是一種學術(shù)文檔PDF解析器,可以理解LaTeX數(shù)學和表格。

Project page: https://facebookresearch.github.io/nougat/

1.安裝

From pip:

pip install nougat-ocr

From repository:

pip install git+https://github.com/facebookresearch/nougat

Note, on Windows: If you want to utilize a GPU, make sure you first install the correct PyTorch version. Follow instructions here

如果您想從API調(diào)用模型或生成數(shù)據(jù)集,則會有額外的依賴項。
安裝通過

pip install "nougat-ocr[api]" or pip install "nougat-ocr[dataset]"

1.2 獲取PDF的預測

1.2.1 CLI

To get predictions for a PDF run

$ nougat path/to/file.pdf -o output_directory

目錄或文件的路徑(其中每行都是PDF的路徑)也可以作為位置參數(shù)傳遞

$ nougat path/to/directory -o output_directory
usage: nougat [-h] [--batchsize BATCHSIZE] [--checkpoint CHECKPOINT] [--model MODEL] [--out OUT]
              [--recompute] [--markdown] [--no-skipping] pdf [pdf ...]

positional arguments:
  pdf                   PDF(s) to process.

options:
  -h, --help            show this help message and exit
  --batchsize BATCHSIZE, -b BATCHSIZE
                        Batch size to use.
  --checkpoint CHECKPOINT, -c CHECKPOINT
                        Path to checkpoint directory.
  --model MODEL_TAG, -m MODEL_TAG
                        Model tag to use.
  --out OUT, -o OUT     Output directory.
  --recompute           Recompute already computed PDF, discarding previous predictions.
  --full-precision      Use float32 instead of bfloat16. Can speed up CPU conversion for some setups.
  --no-markdown         Do not add postprocessing step for markdown compatibility.
  --markdown            Add postprocessing step for markdown compatibility (default).
  --no-skipping         Don't apply failure detection heuristic.
  --pages PAGES, -p PAGES
                        Provide page numbers like '1-4,7' for pages 1 through 4 and page 7. Only works for single PDFs.

The default model tag is 0.1.0-small. If you want to use the base model, use 0.1.0-base.

$ nougat path/to/file.pdf -o output_directory -m 0.1.0-base

In the output directory every PDF will be saved as a .mmd file, the lightweight markup language, mostly compatible with Mathpix Markdown (we make use of the LaTeX tables).

Note: On some devices the failure detection heuristic is not working properly. If you experience a lot of [MISSING_PAGE] responses, try to run with the --no-skipping flag. Related: #11, #67

1.2.2 API

With the extra dependencies you use app.py to start an API. Call

$ nougat_api

通過向http://127.0.0.1:8503/ predict/發(fā)出POST請求來獲得PDF文件的預測。它還接受參數(shù)“start”和“stop”,以限制計算選擇頁碼(包括邊界)。

響應是一個帶有文檔標記文本的字符串。

curl -X 'POST' \
  'http://127.0.0.1:8503/predict/' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@<PDFFILE.pdf>;type=application/pdf'

To use the limit the conversion to pages 1 to 5, use the start/stop parameters in the request URL: http://127.0.0.1:8503/predict/?start=1&stop=5

2.Dataset

2.1 生成數(shù)據(jù)集

To generate a dataset you need

  1. A directory containing the PDFs
  2. A directory containing the .html files (processed .tex files by LaTeXML) with the same folder structure
  3. A binary file of pdffigures2 and a corresponding environment variable export PDFFIGURES_PATH="/path/to/binary.jar"

Next run

python -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs

Additional arguments include

Argument Description
--recompute recompute all splits
--markdown MARKDOWN Markdown output dir
--workers WORKERS How many processes to use
--dpi DPI What resolution the pages will be saved at
--timeout TIMEOUT max time per paper in seconds
--tesseract Tesseract OCR prediction for each page

Finally create a jsonl file that contains all the image paths, markdown text and meta information.

python -m nougat.dataset.create_index --dir path/paired/output --out index.jsonl

For each jsonl file you also need to generate a seek map for faster data loading:

python -m nougat.dataset.gen_seek file.jsonl

The resulting directory structure can look as follows:

root/
├── images
├── train.jsonl
├── train.seek.map
├── test.jsonl
├── test.seek.map
├── validation.jsonl
└── validation.seek.map

Note that the .mmd and .json files in the path/paired/output (here images) are no longer required.
This can be useful for pushing to a S3 bucket by halving the amount of files.

2.2Training

To train or fine tune a Nougat model, run

python train.py --config config/train_nougat.yaml

2.3 Evaluation

Run

python test.py --checkpoint path/to/checkpoint --dataset path/to/test.jsonl --save_path path/to/results.json

To get the results for the different text modalities, run

python -m nougat.metrics path/to/results.json

2.4 FAQ

  • Why am I only getting [MISSING_PAGE]?

    Nougat was trained on scientific papers found on arXiv and PMC. Is the document you’re processing similar to that?
    What language is the document in? Nougat works best with English papers, other Latin-based languages might work. Chinese, Russian, Japanese etc. will not work.
    If these requirements are fulfilled it might be because of false positives in the failure detection, when computing on CPU or older GPUs (#11). Try passing the --no-skipping flag for now.

  • Where can I download the model checkpoint from.

    They are uploaded here on GitHub in the release section. You can also download them during the first execution of the program. Choose the preferred preferred model by passing --model 0.1.0-{base,small}

參考鏈接:
https://github.com/facebookresearch/nougat

更多優(yōu)質(zhì)內(nèi)容請關(guān)注公號:汀丶人工智能;會提供一些相關(guān)的資源和優(yōu)質(zhì)文章,免費獲取閱讀。文章來源地址http://www.zghlxwxcb.cn/news/detail-757918.html

到了這里,關(guān)于Nougat:結(jié)合光學神經(jīng)網(wǎng)絡,引領學術(shù)PDF文檔的智能解析、挖掘?qū)W術(shù)論文PDF的價值的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!

本文來自互聯(lián)網(wǎng)用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權(quán),不承擔相關(guān)法律責任。如若轉(zhuǎn)載,請注明出處: 如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實不符,請點擊違法舉報進行投訴反饋,一經(jīng)查實,立即刪除!

領支付寶紅包贊助服務器費用

相關(guān)文章

覺得文章有用就打賞一下文章作者

支付寶掃一掃打賞

博客贊助

微信掃一掃打賞

請作者喝杯咖啡吧~博客贊助

支付寶掃一掃領取紅包,優(yōu)惠每天領

二維碼1

領取紅包

二維碼2

領紅包