背景
因為經(jīng)常出差火車上沒網(wǎng)、不方便電子書閱讀器批注,需要從某網(wǎng)站上批量下載多本書籍的圖片并自動打包成PDF文件。
分析
1、嘗試獲得圖片地址,發(fā)現(xiàn)F12被禁
解決方法:使用Chrome瀏覽器,點擊右上角三個點呼出菜單,選擇“更多工具”->“開發(fā)者工具”
或者使用Ctrl+Shift+C、Ctrl+Shift+I
2、審查元素,發(fā)現(xiàn)圖片地址非常有規(guī)律:
在class為side-image的div里有一個img,src是../files/mobile/1.jpg?220927153454
,去掉后面的問號部分即可得到/files/mobile/1.jpg
,通過觀察,這本書一共有多少頁就會有多少個.jpg文件
3、回到欄目頁,可得到基目錄,所以批量抓取的大致思路是從欄目頁獲得基目錄,然后不斷累加一個數(shù),直到獲得jpg時對方服務(wù)器報404錯誤,即可得到剛剛處理的那一頁即最后一頁。
4、如何從欄目頁獲得基目錄呢?
經(jīng)觀察,每個page_pc_btm_book_body
里都有兩個a
標(biāo)簽,第一個是圖片,第二個是“在線閱讀”按鈕,但是需要翻頁怎么辦呢?所以需要建立一個變量收集它們,每翻一頁,做一次收集。于是可以寫如下收集函數(shù):
let books=[]
function catchBook() {
let links = document.getElementsByClassName("page_pc_btm_book_body");
for (let i in links) {
if(!links[i].children||links[i].children.length<2)continue;
let title = links[i].children[0].title;
let link = links[i].children[0].href;
books.push({title,link})
}
}
然后在瀏覽器里每翻一頁,在控制臺里執(zhí)行一次catchBook,這樣書名和基目錄就都獲得了。
5、如何把JSON導(dǎo)出來呢
在控制臺里JSON.stringify(books),把結(jié)果復(fù)制出來,然后到網(wǎng)上隨便找一個JSON轉(zhuǎn)Excel的工具,轉(zhuǎn)出來即可,然后注意把第一行當(dāng)表頭,數(shù)據(jù)復(fù)制到第二行開始。
6、最后一步就寫個程序從Excel里讀出數(shù)據(jù),把圖片都批量抓下來即可,下面就說說如何寫程序來處理。
需要引的包
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>4.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>4.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml-schemas</artifactId>
<version>4.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>4.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>ooxml-schemas</artifactId>
<version>1.4</version>
</dependency>
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itextpdf</artifactId>
<version>5.5.13.3</version>
</dependency>
從Excel到實體
先定義一個實體,這里我多加了一列type,表示類型,name就是從上面那個里面獲得的title,link就是上面獲得的link屬性。
import lombok.Data;
@Data
public class Book {
private String type;
private String name;
private String link;
}
然后寫個ExcelReader
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
public class ExcelReader {
public static List<Book> readXlsxToList(String filePath) {
List<Book> bookList = new ArrayList<>();
try (FileInputStream fileInputStream = new FileInputStream(filePath);
Workbook workbook = new XSSFWorkbook(fileInputStream)) {
Sheet sheet = workbook.getSheetAt(0);
Iterator<Row> rowIterator = sheet.iterator();
// 獲取表頭(第一行)并轉(zhuǎn)換為屬性數(shù)組
Row headerRow = rowIterator.next();
String[] headers = getRowDataAsStringArray(headerRow);
// 遍歷每一行(從第二行開始)
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
Book book = new Book();
// 遍歷每個單元格,并根據(jù)屬性名稱設(shè)置對應(yīng)的實體類屬性值
for (Cell cell : row) {
int columnIndex = cell.getColumnIndex();
if (columnIndex < headers.length) {
String headerValue = headers[columnIndex];
String cellValue = getCellValueAsString(cell);
setBookProperty(book, headerValue, cellValue);
}
}
bookList.add(book);
}
} catch (IOException e) {
e.printStackTrace();
}
return bookList;
}
private static String[] getRowDataAsStringArray(Row row) {
String[] rowData = new String[row.getLastCellNum()];
for (Cell cell : row) {
int columnIndex = cell.getColumnIndex();
rowData[columnIndex] = getCellValueAsString(cell);
}
return rowData;
}
private static String getCellValueAsString(Cell cell) {
String cellValue = "";
if (cell != null) {
switch (cell.getCellType()) {
case STRING:
cellValue = cell.getStringCellValue();
break;
case NUMERIC:
cellValue = String.valueOf(cell.getNumericCellValue());
break;
case BOOLEAN:
cellValue = String.valueOf(cell.getBooleanCellValue());
break;
case FORMULA:
cellValue = cell.getCellFormula();
break;
default:
cellValue = "";
}
}
return cellValue;
}
private static void setBookProperty(Book book, String propertyName, String propertyValue) {
switch (propertyName) {
case "type":
book.setType(propertyValue);
break;
case "name":
book.setName(propertyValue);
break;
case "link":
book.setLink(propertyValue);
break;
// 添加其他屬性
default:
// 未知屬性,可以根據(jù)需要進行處理
break;
}
}
}
從實體集合到批量下載成jpg
還需要想辦法實現(xiàn)批量下載的功能,需要注意的是Windows的默認(rèn)文件排序是按ASC碼排序的,會把10.jpg排在2.jpg前面,所以需要對頁碼格式化一下,把它變成三位數(shù)。
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.List;
public class ImageDownloader {
public static void downloadImages(List<Book> bookList, String targetDir) {
for (Book book : bookList) {
String type = book.getType();
String name = book.getName();
String link = book.getLink();
String basePath = targetDir + "/" + type + "/" + name;
int count = 1;
boolean continueDownload = true;
if(!new File(basePath).exists()){
new File(basePath).mkdirs();
}
while (continueDownload) {
String imgUrl = link + "files/mobile/" + count + ".jpg";
String outputPath = String.format("%s/%03d.jpg", basePath, count);
if (!imageExists(outputPath)) {
try {
downloadImage(imgUrl, outputPath);
System.out.println("Downloaded: " + outputPath);
} catch (IOException e) {
System.out.println("Error downloading image: " + imgUrl);
e.printStackTrace();
continueDownload = false;
}
} else {
System.out.println("Image already exists: " + outputPath);
}
count++;
}
}
}
private static boolean imageExists(String path) {
Path imagePath = Paths.get(path);
return Files.exists(imagePath);
}
private static void downloadImage(String imageUrl, String outputPath) throws IOException {
URL url = new URL(imageUrl);
HttpURLConnection httpConn = (HttpURLConnection) url.openConnection();
int responseCode = httpConn.getResponseCode();
if (responseCode == HttpURLConnection.HTTP_OK) {
try (InputStream inputStream = httpConn.getInputStream();
FileOutputStream outputStream = new FileOutputStream(outputPath)) {
byte[] buffer = new byte[4096];
int bytesRead;
while ((bytesRead = inputStream.read(buffer)) != -1) {
outputStream.write(buffer, 0, bytesRead);
}
}
} else {
throw new IOException("Server returned response code " + responseCode);
}
}
}
開始批量下載
import java.util.List;
public class Test {
public static void main(String[] args) {
List<Book> books = ExcelReader.readXlsxToList("C:\\Users\\Administrator\\Desktop\\某某書庫.xlsx");
String targetDir = "D:\\書庫\\";
ImageDownloader.downloadImages(books, targetDir);
}
}
寫完執(zhí)行,回去睡一覺
jpg圖片批量轉(zhuǎn)成pdf
都下載完之后,就可以想辦法批量轉(zhuǎn)成PDF格式了。
import com.itextpdf.text.*;
import com.itextpdf.text.pdf.PdfWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.MalformedURLException;
public class ImageToPdfConverter {
public static void convertToPdf(String folderPath, String outputFilePath) {
try {
// 獲取文件夾中的所有jpg文件
File folder = new File(folderPath);
File[] files = folder.listFiles((dir, name) -> name.toLowerCase().endsWith(".jpg"));
// 預(yù)讀第一章圖片獲得大小
Rectangle rect = null;
if (files.length == 0) {
return;
} else {
Image image = Image.getInstance(files[0].getAbsolutePath());
rect = new Rectangle(image.getWidth(), image.getHeight());
}
// 創(chuàng)建PDF文檔對象
Document document = new Document(rect);
document.setMargins(0, 0, 0, 0);
// 創(chuàng)建PDF寫入器
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(outputFilePath));
writer.setStrictImageSequence(true);
// 打開PDF文檔
document.open();
// 遍歷圖片文件并將其加入到PDF文檔中
for (File file : files) {
Image image = Image.getInstance(file.getAbsolutePath());
document.add(image);
}
// 關(guān)閉PDF文檔
document.close();
System.out.println("PDF文件生成成功!");
} catch (FileNotFoundException | DocumentException e) {
e.printStackTrace();
} catch (MalformedURLException e) {
throw new RuntimeException(e);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
public static void main(String[] args) {
String startDir="D:\\書庫\\開發(fā)技術(shù)\\";
File[] subdirs = new File(startDir).listFiles();
for (File subdir : subdirs) {
if(subdir.isDirectory()){
convertToPdf(subdir.getAbsolutePath(), subdir.getAbsolutePath()+".pdf");
}
}
}
}
結(jié)束
最后把PDF文件傳到網(wǎng)盤上,手機、平板、電腦隨時可以下載離線看,非常舒服。文章來源:http://www.zghlxwxcb.cn/news/detail-645479.html
注意:自己抓取書籍自己看無所謂,但通過網(wǎng)絡(luò)分享出去是侵犯他人著作權(quán)的。文章來源地址http://www.zghlxwxcb.cn/news/detail-645479.html
到了這里,關(guān)于Java批量下載書籍圖片并保存為PDF的方法的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!