1. 如何爬取自己的CSDN博客文章列表(獲取列表)(博客列表)(手動(dòng)+python代碼方式)
2. 獲取自己CSDN文章列表并按質(zhì)量分由小到大排序(文章質(zhì)量分、博客質(zhì)量分、博文質(zhì)量分)(阿里云API認(rèn)證)
步驟
打開谷歌瀏覽器
輸入網(wǎng)址
https://dontla.blog.csdn.net/?type=blog
按F12進(jìn)入調(diào)試界面
點(diǎn)擊網(wǎng)絡(luò),清除歷史消息
按F5刷新頁面
找到接口(community/home-api/v1/get-business-list)
https://blog.csdn.net/community/home-api/v1/get-business-list?page=1&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username=Dontla
接口解讀
https://blog.csdn.net/community/home-api/v1/get-business-list?page=1&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username=Dontla
https://blog.csdn.net/community/home-api/v1/get-business-list?page=1&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username=Dontla
這是一個(gè)HTTP GET請(qǐng)求的接口,用于獲取CSDN博客網(wǎng)站上的業(yè)務(wù)列表信息。具體來說,它是用于獲取某個(gè)用戶的博客文章列表。讓我們逐個(gè)分析URL中的參數(shù):
-
page=1:這個(gè)參數(shù)表示請(qǐng)求的頁面編號(hào),設(shè)為1意味著請(qǐng)求第一頁的數(shù)據(jù)。
-
size=20:這個(gè)參數(shù)表示每頁顯示的記錄數(shù)。這里,每頁顯示20條記錄。
-
businessType=blog:這個(gè)參數(shù)指定了業(yè)務(wù)類型,此處為"blog",所以它應(yīng)該是用來獲取博客文章的。
-
orderby=:這個(gè)參數(shù)應(yīng)該是用來指定排序方式的,但在這個(gè)請(qǐng)求中并沒有具體值,可能默認(rèn)為某種排序方式,如按發(fā)布時(shí)間降序等。
-
noMore=false:這個(gè)參數(shù)可能是用來判斷是否還有更多的記錄可以獲取。如果設(shè)置為false,表示可能還有更多的記錄。
-
year= & month=:這兩個(gè)參數(shù)可能是用來篩選特定年份和月份的博客文章,但在這個(gè)請(qǐng)求中并沒有具體值,因此可能會(huì)返回所有時(shí)間段的文章。
-
username=Dontla:這個(gè)參數(shù)指定了用戶名,意味著這個(gè)請(qǐng)求可能用來獲取名為"Dontla"的用戶的博客文章列表。
撰寫代碼獲取博客列表
先明確返回信息格式
我們將https://blog.csdn.net/community/home-api/v1/get-business-list?page=1&size=1&businessType=blog&orderby=&noMore=false&year=&month=&username=Dontla
拷貝到瀏覽器url欄打開:
全選拷貝,將文字粘貼到編輯器并格式化:
{“code”:200,“message”:“success”,“traceId”:“47d3f9ad-bfc0-4604-b386-48b0e0b40c8d”,“data”:{“l(fā)ist”:[{“articleId”:132295415,“title”:“shellcheck警告:Declare and assign separately to avoid masking return values.shellcheck(SC2155)”,“description”:“ShellCheck的SC2155警告是關(guān)于在shell腳本中正確處理命令返回值的一個(gè)重要提示。通過將聲明和賦值分開進(jìn)行,我們可以確保命令的返回值不會(huì)被誤導(dǎo),并且在命令執(zhí)行失敗時(shí),腳本能夠正確地捕獲并處理錯(cuò)誤?!?“url”:“https://dontla.blog.csdn.net/article/details/132295415”,“type”:1,“top”:false,“forcePlan”:false,“viewCount”:8,“commentCount”:0,“editUrl”:“https://editor.csdn.net/md?articleId=132295415”,“postTime”:“2023-08-15 13:16:23”,“diggCount”:0,“formatTime”:“8 小時(shí)前”,“picList”:[“https://img-blog.yssmx.com/a0eb894421994488a27fd20a767d00de.png”],“collectCount”:0}],“total”:2557}}
{
"code": 200,
"message": "success",
"traceId": "47d3f9ad-bfc0-4604-b386-48b0e0b40c8d",
"data": {
"list": [
{
"articleId": 132295415,
"title": "shellcheck警告:Declare and assign separately to avoid masking return values.shellcheck(SC2155)",
"description": "ShellCheck的SC2155警告是關(guān)于在shell腳本中正確處理命令返回值的一個(gè)重要提示。通過將聲明和賦值分開進(jìn)行,我們可以確保命令的返回值不會(huì)被誤導(dǎo),并且在命令執(zhí)行失敗時(shí),腳本能夠正確地捕獲并處理錯(cuò)誤。",
"url": "https://dontla.blog.csdn.net/article/details/132295415",
"type": 1,
"top": false,
"forcePlan": false,
"viewCount": 8,
"commentCount": 0,
"editUrl": "https://editor.csdn.net/md?articleId=132295415",
"postTime": "2023-08-15 13:16:23",
"diggCount": 0,
"formatTime": "8 小時(shí)前",
"picList": [
"https://img-blog.yssmx.com/a0eb894421994488a27fd20a767d00de.png"
],
"collectCount": 0
}
],
"total": 2557
}
}
目前已知的是:原創(chuàng)對(duì)應(yīng)type值為1,轉(zhuǎn)載對(duì)應(yīng)為2。
json字段解讀
這是一個(gè)JSON格式的HTTP響應(yīng),用于傳輸具體的數(shù)據(jù)信息。以下是對(duì)每個(gè)字段的解讀:
-
code: 這是HTTP響應(yīng)狀態(tài)碼,200通常表示請(qǐng)求成功。
-
message: 這是響應(yīng)的描述信息,"success"表示請(qǐng)求處理成功。
-
traceId: 這可能是此次請(qǐng)求的唯一標(biāo)識(shí)符,用于追蹤和調(diào)試。
-
data: 這是實(shí)際返回的數(shù)據(jù)對(duì)象,包含以下字段:
-
list: 這是一個(gè)數(shù)組,包含請(qǐng)求的業(yè)務(wù)列表。由于在請(qǐng)求中指定了
size=1
,所以此處只有一個(gè)對(duì)象。該對(duì)象包含以下屬性:- articleId: 文章的唯一標(biāo)識(shí)符。
- title: 文章的標(biāo)題。
- description: 文章的描述。
- url: 文章的網(wǎng)址鏈接。
- type: 文章的類型,具體代表什么需要參考API文檔或者詢問API提供者。
- top: 是否置頂,false表示未置頂。
- forcePlan: 不清楚這個(gè)字段的具體含義,可能需要參考API文檔或者詢問API提供者。
- viewCount: 文章的瀏覽次數(shù)。
- commentCount: 文章的評(píng)論數(shù)量。
- editUrl: 編輯文章的鏈接。
- postTime: 文章的發(fā)布時(shí)間。
- diggCount: 文章的點(diǎn)贊數(shù)。
- formatTime: 格式化后的發(fā)布時(shí)間。
- picList: 文章中的圖片列表。
- collectCount: 文章的收藏?cái)?shù)量。
- total: 在滿足請(qǐng)求條件(如用戶名、業(yè)務(wù)類型等)的情況下,總的記錄數(shù)量。
-
list: 這是一個(gè)數(shù)組,包含請(qǐng)求的業(yè)務(wù)列表。由于在請(qǐng)求中指定了
綜上,這個(gè)JSON響應(yīng)表示成功獲取了用戶"Dontla"的博客文章列表(因?yàn)樵O(shè)置了size=1
,所以只返回了一個(gè)結(jié)果)。該用戶共有2557篇博客文章,最新的一篇文章的標(biāo)題、描述、鏈接、類型、瀏覽次數(shù)、評(píng)論數(shù)量、編輯鏈接、發(fā)布時(shí)間、點(diǎn)贊數(shù)、圖片列表和收藏?cái)?shù)量都在響應(yīng)中給出。
點(diǎn)贊為什么是digg?
digg"這個(gè)詞在網(wǎng)絡(luò)社區(qū)中經(jīng)常被用來表示“點(diǎn)贊”或者“投票”。這個(gè)詞的來源是一家名為Digg的美國新聞網(wǎng)站,用戶可以對(duì)他們喜歡的文章進(jìn)行“digg”,也就是投票,最受歡迎的文章會(huì)被推送到首頁。因此,"digg"在很多網(wǎng)站和應(yīng)用中都被用作代表用戶點(diǎn)贊或者投票的動(dòng)作。
Apipost測試接口
GET https://blog.csdn.net/community/home-api/v1/get-business-list?page=1&size=1&businessType=blog&orderby=&noMore=false&year=&month=&username=Dontla
(Apipost接口元數(shù)據(jù))
{
"parent_id": "0",
"project_id": "-1",
"target_id": "fdb84824-e558-48f1-9456-219ea5e9950e",
"target_type": "api",
"name": "新建接口",
"sort": 1,
"version": 0,
"mark": "developing",
"create_dtime": 1692028800,
"update_dtime": 1692109242,
"update_day": 1692028800000,
"status": 1,
"modifier_id": "-1",
"method": "GET",
"mock": "{}",
"mock_url": "/community/home-api/v1/get-business-list",
"url": "https://blog.csdn.net/community/home-api/v1/get-business-list?page=1&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username=Dontla",
"request": {
"url": "https://blog.csdn.net/community/home-api/v1/get-business-list?page=1&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username=Dontla",
"description": "",
"auth": {
"type": "noauth",
"kv": {
"key": "",
"value": ""
},
"bearer": {
"key": ""
},
"basic": {
"username": "",
"password": ""
},
"digest": {
"username": "",
"password": "",
"realm": "",
"nonce": "",
"algorithm": "",
"qop": "",
"nc": "",
"cnonce": "",
"opaque": ""
},
"hawk": {
"authId": "",
"authKey": "",
"algorithm": "",
"user": "",
"nonce": "",
"extraData": "",
"app": "",
"delegation": "",
"timestamp": "",
"includePayloadHash": -1
},
"awsv4": {
"accessKey": "",
"secretKey": "",
"region": "",
"service": "",
"sessionToken": "",
"addAuthDataToQuery": -1
},
"ntlm": {
"username": "",
"password": "",
"domain": "",
"workstation": "",
"disableRetryRequest": 1
},
"edgegrid": {
"accessToken": "",
"clientToken": "",
"clientSecret": "",
"nonce": "",
"timestamp": "",
"baseURi": "",
"headersToSign": ""
},
"oauth1": {
"consumerKey": "",
"consumerSecret": "",
"signatureMethod": "",
"addEmptyParamsToSign": -1,
"includeBodyHash": -1,
"addParamsToHeader": -1,
"realm": "",
"version": "1.0",
"nonce": "",
"timestamp": "",
"verifier": "",
"callback": "",
"tokenSecret": "",
"token": ""
}
},
"body": {
"mode": "none",
"parameter": [],
"raw": "",
"raw_para": [],
"raw_schema": {
"type": "object"
}
},
"event": {
"pre_script": "",
"test": ""
},
"header": {
"parameter": []
},
"query": {
"parameter": [
{
"description": "",
"is_checked": 1,
"key": "page",
"type": "Text",
"not_null": 1,
"field_type": "String",
"value": "1"
},
{
"description": "",
"is_checked": 1,
"key": "size",
"type": "Text",
"not_null": 1,
"field_type": "String",
"value": "20"
},
{
"description": "",
"is_checked": 1,
"key": "businessType",
"type": "Text",
"not_null": 1,
"field_type": "String",
"value": "blog"
},
{
"description": "",
"is_checked": 1,
"key": "orderby",
"type": "Text",
"not_null": 1,
"field_type": "String",
"value": ""
},
{
"description": "",
"is_checked": 1,
"key": "noMore",
"type": "Text",
"not_null": 1,
"field_type": "String",
"value": "false"
},
{
"description": "",
"is_checked": 1,
"key": "year",
"type": "Text",
"not_null": 1,
"field_type": "String",
"value": ""
},
{
"description": "",
"is_checked": 1,
"key": "month",
"type": "Text",
"not_null": 1,
"field_type": "String",
"value": ""
},
{
"description": "",
"is_checked": 1,
"key": "username",
"type": "Text",
"not_null": 1,
"field_type": "String",
"value": "Dontla"
}
]
},
"cookie": {
"parameter": []
},
"resful": {
"parameter": []
}
},
"response": {
"success": {
"raw": "",
"parameter": [],
"expect": {
"name": "成功",
"isDefault": 1,
"code": 200,
"contentType": "json",
"verifyType": "schema",
"mock": "",
"schema": {}
}
},
"error": {
"raw": "",
"parameter": [],
"expect": {
"name": "失敗",
"isDefault": -1,
"code": 404,
"contentType": "json",
"verifyType": "schema",
"mock": "",
"schema": {}
}
}
},
"is_first_match": 1,
"ai_expect": {
"list": [],
"none_math_expect_id": "error"
},
"enable_ai_expect": -1,
"enable_server_mock": -1,
"is_example": -1,
"is_locked": -1,
"is_check_result": 1,
"check_result_expectId": "",
"is_changed": -1,
"is_saved": -1
}
編寫python代碼(注意有反爬蟲策略,需要設(shè)置請(qǐng)求頭)(成功)
網(wǎng)站反爬蟲策略:一些網(wǎng)站會(huì)通過識(shí)別請(qǐng)求頭(User-Agent)來判斷是否為機(jī)器人行為。解決方法是添加合適的請(qǐng)求頭:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
}
response = requests.get(url, headers=headers)
完整代碼:
import requests
import json
# 定義變量存儲(chǔ)所有文章信息
articles = []
# 設(shè)置初始分頁
page = 1
# 設(shè)置每頁查詢數(shù)量
page_size = 50
while True:
# 構(gòu)建請(qǐng)求url
url = f"https://blog.csdn.net/community/home-api/v1/get-business-list?page={page}&size={page_size}&businessType=blog&orderby=&noMore=false&year=&month=&username=Dontla"
# 發(fā)送GET請(qǐng)求
# response = requests.get(url)
# 防止反爬蟲策略
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
}
response = requests.get(url, headers=headers)
# 如果請(qǐng)求成功
if response.status_code == 200:
# print(f'response.content: {response.content}')
# print(f'response.text: {response.text}')
# 檢查響應(yīng)是否為空
if response.text:
# 解析JSON響應(yīng)
try:
data = response.json()
except json.JSONDecodeError:
print(f"Error parsing JSON: {response.text}")
break
# 遍歷每個(gè)文章
for article in data['data']['list']:
print(f"page: {page}, {article['url']}")
# 獲取并保存需要的信息
articles.append({
'title': article['title'],
'url': article['url'],
'type': article['type'],
'postTime': article['postTime']
})
# 判斷是否還有更多文章,如果沒有則結(jié)束循環(huán)
if len(data['data']['list']) < page_size:
break
# 增加分頁數(shù)以獲取下一頁的文章
page += 1
else:
print("Response is empty")
break
else:
print(f"Error: {response.status_code}")
break
# 將結(jié)果保存為json文件
with open('articles.json', 'w') as f:
json.dump(articles, f, ensure_ascii = False, indent = 4)
注意,最大單次查詢上限為100,我一開始把每頁查詢數(shù)量page_size設(shè)置成200,發(fā)現(xiàn)不行,后來設(shè)置成100以下就ok了,為了保證速度,我就設(shè)置成100:
這是代碼運(yùn)行結(jié)果:
這是生成的j’son文件:
總共2557個(gè)元素,跟我的博文數(shù)量相符:文章來源:http://www.zghlxwxcb.cn/news/detail-651596.html
文章來源地址http://www.zghlxwxcb.cn/news/detail-651596.html
到了這里,關(guān)于3. 爬取自己CSDN博客列表(自動(dòng)方式)(分頁查詢)(網(wǎng)站反爬蟲策略,需要在代碼中添加合適的請(qǐng)求頭User-Agent,否則response返回空)的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!