目錄
5.05更新 增加FMF、SSA數(shù)據(jù)下載(見GitHub)
4.10更新 通過CURL、WGET等方式下載目標數(shù)據(jù)
獲取下載地址
Using Tools to Save Web Output as a File
Wget
Curl
AERONET AOD 數(shù)據(jù)下載
?利用 PYTHON + SELENIUM 自動化下載中國站點數(shù)據(jù)
獲得站點URL列表
?獲取站點數(shù)據(jù)時間
下載數(shù)據(jù)
寫在前面的小結(jié):預(yù)檢索動態(tài)資源,正則化工具匹配標簽。
完整代碼見:SakuraSong001/spider4remotedata (github.com)
項目另含哨兵五號批量自動化下載工具。
7.21增加多線程并行下載和selenium后臺運行。
5.05更新 增加FMF、SSA數(shù)據(jù)下載(見GitHub)
針對SSA、FMF等產(chǎn)品的網(wǎng)站網(wǎng)頁解析進行修改,實現(xiàn)自動化批量下載。
4.10更新 通過CURL、WGET等方式下載目標數(shù)據(jù)
AErosol RObotic NETwork (AERONET)下載工具提供現(xiàn)成的下載地址編輯規(guī)則如下所示。檢索結(jié)果為HTML格式,可根據(jù)需要生成下載鏈接,結(jié)果格式需要再一次轉(zhuǎn)換。
AOD AND SDA :?Data - Aerosol Robotic Network (AERONET) Homepage (nasa.gov)
獲取下載地址
AOD、SDA、TOT等數(shù)據(jù)下載地址的可選參數(shù)如下表所示,可利用網(wǎng)站地圖工具和BeautifulSoup工具,獲得目標區(qū)域網(wǎng)站列表,根據(jù)地址規(guī)則生成下載地址列表后,通過curl或wget工具實現(xiàn)批量下載。
Table 1: Explanation and Values for Mandatory and Optional Web Service Parameters
Mandatory Parameters | Explanation | Values |
---|---|---|
year,month,day | Starting time moment (year= 1992 to present), (month=1 to 12), (day = 1 to max num, depends on month) | Year: 1993 to present (must be 4-digits) Month: 1 to 12 Day: 1 to max_day_of_month |
AVG | Data Format | All points: AVG=10 Daily average: AVG=20 |
[data_type] | Data Types (See Table 2) | [data_type]=1 |
Optional Parameters | ? | ? |
year2,month2,day2 | Ending time moment** | Year: 1993 to present (must be 4-digits) Month: 1 to 12 Day: 1 to max_day_of_month **if year2,month2, and day2 are omitted, then the current day is assumed |
hour, hour2 | Specified beginning (hour) and ending hour (hour2) | Hour: 0 to 23 if not specified, then the hour is set to zero; time2 is incremented to next day and hour2=0 |
site | AERONET site name |
Exact match of AERONET database name If none specified, then all sites are searched for data during the time interval specified AERONET Site Name List |
lat1,lon1,lat2,lon2 | Bounding Box ** | lat1,lon1 - Lower Left **values must be in decimal degrees (including the decimal) |
lunar_merge | Enable Lunar AOD (Provisional) Only Download | 0 - No Lunar |
if_no_html | Determine whether html formatting is printed | 0 - HTML formatting printed (default) 1 - No HTML formatting printed |
Table 2: Explanation of Data Types for the Web Service
Data Types | Explanation |
---|---|
AOD10 | Aerosol Optical Depth Level 1.0 |
AOD15 | Aerosol Optical Depth Level 1.5 |
AOD20 | Aerosol Optical Depth Level 2.0 |
SDA10 | SDA Retrieval Level 1.0 |
SDA15 | SDA Retrieval Level 1.5 |
SDA20 | SDA Retrieval Level 2.0 |
TOT10 | Total Optical Depth based on AOD Level 1.0 (all points only) |
TOT15 | Total Optical Depth based on AOD Level 1.5 (all points only) |
TOT20 | Total Optical Depth based on AOD Level 2.0 (all points only) |
Using Tools to Save Web Output as a File
Wget
wget --no-check-certificate ?-q ?-O test.out "https://aeronet.gsfc.nasa.gov/cgi-bin/print_web_data_v3?site=Cart_Site&year=2000&month=6&day=1&year2=2000&month2=6&day2=14&AOD15=1&AVG=10"
Curl
curl -s -k -o test.out "https://aeronet.gsfc.nasa.gov/cgi-bin/print_web_data_v3?site=Cart_Site&year=2000&month=6&day=1&year2=2000&month2=6&day2=14&AOD15=1&AVG=10"
AERONET AOD 數(shù)據(jù)下載
AErosol RObotic NETwork (AERONET)是由NASA 和 LOA-PHOTONS (CNRS) 聯(lián)合建立的地基氣溶膠遙感觀測網(wǎng),提供對不同氣溶膠狀態(tài)下的光譜氣溶膠光學(xué)深度(AOD),反演產(chǎn)物和可沉淀水的全球分布式觀測?,F(xiàn)行版本 3 AOD 數(shù)據(jù)提供:級別 1.0(未篩選)、級別 1.5(云篩選和質(zhì)量控制)和級別 2.0(質(zhì)量保證)。
官網(wǎng)地址:Aerosol Robotic Network (AERONET) Homepage (nasa.gov)?
Aeronet 網(wǎng)站支持靈活篩選條件,可根據(jù)需求下載特定時間、級別、站點的數(shù)據(jù)。同時可通過網(wǎng)頁提供的篩選功能篩選符合特定條件的數(shù)據(jù)。
例如下載 2012年?Alboran?站點的 AOD1.5 數(shù)據(jù),點擊 2012 - level 1.5 -?Alboran?進入站點數(shù)據(jù)詳情頁面,點擊 AOD Level 1.5 進入數(shù)據(jù)請求下載頁面,點擊 Accept 即可下載數(shù)據(jù)。
?利用 PYTHON + SELENIUM 自動化下載中國站點數(shù)據(jù)
獲得站點URL列表
首先通過網(wǎng)站提供的地圖篩選工具,大致選擇中國范圍,并將頁面另存為本地HTML文件,利用BEAUTIFULSOUP解析頁面獲得站點列表。利用正則化工具篩選、獲得所有站點URL,并通過父節(jié)點獲得站點名稱、經(jīng)緯度等信息。
def get_stations(area_file):
result = []
pattern = r'https\:\/\/aeronet\.gsfc\.nasa\.gov\/cgi\-bin\/data\_display\_aod\_v3\?site\=.+'
# 本地頁面
chinaAreaPage = r'AERONET Data Display Interface - WWW DEMONSTRAT.html'
soup = BeautifulSoup(open(chinaAreaPage, 'r', encoding='utf-8').read(), 'html.parser')
aList = soup.find_all('a')
for item in aList:
sHref = item.get('href')
if re.match(pattern, str(sHref)):
station = re.sub(r'\n', '', item.get_text())
geoInfo = re.sub(r'\n+.+\(\s', r'(', item.parent.get_text())
response = session.get(sHref, headers=header)
beautifulSoup = BeautifulSoup(response.text, 'html.parser')
pageUrl = beautifulSoup.find('a', text=re.compile(r'More AERONET Downloadable Products\.{3}')).get('href')
date = beautifulSoup.find(text=re.compile(r'Start Date.+')).split('-')
start_year = re.sub(r'\;.+', '', date[2])
latest_year = date[4]
result.append([station, geoInfo, pageUrl, start_year, latest_year])
# print(result, file=open(area_file, 'w', encoding='utf-8'))
result = np.array(result)
dataframe = pd.DataFrame(
{'station': result[:, 0], 'geoInfo': result[:, 1], 'pageUrl': result[:, 2], 'start_year': statList[:, 3],
'latest_year': result[:, 4]}
)
dataframe.to_csv(area_file, index=False, sep=',', encoding='utf-8')
return result
?獲取站點數(shù)據(jù)時間
根據(jù)給定時間范圍篩選站點、并獲取站點數(shù)據(jù)的有效時間節(jié)點。
if '2005' <= first <= '2012' or '2005' <= latest <= '2012' or (first <= '2005' and latest >= '2012'):# 不在此時間范圍內(nèi)的直接跳過
print('\n')
begin = end = '0'
statUrl = 'https://aeronet.gsfc.nasa.gov/cgi-bin/' + statHref.replace('?', '&re')
driver.get(statUrl)
time.sleep(3)
ele = driver.find_element(By.XPATH, '//*[@id="Year1"]')
options = ele.find_elements(By.TAG_NAME, 'option')
for option in options:
# print(option.get_attribute('value'))
# print(option.text)
if option.text >= '2013':
break
if begin == '0' and '2005' <= option.text:
begin = option.text
if end < option.text:
end = option.text
# -- end for options --
if begin == '0' or end == '0':
print('wrong time', stat, first, latest, begin, end, statUrl)
continue
print(stat, first, latest, begin, end, statUrl)
下載數(shù)據(jù)
由于aeronet網(wǎng)站的特殊設(shè)置,部分數(shù)據(jù)需要先檢索才可以下載,否則按照正確的下載地址也會提示 HTTPError: 404 not found。
# 模擬檢索動作
statUrl = 'https://aeronet.gsfc.nasa.gov/cgi-bin/' + statHref.replace('?', '&re')
driver.get(statUrl)
time.sleep(3)
select1 = Select(driver.find_element(By.XPATH, '//*[@id="Year1"]'))
select1.select_by_visible_text(str(year))
select2 = Select(driver.find_element(By.XPATH, '//*[@id="Year2"]'))
select2.select_by_visible_text(str(year))
try:
aod15_checkbox = driver.find_element(By.NAME, 'AOD15')
aod15_checkbox.click()
except NoSuchElementException:
print('No such aod 1.5 ', year, url)
break
submit = driver.find_element(By.NAME, 'Submit')
submit.click()
time.sleep(30)
# download file
for year in range(int(begin), int(end) + 1):
# print(year)
filename = '{0}0101_{0}1231_{1}.zip'.format(year, stat)
filepath = r'F:\WORKSPACE\DBN-PARASOL\aeronet data 1.5\{}'.format(filename)
url = 'https://aeronet.gsfc.nasa.gov/zip_files_v3/{}'.format(filename)
if os.path.exists(filepath):
print('exist ', year, url)
continue
try:
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36 Edg/99.0.1150.46'), ('Cookie', '_ga=GA1.2.479127296.1609393316; _ga=GA1.4.479127296.1609393316'),('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(url, filepath)
print('Done ', year, url)
except urllib.error.HTTPError:
print('Not Found', year, url)
print(stat, year, url, file=open(r'F:\WORKSPACE\DBN-PARASOL\aeronet data 1.5\notfound.txt', 'a', encoding='utf-8'))
批量自動加載結(jié)果:?文章來源:http://www.zghlxwxcb.cn/news/detail-400911.html
文章來源地址http://www.zghlxwxcb.cn/news/detail-400911.html
到了這里,關(guān)于AERONET AOD 數(shù)據(jù)自動化下載 + PYTHON + SELENIUM的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!