?前面我講述過如何通過BeautifulSoup獲取維基百科的消息盒,同樣可以通過Spider獲取網(wǎng)站內(nèi)容,最近學(xué)習(xí)了Selenium+Phantomjs后,準(zhǔn)備利用它們獲取百度百科的旅游景點消息盒(InfoBox),這也是畢業(yè)設(shè)計實體對齊和屬性的對齊的語料庫前期準(zhǔn)備工作。希望文章對你有所幫助~
源代碼
#?coding=utf-8????
"""??
Created?on?2015-09-04?@author:?Eastmount???
"""????
????
import?time????????????
import?re????????????
import?os????
import?sys??
import?codecs??
from?selenium?import?webdriver????????
from?selenium.webdriver.common.keys?import?Keys????????
import?selenium.webdriver.support.ui?as?ui????????
from?selenium.webdriver.common.action_chains?import?ActionChains????
????
#Open?PhantomJS????
driver?=?webdriver.PhantomJS(executable_path="G:\phantomjs-1.9.1-windows\phantomjs.exe")????
#driver?=?webdriver.Firefox()????
wait?=?ui.WebDriverWait(driver,10)??
global?info?#全局變量??
??
#Get?the?infobox?of?5A?tourist?spots????
def?getInfobox(name):????
????try:????
????????#create?paths?and?txt?files??
????????global?info??
????????basePathDirectory?=?"Tourist_spots_5A"????
????????if?not?os.path.exists(basePathDirectory):????
????????????os.makedirs(basePathDirectory)????
????????baiduFile?=?os.path.join(basePathDirectory,"BaiduSpider.txt")????
????????if?not?os.path.exists(baiduFile):????
????????????info?=?codecs.open(baiduFile,'w','utf-8')????
????????else:????
????????????info?=?codecs.open(baiduFile,'a','utf-8')????
????????
????????#locate?input??notice:?1.visit?url?by?unicode?2.write?files????
????????print?name.rstrip('\n')?#delete?char?'\n'????
????????driver.get("http://baike.baidu.com/")????
????????elem_inp?=?driver.find_element_by_xpath("http://form[@id='searchForm']/input")????
????????elem_inp.send_keys(name)????
????????elem_inp.send_keys(Keys.RETURN)????
????????info.write(name.rstrip('\n')+'\r\n')??#codecs不支持'\n'換行??
????????time.sleep(2)??
????????print?driver.current_url??
????????print?driver.title??
????
????????#load?infobox?basic-info?cmn-clearfix??
????????elem_name?=?driver.find_elements_by_xpath("http://div[@class='basic-info?cmn-clearfix']/dl/dt")????
????????elem_value?=?driver.find_elements_by_xpath("http://div[@class='basic-info?cmn-clearfix']/dl/dd")??
????????for?e?in?elem_name:??
????????????print?e.text??
????????for?e?in?elem_value:??
????????????print?e.text??
??
????
????????#create?dictionary?key-value??
????????#字典是一種散列表結(jié)構(gòu),數(shù)據(jù)輸入后按特征被散列,不記錄原來的數(shù)據(jù),順序建議元組??
????????elem_dic?=?dict(zip(elem_name,elem_value))???
????????for?key?in?elem_dic:????
????????????print?key.text,elem_dic[key].text????
????????????info.writelines(key.text+"?"+elem_dic[key].text+'\r\n')????
????????time.sleep(5)????
????????????
????except?Exception,e:?#'utf8'?codec?can't?decode?byte????
????????print?"Error:?",e????
????finally:????
????????print?'\n'????
????????info.write('\r\n')????
????
#Main?function????
def?main():??
????global?info??
????#By?function?get?information?????
????source?=?open("Tourist_spots_5A_BD.txt",'r')????
????for?name?in?source:????
????????name?=?unicode(name,"utf-8")????
????????if?u'故宮'?in?name:?#else?add?a?'?'????
????????????name?=?u'北京故宮'????
????????getInfobox(name)????
????print?'End?Read?Files!'????
????source.close()????
????info.close()????
????driver.close()????
????
main()??
??????
運行結(jié)果
? ? ? ? 主要通過從F盤中txt文件中讀取國家5A級景區(qū)的名字,再調(diào)用Phantomjs.exe瀏覽器依次訪問獲取InfoBox值。同時如果存在編碼問題“'ascii' codec can't encode characters”則可通過下面代碼設(shè)置編譯器utf-8編碼,代碼如下:
?
#設(shè)置編碼utf-8??
import?sys???
reload(sys)????
sys.setdefaultencoding('utf-8')??
#顯示當(dāng)前默認(rèn)編碼方式??
print?sys.getdefaultencoding()??
對應(yīng)源碼
? ? ? ? 其中對應(yīng)的百度百科InfoBox源代碼如下圖,代碼中基礎(chǔ)知識可以參考我前面的博文或我的Python爬蟲專利,Selenium不僅僅擅長做自動測試,同樣適合做簡單的爬蟲。
編碼問題
? ? ? ? 此時你仍然可能遇到“'ascii' codec can't encode characters”編碼問題。
? ? ? ?它是因為你創(chuàng)建txt文件時默認(rèn)是ascii格式,此時你的文字確實'utf-8'格式,所以需要轉(zhuǎn)換通過如下方法。
import?codecs??
??
#用codecs提供的open方法來指定打開的文件的語言編碼,它會在讀取的時候自動轉(zhuǎn)換為內(nèi)部unicode??
if?not?os.path.exists(baiduFile):????
????info?=?codecs.open(baiduFile,'w','utf-8')????
else:????
????info?=?codecs.open(baiduFile,'a','utf-8')??
??????
#該方法不是io故換行是'\r\n'??
info.writelines(key.text+":"+elem_dic[key].text+'\r\n')????
總結(jié)
? ? ? ?你可以代碼中學(xué)習(xí)基本的自動化爬蟲方法、同時可以學(xué)會如何通過for循環(huán)顯示key-value鍵值對,對應(yīng)的就是顯示的屬性和屬性值,通過如下代碼實現(xiàn):? ? ??文章來源:http://www.zghlxwxcb.cn/news/detail-720582.html
?elem_dic = dict(zip(elem_name,elem_value))
? ? ? ?但最后的輸出結(jié)果不是infobox中的順序,why??
? ? ? ?最后希望文章對你有所幫助,還有一篇基礎(chǔ)介紹文章,文章來源地址http://www.zghlxwxcb.cn/news/detail-720582.html
到了這里,關(guān)于Selenium獲取百度百科旅游景點的InfoBox消息盒的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!