书签分享收藏举报版权申诉 / 5

立即下载加入VIP,免费下载

当前位置：首页 > 总结汇报 > 工作总结汇报 > Python网络爬虫实习报告材料Word格式.docx

Python网络爬虫实习报告材料Word格式.docx

文档编号：21297716
上传时间：2023-01-29
格式：DOCX
页数：5
大小：112.54KB

《Python网络爬虫实习报告材料Word格式.docx》由会员分享，可在线阅读，更多相关《Python网络爬虫实习报告材料Word格式.docx（5页珍藏版）》请在冰豆网上搜索。

Python网络爬虫实习报告材料Word格式.docx

5>

元标签

五、数据爬取实战（豆瓣网爬取电影数据）

1分析网页

#获取html源代码

def__getHtml（）:

data=[]

pageNum=1

pageSize=0

try:

while（pageSize<

=125）:

#headers={'

User-Agent'

:

'

Mozilla/5.0（WindowsNT6.1）AppleWebKit/537.11（KHTML,likeGecko）Chrome/23.0.1271.64Safari/537.11'

#'

Referer'

None#注意如果依然不能抓取的话，这里可以设置抓取网站的host

#}

#opener=urllib.request.build_opener（）

#opener.addheaders=[headers]

url="

+str（pageSize）+"

&

filter="

+str（pageNum）

#data['

html%s'

%i]=urllib.request.urlopen（url）.read（）.decode（"

utf-8"

）

data.append（urllib.request.urlopen（url）.read（）.decode（"

））

pageSize+=25

pageNum+=1

print（pageSize,pageNum）

exceptExceptionase:

raisee

returndata

2爬取数据

def__getData（html）:

title=[]#电影标题

#rating_num=[]#评分

range_num=[]#排名

#rating_people_num=[]#评价人数

movie_author=[]#导演

data={}

#bs4解析html

soup=BeautifulSoup（html,"

html.parser"

forliinsoup.find（"

ol"

attrs={'

class'

'

grid_view'

}）.find_all（"

li"

）:

title.append（li.find（"

span"

class_="

title"

）.text）

#rating_num.append（li.find（"

div"

class_='

star'

）.find（"

rating_num'

range_num.append（li.find（"

pic'

em"

#spans=li.find（"

）.find_all（"

#forxinrange（len（spans））:

#ifx<

=2:

#pass

#else:

#rating_people_num.append（spans[x].string[-len（spans[x].string）:

-3]）

str=li.find（"

bd'

p"

）.text.lstrip（）

index=str.find（"

主"

if（index==-1）:

..."

print（li.find（"

if（li.find（"

）.text==210）:

index=60

#print（"

aaa"

#print（str[4:

index]）

movie_author.append（str[4:

data['

title'

]=title

#data['

]=rating_num

range_num'

]=range_num

rating_people_num'

]=rating_people_num

movie_author'

]=movie_author

3数据整理、转换

def__getMovies（data）:

f=open（'

F:

//douban_movie.html'

'

w'

encoding='

utf-8'

f.write（"

html>

"

head>

metacharset='

UTF-8'

>

title>

Inserttitlehere<

/title>

/head>

body>

h1>

爬取豆瓣电影<

/h1>

h4>

作者：

刘文斌<

/h4>

时间：

+nowtime+"

hr>

tablewidth='

800px'

border='

1'

align=center>

thead>

tr>

th>

fontsize='

5'

color=green>

电影<

/font>

/th>

#f.write（"

thwidth='

50px'

评分<

排名<

100px'

评价人数<

导演<

/tr>

/thead>

f.write（"

tbody>

fordataindatas:

foriinrange（0,25）:

tdstyle='

color:

orange;

text-align:

center'

%s<

/td>

%data['

][i]）

#f.write（"

blue;

red;

black;

/tbody>

/table>

/body>

/html>

f.close（）

if__name__=='

__main__'

datas=[]

htmls=__getHtml（）

foriinrange（len（htmls））:

data=__getData（htmls[i]）

datas.append（data）

__getMovies（datas）

4数据保存、展示

结果如后图所示：

5技术难点关键点

数据爬取实战（搜房网爬取房屋数据）

frombs4importBeautifulSoup

importrequests

rep=requests.get（'

rep.encoding="

gb2312"

#设置编码方式

html=rep.text

soup=BeautifulSoup（html,'

html.parser'

f=open（'

//fang.html'

center>

新房成交TOP3<

/center>

tableborder='

1px'

width='

1000px'

height='

h2>

房址<

/h2>

成交量<

均价<

forliinsoup.find（"

ul"

class_="

ul02"

name=li.find（"

pbtext"

）.text

chengjiaoliang=li.find（"

red-f3"

junjia=li.find（"

ohter"

gray-9"

）#.text.replace（'

�O'

平方米'

tdalign=center>

5px'

color=red>

%name）

color=blue>

%chengjiaoliang）

%junjia）

print（name）

六、总结

教师评语：

成绩：

指导教师：

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

下载	加入VIP,免费下载

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: Python 网络爬虫实习报告材料

冰豆网所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：Python网络爬虫实习报告材料Word格式.docx
链接地址：https://www.bdocx.com/doc/21297716.html

Python网络爬虫实习报告材料Word格式.docx

热门标签