书签分享收藏举报版权申诉 / 19

立即下载加入VIP,免费下载

当前位置：首页 > 考试认证 > IT认证 > 数据清洗之网页中文过滤Word文档格式.docx

数据清洗之网页中文过滤Word文档格式.docx

文档编号：19855881
上传时间：2023-01-11
格式：DOCX
页数：19
大小：55.60KB

《数据清洗之网页中文过滤Word文档格式.docx》由会员分享，可在线阅读，更多相关《数据清洗之网页中文过滤Word文档格式.docx（19页珍藏版）》请在冰豆网上搜索。

数据清洗之网页中文过滤Word文档格式.docx

过滤其它字符。

2、保留<

td>

/td>

tr>

/tr>

标签，保留其中汉字及标点符号，过滤其它字符。

h1>

/h1>

h标签，保留其中汉字及标点符号，滤掉其它字符。

3、最终以<

html>

文档，UTF-8编码返回结果。

4、通过创建规则表，存储各种保留规则。

三、实现代码

2.1表说明

表名

功能描述

代码

UTF8_NOFILTER_CH

非过滤字符表

createtableUTF8_NOFILTER_CH

（CHNVARCHAR2（10）,

SMNVARCHAR2（100）,

BZVARCHAR2（20）

）;

UTF8_WRITE_CH_TMP

逐字符存放网页内容

createtableUTF8_WRITE_CH_TMP

（VNVARCHAR2

（1）,

V2VARCHAR2（10）,

RINTEGER

UTF8

存放测试数据，其中V列存放网页内容，V2列存放处理好的网页内容，XH表示序号为主键，表示一共有多少个网页需处理。

createtableUTF8

（VNCLOB,

V2NCLOB,

XHINTEGERprimarykey

UTF8_WRITE_LINE_TMP

将UTF8_WRITE_CH_TMP表中的字符合并为行

createtableUTF8_WRITE_LINE_TMP

（VNVARCHAR2（4000）,

UTF8_HANDLE_LINE_TMP

存放UTF8_WRITE_LINE_TMP表中的每一行按过滤规则处理后的结果

createtableUTF8_HANDLE_LINE_TMP

2.2初始化

规则表名

初始化非过滤字符数据，目前只包括，全、半角标点符号、部分需保留的HTML关键字。

.\scripts\insert\UTF8_NOFILTER_CH.sql

2.3包说明

包头子程序名

接口代码（包头）

filter

按过滤规则批量处理"

UTF8"

表中的列“V“中的网页内容。

procedurefilter;

filter（n_xhint）

按过滤规则处理"

表中某一行的列“V“中的网页内容。

procedurefilter（n_xhint）;

包体子程序名

read_ch_from_nclob

从NCLOB逐字符读取至表utf8_write_ch_tmp中。

详见包体中read_ch_from_nclob过程代码

read_line_from_ch

将表utf8_write_ch_tmp中字符处理为行。

详见包体中read_line_from_ch过程代码

write_line_to_nclob_init

首次写nclob时对其初始化

详见包体中write_line_to_nclob_init过程代码

write_line_to_nclob

将处理好的行信息写入nclob字段中

详见包体中write_line_to_nclob

is_ch_filter

功能：

检测某个字符是否为UTF8编码的汉字。

入口：

一个字符

出口：

1为是，0为否

详见包体中is_ch_filter函数

is_sign_filter

检测某个字符是否为UTF8编码的全角、半角标点符号。

详见包体中is_sign_filter函数

is_valid_filter

检测某个字符串是否为UTF8编码的HTML关键字，具体支持列表详见“UTF8_NOFILTER_CH”表中的BZ=‘HTML’的结果。

一个字符串

正数表示是，0为否

详见包体中is_valid_filter函数

process

对表“UTF8_WRITE_LINE_TMP”中的每一行内容进行按规则过滤操作。

处理前的行

过滤后的行

详见包体中process函数

handle_line

按过滤规则循环处理“UTF8_WRITE_LINE_TMP”个中所有行。

详见包体中handle_line函数

handle_exception

对处理后的行进行处理，删除无效行或对行进行修正

详见包体中handle_exception过程。

merge_into_nclob

合并“UTF8_HANDLE_LINE_TMP”表中行并更新至"

表中的列“V2“列中。

详见包体中merge_into_nclob过程。

详见包体中filter过程。

详见包体中filter（n_xhint）过程。

init

过滤前的初始化工作

详见包体中init过程。

四、程序源码

3.1包头代码

createorreplacepackagedp_filter_chis

/*

方法：

功能：

表中的列“V“中的网页内容：

规则如下：

1.保留中文字符

2.保留全角、半角标点符号

3.保留HTML保留字，详见规则表：

utf8_nofilter_ch

*/

procedurefilter;

filter（n_xhint）;

表中某一行的列“V“中的网页内容：

procedurefilter（n_xhint）;

enddp_filter_ch;

3.2包体代码

createorreplacepackagebodydp_filter_chis

p_xhint;

procedureread_ch_from_nclobis

directionsnclob;

buffernvarchar2

（1）;

amountint:

=1;

n_lenint;

n_existsint;

begin

selectcount（0）inton_existsfromutf8_write_ch_tmpwherexh=p_xh;

ifn_exists=0then

deletefromutf8_write_ch_tmpwherexh=p_xh;

commit;

selectv,length（v）intodirections,n_lenfromutf8wherexh=p_xh;

dbms_lob.open（directions,dbms_lob.lob_readonly）;

foriin1..n_lenloop

DBMS_LOB.read（directions,amount,i,buffer）;

insertintoutf8_write_ch_tmp（xh,r,v）values（p_xh,i,buffer）;

ifmod（i,500）=0then

endif;

endloop;

DBMS_LOB.CLOSE（directions）;

exception

whenothersthen

null;

end;

procedureread_line_from_chis

v_buffernvarchar2（4000）;

n_offsetint:

=1;

begin

deletefromutf8_write_line_tmpwherexh=p_xh;

v_buffer:

='

'

;

foriin（selectxh,v,r,ascii（v）chfromutf8_write_ch_tmptwherexh=p_xhorderbyr）loop

ifi.ch<

>

10andlength（v_buffer）<

500then

=v_buffer||i.v;

else

n_offset:

=n_offset+1;

insertintoutf8_write_line_tmp（xh,r,v）values（p_xh,n_offset,v_buffer）;

=chr（10）;

ifmod（n_offset,20）=0then

insertintoutf8_write_line_tmp（xh,r,v）values（p_xh,n_offset,v_buffer）;

procedurewrite_line_to_nclob_initis

buffernvarchar2（100）:

'

=length（buffer）;

offsetint:

updateutf8setv2=empty_clob（）wherexh=p_xh;

selectv2,length（v2）intodirections,n_lenfromutf8wherexh=p_xhforupdate;

dbms_lob.open（directions,dbms_lob.lob_readwrite）;

DBMS_LOB.WRITE（directions,amount,offset,buffer）;

updateutf8setv2=directionswherexh=p_xh;

procedurewrite_line_to_nclob（v_tmpnvarchar2）is

buffernvarchar2（4000）:

=v_tmp||chr（10）;

selectv2intodirectionsfromutf8wherexh=p_xh;

ifdirectionsisnullthen

write_line_to_nclob_init;

DBMS_LOB.WRITEAPPEND（directions,LENGTH（buffer）,buffer）;

updateutf8setv2=directionswherexh=p_xh;

functionis_ch_filter（P_strnvarchar2）returnintis

v_setvarchar2（10）:

='

UTF8'

n_beginint:

=to_number（'

4E00'

'

XXXXXXXX'

n_endint:

9FA5'

n_retint:

=0;

iflengthb（P_str）=3then

selectinstr（dump（P_str,1010）,v_set）inton_existsfromdual;

ifn_exists>

0then

ifto_number（replace（asciistr（P_str）,'

\'

）,'

XXXX'

）betweenn_beginandn_endthen

n_ret:

returnn_ret;

functionis_sign_filter（P_strnvarchar2）returnintis

n_existsint:

selectcount（0）

inton_exists

fromdual

whereexists（select1

fromutf8_nofilter_cht

wheret.bz='

BDFH'

andt.ch=asciistr（P_str））;

returnsign（n_exists）;

functionis_valid_filter（P_chnvarchar2,P_posint,P_strnvarchar2）returnintis

ifP_ch='

<

then

foriin（selectasciistr（t.ch）ch,lengthb（t.ch）len

fromutf8_nofilter_chtwheret.bz='

HTML'

）loop

ifasciistr（upper（substr（P_str,p_pos,i.len）））=asciistr（upper（i.ch））then

returni.len;

return0;

functionprocess（P_strnvarchar2）returnnvarchar2is

v_tmpnvarchar2（20）;

n_posint:

n_lenint:

v_retnvarchar2（4000）;

v_ret:

loop

v_tmp:

=substr（P_str,n_pos,n_len）;

--当取出字符为空时退出

ifv_tmpisnullthen

exit;

--中文，标点，一次移动一个字符

ifis_ch_filter（v_tmp）=1oris_sign_filter（v_tmp）=1then

=v_ret||v_tmp;

n_pos:

=n_pos+1;

--不需过滤的字符或字符串,按过滤表中字符的长度进行移动

elsifis_valid_filter（v_tmp,n_pos,P_str）>

0then

n_len:

=is_valid_filter（v_tmp,n_pos,P_str）;

=v_ret||substr（P_str,n_pos,n_len）;

=n_pos+n_len;

--其他情况移动一个字符

elsifis_valid_filter（v_tmp,n_pos,P_str）=0then

returnv_ret;

procedurehandle_lineis

deleteutf8_handle_line_tmpwherexh=p_xh;

foriin（selectr,vfromutf8_write_line_tmptwherexh=p_xhorderbyr）loop

v_buffer:

=v_buffer||process（i.v）;

insertintoutf8_handle_line_tmp（xh,r,v）values（p_xh,i.r,v_buffer）;

proceduremerge_into_nclobis

foriin（selectr,v,ascii（v）,lengthb（v）fromutf8_handle_line_tmpwherexh=p_xhorderbyr）loop

write_line_to_nclob（i.v）;

procedurehandle_exceptionis

--删除一行中只有一个字符且该字符为ASCII=10

deletefromutf8_write_line_tmptwherelength（v）=1andascii（v）=10;

--删除一行中ASCII字符为空或每行长度为1或2的行。

deletefromutf8_handle_line_tmptwhereascii（t.v）isnullorlengthb（t.v）in（1,2）;

--将连续两个字符为以下情况的都替换为空

updateutf8_handle_line_tmpsetv=replace（v,'

..'

）wherexh=p_xhandr=i.r;

,'

.,'

.'

--以.或,开头的去掉开头字符,或以.结尾的字符

foriin（selectrowidfromutf8_handle_line_tmp

wherexh=p_xhand（vlike'

%'

orvlike'

.%'

%.'

））loop

updateutf8_handle_line_tmpsetv=substr（v,2）whererowid=i.rowid;

--以.结尾的字符

foriin（selectrowidfromutf8_handle_line_tmpwherexh=p_xhandvlike'

updateutf8_handle_line_tmpsetv=substr（v,1,length（v）-1）whererowid=i.rowid;

--当一行内容只有以下字符时，删除改行

deletefromutf8_handle_line_tmpwherexh=p_xhandvisnull;

deletefromutf8_handle_line_tmpwherexh=p_xhandv='

.'

...'

procedureinit（n_xhint）is

p_xh:

=n_xh;

updateutf8setv2=nullwherexh=p_xh;

pr

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

下载	加入VIP,免费下载

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 数据清洗网页中文过滤

冰豆网所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：数据清洗之网页中文过滤Word文档格式.docx
链接地址：https://www.bdocx.com/doc/19855881.html

数据清洗之网页中文过滤Word文档格式.docx

热门标签