纺织学报 ›› 2018, Vol. 39 ›› Issue (10): 156-161.doi: 10.13475/j.fzxb.20171010106

• 管理与信息化 • 上一篇    下一篇

网络家用纺织品资源抽取方法

    


  • 收稿日期:2017-10-30 修回日期:2018-05-28 出版日期:2018-10-15 发布日期:2018-10-17
  • 基金资助:

     

Extracting method of household textile resources from Web

  • Received:2017-10-30 Revised:2018-05-28 Online:2018-10-15 Published:2018-10-17

摘要:

针对目前网络家纺资源采集方式在处理海量网络资源尤其是深网资源时效率低下的问题,提出了一种自动化的网络家纺资源抽取方法。该方法首先根据查询接口属性有限性和收敛性的特征,构建领域模型对深网查询接口进行识别,然后利用家纺领域关键词自动填写查询接口,抽取深网家纺资源;对于返回的查询页面,为过滤与抽取与主题无关的噪声信息,对页面进行视觉分块,利用标记的分块样本数据训练分块重要度模型,并利用该模型过滤与主题无关的噪声信息。实验结果表明,领域模型识别深网查询接口的阳性预测值和准确率比基于规则的方法分别提高了8%和6%,分块重要度模型过滤噪声的准确率和召回率的调和平均数值在3 个等级上比基于规则方法的正确率平均提高了12.90%。

关键词: 家用纺织品, 资源库, 深网, 信息抽取

Abstract:

Aiming at the of poor efficiency while processing a huge quantity of Web resources, particularly data reaources hidden in deep web by problem of current household textile resources from Web acquisition mode, an automatic approach to extract home textile resouces from Web was proposed. In this approach, a domain model was firstly proposed to identify deep Web query interfaces, then the identkfied query interfaces were filled automatically with domain keywords from household textiles, and the household textile resources from deep Web were extracted. In addition, in order to filter noises from response Web pages, pages were divided into different view blocks, a block importance model was proposed and trained by labeled blocks, and the model was utilized to filter the noise information independent from the subject. Experimental results show that in comparison with rule-based approaches, the domain model achieves 8% and 6% improvements in terms of positive predictive value and accuracy for query interface identification. Also, the block importance model achieves average 12.9% improvements at three levels in terms of harmonic average value for filtering noise information.

Key words: household textile, resource database, deep Web, information extraction

[1] 郭春花. 纺织“十三五”蓝图初绘 访中国纺织工业联合会副会长孙瑞哲[J]. 纺织服装周刊,2016,(02):16-17.
GUO Chunhua. Textile "13th five-year" blueprint: Inter-view with Sun Ruizhe, vice president of China Textile In-dustry Association [J]. Textiles and clothing week-ly,2016,(02):16-17.
[2] 战洪飞. 基于网格的家纺行业产品协同设计[J]. 纺织学报,2009,30(08):138-142.
ZHAN H F. Study on grid based product collaborative de-sign for home textile enterprises [J]. Journal of
Textile Research,2009,30(8):138-142.
[3] 曹飞. 家纺床品数据库查询系统的研究与实现[D]. 苏州大学, 2011.
CAO Fei. The Research and Implementation of Home Textile Bedding Database Query System[D]. Soochow University, 2011.
[4] ZHENG Q H, WU Z H, CHENG X C, et al. Learning to crawl deep web [J]. Information Systems, 38(6): 801-819.
[5] Jan Zeleny, Radek Burget, Jaroslav Zendulka. Box cluster-ing segmentation: A new method for vision-based web page preprocessing[J]. Information Processing & Man-agement, 2017, 53(3): 735-750.
[6] Fayzrakhmanov R R. Information Extraction from Web Pages Based on Their Visual Representation[M]Current Trends in Web Engineering. Springer Berlin Heidelberg, 2011:342-346.
[7] Seung Min Kim, Suk I. Yoo. DOM tree browsing of a very large XML document: Design and implementation [J]. Journal of Systems and Software, 82(11): 1843-1858.
[8] Maksim Lapin, Matthias Hein, Bernt Schiele. Learning using privileged information: SVM+ and weighted SVM[J]. Neural Networks, 53: 95-108.
[9] FU Y, YANG D Q, TANG S W. Using Xpath to discover informative content blocks of web pages[C]//Proceedings of the Third International Conference on Semantics, Knowledge and Grid, Shan Xi;2007:450-453.
[1] 龚建培. 家用纺织品材质再设计[J]. 纺织学报, 2005, 26(3): 153-155.
[2] 李栋高. 家纺业的发展需要新理念的推动[J]. 纺织学报, 2003, 24(01): 79-79.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘凤华. 纺织品市场预测及其计算机系统实现[J]. 纺织学报, 2003, 24(05): 102 -103 .
[2] 万振凯;李静东. 三维编织复合材料压缩损伤声发射特性分析[J]. 纺织学报, 2006, 27(2): 20 -24 .
[3] 朱龙彪;庄健;徐海黎. 高性能摆丝机械的改造设计[J]. 纺织学报, 2004, 25(04): 87 -88 .
[4] 张艺于;王璐;C.CAMPAGNE;R.ABDESSEMED. 基于β-环糊精包合技术的棉织物熏衣草芳香整理[J]. 纺织学报, 2008, 29(9): 94 -97 .
[5] 尤秀兰;刘兆峰;曹煜彤;胡祖明. 挤出PPTA冻胶体厚度对浆粕性能的影响[J]. 纺织学报, 2006, 27(9): 22 -24 .
[6] 沙嫣云. 短涤包芯织物简介[J]. 纺织学报, 1982, 3(09): 4 .
[7] 李茂松;杨斌;周文龙. 用核磁共振法测定生丝结晶度的研究[J]. 纺织学报, 1992, 13(07): 16 -19 .
[8] 檀革银;靳向煜;唐守星. 非织造用转基因棉的脱脂工艺及性能研究[J]. 纺织学报, 2004, 25(04): 24 -25 .
[9] 张沛人;徐国亭. 泡沫印花浆的流变性能与印制效果的关系[J]. 纺织学报, 1989, 10(06): 18 -20 .
[10] 钱梓玉;尹碧茵. Ⅲ区热拉伸后的PET纤维超分子结构及动态力学性质[J]. 纺织学报, 1989, 10(08): 15 -18 .