愛(ài)鋒貝

標(biāo)題: 使用PaddleNLP進(jìn)行惡意網(wǎng)頁(yè)識(shí)別（二） [打印本頁(yè)]

作者: 美曼科技 時(shí)間: 2023-4-8 04:25
標(biāo)題: 使用PaddleNLP進(jìn)行惡意網(wǎng)頁(yè)識(shí)別（二）

本文是《使用PaddleNLP進(jìn)行惡意網(wǎng)頁(yè)識(shí)別》系列第二篇，該系列持續(xù)更新中……

系列目錄

使用PaddleNLP進(jìn)行惡意網(wǎng)頁(yè)識(shí)別（一）
- 使用PaddleNLP的文本分類模型，做正常網(wǎng)頁(yè)與被黑網(wǎng)頁(yè)的簡(jiǎn)單二分類，根據(jù)HTML網(wǎng)頁(yè)內(nèi)容處理結(jié)果判斷網(wǎng)頁(yè)是否正常。

使用PaddleNLP進(jìn)行惡意網(wǎng)頁(yè)識(shí)別（二）
- 使用PaddleNLP的預(yù)訓(xùn)練模型Fine-tune，大幅提高根據(jù)HTML網(wǎng)頁(yè)內(nèi)容處理結(jié)果判斷網(wǎng)頁(yè)準(zhǔn)確率。
本文能順利完成，要特別感謝社區(qū)@沒(méi)入門的研究生大佬的指導(dǎo)
使用PaddleNLP進(jìn)行惡意網(wǎng)頁(yè)識(shí)別（三）
- 使用PaddleNLP的文本分類模型，做正常網(wǎng)頁(yè)與惡意網(wǎng)頁(yè)的簡(jiǎn)單二分類，提取HTML標(biāo)簽信息判斷網(wǎng)頁(yè)是否正常。

使用PaddleNLP進(jìn)行惡意網(wǎng)頁(yè)識(shí)別（四）
- 嘗試使用人工判斷條件，設(shè)計(jì)提取HTML標(biāo)簽信息識(shí)別惡意網(wǎng)頁(yè)的流程。
- 使用PaddleNLP的預(yù)訓(xùn)練模型Fine-tune，嘗試提高提取HTML標(biāo)簽信息判斷網(wǎng)頁(yè)是否正常的效果。
- 將動(dòng)態(tài)圖訓(xùn)練的網(wǎng)頁(yè)分類模型導(dǎo)出并使用Python部署。
使用PaddleNLP進(jìn)行惡意網(wǎng)頁(yè)識(shí)別（五）
- 該項(xiàng)目直接對(duì)標(biāo)系列第二篇，對(duì)比BERT中文預(yù)訓(xùn)練模型和Ernie預(yù)訓(xùn)練模型在HTML網(wǎng)頁(yè)內(nèi)容分類流程和效果上的差異。
- 項(xiàng)目進(jìn)一步完善和優(yōu)化了HTML網(wǎng)頁(yè)內(nèi)容提取和數(shù)據(jù)清洗流程。
- 驗(yàn)證集上模型準(zhǔn)確率可以輕松達(dá)到91.5%以上，最高達(dá)到95%，在BERT預(yù)訓(xùn)練模型上進(jìn)行finetune，得到了目前在該HTML網(wǎng)頁(yè)內(nèi)容分類任務(wù)上的最好表現(xiàn)。

使用PaddleNLP進(jìn)行惡意網(wǎng)頁(yè)識(shí)別（六）
- 該項(xiàng)目直接對(duì)標(biāo)系列第四篇，對(duì)比BERT中文預(yù)訓(xùn)練模型和Ernie預(yù)訓(xùn)練模型在HTML網(wǎng)頁(yè)標(biāo)簽分類效果上的差異。
- 項(xiàng)目進(jìn)一步完善和優(yōu)化了HTML的tag內(nèi)容提取和數(shù)據(jù)清洗流程。
- 驗(yàn)證集上模型準(zhǔn)確率可以輕松達(dá)到96.5%以上，測(cè)試集上準(zhǔn)確率接近97%，在BERT預(yù)訓(xùn)練模型上進(jìn)行finetune，得到了目前在該HTML網(wǎng)頁(yè)標(biāo)簽序列分類任務(wù)上的最好表現(xiàn)。
使用PaddleNLP進(jìn)行惡意網(wǎng)頁(yè)識(shí)別（七）
- 介紹了使用自動(dòng)化測(cè)試工具selenium進(jìn)行網(wǎng)頁(yè)快照抓取的方法。
- 介紹了使用zxing開(kāi)源庫(kù)進(jìn)行網(wǎng)頁(yè)快照二維碼定位和解析的方法。
- 介紹使用系列第六篇訓(xùn)練的模型，對(duì)二維碼中包含的url網(wǎng)頁(yè)鏈接進(jìn)行識(shí)別分類的思路。

關(guān)于Fine-tune

近年來(lái)隨著深度學(xué)習(xí)的發(fā)展，模型參數(shù)數(shù)量飛速增長(zhǎng)，為了訓(xùn)練這些參數(shù)，需要更大的數(shù)據(jù)集來(lái)避免過(guò)擬合。然而，對(duì)于大部分NLP任務(wù)來(lái)說(shuō)，構(gòu)建大規(guī)模的標(biāo)注數(shù)據(jù)集成本過(guò)高，非常困難，特別是對(duì)于句法和語(yǔ)義相關(guān)的任務(wù)。相比之下，大規(guī)模的未標(biāo)注語(yǔ)料庫(kù)的構(gòu)建則相對(duì)容易。最近的研究表明，基于大規(guī)模未標(biāo)注語(yǔ)料庫(kù)的預(yù)訓(xùn)練模型（Pretrained Models, PTM) 能夠習(xí)得通用的語(yǔ)言表示，將預(yù)訓(xùn)練模型Fine-tune到下游任務(wù)，能夠獲得出色的表現(xiàn)。另外，預(yù)訓(xùn)練模型能夠避免從零開(kāi)始訓(xùn)練模型。

(, 下載次數(shù): 127)

預(yù)訓(xùn)練模型一覽，圖片來(lái)源：https://github.com/thunlp/PLMpapers

在PaddleNLP文檔中，介紹了下面這個(gè)示例：
使用預(yù)訓(xùn)練模型Fine-tune完成情感分析分類任務(wù)

該示例展示了以ERNIE(Enhanced Representation through Knowledge Integration)為代表的預(yù)訓(xùn)練模型如何Finetune完成中文情感分析任務(wù)。首先來(lái)看看這個(gè)示例的效果。
使用從源碼安裝的方法，安裝最新的PaddleNLP develop分支
!pip install --upgrade git+https://gitee.com/PaddlePaddle/PaddleNLP.git
!git clone https://gitee.com/paddlepaddle/PaddleNLP.git
import paddle
import paddlenlp
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
def convert_to_list(value, n, name, dtype=np.int):預(yù)訓(xùn)練模型簡(jiǎn)介

PaddleNLP針對(duì)中文文本分類問(wèn)題，開(kāi)源了一系列模型，供用戶可配置地使用：

BERT(Bidirectional Encoder Representations from Transformers)中文模型，簡(jiǎn)寫bert-base-chinese，其由12層Transformer網(wǎng)絡(luò)組成。
ERNIE(Enhanced Representation through Knowledge Integration)，支持ERNIE 1.0中文模型（簡(jiǎn)寫ernie-1.0）和ERNIE Tiny中文模型（簡(jiǎn)寫ernie-tiny)。其中ernie由12層Transformer網(wǎng)絡(luò)組成，ernie-tiny由3層Transformer網(wǎng)絡(luò)組成。
RoBERTa(A Robustly Optimized BERT Pretraining Approach)，支持24層Transformer網(wǎng)絡(luò)的roberta-wwm-ext-large和12層Transformer網(wǎng)絡(luò)的roberta-wwm-ext。

| 模型 | dev acc | test acc | | --- | --- | --- | | bert-base-chinese | 0.93833 | 0.94750 | | bert-wwm-chinese | 0.94583 | 0.94917 | | bert-wwm-ext-chinese | 0.94667 | 0.95500 | | ernie-1.0 | 0.94667 | 0.95333 | | ernie-tiny | 0.93917 | 0.94833 | | roberta-wwm-ext | 0.94750 | 0.95250 | | roberta-wwm-ext-large | 0.95250 | 0.95333 | | rbt3 | 0.92583 | 0.93250 | | rbtl3 | 0.9341 | 0.93583 |
詳細(xì)說(shuō)明可以參考：PaddleNLP文檔：使用預(yù)訓(xùn)練模型Fine-tune完成中文文本分類任務(wù)，這里就不做贅述，主要看看訓(xùn)練效果：
!cd PaddleNLP/examples/text_classification/pretrained_models/ && python -m paddle.distributed.launch train.py --device gpu --save_dir ./checkpoints
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  def convert_to_list(value, n, name, dtype=np.int):
-----------  Configuration Arguments -----------
gpus: None
heter_worker_num: None
heter_workers:
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: None
servers:
training_script: train.py
training_script_args: [&#39;--device&#39;, &#39;gpu&#39;, &#39;--save_dir&#39;, &#39;./checkpoints&#39;]
worker_num: None
workers:
------------------------------------------------
WARNING 2021-04-28 10:53:28,780 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
launch train in GPU mode
INFO 2021-04-28 10:53:28,782 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
|                      Distributed Envs                   Value                   |
+---------------------------------------------------------------------------------------+
|                      PADDLE_TRAINER_ID                      0                   |
|                PADDLE_CURRENT_ENDPOINT                127.0.0.1:48117             |
|                   PADDLE_TRAINERS_NUM                      1                   |
|             PADDLE_TRAINER_ENDPOINTS                127.0.0.1:48117             |
|                   FLAGS_selected_gpus                      0                   |
+=======================================================================================+

INFO 2021-04-28 10:53:28,782 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  def convert_to_list(value, n, name, dtype=np.int):
[2021-04-28 10:53:30,163] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-tiny/ernie_tiny.pdparams
W0428 10:53:30.165206  8918 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0428 10:53:30.170289  8918 device_context.cc:372] device: 0, cuDNN Version: 7.6.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1303: UserWarning: Skip loading for classifier.weight. classifier.weight is not found in the provided dict.
  warnings.warn((&#34;Skip loading for {}. &#34;.format(key) + str(err)))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1303: UserWarning: Skip loading for classifier.bias. classifier.bias is not found in the provided dict.
  warnings.warn((&#34;Skip loading for {}. &#34;.format(key) + str(err)))
[2021-04-28 10:53:37,440] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/vocab.txt
[2021-04-28 10:53:37,441] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/spm_cased_simp_sampled.model
[2021-04-28 10:53:37,441] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/dict.wordseg.pickle
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py:423: UserWarning: The program will return to single-card operation. Please check 1, whether you use spawn or fleetrun to start the program. 2, Whether it is a multi-card program. 3, Is the current environment multi-card.
  warnings.warn(&#34;The program will return to single-card operation. &#34;
global step 10, epoch: 1, batch: 10, loss: 0.52361, accu: 0.65000, speed: 9.22 step/s
global step 20, epoch: 1, batch: 20, loss: 0.50937, accu: 0.75625, speed: 11.10 step/s
global step 30, epoch: 1, batch: 30, loss: 0.20778, accu: 0.79271, speed: 11.12 step/s
global step 40, epoch: 1, batch: 40, loss: 0.17973, accu: 0.81719, speed: 11.10 step/s
global step 50, epoch: 1, batch: 50, loss: 0.13715, accu: 0.83500, speed: 11.10 step/s
global step 60, epoch: 1, batch: 60, loss: 0.31635, accu: 0.84792, speed: 10.91 step/s
global step 70, epoch: 1, batch: 70, loss: 0.48289, accu: 0.85759, speed: 11.09 step/s
global step 80, epoch: 1, batch: 80, loss: 0.33256, accu: 0.85977, speed: 11.09 step/s
global step 90, epoch: 1, batch: 90, loss: 0.23529, accu: 0.86528, speed: 10.92 step/s
global step 100, epoch: 1, batch: 100, loss: 0.10831, accu: 0.86781, speed: 11.07 step/s
eval loss: 0.24311, accu: 0.91000
global step 110, epoch: 1, batch: 110, loss: 0.24633, accu: 0.88750, speed: 1.30 step/s
global step 120, epoch: 1, batch: 120, loss: 0.21631, accu: 0.88594, speed: 11.09 step/s
global step 130, epoch: 1, batch: 130, loss: 0.24054, accu: 0.88542, speed: 11.08 step/s
global step 140, epoch: 1, batch: 140, loss: 0.14191, accu: 0.88828, speed: 11.06 step/s
global step 150, epoch: 1, batch: 150, loss: 0.17121, accu: 0.89687, speed: 11.11 step/s
global step 160, epoch: 1, batch: 160, loss: 0.26325, accu: 0.90104, speed: 11.13 step/s
global step 170, epoch: 1, batch: 170, loss: 0.11606, accu: 0.89955, speed: 11.10 step/s
global step 180, epoch: 1, batch: 180, loss: 0.06721, accu: 0.90391, speed: 11.08 step/s
global step 190, epoch: 1, batch: 190, loss: 0.24036, accu: 0.90417, speed: 11.10 step/s
global step 200, epoch: 1, batch: 200, loss: 0.58651, accu: 0.90312, speed: 11.13 step/s
eval loss: 0.21573, accu: 0.92583
global step 210, epoch: 1, batch: 210, loss: 0.29864, accu: 0.91563, speed: 1.29 step/s
global step 220, epoch: 1, batch: 220, loss: 0.33272, accu: 0.92188, speed: 11.08 step/s
global step 230, epoch: 1, batch: 230, loss: 0.30313, accu: 0.91771, speed: 11.10 step/s
global step 240, epoch: 1, batch: 240, loss: 0.26417, accu: 0.92109, speed: 11.09 step/s
global step 250, epoch: 1, batch: 250, loss: 0.32720, accu: 0.91938, speed: 11.08 step/s
global step 260, epoch: 1, batch: 260, loss: 0.29275, accu: 0.92188, speed: 11.11 step/s
global step 270, epoch: 1, batch: 270, loss: 0.22924, accu: 0.92277, speed: 11.13 step/s
global step 280, epoch: 1, batch: 280, loss: 0.07589, accu: 0.92461, speed: 11.09 step/s
global step 290, epoch: 1, batch: 290, loss: 0.19718, accu: 0.92708, speed: 11.11 step/s
global step 300, epoch: 1, batch: 300, loss: 0.09135, accu: 0.92875, speed: 11.06 step/s
eval loss: 0.19773, accu: 0.93583
^C
Traceback (most recent call last):
  File &#34;/opt/conda/envs/python35-paddle120-env/lib/python3.7/runpy.py&#34;, line 193, in _run_module_as_main
&#34;__main__&#34;, mod_spec)
  File &#34;/opt/conda/envs/python35-paddle120-env/lib/python3.7/runpy.py&#34;, line 85, in _run_code
exec(code, run_globals)
  File &#34;/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/distributed/launch.py&#34;, line 16, in <module>
launch.launch()
  File &#34;/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/distributed/fleet/launch.py&#34;, line 328, in launch
launch_collective(args)
  File &#34;/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/distributed/fleet/launch.py&#34;, line 244, in launch_collective
time.sleep(3)
KeyboardInterrupt可以明顯看出，僅用了1個(gè)epoch后，情感分析的分類準(zhǔn)確率就來(lái)到90%以上，效果顯著。因此，可以想想，F(xiàn)inetune將是解決使用PaddleNLP進(jìn)行惡意網(wǎng)頁(yè)識(shí)別（一）這個(gè)項(xiàng)目中，從0開(kāi)始訓(xùn)練，效果不佳問(wèn)題的一個(gè)答案。
使用PaddleNLP語(yǔ)義預(yù)訓(xùn)練模型ERNIE優(yōu)化惡意網(wǎng)頁(yè)識(shí)別效果

關(guān)于自定義數(shù)據(jù)集

不得不說(shuō)，上面介紹的PaddleNLP Finetune示例封裝程度有點(diǎn)高，是直接使用了DatasetBuilder 的子類實(shí)現(xiàn)數(shù)據(jù)集的貢獻(xiàn)到數(shù)據(jù)集加載、模型訓(xùn)練的打通。但是在自定義數(shù)據(jù)集中，其實(shí)我們只需要關(guān)注訓(xùn)練數(shù)據(jù)是怎么整理的，格式如何即可。對(duì)應(yīng)到PaddleNLP的源碼中，就是text_classification/pretrained_models/train.py里的實(shí)現(xiàn)，相關(guān)語(yǔ)句摘錄如下。
from paddlenlp.datasets import load_dataset
train_ds, dev_ds, test_ds = load_dataset(
&#34;chnsenticorp&#34;, splits=[&#34;train&#34;, &#34;dev&#34;, &#34;test&#34;])
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  def convert_to_list(value, n, name, dtype=np.int):
train_ds.label_list
[&#39;0&#39;, &#39;1&#39;]
train_ds.data[:5]
[{&#39;text&#39;: &#39;選擇珠江花園的原因就是方便，有電動(dòng)扶梯直接到達(dá)海邊，周圍餐館、食廊、商場(chǎng)、超市、攤位一應(yīng)俱全。酒店裝修一般，但還算整潔。泳池在大堂的屋頂，因此很小，不過(guò)女兒倒是喜歡。包的早餐是西式的，還算豐富。服務(wù)嗎，一般&#39;,
  &#39;label&#39;: 1},
{&#39;text&#39;: &#39;15.4寸筆記本的鍵盤確實(shí)爽，基本跟臺(tái)式機(jī)差不多了，蠻喜歡數(shù)字小鍵盤，輸數(shù)字特方便，樣子也很美觀，做工也相當(dāng)不錯(cuò)&#39;,
  &#39;label&#39;: 1},
{&#39;text&#39;: &#39;房間太小。其他的都一般。。。。。。。。。&#39;, &#39;label&#39;: 0},
{&#39;text&#39;: &#39;1.接電源沒(méi)有幾分鐘,電源適配器熱的不行. 2.攝像頭用不起來(lái). 3.機(jī)蓋的鋼琴漆，手不能摸，一摸一個(gè)印. 4.硬盤分區(qū)不好辦.&#39;,
  &#39;label&#39;: 0},
{&#39;text&#39;: &#39;今天才知道這書還有第6卷,真有點(diǎn)郁悶:為什么同一套書有兩種版本呢?當(dāng)當(dāng)網(wǎng)是不是該跟出版社商量商量,單獨(dú)出個(gè)第6卷,讓我們的孩子不會(huì)有所遺憾。&#39;,
  &#39;label&#39;: 1}]因此，我們現(xiàn)在需要做的，就是將自定義數(shù)據(jù)集整理成上面的形式，其實(shí)就是PaddleNLP的MapDataset。由于PaddleNLP的文檔尚在完善中，讀者可能一下找不到相關(guān)入口，這時(shí)，可以從源代碼的注釋里找，比如利用AI Studio或者IDE上的這個(gè)功能：
??load_dataset自定義數(shù)據(jù)集的內(nèi)容則來(lái)自于上一個(gè)項(xiàng)目生成的webtrain.txt,webdev.txt,webtest.txt，本文已經(jīng)內(nèi)置。

PaddleNLP文檔：如何自定義數(shù)據(jù)集
通過(guò)使用PaddleNLP提供的 load_dataset() ， MapDataset 和 IterDataset 。任何人都可以方便的定義屬于自己的數(shù)據(jù)集。
從本地文件創(chuàng)建數(shù)據(jù)集?
從本地文件創(chuàng)建數(shù)據(jù)集時(shí)，我們推薦根據(jù)本地?cái)?shù)據(jù)集的格式給出讀取function并傳入 load_dataset() 中創(chuàng)建數(shù)據(jù)集。
以 waybill_ie 快遞單信息抽取任務(wù)中的數(shù)據(jù)為例：
```python from paddlenlp.datasets import load_dataset
def read(data_path): with open(data_path, &#39;r&#39;, encoding=&#39;utf-8&#39;) as f: # 跳過(guò)列名 next(f) for line in f: words, labels = line.strip(&#39;\n&#39;).split(&#39;\t&#39;) words = words.split(&#39;\002&#39;) labels = labels.split(&#39;\002&#39;) yield {&#39;tokens&#39;: words, &#39;labels&#39;: labels}
data_path為read()方法的參數(shù)
map_ds = load_dataset(read, data_path=&#39;train.txt&#39;,lazy=False) iter_ds = load_dataset(read, data_path=&#39;train.txt&#39;,lazy=True) ```
我們推薦將數(shù)據(jù)讀取代碼寫成生成器(generator)的形式，這樣可以更好的構(gòu)建 MapDataset 和 IterDataset 兩種數(shù)據(jù)集。同時(shí)我們也推薦將單條數(shù)據(jù)寫成字典的格式，這樣可以更方便的監(jiān)測(cè)數(shù)據(jù)流向。
事實(shí)上，MapDataset 在絕大多數(shù)時(shí)候都可以滿足要求。一般只有在數(shù)據(jù)集過(guò)于龐大無(wú)法一次性加載進(jìn)內(nèi)存的時(shí)候我們才考慮使用 IterDataset 。任何人都可以方便的定義屬于自己的數(shù)據(jù)集。
注解
需要注意的是，只有從 DatasetBuilder 初始化的數(shù)據(jù)集具有將數(shù)據(jù)中的label自動(dòng)轉(zhuǎn)為id的功能（詳細(xì)條件參見(jiàn) 如何貢獻(xiàn)數(shù)據(jù)集）。
像上例中的自定義數(shù)據(jù)集需要在自定義的convert to feature方法中添加label轉(zhuǎn)id的功能。
自定義數(shù)據(jù)讀取function中的參數(shù)可以直接以關(guān)鍵字參數(shù)的的方式傳入 load_dataset() 中。而且對(duì)于自定義數(shù)據(jù)集，lazy 參數(shù)是必須傳入的。

from paddlenlp.datasets import load_dataset

def read(data_path):
with open(data_path, &#39;r&#39;, encoding=&#39;utf-8&#39;) as f:
      for line in f:
         line = line.strip(&#39;\n&#39;).split(&#39;\t&#39;)
         # 注意，原數(shù)據(jù)集中可能文本里還有\(zhòng)t殘留，因此要使用下面的方法提取文本與標(biāo)簽，否則會(huì)報(bào)錯(cuò)
         words = &#39;&#39;.join(line[:-1])
         labels = line[-1]
         yield {&#39;text&#39;: words, &#39;label&#39;: labels}

# data_path為read()方法的參數(shù)
train_ds = load_dataset(read, data_path=&#39;webtrain.txt&#39;,lazy=False)
dev_ds = load_dataset(read, data_path=&#39;webdev.txt&#39;,lazy=False)
test_ds = load_dataset(read, data_path=&#39;webtest.txt&#39;,lazy=False)
train_ds.data[:10]
[{&#39;text&#39;: &#39;年以內(nèi),2萬(wàn)公里以內(nèi)SUV1年以內(nèi)易車二手車體驗(yàn)更好，速度更快立即前往APP看電腦版看微信版提意見(jiàn)購(gòu)車熱線：4000-189-167(,9:00,–,21:00,)易車二手車,m.taoche.com&#39;,
  &#39;label&#39;: &#39;0&#39;},
{&#39;text&#39;: &#39;ipaime.com/thread-694853-1-1.htmlcoryphaei.com/forum.php?mod=viewthread&tid=3074054回復(fù)返回版塊參與回復(fù)?,棲霞商業(yè)網(wǎng)&#39;,
  &#39;label&#39;: &#39;0&#39;},
{&#39;text&#39;: &#39;大直街店集體課表人和國(guó)際健身俱樂(lè)部首頁(yè)集體課表聯(lián)系我們掃描二維碼用手機(jī)訪問(wèn)本站由業(yè)界領(lǐng)先的搜狐快站免費(fèi)提供技術(shù)支持人和國(guó)際健身俱樂(lè)部人和健身大直街店集體課表15小時(shí)前閱讀Powered,by,搜狐快站&#39;,
  &#39;label&#39;: &#39;0&#39;},
{&#39;text&#39;: &#39;個(gè)人帳戶工作或?qū)W校帳戶單位或?qū)W校未分配帳戶?使用,Microsoft,帳戶登錄厭煩了這個(gè)帳戶名稱?重命名您的個(gè)人,Microsoft,帳戶。?,2017,Microsoft使用條款隱私與,Cookie&#39;,
  &#39;label&#39;: &#39;0&#39;},
{&#39;text&#39;: &#39;ONGAB4yONGAB4y精絕美女-在線直播在線播放-高清無(wú)水印九獅賭城-美女荷官All,rights,reserved.Copyright,?2016,&2017&#39;,
  &#39;label&#39;: &#39;0&#39;},
{&#39;text&#39;: &#39;蕭蕭聯(lián)稿...友情鏈接與我在線留言本站|友情連接|后臺(tái)管理2008--2012｜http://www.syslh.com,｜管理：塵涵圣域詩(shī)聯(lián)網(wǎng)站魯ICP備06019539號(hào)Open,LoginBar&#39;,
  &#39;label&#39;: &#39;1&#39;},
{&#39;text&#39;: &#39;專利加急,專利查詢免責(zé)聲明:本站部分資料來(lái)自互聯(lián)網(wǎng)，轉(zhuǎn)載時(shí)會(huì)注明出處；如果侵犯了你的權(quán)益，請(qǐng)通知我們，我們會(huì)及時(shí)刪除侵權(quán)內(nèi)容，謝謝合作！Powered,bywqCms5.7withWangqiInc.&#39;,
  &#39;label&#39;: &#39;0&#39;},
{&#39;text&#39;: &#39;銀河賭場(chǎng)澳門賭場(chǎng)玩法在線賭博真錢賭博網(wǎng)站賭博現(xiàn)金網(wǎng)線上賭博平臺(tái)葡京娛樂(lè)場(chǎng)新葡京娛樂(lè)場(chǎng)金沙娛樂(lè)場(chǎng)bet365娛樂(lè)場(chǎng)真人娛樂(lè)場(chǎng)博彩網(wǎng)澳門博彩網(wǎng)站博彩公司博彩公司評(píng)級(jí)博彩現(xiàn)金網(wǎng)博彩網(wǎng)導(dǎo)航博彩技巧博彩公司排名&#39;,
  &#39;label&#39;: &#39;1&#39;},
{&#39;text&#39;: &#39;云集品電子商務(wù)有限公司,版權(quán)所有.粵ICP備14072989號(hào)-2聯(lián)系方式0755-33198568elapsed_time:0.1559,memory_usage:6.96MB關(guān)注微信,贏取更多優(yōu)惠&#39;,
  &#39;label&#39;: &#39;0&#39;},
{&#39;text&#39;: &#39;云集品電子商務(wù)有限公司,版權(quán)所有.粵ICP備14072989號(hào)-2聯(lián)系方式0755-33198568elapsed_time:0.1958,memory_usage:6.96MB關(guān)注微信,贏取更多優(yōu)惠&#39;,
  &#39;label&#39;: &#39;0&#39;}]
# label_list手動(dòng)添加進(jìn)去就行
train_ds.label_list = [&#39;0&#39;, &#39;1&#39;]
dev_ds.label_list = [&#39;0&#39;, &#39;1&#39;]
test_ds.label_list = [&#39;0&#39;, &#39;1&#39;]
# 查看效果，確認(rèn)自定義數(shù)據(jù)集完成
print(train_ds.label_list)

for data in train_ds.data[:5]:
print(data)
[&#39;0&#39;, &#39;1&#39;]
{&#39;text&#39;: &#39;年以內(nèi),2萬(wàn)公里以內(nèi)SUV1年以內(nèi)易車二手車體驗(yàn)更好，速度更快立即前往APP看電腦版看微信版提意見(jiàn)購(gòu)車熱線：4000-189-167(,9:00,–,21:00,)易車二手車,m.taoche.com&#39;, &#39;label&#39;: &#39;0&#39;}
{&#39;text&#39;: &#39;ipaime.com/thread-694853-1-1.htmlcoryphaei.com/forum.php?mod=viewthread&tid=3074054回復(fù)返回版塊參與回復(fù)?,棲霞商業(yè)網(wǎng)&#39;, &#39;label&#39;: &#39;0&#39;}
{&#39;text&#39;: &#39;大直街店集體課表人和國(guó)際健身俱樂(lè)部首頁(yè)集體課表聯(lián)系我們掃描二維碼用手機(jī)訪問(wèn)本站由業(yè)界領(lǐng)先的搜狐快站免費(fèi)提供技術(shù)支持人和國(guó)際健身俱樂(lè)部人和健身大直街店集體課表15小時(shí)前閱讀Powered,by,搜狐快站&#39;, &#39;label&#39;: &#39;0&#39;}
{&#39;text&#39;: &#39;個(gè)人帳戶工作或?qū)W校帳戶單位或?qū)W校未分配帳戶?使用,Microsoft,帳戶登錄厭煩了這個(gè)帳戶名稱?重命名您的個(gè)人,Microsoft,帳戶。?,2017,Microsoft使用條款隱私與,Cookie&#39;, &#39;label&#39;: &#39;0&#39;}
{&#39;text&#39;: &#39;ONGAB4yONGAB4y精絕美女-在線直播在線播放-高清無(wú)水印九獅賭城-美女荷官All,rights,reserved.Copyright,?2016,&2017&#39;, &#39;label&#39;: &#39;0&#39;}參考項(xiàng)目：『NLP經(jīng)典項(xiàng)目集』02：使用預(yù)訓(xùn)練模型ERNIE優(yōu)化情感分析，該項(xiàng)目也提供了數(shù)據(jù)處理腳本utils.py，本文已經(jīng)內(nèi)置。
PaddleNLP一鍵加載預(yù)訓(xùn)練模型

PaddleNLP對(duì)于各種預(yù)訓(xùn)練模型已經(jīng)內(nèi)置了對(duì)于下游任務(wù)-文本分類的Fine-tune網(wǎng)絡(luò)。以下教程ERNIE為例，介紹如何將預(yù)訓(xùn)練模型Fine-tune完成文本分類任務(wù)。

paddlenlp.transformers.ErnieModel()一行代碼即可加載預(yù)訓(xùn)練模型ERNIE。
paddlenlp.transformers.ErnieForSequenceClassification()一行代碼即可加載預(yù)訓(xùn)練模型ERNIE用于文本分類任務(wù)的Fine-tune網(wǎng)絡(luò)。
- 其在ERNIE模型后拼接上一個(gè)全連接網(wǎng)絡(luò)（Full Connected）進(jìn)行分類。

paddlenlp.transformers.ErnieForSequenceClassification.from_pretrained() 只需指定想要使用的模型名稱和文本分類的類別數(shù)即可完成網(wǎng)絡(luò)定義。

# 設(shè)置想要使用模型的名稱

MODEL_NAME = &#34;ernie-1.0&#34;

ernie_model = paddlenlp.transformers.ErnieModel.from_pretrained(MODEL_NAME)

model = paddlenlp.transformers.ErnieForSequenceClassification.from_pretrained(MODEL_NAME, num_classes=len(train_ds.label_list))
[2021-04-28 13:16:39,380] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-1.0/ernie_v1_chn_base.pdparams
[2021-04-28 13:16:45,760] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-1.0/ernie_v1_chn_base.pdparams
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1303: UserWarning: Skip loading for classifier.weight. classifier.weight is not found in the provided dict.
  warnings.warn((&#34;Skip loading for {}. &#34;.format(key) + str(err)))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1303: UserWarning: Skip loading for classifier.bias. classifier.bias is not found in the provided dict.
  warnings.warn((&#34;Skip loading for {}. &#34;.format(key) + str(err)))調(diào)用ppnlp.transformers.ErnieTokenizer進(jìn)行數(shù)據(jù)處理

預(yù)訓(xùn)練模型ERNIE對(duì)中文數(shù)據(jù)的處理是以字為單位。PaddleNLP對(duì)于各種預(yù)訓(xùn)練模型已經(jīng)內(nèi)置了相應(yīng)的tokenizer。指定想要使用的模型名字即可加載對(duì)應(yīng)的tokenizer。
tokenizer作用為將原始輸入文本轉(zhuǎn)化成模型model可以接受的輸入數(shù)據(jù)形式。

(, 下載次數(shù): 128)

(, 下載次數(shù): 124)

ERNIE模型框架示意圖

tokenizer = paddlenlp.transformers.ErnieTokenizer.from_pretrained(MODEL_NAME)
[2021-04-28 13:16:47,310] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-1.0/vocab.txt數(shù)據(jù)讀入

使用paddle.io.DataLoader接口多線程異步加載數(shù)據(jù)。
from functools import partial
from paddlenlp.data import Stack, Tuple, Pad
from utils import  convert_example, create_dataloader

# 模型運(yùn)行批處理大小
batch_size = 16
max_seq_length = 128

trans_func = partial(
convert_example,
tokenizer=tokenizer,
max_seq_length=max_seq_length)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
Stack(dtype=&#34;int64&#34;)  # label
): [data for data in fn(samples)]

train_data_loader = create_dataloader(
train_ds,
mode=&#39;train&#39;,
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
dev_data_loader = create_dataloader(
dev_ds,
mode=&#39;dev&#39;,
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
test_data_loader = create_dataloader(
test_ds,
mode=&#39;test&#39;,
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
from paddlenlp.transformers import LinearDecayWithWarmup

# 訓(xùn)練過(guò)程中的最大學(xué)習(xí)率
learning_rate = 5e-5
# 訓(xùn)練輪次
epochs = 10
# 學(xué)習(xí)率預(yù)熱比例
warmup_proportion = 0.1
# 權(quán)重衰減系數(shù)，類似模型正則項(xiàng)策略，避免模型過(guò)擬合
weight_decay = 0.01

num_training_steps = len(train_data_loader) * epochs
lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
weight_decay=weight_decay,
apply_decay_param_fun=lambda x: x in [
      p.name for n, p in model.named_parameters()
      if not any(nd in n for nd in [&#34;bias&#34;, &#34;norm&#34;])
])

criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()
# checkpoint文件夾用于保存訓(xùn)練模型
!mkdir /home/aistudio/checkpoint模型訓(xùn)練與評(píng)估

模型訓(xùn)練的過(guò)程通常有以下步驟：

從dataloader中取出一個(gè)batch data
將batch data喂給model，做前向計(jì)算
將前向計(jì)算結(jié)果傳給損失函數(shù)，計(jì)算loss。將前向計(jì)算結(jié)果傳給評(píng)價(jià)方法，計(jì)算評(píng)價(jià)指標(biāo)。
loss反向回傳，更新梯度。重復(fù)以上步驟。

每訓(xùn)練一個(gè)epoch時(shí)，程序?qū)?huì)評(píng)估一次，評(píng)估當(dāng)前模型訓(xùn)練的效果。
import paddle.nn.functional as F
from utils import evaluate
from visualdl import LogWriter

global_step = 0
for epoch in range(1, epochs + 1):
with LogWriter(logdir=&#34;./visualdl&#34;) as writer:
      for step, batch in enumerate(train_data_loader, start=1):
         input_ids, segment_ids, labels = batch
         logits = model(input_ids, segment_ids)
         loss = criterion(logits, labels)
         probs = F.softmax(logits, axis=1)
         correct = metric.compute(probs, labels)
         metric.update(correct)
         acc = metric.accumulate()
         global_step += 1
         if global_step % 10 == 0 :
            print(&#34;global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f&#34; % (global_step, epoch, step, loss, acc))
            # 向記錄器添加一個(gè)tag為`loss`的數(shù)據(jù)
         writer.add_scalar(tag=&#34;loss&#34;, step=global_step, value=loss)
         # 向記錄器添加一個(gè)tag為`acc`的數(shù)據(jù)
         writer.add_scalar(tag=&#34;acc&#34;, step=global_step, value=acc)
         loss.backward()
         optimizer.step()
         lr_scheduler.step()
         optimizer.clear_grad()
      evaluate(model, criterion, metric, dev_data_loader)

model.save_pretrained(&#39;/home/aistudio/checkpoint&#39;)
tokenizer.save_pretrained(&#39;/home/aistudio/checkpoint&#39;)
global step 10, epoch: 1, batch: 10, loss: 0.20771, acc: 0.96875
eval loss: 0.30647, accu: 0.87963
global step 20, epoch: 2, batch: 1, loss: 0.06249, acc: 1.00000
global step 30, epoch: 2, batch: 11, loss: 0.11620, acc: 0.97727
eval loss: 0.36536, accu: 0.87037
global step 40, epoch: 3, batch: 2, loss: 0.02215, acc: 1.00000
global step 50, epoch: 3, batch: 12, loss: 0.00893, acc: 0.98958
eval loss: 0.56552, accu: 0.86111
global step 60, epoch: 4, batch: 3, loss: 0.00972, acc: 0.95833
global step 70, epoch: 4, batch: 13, loss: 0.01328, acc: 0.98077
eval loss: 0.81667, accu: 0.77778
global step 80, epoch: 5, batch: 4, loss: 0.22332, acc: 0.95312
global step 90, epoch: 5, batch: 14, loss: 0.27860, acc: 0.96429
eval loss: 0.40668, accu: 0.85185
global step 100, epoch: 6, batch: 5, loss: 0.03298, acc: 1.00000
global step 110, epoch: 6, batch: 15, loss: 0.00961, acc: 0.99167
eval loss: 0.27407, accu: 0.87963
global step 120, epoch: 7, batch: 6, loss: 0.01436, acc: 1.00000
global step 130, epoch: 7, batch: 16, loss: 0.00705, acc: 1.00000
eval loss: 0.31425, accu: 0.88889
global step 140, epoch: 8, batch: 7, loss: 0.00653, acc: 1.00000
global step 150, epoch: 8, batch: 17, loss: 0.00868, acc: 1.00000
eval loss: 0.30953, accu: 0.87963
global step 160, epoch: 9, batch: 8, loss: 0.00582, acc: 1.00000
global step 170, epoch: 9, batch: 18, loss: 0.00582, acc: 1.00000
eval loss: 0.31251, accu: 0.87963
global step 180, epoch: 10, batch: 9, loss: 0.00378, acc: 1.00000
global step 190, epoch: 10, batch: 19, loss: 0.00494, acc: 1.00000
eval loss: 0.31438, accu: 0.87963
(, 下載次數(shù): 129)

(, 下載次數(shù): 120)

VisualDL訓(xùn)練過(guò)程

模型預(yù)測(cè)

訓(xùn)練保存好的訓(xùn)練，即可用于預(yù)測(cè)。如以下示例代碼自定義預(yù)測(cè)數(shù)據(jù)，調(diào)用predict()函數(shù)即可一鍵預(yù)測(cè)。
from utils import predict

data = [
{&#39;text&#39;: &#39;,3,4號(hào)不發(fā)貨，介意者慎拍，謝謝！有品燃脂營(yíng)微信“掃一掃”立即關(guān)注微信號(hào)：picoocbxj主頁(yè)最新商品有品私教有品魔秤我的訂單店鋪主頁(yè)會(huì)員中心關(guān)注我們店鋪信息有贊提供技術(shù)支持取消清除歷史搜索購(gòu)物車&#39;},

{&#39;text&#39;: &#39;et是什么意思weekend是什么意思warn是什么意思team是什么意思Copyright,?,2006,-,2016,XUEXILA.COM,All,Rights,Reserved學(xué)習(xí)啦,版權(quán)所有&#39;},

{&#39;text&#39;: &#39;太陽(yáng)娛樂(lè)城現(xiàn)金開(kāi)戶皇家堡娛樂(lè)城彩票網(wǎng)上投注怎么領(lǐng)獎(jiǎng)彩票網(wǎng)上投注那個(gè)好網(wǎng)上賭場(chǎng)排行送173元,一肖中特免費(fèi)資料博悅娛樂(lè)是不是正規(guī)的天苑娛樂(lè)城xvsr姦全犯罪成海うるみkzcs16.com藝考不是博彩首頁(yè)&#39;},

{&#39;text&#39;: &#39;SS訂閱抵制不良游戲,拒絕盜版游戲,注意自我保護(hù),謹(jǐn)防受騙上當(dāng),適度游戲益腦,沉迷游戲傷身,合理安排時(shí)間,享受健康生活Copyright,?,201788130安卓下載滬ICP備16000974號(hào)-9&#39;},

{&#39;text&#39;: &#39;百度架構(gòu)師手把手帶你實(shí)現(xiàn)零基礎(chǔ)小白到AI工程師的華麗蛻變&#39;},
]
label_map = {0: &#39;正常頁(yè)面&#39;, 1: &#39;被黑頁(yè)面&#39;}

results = predict(
model, data, tokenizer, label_map, batch_size=batch_size)
for idx, text in enumerate(data):
print(&#39;Data: {} \t Lable: {}&#39;.format(text, results[idx]))
Data: {&#39;text&#39;: &#39;,3,4號(hào)不發(fā)貨，介意者慎拍，謝謝！有品燃脂營(yíng)微信“掃一掃”立即關(guān)注微信號(hào)：picoocbxj主頁(yè)最新商品有品私教有品魔秤我的訂單店鋪主頁(yè)會(huì)員中心關(guān)注我們店鋪信息有贊提供技術(shù)支持取消清除歷史搜索購(gòu)物車&#39;} Lable: 正常頁(yè)面
Data: {&#39;text&#39;: &#39;et是什么意思weekend是什么意思warn是什么意思team是什么意思Copyright,?,2006,-,2016,XUEXILA.COM,All,Rights,Reserved學(xué)習(xí)啦,版權(quán)所有&#39;} Lable: 正常頁(yè)面
Data: {&#39;text&#39;: &#39;太陽(yáng)娛樂(lè)城現(xiàn)金開(kāi)戶皇家堡娛樂(lè)城彩票網(wǎng)上投注怎么領(lǐng)獎(jiǎng)彩票網(wǎng)上投注那個(gè)好網(wǎng)上賭場(chǎng)排行送173元,一肖中特免費(fèi)資料博悅娛樂(lè)是不是正規(guī)的天苑娛樂(lè)城xvsr姦全犯罪成海うるみkzcs16.com藝考不是博彩首頁(yè)&#39;} Lable: 被黑頁(yè)面
Data: {&#39;text&#39;: &#39;SS訂閱抵制不良游戲,拒絕盜版游戲,注意自我保護(hù),謹(jǐn)防受騙上當(dāng),適度游戲益腦,沉迷游戲傷身,合理安排時(shí)間,享受健康生活Copyright,?,201788130安卓下載滬ICP備16000974號(hào)-9&#39;} Lable: 正常頁(yè)面
Data: {&#39;text&#39;: &#39;百度架構(gòu)師手把手帶你實(shí)現(xiàn)零基礎(chǔ)小白到AI工程師的華麗蛻變&#39;} Lable: 正常頁(yè)面為試驗(yàn)?zāi)Ｐ偷姆夯芰?，這里還把情感分析的文本也傳入模型中（腦洞大開(kāi)= =），輸出結(jié)果正常。
from utils import predict

data = [
{&#34;text&#34;:&#39;這個(gè)賓館比較陳舊了，特價(jià)的房間也很一般。總體來(lái)說(shuō)一般&#39;},
{&#34;text&#34;:&#39;懷著十分激動(dòng)的心情放映，可是看著看著發(fā)現(xiàn)，在放映完畢后，出現(xiàn)一集米老鼠的動(dòng)畫片&#39;},
{&#34;text&#34;:&#39;作為老的四星酒店，房間依然很整潔，相當(dāng)不錯(cuò)。機(jī)場(chǎng)接機(jī)服務(wù)很好，可以在車上辦理入住手續(xù)，節(jié)省時(shí)間。&#39;},
]
label_map = {0: &#39;正常頁(yè)面&#39;, 1: &#39;被黑頁(yè)面&#39;}

results = predict(
model, data, tokenizer, label_map, batch_size=batch_size)
for idx, text in enumerate(data):
print(&#39;Data: {} \t Lable: {}&#39;.format(text, results[idx]))
Data: {&#39;text&#39;: &#39;這個(gè)賓館比較陳舊了，特價(jià)的房間也很一般。總體來(lái)說(shuō)一般&#39;}    Lable: 正常頁(yè)面
Data: {&#39;text&#39;: &#39;懷著十分激動(dòng)的心情放映，可是看著看著發(fā)現(xiàn)，在放映完畢后，出現(xiàn)一集米老鼠的動(dòng)畫片&#39;} Lable: 正常頁(yè)面
Data: {&#39;text&#39;: &#39;作為老的四星酒店，房間依然很整潔，相當(dāng)不錯(cuò)。機(jī)場(chǎng)接機(jī)服務(wù)很好，可以在車上辦理入住手續(xù)，節(jié)省時(shí)間。&#39;} Lable: 正常頁(yè)面也可以通過(guò)自定義數(shù)據(jù)集的方式，從測(cè)試集直接加載數(shù)據(jù)
from paddlenlp.datasets import load_dataset

def read(data_path):
with open(data_path, &#39;r&#39;, encoding=&#39;utf-8&#39;) as f:
      for line in f:
         line = line.strip(&#39;\n&#39;).split(&#39;\t&#39;)
         # 注意，原數(shù)據(jù)集中可能文本里還有\(zhòng)t殘留，因此要使用下面的方法提取文本與標(biāo)簽，否則會(huì)報(bào)錯(cuò)
         words = &#39;&#39;.join(line[:-1])
         yield {&#39;text&#39;: words}

# data_path為read()方法的參數(shù)
test_ds = load_dataset(read, data_path=&#39;webtest.txt&#39;,lazy=False)
test_ds.label_list = [&#39;0&#39;, &#39;1&#39;]
test_ds[-5:]
[{&#39;text&#39;: &#39;區(qū)永壽縣三原縣禮泉縣長(zhǎng)武縣陜西相關(guān)分類：安康綜合醫(yī)院黃頁(yè)寶雞綜合醫(yī)院黃頁(yè)漢中綜合醫(yī)院黃頁(yè)商洛綜合醫(yī)院黃頁(yè)銅川綜合醫(yī)院黃頁(yè)渭南綜合醫(yī)院黃頁(yè)西安綜合醫(yī)院黃頁(yè)延安綜合醫(yī)院黃頁(yè)榆林綜合醫(yī)院黃頁(yè)分享到更多...&#39;},
{&#39;text&#39;: &#39;云集品電子商務(wù)有限公司,版權(quán)所有.粵ICP備14072989號(hào)-2聯(lián)系方式0755-33198568elapsed_time:0.1522,memory_usage:6.96MB關(guān)注微信,贏取更多優(yōu)惠&#39;},
{&#39;text&#39;: &#39;陽(yáng)城娛樂(lè)開(kāi)戶網(wǎng)址實(shí)戰(zhàn)百家樂(lè)真錢娛樂(lè)百家樂(lè)莊和閑哪個(gè)多澳門賭場(chǎng)攻略彩摘博彩資訊網(wǎng)上真錢賭博新華區(qū)體育路小學(xué)足球博彩網(wǎng)址大全太陽(yáng)城開(kāi)戶澳門皇冠賭場(chǎng)足球盈虧指數(shù)優(yōu)博博彩太陽(yáng)城游戲開(kāi)戶皇冠在線充值平臺(tái)大家比分&#39;},
{&#39;text&#39;: &#39;站長(zhǎng)統(tǒng)計(jì)&#39;},
{&#39;text&#39;: &#39;公司名：常熟市鑫豐貨運(yùn)有限公司會(huì)員組：免費(fèi)會(huì)員,\xa0[我要升級(jí)]狀態(tài)：離線姓名：魏如文(先生)職位：總經(jīng)理電話：0512-52849533手機(jī)：15962489066QQ：地址：常熟市通港路88一99號(hào)&#39;}]
from utils import predict

label_map = {0: &#39;正常頁(yè)面&#39;, 1: &#39;被黑頁(yè)面&#39;}

results = predict(
model, test_ds[-5:], tokenizer, label_map, batch_size=batch_size)
for idx, text in enumerate(test_ds[-5:]):
print(&#39;Data: {} \t Lable: {}&#39;.format(text, results[idx]))
Data: {&#39;text&#39;: &#39;區(qū)永壽縣三原縣禮泉縣長(zhǎng)武縣陜西相關(guān)分類：安康綜合醫(yī)院黃頁(yè)寶雞綜合醫(yī)院黃頁(yè)漢中綜合醫(yī)院黃頁(yè)商洛綜合醫(yī)院黃頁(yè)銅川綜合醫(yī)院黃頁(yè)渭南綜合醫(yī)院黃頁(yè)西安綜合醫(yī)院黃頁(yè)延安綜合醫(yī)院黃頁(yè)榆林綜合醫(yī)院黃頁(yè)分享到更多...&#39;} Lable: 正常頁(yè)面
Data: {&#39;text&#39;: &#39;云集品電子商務(wù)有限公司,版權(quán)所有.粵ICP備14072989號(hào)-2聯(lián)系方式0755-33198568elapsed_time:0.1522,memory_usage:6.96MB關(guān)注微信,贏取更多優(yōu)惠&#39;} Lable: 正常頁(yè)面
Data: {&#39;text&#39;: &#39;陽(yáng)城娛樂(lè)開(kāi)戶網(wǎng)址實(shí)戰(zhàn)百家樂(lè)真錢娛樂(lè)百家樂(lè)莊和閑哪個(gè)多澳門賭場(chǎng)攻略彩摘博彩資訊網(wǎng)上真錢賭博新華區(qū)體育路小學(xué)足球博彩網(wǎng)址大全太陽(yáng)城開(kāi)戶澳門皇冠賭場(chǎng)足球盈虧指數(shù)優(yōu)博博彩太陽(yáng)城游戲開(kāi)戶皇冠在線充值平臺(tái)大家比分&#39;} Lable: 被黑頁(yè)面
Data: {&#39;text&#39;: &#39;站長(zhǎng)統(tǒng)計(jì)&#39;} Lable: 正常頁(yè)面
Data: {&#39;text&#39;: &#39;公司名：常熟市鑫豐貨運(yùn)有限公司會(huì)員組：免費(fèi)會(huì)員,\xa0[我要升級(jí)]狀態(tài)：離線姓名：魏如文(先生)職位：總經(jīng)理電話：0512-52849533手機(jī)：15962489066QQ：地址：常熟市通港路88一99號(hào)&#39;} Lable: 正常頁(yè)面小結(jié)

Finetune后模型預(yù)測(cè)效果得到了大幅提升，且收斂非常快。當(dāng)然，由于第一個(gè)項(xiàng)目在數(shù)據(jù)預(yù)處理的時(shí)候還是采用比較粗暴的方式，因此數(shù)據(jù)集里可能有一些瑕疵，影響了指標(biāo)進(jìn)一步提升，在后續(xù)的項(xiàng)目中，會(huì)進(jìn)一步研究數(shù)據(jù)清洗問(wèn)題。
NLP的調(diào)參有點(diǎn)玄學(xué)，F(xiàn)inetune后模型如何“煉丹”，需要讀者進(jìn)一步探索。
再次感謝社區(qū)@沒(méi)入門的研究生大佬的指導(dǎo)，NLP果然上手難度比較大，建議初學(xué)者使用的時(shí)候，遇到問(wèn)題也要多交流，避免卡很久。

參考資料：PaddleNLP更多教程

使用seq2vec模塊進(jìn)行句子情感分類
使用BiGRU-CRF模型完成快遞單信息抽取
使用預(yù)訓(xùn)練模型ERNIE優(yōu)化快遞單信息抽取
使用Seq2Seq模型完成自動(dòng)對(duì)聯(lián)
使用預(yù)訓(xùn)練模型ERNIE-GEN實(shí)現(xiàn)智能寫詩(shī)
使用TCN網(wǎng)絡(luò)完成新冠疫情病例數(shù)預(yù)測(cè)
使用預(yù)訓(xùn)練模型完成閱讀理解
自定義數(shù)據(jù)集實(shí)現(xiàn)文本多分類任務(wù)

-----------------------------

歡迎光臨愛(ài)鋒貝 (http://m.7gfy2te7.cn/)