如何使用ML算法从单词包中提取特征向量数据？

def bagOfWords(description, vocabulary): bag = np.zeros(len(vocabulary)).astype(int) for sw in description: for i,word in enumerate(vocabulary): if word == sw: bag[i] += 1 print("Bag: ", bag) return bag

import pandas as pd import numpy as np import warnings import tkinter as tk from tkinter import filedialog from nltk.tokenize import TweetTokenizer warnings.filterwarnings("ignore", category=FutureWarning) root= tk.Tk() canvas1 = tk.Canvas(root, width = 300, height = 300, bg = 'lightsteelblue') canvas1.pack() def getExcel (): global df vocabularysheet = pd.read_excel (r'Filepath\filename.xlsx') vocabularydf = pd.DataFrame(vocabularysheet, columns = ['Word']) vocabulary = vocabularydf.values.tolist() unitlabelsdf = pd.DataFrame(vocabularysheet, columns = ['Unit']) unitlabels = unitlabelsdf.values.tolist() for voc in vocabulary: index = vocabulary.index(voc) voc = vocabulary[index][0] vocabulary[index] = voc for label in unitlabels: index = unitlabels.index(label) label = unitlabels[index][0] unitlabels[index] = label import_file_path = filedialog.askopenfilename() testdatasheet = pd.read_excel (import_file_path) descriptiondf = pd.DataFrame(testdatasheet, columns = ['Description']) descriptiondf = descriptiondf.replace('\n',' ', regex=True).replace('\xa0',' ', regex=True).replace('•', ' ', regex=True).replace('u200b', ' ', regex=True) description = descriptiondf.values.tolist() tokenized_description = tokanize(description) for x in tokenized_description: index = tokenized_description.index(x) tokenized_description[index] = bagOfWords(x, vocabulary) def tokanize(description): for d in description: index = description.index(d) tknzr = TweetTokenizer() tokenized_description = list(tknzr.tokenize((str(d).lower()))) description[index] = tokenized_description return description def wordFilter(tokenized_description): bad_chars = [';', ':', '!', "*", ']', '[', '.', ',', "'", '"'] if(tokenized_description in bad_chars): return False else: return True def bagOfWords(description, vocabulary): bag = np.zeros(len(vocabulary)).astype(int) for sw in description: for i,word in enumerate(vocabulary): if word == sw: bag[i] += 1 print("Bag: ", bag) return bag browseButton_Excel = tk.Button(text='Import Excel File', command=getExcel, bg='green', fg='white', font=('helvetica', 12, 'bold')) predictionButton = tk.Button(text='Button', command=getExcel, bg='green', fg='white', font=('helvetica', 12, 'bold')) canvas1.create_window(150, 150, window=browseButton_Excel) root.mainloop()

1条回答

网友

1楼 · 发布于 2024-04-26 13:29:56

您已经知道如何为培训准备数据集。你知道吗

我举了一个例子来解释：

voca = ["java", "spring", "net", "csharp", "python", "numpy", "nodejs", "javascript"]

units = ["MicrosoftTech", "JavaTech", "Pythoneers", "JavascriptRoots"]
desc1 = "Company X is looking for a Java Developer. Requirements: Has worked with Java. 3+ years experience with Java, Maven and Spring."
desc2 = "Company Y is looking for a csharp Developer. Requirements: Has wored with csharp. 5+ years experience with csharp, Net."

x_train = []
y_train = []

x_train.append(bagOfWords(desc1, voca))
y_train.append(units.index("JavaTech"))
x_train.append(bagOfWords(desc2, voca))
y_train.append(units.index("MicrosoftTech"))

我们得到了两个训练数据集：

[array([3, 1, 0, 0, 0, 0, 0, 0]), array([0, 0, 1, 3, 0, 0, 0, 0])] [1, 0]

array([3, 1, 0, 0, 0, 0, 0, 0]) => 1 (It means JavaTech)
array([0, 0, 1, 3, 0, 0, 0, 0]) => 0 (It means MicrosoftTech)

而且，模型需要预测你定义的4个单位中的一个单位。因此，我们需要一个分类机器学习模型。分类机器学习模型需要“softmax”作为输出层的激活函数。并且，需要一个“交叉熵”损失函数。这是由tensorflow的keras API编写的非常简单的深度学习模型。你知道吗

import tensorflow as tf
import numpy as np

units = ["MicrosoftTech", "JavaTech", "Pythoneers", "JavascriptRoots"]
x_train = np.array([[3, 1, 0, 0, 0, 0, 0, 0],
                [1, 0, 0, 0, 0, 0, 0, 0],
                [0, 0, 1, 1, 0, 0, 0, 0],
                [0, 0, 2, 0, 0, 0, 0, 0],
                [0, 0, 0, 0, 2, 1, 0, 0],
                [0, 0, 0, 0, 1, 2, 0, 0],
                [0, 0, 0, 0, 0, 0, 1, 1],
                [0, 0, 0, 0, 0, 0, 1, 0]])
y_train = np.array([0, 0, 1, 1, 2, 2, 3, 3])

该模型由一个256单元的隐藏层和4单元的输出层组成。你知道吗

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation=tf.nn.relu),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(len(units), activation=tf.nn.softmax)])
model.compile(optimizer='adam',
                         loss='sparse_categorical_crossentropy',
                         metrics=['accuracy'])

我把年代定为50年。你需要看到损失和acc，而它是运行学习。实际上，10个还不够。我会开始学习的。你知道吗

model.fit(x_train, y_train, epochs=50)

而且，这也是预测的一部分。newSample只是我做的样品。你知道吗

newSample = np.array([[2, 2, 0, 0, 0, 0, 0, 0]])
prediction = model.predict(newSample)
print (prediction)
print (units[np.argmax(prediction)])

最后，我得到一个结果如下：

[[0.96280855 0.00981709 0.0102595  0.01711495]]
MicrosoftTech

它意味着每个单元的可能性。最有可能的是微软科技。你知道吗

MicrosoftTech : 0.96280855
JavaTech : 0.00981709
....

而且，它是学习步骤的结果。你可以看到损失一直在减少。所以，我增加了纪元的数量。你知道吗

Epoch 1/50
8/8 [==============================] - 0s 48ms/step - loss: 1.3978 - acc: 0.0000e+00
Epoch 2/50
8/8 [==============================] - 0s 356us/step - loss: 1.3618 - acc: 0.1250
Epoch 3/50
8/8 [==============================] - 0s 201us/step - loss: 1.3313 - acc: 0.3750
Epoch 4/50
8/8 [==============================] - 0s 167us/step - loss: 1.2965 - acc: 0.7500
Epoch 5/50
8/8 [==============================] - 0s 139us/step - loss: 1.2643 - acc: 0.8750
........
........
Epoch 45/50
8/8 [==============================] - 0s 122us/step - loss: 0.3500 - acc: 1.0000
Epoch 46/50
8/8 [==============================] - 0s 140us/step - loss: 0.3376 - acc: 1.0000
Epoch 47/50
8/8 [==============================] - 0s 134us/step - loss: 0.3257 - acc: 1.0000
Epoch 48/50
8/8 [==============================] - 0s 137us/step - loss: 0.3143 - acc: 1.0000
Epoch 49/50
8/8 [==============================] - 0s 141us/step - loss: 0.3032 - acc: 1.0000
Epoch 50/50
8/8 [==============================] - 0s 177us/step - loss: 0.2925 - acc: 1.0000

相关问题更多 >

编程相关推荐

热门问题

热门文章