0%

py爬虫项目一_爬取学校练习题答案

观察网站

查看源代码只有网站外观设计后,发现F12抓包后名为[practice]的文件内有“label”和”questionBh”两个题目号
练习题目首页
点进题目后再点击查看答案,抓包的文件”data”中就有完整的答案
![点进一道题目,并点击参考答案]/images/(3.png)
点开第二次抓包的标头,发现url是http://[一些域名]={questionBh}
点开第二次抓包的标头
我们可以通过 label -> questionBh -> answer

写代码

先进行 label -> questionBh 的步骤

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import requests
import re
import csv

url = f'http://[一些域名]/api/exam/question/practice'

headers = {
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 Edg/130.0.0.0"
}
resp = requests.get(url,headers=headers)

obj = re.compile(r'"label":"((?P<num>.*?))(?P<none>.*?)questionBh":"(?P<web>.*?)"',re.S)
result = obj.finditer(resp.text)

for i in result:
print(i.group("num","web"))

resp.close()

再对得到的格式进行一些调整,并变成csv格式,示例如下

1
2
3
num,questionBh
5,02ac8e3a76404d668bea36a31c90fc7b
*,********************************

再进行 questionBh -> answer 的步骤

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import requests
import re
import csv


query= input("输入你想搜索的东西\n")
url = f'http://[一些域名]/api/exam/question/getAnswer?questionBh={query}&studentId=2024011666'

headers = {
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 Edg/130.0.0.0"
}
resp = requests.get(url,headers=headers)

obj = re.compile(r'"data":"(?P<ans>.*?)"}',re.S)
result = obj.finditer(resp.text)


整合代码,通过csv文件串联起来

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import csv
import requests
import re


def load_mapping(file_path):
mapping = {}
with open(file_path, mode='r', encoding='utf-16be') as file:
reader = csv.reader(file)
next(reader)
for row in reader:
num, questionBh = row
mapping[num] = questionBh
return mapping

def get_question_bh(num, mapping):
return mapping.get(num, "Not Found")


file_path = 'mapping.csv'

k = input("输入题号:")


mapping = load_mapping(file_path)

question_bh = get_question_bh(k, mapping)
print(f"生成的 questionBh: {question_bh}")

k = f"{question_bh}"


url = f'http://[一些域名]/api/exam/question/getAnswer?questionBh={k}'

headers = {
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 Edg/130.0.0.0"
}
resp = requests.get(url,headers=headers)

obj = re.compile(r'"data":"(?P<ans>.*?)"}',re.S)
result = obj.finditer(resp.text)

使用 ast 解析转义字符

笔者在学习中发现,使用 ast.literal_eval() 解析转义字符可以消除多余字符的影响,省去了复制到ai转换的麻烦

1
2
3
4
5
6
7
8
9
10
11
ans_list = [i.group("ans") for i in result]

# 将列表中的所有内容连接成一个字符串
a = ''.join(ans_list)

# 使用 ast.literal_eval() 安全地解析转义字符
escaped_text = ast.literal_eval(f'"""{a}"""')
print(escaped_text)


resp.close()

完整代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import csv
import requests
import re
import ast

def load_mapping(file_path):
mapping = {}
with open(file_path, mode='r', encoding='utf-16be') as file:
reader = csv.reader(file)
next(reader) # 跳过标题行
for row in reader:
num, questionBh = row
mapping[num] = questionBh
return mapping

def get_question_bh(num, mapping):
return mapping.get(num, "Not Found")

# 示例输入
file_path = 'mapping.csv'

k = input("输入题号:")

# 加载对应关系
mapping = load_mapping(file_path)

# 获取对应的 questionBh
question_bh = get_question_bh(k, mapping)
print(f"生成的 questionBh: {question_bh}")

k = f"{question_bh}"

url = f'http://[一些域名]//api/exam/question/getAnswer?questionBh={k}'

headers = {
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 Edg/130.0.0.0"
}
resp = requests.get(url,headers=headers)

obj = re.compile(r'"data":"(?P<ans>.*?)"}',re.S)
result = obj.finditer(resp.text)

ans_list = [i.group("ans") for i in result]

# 将列表中的所有内容连接成一个字符串
a = ''.join(ans_list)

# 使用 ast.literal_eval() 安全地解析转义字符
escaped_text = ast.literal_eval(f'"""{a}"""')
print(escaped_text)

resp.close()

结果

生成结果如下图,经过修改后可以直接复制到提交窗口使用啦
生成结果

参考

[bilibili]Python爬虫教程 通过视频,我学习了抓包方法,re,正则表达式,最简单的通信原理
在线正则表达式测试