获取BeautifulSoup以正确php标记或忽略它们
问题说明
我目前需要解析许多.phtml文件,获取特定的html标记并向其添加自定义数据属性. 我正在使用python beautifulsoup解析整个文档并添加标签,这部分工作正常.
I currently need to parse a lot of .phtml files, get specific html tags and add a custom data attribute to them. I'm using python beautifulsoup to parse the entire document and add the tags, and this part works just fine.
问题在于,在视图文件(phtml)上也有被解析的标签.以下是输入输出的示例
The problem is that on the view files (phtml) there are tags that get parsed too. Below is an example of input-output
输入
<?php
$stars = $this->getData('sideBarCoStars', []);
if (!$stars) return;
$sideBarCoStarsCount = $this->getData('sideBarCoStarsCount');
$title = $this->getData('sideBarCoStarsTitle');
$viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl');
$isDomain = $this->getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this->getData('emptyImageData');
?>
<header>
<h3>
<a href="https://www.it1352.com/<?php echo $viewAllUrl; ?>" class="noContentLink white">
<?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
</a>
</h3>
输出
<?php
$stars = $this->
getData('sideBarCoStars', []);
if (!$stars) return;
$sideBarCoStarsCount = $this->getData('sideBarCoStarsCount');
$title = $this->getData('sideBarCoStarsTitle');
$viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl');
$isDomain = $this->getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this->getData('emptyImageData');
?>
<header>
<h3>
<a href="https://www.it1352.com/<?php echo $viewAllUrl; ?>">
<?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
</a>
</h3>
我尝试了不同的方法,但是在使beautifulsoup忽略PHP标记方面并没有成功. 是否有可能使html.parser自定义规则忽略或beautifulsoup? 谢谢!
I tried different ways, but didn't succeed on making beautifulsoup to ignore the PHP tags. Is it possible to get html.parser custom rules to ignore , or to beautifulsoup? Thanks!
正确答案
您最好的选择是删除所有PHP元素,然后再将其提供给BeautifulSoup进行解析.这可以通过使用正则表达式来发现所有PHP部分并将其替换为安全的占位符文本来完成.
Your best bet is to remove all of the PHP elements before giving it to BeautifulSoup to parse. This can be done using a regular expression to spot all PHP sections and replace them with safe placeholder text.
使用BeautifulSoup完成所有修改后,即可替换PHP表达式.
After carrying out all of your modifications using BeautifulSoup, the PHP expressions can then be replaced.
由于PHP可以在任何地方,即也可以在带引号的字符串中,所以最好使用简单的唯一字符串占位符,而不是尝试将其包装在HTML注释中(请参见php_sig
).
As the PHP can be anywhere, i.e. also within a quoted string, it is best to use a simple unique string placeholder rather than trying to wrap it in an HTML comment (see php_sig
).
re.sub()
可以被赋予功能.每次进行替换时,原始PHP代码都存储在数组(php_elements
)中.然后进行相反的操作,即搜索php_sig
的所有实例,并将其替换为php_elements
中的下一个元素.如果一切顺利,php_elements
最后应为空,否则,您的修改将导致占位符被删除.
re.sub()
can be given a function. Each time the a substitution is made, the original PHP code is stored in an array (php_elements
). Then the reverse is done afterwards, i.e. search for all instances of php_sig
and replace them with the next element from php_elements
. If all goes well, php_elements
should be empty at the end, if it is not then your modifications have resulted in a place holder being removed.
from bs4 import BeautifulSoup
import re
html = """<html>
<body>
<?php
$stars = $this->getData('sideBarCoStars', []);
if (!$stars) return;
$sideBarCoStarsCount = $this->getData('sideBarCoStarsCount');
$title = $this->getData('sideBarCoStarsTitle');
$viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl');
$isDomain = $this->getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this->getData('emptyImageData');
?>
<header>
<h3>
<a href="https://www.it1352.com/<?php echo $viewAllUrl; ?>" class="noContentLink white">
<?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
</a>
</h3>
</body>"""
php_sig = '!!!PHP!!!'
php_elements = []
def php_remove(m):
php_elements.append(m.group())
return php_sig
def php_add(m):
return php_elements.pop(0)
# Pre-parse HTML to remove all PHP elements
html = re.sub(r'<\?php.*?\?>', php_remove, html, flags=re.S re.M)
soup = BeautifulSoup(html, "html.parser")
# Make modifications to the soup
# Do not remove any elements containing PHP elements
# Post-parse HTML to replace the PHP elements
html = re.sub(php_sig, php_add, soup.prettify())
print(html)
这篇好文章是转载于:学新通技术网
- 版权申明: 本站部分内容来自互联网,仅供学习及演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,请提供相关证据及您的身份证明,我们将在收到邮件后48小时内删除。
- 本站站名: 学新通技术网
- 本文地址: /reply/detail/tanhcfffjk
-
YouTube API 不能在 iOS (iPhone/iPad) 工作,但在桌面浏览器工作正常?
it1352 07-30 -
iPhone,一张图像叠加到另一张图像上以创建要保存的新图像?(水印)
it1352 07-17 -
保持在后台运行的 iPhone 应用程序完全可操作
it1352 07-25 -
使用 iPhone 进行移动设备管理
it1352 07-23 -
在android同时打开手电筒和前置摄像头
it1352 09-28 -
扫描 NFC 标签时是否可以启动应用程序?
it1352 08-02 -
检查邮件是否发送成功
it1352 07-25 -
Android微调工具-删除当前选择
it1352 06-20 -
Android App 和三星 Galaxy S4 不兼容
it1352 07-20 -
希伯来语的空格句子标记化错误
it1352 06-22