• 首页 首页 icon
  • 工具库 工具库 icon
    • IP查询 IP查询 icon
  • 内容库 内容库 icon
    • 快讯库 快讯库 icon
    • 精品库 精品库 icon
    • 问答库 问答库 icon
  • 更多 更多 icon
    • 服务条款 服务条款 icon

获取BeautifulSoup以正确php标记或忽略它们

用户头像
it1352
帮助1

问题说明

我目前需要解析许多.phtml文件,获取特定的html标记并向其添加自定义数据属性. 我正在使用python beautifulsoup解析整个文档并添加标签,这部分工作正常.

I currently need to parse a lot of .phtml files, get specific html tags and add a custom data attribute to them. I'm using python beautifulsoup to parse the entire document and add the tags, and this part works just fine.

问题在于,在视图文件(phtml)上也有被解析的标签.以下是输入输出的示例

The problem is that on the view files (phtml) there are tags that get parsed too. Below is an example of input-output

输入

<?php

$stars = $this->getData('sideBarCoStars', []);

if (!$stars) return;

$sideBarCoStarsCount = $this->getData('sideBarCoStarsCount');
$title = $this->getData('sideBarCoStarsTitle');
$viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl');
$isDomain = $this->getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this->getData('emptyImageData');
?>
<header>
    <h3>
        <a href="https://www.it1352.com/<?php echo $viewAllUrl; ?>" class="noContentLink white">
        <?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
        </a>
    </h3>

输出

<?php
$stars = $this->
getData('sideBarCoStars', []);

if (!$stars) return;

$sideBarCoStarsCount = $this-&gt;getData('sideBarCoStarsCount');
$title = $this-&gt;getData('sideBarCoStarsTitle');
$viewAllUrl = $this-&gt;getData('sideBarCoStarsViewAllUrl');
$isDomain = $this-&gt;getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this-&gt;getData('emptyImageData');
?&gt;
<header>
 <h3>
  <a   href="https://www.it1352.com/&lt;?php echo $viewAllUrl; ?&gt;">
   <?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
  </a>
 </h3>

我尝试了不同的方法,但是在使beautifulsoup忽略PHP标记方面并没有成功. 是否有可能使html.parser自定义规则忽略或beautifulsoup? 谢谢!

I tried different ways, but didn't succeed on making beautifulsoup to ignore the PHP tags. Is it possible to get html.parser custom rules to ignore , or to beautifulsoup? Thanks!

正确答案

#1

您最好的选择是删除所有PHP元素,然后再将其提供给BeautifulSoup进行解析.这可以通过使用正则表达式来发现所有PHP部分并将其替换为安全的占位符文本来完成.

Your best bet is to remove all of the PHP elements before giving it to BeautifulSoup to parse. This can be done using a regular expression to spot all PHP sections and replace them with safe placeholder text.

使用BeautifulSoup完成所有修改后,即可替换PHP表达式.

After carrying out all of your modifications using BeautifulSoup, the PHP expressions can then be replaced.

由于PHP可以在任何地方,即也可以在带引号的字符串中,所以最好使用简单的唯一字符串占位符,而不是尝试将其包装在HTML注释中(请参见php_sig).

As the PHP can be anywhere, i.e. also within a quoted string, it is best to use a simple unique string placeholder rather than trying to wrap it in an HTML comment (see php_sig).

re.sub()可以被赋予功能.每次进行替换时,原始PHP代码都存储在数组(php_elements)中.然后进行相反的操作,即搜索php_sig的所有实例,并将其替换为php_elements中的下一个元素.如果一切顺利,php_elements最后应为空,否则,您的修改将导致占位符被删除.

re.sub() can be given a function. Each time the a substitution is made, the original PHP code is stored in an array (php_elements). Then the reverse is done afterwards, i.e. search for all instances of php_sig and replace them with the next element from php_elements. If all goes well, php_elements should be empty at the end, if it is not then your modifications have resulted in a place holder being removed.

from bs4 import BeautifulSoup
import re

html = """<html>
<body>

<?php 
$stars = $this->getData('sideBarCoStars', []);

if (!$stars) return;

$sideBarCoStarsCount = $this->getData('sideBarCoStarsCount');
$title = $this->getData('sideBarCoStarsTitle');
$viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl');
$isDomain = $this->getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this->getData('emptyImageData');
?>

<header>
    <h3>
        <a href="https://www.it1352.com/<?php echo $viewAllUrl; ?>" class="noContentLink white">
        <?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
        </a>
    </h3>

</body>"""

php_sig = '!!!PHP!!!'
php_elements = []

def php_remove(m):
    php_elements.append(m.group())
    return php_sig

def php_add(m):
    return php_elements.pop(0)

# Pre-parse HTML to remove all PHP elements
html = re.sub(r'<\?php.*?\?>', php_remove, html, flags=re.S re.M)

soup = BeautifulSoup(html, "html.parser")

# Make modifications to the soup
# Do not remove any elements containing PHP elements

# Post-parse HTML to replace the PHP elements
html = re.sub(php_sig, php_add, soup.prettify())

print(html)

这篇好文章是转载于:学新通技术网

  • 版权申明: 本站部分内容来自互联网,仅供学习及演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,请提供相关证据及您的身份证明,我们将在收到邮件后48小时内删除。
  • 本站站名: 学新通技术网
  • 本文地址: /reply/detail/tanhcfffjk
系列文章
更多 icon
同类精品
更多 icon
继续加载