获取BeautifulSoup以正确php标记或忽略它们

Question

问题说明

我目前需要解析许多.phtml文件，获取特定的html标记并向其添加自定义数据属性. 我正在使用python beautifulsoup解析整个文档并添加标签，这部分工作正常.

I currently need to parse a lot of .phtml files, get specific html tags and add a custom data attribute to them. I'm using python beautifulsoup to parse the entire document and add the tags, and this part works just fine.

问题在于，在视图文件(phtml)上也有被解析的标签.以下是输入输出的示例

The problem is that on the view files (phtml) there are tags that get parsed too. Below is an example of input-output

输入

<?php

$stars = $this->getData('sideBarCoStars', []);

if (!$stars) return;

$sideBarCoStarsCount = $this->getData('sideBarCoStarsCount');
$title = $this->getData('sideBarCoStarsTitle');
$viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl');
$isDomain = $this->getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this->getData('emptyImageData');
?>
<header>
    <h3>
        <a href="https://www.it1352.com/<?php echo $viewAllUrl; ?>" class="noContentLink white">
        <?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
        </a>
    </h3>

输出

<?php
$stars = $this->
getData('sideBarCoStars', []);

if (!$stars) return;

$sideBarCoStarsCount = $this-&gt;getData('sideBarCoStarsCount');
$title = $this-&gt;getData('sideBarCoStarsTitle');
$viewAllUrl = $this-&gt;getData('sideBarCoStarsViewAllUrl');
$isDomain = $this-&gt;getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this-&gt;getData('emptyImageData');
?&gt;
<header>
 <h3>
  <a   href="https://www.it1352.com/&lt;?php echo $viewAllUrl; ?&gt;">
   <?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
  </a>
 </h3>

我尝试了不同的方法，但是在使beautifulsoup忽略PHP标记方面并没有成功. 是否有可能使html.parser自定义规则忽略或beautifulsoup? 谢谢！

I tried different ways, but didn't succeed on making beautifulsoup to ignore the PHP tags. Is it possible to get html.parser custom rules to ignore , or to beautifulsoup? Thanks!

Answer 1

正确答案

#1

您最好的选择是删除所有PHP元素，然后再将其提供给BeautifulSoup进行解析.这可以通过使用正则表达式来发现所有PHP部分并将其替换为安全的占位符文本来完成.

Your best bet is to remove all of the PHP elements before giving it to BeautifulSoup to parse. This can be done using a regular expression to spot all PHP sections and replace them with safe placeholder text.

使用BeautifulSoup完成所有修改后，即可替换PHP表达式.

After carrying out all of your modifications using BeautifulSoup, the PHP expressions can then be replaced.

由于PHP可以在任何地方，即也可以在带引号的字符串中，所以最好使用简单的唯一字符串占位符，而不是尝试将其包装在HTML注释中(请参见php_sig).

As the PHP can be anywhere, i.e. also within a quoted string, it is best to use a simple unique string placeholder rather than trying to wrap it in an HTML comment (see php_sig).

re.sub()可以被赋予功能.每次进行替换时，原始PHP代码都存储在数组(php_elements)中.然后进行相反的操作，即搜索php_sig的所有实例，并将其替换为php_elements中的下一个元素.如果一切顺利，php_elements最后应为空，否则，您的修改将导致占位符被删除.

re.sub() can be given a function. Each time the a substitution is made, the original PHP code is stored in an array (php_elements). Then the reverse is done afterwards, i.e. search for all instances of php_sig and replace them with the next element from php_elements. If all goes well, php_elements should be empty at the end, if it is not then your modifications have resulted in a place holder being removed.

from bs4 import BeautifulSoup
import re

html = """<html>
<body>

<?php 
$stars = $this->getData('sideBarCoStars', []);

if (!$stars) return;

$sideBarCoStarsCount = $this->getData('sideBarCoStarsCount');
$title = $this->getData('sideBarCoStarsTitle');
$viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl');
$isDomain = $this->getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this->getData('emptyImageData');
?>

<header>
    <h3>
        <a href="https://www.it1352.com/<?php echo $viewAllUrl; ?>" class="noContentLink white">
        <?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
        </a>
    </h3>

</body>"""

php_sig = '!!!PHP!!!'
php_elements = []

def php_remove(m):
    php_elements.append(m.group())
    return php_sig

def php_add(m):
    return php_elements.pop(0)

# Pre-parse HTML to remove all PHP elements
html = re.sub(r'<\?php.*?\?>', php_remove, html, flags=re.S re.M)

soup = BeautifulSoup(html, "html.parser")

# Make modifications to the soup
# Do not remove any elements containing PHP elements

# Post-parse HTML to replace the PHP elements
html = re.sub(php_sig, php_add, soup.prettify())

print(html)

这篇好文章是转载于：学新通技术网

获取BeautifulSoup以正确php标记或忽略它们

问题说明

正确答案

YouTube API 不能在 iOS (iPhone/iPad) 工作，但在桌面浏览器工作正常?

iPhone，一张图像叠加到另一张图像上以创建要保存的新图像?(水印)

保持在后台运行的 iPhone 应用程序完全可操作

使用 iPhone 进行移动设备管理

在android同时打开手电筒和前置摄像头

扫描 NFC 标签时是否可以启动应用程序?

检查邮件是否发送成功

Android微调工具-删除当前选择

Android App 和三星 Galaxy S4 不兼容

希伯来语的空格句子标记化错误