在PythonHTML表转换为CSV

Question

问题说明

我正在尝试将HTML中的表转换为Python中的csv。我要提取的表就是这个表：

I am trying to convert a table in HTML to a csv in Python. The table I am trying to extract is this one:

<table class="tblperiode">
    <caption>Dades de per&iacute;ode</caption>
    <tr>
        <th class="sortable"><span   title="Per&iacute;ode (Temps Universal)">Per&iacute;ode</span><br/>TU</th>                   
            <th><span   title="Temperatura mitjana (&deg;C)">TM</span><br/>&deg;C</th> 
            <th><span   title="Temperatura m&agrave;xima (&deg;C)">TX</span><br/>&deg;C</th>
            <th><span   title="Temperatura m&iacute;nima (&deg;C)">TN</span><br/>&deg;C</th>
            <th><span   title="Humitat relativa mitjana (%)">HRM</span><br/>%</th>
            <th><span   title="Precipitaci&oacute; (mm)">PPT</span><br/>mm</th>
            <th><span   title="Velocitat mitjana del vent (km/h)">VVM (10 m)</span><br/>km/h</th>
            <th><span   title="Direcci&oacute; mitjana del vent (graus)">DVM (10 m)</span><br/>graus</th>
            <th><span   title="Ratxa m&agrave;xima del vent (km/h)">VVX (10 m)</span><br/>km/h</th>
            <th><span   title="Irradi&agrave;ncia solar global mitjana (W/m2)">RS</span><br/>W/m<sup>2</sup></th>
    </tr>
            <tr>
                <th>
                            00:00 - 00:30            
                </th>
                                <td>16.2</td>
                                <td>16.5</td>
                                <td>15.4</td>
                                <td>93</td>
                                <td>0.0</td>
                                <td>6.5</td>
                                <td>293</td>
                                <td>10.4</td>
                                <td>0</td>
            </tr>
            <tr>
                <th>
                            00:30 - 01:00
                </th>
                                <td>16.4</td>
                                <td>16.5</td>
                                <td>16.1</td>
                                <td>90</td>
                                <td>0.0</td>
                                <td>5.8</td>
                                <td>288</td>
                                <td>8.6</td>
                                <td>0</td>
            </tr>

我希望它看起来像这样：

And I want it to look something like this:

要实现这一点，我尝试解析html并成功构建数据正确执行以下操作的数据框：

To achieve so, what I have tried is to parse the html and I have managed to build a dataframe with the data correctly doing the following:

from bs4 import BeautifulSoup
import csv
html = open("table.html").read()
soup = BeautifulSoup(html)
table = soup.select_one("table.tblperiode")

output_rows = []
for table_row in table.findAll('tr'):
    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
        output_row.append(column.text)
    output_rows.append(output_row)

 df = pd.DataFrame(output_rows)
 print(df)

但是，我想将列命名为d指示时间间隔的列，在上面的html示例中，其中只有两个出现在00：00-00：30和00:30 1:00。因此，我的表应具有两行，其中一行对应于00：00-00：30的观察值，另一行对应于00:30和1:00的观察值。

However, I would like to have the columns name and a column indicating the interval of time, in the example of html above just two of them appear 00:00-00:30 and 00:30 1:00. Therefore my table should have two rows, one corresponding with the observations of 00:00-00:30 and another one with the observations of 00:30 and 1:00.

如何从HTML中获取此信息？

How could I get this information from my HTML?

Answer 1

正确答案

#1

这里是一种方法，它可能不是最好的方法，但它可以工作！您可以通读注释以了解代码在做什么！

Here's a way of doing it, it's probably not the nicest way but it works! You can read through the comments to figure out what the code is doing!

from bs4 import BeautifulSoup
import csv

#read the html
html = open("table.html").read()
soup = BeautifulSoup(html, 'html.parser')

# get the table from html
table = soup.select_one("table.tblperiode")

# find all rows
rows = table.findAll('tr')

# strip the header from rows
headers = rows[0]
header_text = []

# add the header text to array
for th in headers.findAll('th'):
    header_text.append(th.text)

# init row text array
row_text_array = []

# loop through rows and add row text to array
for row in rows[1:]:
    row_text = []
    # loop through the elements
    for row_element in row.findAll(['th', 'td']):
        # append the array with the elements inner text
        row_text.append(row_element.text.replace('\n', '').strip())
    # append the text array to the row text array
    row_text_array.append(row_text)

# output csv
with open("out.csv", "w") as f:
    wr = csv.writer(f)
    wr.writerow(header_text)
    # loop through each row array
    for row_text_single in row_text_array:
        wr.writerow(row_text_single)

这篇好文章是转载于：学新通技术网

在PythonHTML表转换为CSV

问题说明

正确答案

YouTube API 不能在 iOS (iPhone/iPad) 工作，但在桌面浏览器工作正常?

iPhone，一张图像叠加到另一张图像上以创建要保存的新图像?(水印)

保持在后台运行的 iPhone 应用程序完全可操作

使用 iPhone 进行移动设备管理

在android同时打开手电筒和前置摄像头

检查邮件是否发送成功

扫描 NFC 标签时是否可以启动应用程序?

Android微调工具-删除当前选择

Android App 和三星 Galaxy S4 不兼容

希伯来语的空格句子标记化错误