• 首页 首页 icon
  • 工具库 工具库 icon
    • IP查询 IP查询 icon
  • 内容库 内容库 icon
    • 快讯库 快讯库 icon
    • 精品库 精品库 icon
    • 问答库 问答库 icon
  • 更多 更多 icon
    • 服务条款 服务条款 icon

在PythonHTML表转换为CSV

用户头像
it1352
帮助1

问题说明

我正在尝试将HTML中的表转换为Python中的csv。我要提取的表就是这个表:

I am trying to convert a table in HTML to a csv in Python. The table I am trying to extract is this one:

<table class="tblperiode">
    <caption>Dades de per&iacute;ode</caption>
    <tr>
        <th class="sortable"><span   title="Per&iacute;ode (Temps Universal)">Per&iacute;ode</span><br/>TU</th>                   
            <th><span   title="Temperatura mitjana (&deg;C)">TM</span><br/>&deg;C</th> 
            <th><span   title="Temperatura m&agrave;xima (&deg;C)">TX</span><br/>&deg;C</th>
            <th><span   title="Temperatura m&iacute;nima (&deg;C)">TN</span><br/>&deg;C</th>
            <th><span   title="Humitat relativa mitjana (%)">HRM</span><br/>%</th>
            <th><span   title="Precipitaci&oacute; (mm)">PPT</span><br/>mm</th>
            <th><span   title="Velocitat mitjana del vent (km/h)">VVM (10 m)</span><br/>km/h</th>
            <th><span   title="Direcci&oacute; mitjana del vent (graus)">DVM (10 m)</span><br/>graus</th>
            <th><span   title="Ratxa m&agrave;xima del vent (km/h)">VVX (10 m)</span><br/>km/h</th>
            <th><span   title="Irradi&agrave;ncia solar global mitjana (W/m2)">RS</span><br/>W/m<sup>2</sup></th>
    </tr>
            <tr>
                <th>
                            00:00 - 00:30            
                </th>
                                <td>16.2</td>
                                <td>16.5</td>
                                <td>15.4</td>
                                <td>93</td>
                                <td>0.0</td>
                                <td>6.5</td>
                                <td>293</td>
                                <td>10.4</td>
                                <td>0</td>
            </tr>
            <tr>
                <th>
                            00:30 - 01:00
                </th>
                                <td>16.4</td>
                                <td>16.5</td>
                                <td>16.1</td>
                                <td>90</td>
                                <td>0.0</td>
                                <td>5.8</td>
                                <td>288</td>
                                <td>8.6</td>
                                <td>0</td>
            </tr>

我希望它看起来像这样:

And I want it to look something like this:

要实现这一点,我尝试解析html并成功构建数据正确执行以下操作的数据框:

To achieve so, what I have tried is to parse the html and I have managed to build a dataframe with the data correctly doing the following:

from bs4 import BeautifulSoup
import csv
html = open("table.html").read()
soup = BeautifulSoup(html)
table = soup.select_one("table.tblperiode")

output_rows = []
for table_row in table.findAll('tr'):
    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
        output_row.append(column.text)
    output_rows.append(output_row)

 df = pd.DataFrame(output_rows)
 print(df)

但是,我想将列命名为d指示时间间隔的列,在上面的html示例中,其中只有两个出现在00:00-00:30和00:30 1:00。因此,我的表应具有两行,其中一行对应于00:00-00:30的观察值,另一行对应于00:30和1:00的观察值。

However, I would like to have the columns name and a column indicating the interval of time, in the example of html above just two of them appear 00:00-00:30 and 00:30 1:00. Therefore my table should have two rows, one corresponding with the observations of 00:00-00:30 and another one with the observations of 00:30 and 1:00.

如何从HTML中获取此信息?

How could I get this information from my HTML?

正确答案

#1

这里是一种方法,它可能不是最好的方法,但它可以工作!您可以通读注释以了解代码在做什么!

Here's a way of doing it, it's probably not the nicest way but it works! You can read through the comments to figure out what the code is doing!

from bs4 import BeautifulSoup
import csv

#read the html
html = open("table.html").read()
soup = BeautifulSoup(html, 'html.parser')

# get the table from html
table = soup.select_one("table.tblperiode")

# find all rows
rows = table.findAll('tr')

# strip the header from rows
headers = rows[0]
header_text = []

# add the header text to array
for th in headers.findAll('th'):
    header_text.append(th.text)

# init row text array
row_text_array = []

# loop through rows and add row text to array
for row in rows[1:]:
    row_text = []
    # loop through the elements
    for row_element in row.findAll(['th', 'td']):
        # append the array with the elements inner text
        row_text.append(row_element.text.replace('\n', '').strip())
    # append the text array to the row text array
    row_text_array.append(row_text)

# output csv
with open("out.csv", "w") as f:
    wr = csv.writer(f)
    wr.writerow(header_text)
    # loop through each row array
    for row_text_single in row_text_array:
        wr.writerow(row_text_single)

这篇好文章是转载于:学新通技术网

  • 版权申明: 本站部分内容来自互联网,仅供学习及演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,请提供相关证据及您的身份证明,我们将在收到邮件后48小时内删除。
  • 本站站名: 学新通技术网
  • 本文地址: /reply/detail/tanhcffkkh
系列文章
更多 icon
同类精品
更多 icon
继续加载