• 首页 首页 icon
  • 工具库 工具库 icon
    • IP查询 IP查询 icon
  • 内容库 内容库 icon
    • 快讯库 快讯库 icon
    • 精品库 精品库 icon
    • 问答库 问答库 icon
  • 更多 更多 icon
    • 服务条款 服务条款 icon

pandas 的大而持久的 DataFrame



作为长期 SAS 用户,我正在探索切换到 python 和 pandas.

I am exploring switching to python and pandas as a long-time SAS user.

然而,今天在运行一些测试时,我很惊讶 python 在尝试 pandas.read_csv() 一个 128mb 的 csv 文件时内存不足.它有大约 200,000 行和 200 列主要是数字数据.

However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.

使用 SAS,我可以将 csv 文件导入 SAS 数据集,它可以和我的硬盘一样大.

With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.

pandas 中有类似的东西吗?


I regularly work with large files and do not have access to a distributed computing network.



原则上不应该用完内存,但是目前read_csv对大文件存在内存问题,原因是一些复杂的Python 内部问题(这个很模糊,但是早就知道了:http://github.com/pydata/pandas/问题/407).

In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).

At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.


  • 版权申明: 本站部分内容来自互联网,仅供学习及演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,请提供相关证据及您的身份证明,我们将在收到邮件后48小时内删除。
  • 本站站名: 学新通技术网
  • 本文地址: /reply/detail/tangicaii
更多 icon
更多 icon