How to delete historical versions of edited files and retain only the latest version

Dear OnlyOffice Team

We have integrated OnlyOffice into our application to enable editing and previewing of documents uploaded by users to our server. OnlyOffice works exceptionally well; users find it very smooth, and it fully meets our expectations. The only issue we’ve encountered relates to OnlyOffice’s versioning mechanism.

By examining the OnlyOffice working directory on the server (/var/lib/onlyoffice/documentserver/App_Data/cache/files/data), we observed the following workflow for file editing:

  1. When a user first opens a file via OnlyOffice, an Editor.bin file and a media folder are created under App_Data/cache/files/data/${fileKey}/.
  2. After editing, the changes are saved to a new directory: App_Data/cache/files/data/${fileKey}_\d+/. This directory contains three files: changes.zip, changeHistory.json, and output.${fileType}.
  3. Every edit generates a new directory with these three files.

This poses a problem: if the original file is large, multiple edits (even minor ones) will create numerous copies of output.${fileType}, consuming significant storage space.

Our Questions:

  1. Can OnlyOffice be configured to retain only the latest version of edits? If so, how?
  2. If configuration isn’t possible, can we manually delete intermediate versions? Based on our analysis of the directory structure before/after edits, we suspect OnlyOffice works as follows:
  • The original file is converted to Editor.bin upon first open and remains unchanged.
  • Each edit generates a changes.zip, which functions like a diff/patch file. The output.${fileType} is generated by applying changes.zip to Editor.bin.

If our understanding is correct, we could safely delete intermediate versions and keep only the latest edit results. We plan to identify the latest version using the last_open_date field in the task_result table of the PostgreSQL database.

Would there be any side effects if we programmatically delete these intermediate versions?

------------------------------------------------ 以为内容为 AI 翻译的英文原文------------------------------------------------

如何删除编辑过后的文件的历史版本,只保留最新版本。

OnlyOffice 的朋友们您们好,我将 OnlyOffice 接入了我们的应用用来对用户上传到服务端的文档进行编辑和预览。
OnlyOffice 工作的很好,用户起来很顺滑,完全符合我们的预期。唯一的问题出现在 OnlyOffice 的多版本机制上。

通过查看服务端的 OnlyOffice 工作目录: var/lib/onlyoffice/documentserver/App_Data/cache/files/data,我发现 OnlyOffice 对于文件的编辑是按如下流程处理的:

  1. 用户首次通过 OnlyOffice 打开文件后,会在 App_Data/cache/files/data/${fileKey}/ 目录下创建 Editor.bin 文件以及 media 文件夹
  2. 用户对文件进行编辑之后将编辑结果保存到目录: App_Data/cache/files/data/${fileKey}_\d+/,在这个目录中会写入: changes.zip, changeHistory.json, output.${fileType}
  3. 用户没每对文件进行编辑一次就生成一个新的目录,并且生成步骤 2 中所描述的 3 个文件。

这样会有一个问题,如果一个文件本身非常大,那么多次修改此文件(哪怕只修改一个字), 就会保存多个版本的 output.${fileType} 文件,会耗费非常大的存储空间。

是否可以通过配置让 OnlyOffice 只保留最新一个版本的修改结果? 如果有的话,要怎么样配置? 如果不能我是否可以自己删除不需要的中间修改版本。根据我对编辑前后生成的目录结构的变化,
我推测 OnlyOffice 的工作逻辑如下:

  1. 用户首次打开文件时将原文件转换为 Editor.bin 文件,之后此文件不不会再被修改
  2. 用户每次次修改后在每个目录生成 changes.zip,changes.zip 的作用就像 diff 命令生成的 patch 文件一样,通过 Editor.bin + changes.zip 就可以生成 output.${fileType}

如果我理解的上述工作原理没有问题的话,我完全可以删除编辑的中间版本,只需要保留最后一次修改的结果就好了。 关于哪个目录是最新版本可以通过查询 psql 数据中的 task_result 表的 last_open_date
来确定。 如果我自己通过代码删除这些中间版本是否会有其他副作用?

Hello @isNaN

Document Server stores files in cache indeed, but clearing them manually can cause various issues related to the document opening. By default Document Server performs cache clearing once in 24 hours (more details about process of cleaning I’ve posted here).

services.CoAuthoring.expire.files is a lifetime of a file in cache that was successfully edited and saved back to the storage. However, the Document Server does not delete such files immediately after the lifetime end. It does that on a schedule set in services.CoAuthoring.expire.filesCron. So, by default, at 12 am each day the Document Server will check if there are files in its cache that are 24 hours (or more) old and delete them.

If you want, you can change the cron job timings according to your needs to make files in cache being remove more frequently. I’d suggest sticking to this approach, because, as I mentioned before, manual removal of items in cache may cause serious problems with the integration.