HTML 到 WordPress 导入项目

Categories: Tools

HTML 到 WordPress 导入项目

本项目提供了一个完整的工作流程，用于处理 HTML 文件并将其导入到 WordPress 中。它包含清理 HTML 内容并正确导入的工具，同时保留音频播放器、视频和嵌入内容等嵌套结构。

目录结构

项目概述
目录结构
安装要求
使用说明
处理步骤
导入到 WordPress
故障排除

项目概述

本项目由两个主要组件组成：

HTML 处理脚本 (process_all.py)：清理 HTML 文件，移除不需要的元素，同时保留内容结构
WordPress 导入脚本 (import-html.php)：将处理后的 HTML 文件导入到 WordPress 作为已发布的文章

目录结构

c:\newseo\SAPSEO\cache\original-html\

├── chinese/          # 原始 HTML 文件

├── chinese-fixed/    # 处理后的 HTML 文件（生成）

├── process_all.py           # HTML 处理脚本

├── import-html.php          # WordPress 导入脚本

└── README.md                # 本文件

安装要求

1. Python 环境

Python 3.6+
无需额外的 Python 包（使用标准库：os、re）

2. PHP 环境（用于 WordPress 导入）

PHP 7.0+
WordPress 安装（本地或远程）
WordPress 必须可从脚本位置访问

3. WP-CLI 安装

WP-CLI 是 WordPress 命令行工具，可以更方便地管理 WordPress。

安装步骤：

下载 WP-CLI：

“`bash

# 在 Windows 上使用 PowerShell

Invoke-WebRequest -Uri “https://raw.githubusercontent.com/wp-cli/builds/gh-pages/phar/wp-cli.phar” -OutFile “wp-cli.phar”

# 在 macOS/Linux 上

curl -O https://raw.githubusercontent.com/wp-cli/builds/gh-pages/phar/wp-cli.phar

“`

验证安装：

“`bash

php wp-cli.phar –info

“`

使其可执行（仅 macOS/Linux）：

“`bash

chmod +x wp-cli.phar

sudo mv wp-cli.phar /usr/local/bin/wp

“`

在 Windows 上创建批处理文件：

创建一个名为 wp.bat 的文件，内容如下：

“`batch

@php “%~dp0wp-cli.phar” %*

“`

将 wp.bat 和 wp-cli.phar 放在同一目录，并将该目录添加到系统 PATH 中。

4. 文件结构

将原始 HTML 文件放在 chinese/ 目录中
确保 chinese-fixed/ 目录具有写入权限

使用说明

步骤 1：处理 HTML 文件

此步骤清理原始 HTML 文件并将其保存到 chinese-fixed/ 目录。

# 导航到项目目录

cd c:\newseo\SAPSEO\cache\original-html



# 运行处理脚本

python process_all.py

脚本将：

从 chinese/ 读取所有 HTML 文件
通过移除不需要的元素来清理它们
将处理后的文件保存到 chinese-fixed/

步骤 2：导入到 WordPress

此步骤将处理后的 HTML 文件导入到 WordPress。

# 确保您在 WordPress 根目录中或调整脚本路径

# 运行导入脚本

php import-html.php

处理步骤

process_all.py 脚本执行以下清理操作：

移除所有图片
移除所有分享块
移除所有链接（保留文本）
移除所有嵌入内容
移除特定的头部 logo div
移除特定的站点标题 div
移除菜单按钮
移除所有 YouTube 视频
移除顶部导航菜单
移除 oEmbed 链接
移除 embed.js 脚本

导入到 WordPress

import-html.php 脚本：

扫描 chinese-fixed/ 目录中的 HTML 文件
提取标题（从文件名或 <h1> 标签）
提取内容（使用 DOMDocument 处理嵌套结构）
创建 WordPress 文章，包含：

– 状态：已发布

– 类型：文章

– 分类：Chinese（如果未找到则使用 ID 1）

– 作者：jacobren（如果未找到则使用 ID 1）

配置

HTML 处理配置

在 process_all.py 中，您可以修改：

source_directory：原始 HTML 文件的路径
target_directory：保存处理后文件的路径

WordPress 导入配置

在 import-html.php 中，您可以修改：

$html_dir：包含处理后 HTML 文件的目录
$default_category：备用分类 ID
$default_author：备用作者 ID

故障排除

常见问题

PHP 错误：require_once(wp-load.php): failed to open stream

– 原因：脚本不是从 WordPress 根目录运行

– 解决方案：将脚本移动到 WordPress 根目录或调整路径

没有文件被处理

– 原因：源目录中没有 HTML 文件

– 解决方案：确保 chinese/ 目录中有 HTML 文件

导入期间内容被截断

– 原因：此问题已通过使用 DOMDocument 而非正则表达式修复

– 解决方案：当前版本应正确处理嵌套结构

验证

要验证处理是否正确：

检查 chinese-fixed/ 中是否创建了文件
打开处理后的文件，确保不需要的元素已被移除
运行导入脚本并检查 WordPress 中的导入文章

注意事项

脚本保留音频播放器、视频和嵌入内容等嵌套 HTML 结构
处理时间可能取决于文件数量
始终保留原始 HTML 文件的备份

许可证

本项目仅供内部使用。

php 代码

<?php
/**
 * WordPress HTML Import Script
 * Modified: 2026-02-06
 */

// 加载 WordPress 环境
require_once('wp-load.php');

// 自动检测分类和作者 ID
$default_category = get_cat_ID('Chinese');
if ($default_category == 0) {
    $default_category = 1; // 兜底值
    echo "Warning: Chinese category not found, using ID 1\n";
}

$default_author = 1; // 兜底值
$user = get_user_by('login', 'jacobren');
if ($user) {
    $default_author = $user->ID;
} else {
    echo "Warning: jacobren user not found, using ID 1\n";
}

// 配置
$html_dir = 'hackingchinese-fixed';

// 扫描 HTML 文件
$files = glob($html_dir . '/*.html');
if (empty($files)) {
    die("Error: No HTML files found in $html_dir directory\n");
}

echo "Found " . count($files) . " HTML files\n\n";

// 处理每个 HTML 文件
foreach ($files as $file) {
    echo "Processing: " . basename($file) . "\n";
    
    // 读取文件内容
    $content = file_get_contents($file);
    if (!$content) {
        echo "Error: Could not read file\n\n";
        continue;
    }
    
    // 提取标题
    $title = basename($file, '.html');
    if (preg_match('/<h1[^>]*>(.*?)<\/h1>/is', $content, $match)) {
        $title = strip_tags($match[1]);
        $title = trim($title);
    }
    
    // 提取正文
    $post_content = '';
    
    // 使用DOMDocument解析HTML，正确处理嵌套结构
    $dom = new DOMDocument();
    // 忽略HTML错误
    @$dom->loadHTML($content);
    
    // 查找entry-content div
    $xpath = new DOMXPath($dom);
    $entries = $xpath->query('//div[contains(@class, "entry-content")]');
    
    if ($entries->length > 0) {
        $entryContent = $entries->item(0);
        // 保存原始内容
        $post_content = $dom->saveHTML($entryContent);
        // 移除外层的entry-content div标签
        $post_content = preg_replace('/^<div[^>]*class=["\'].*?entry-content.*?["\'][^>]*>(.*?)<\/div>$/is', '$1', $post_content);
    } elseif (preg_match('/<body[^>]*>(.*?)<\/body>/is', $content, $match)) {
        $post_content = $match[1];
    } else {
        $post_content = $content;
    }
    
    // 清理内容
    $post_content = preg_replace('/<script.*?<\/script>/is', '', $post_content);
    $post_content = preg_replace('/<style.*?<\/style>/is', '', $post_content);
    $post_content = trim($post_content);
    
    if (empty($post_content)) {
        echo "Error: No content extracted\n\n";
        continue;
    }
    
    // 创建文章
    $post_data = array(
        'post_title'    => $title,
        'post_content'  => $post_content,
        'post_status'   => 'publish',
        'post_type'     => 'post',
        'post_category' => array($default_category),
        'post_author'   => $default_author
    );
    
    $post_id = wp_insert_post($post_data);
    
    if ($post_id) {
        echo "Success: Imported (ID: $post_id)\n\n";
    } else {
        echo "Error: Failed to import\n\n";
    }
}

echo "Import completed!\n";
?>
EOF

优化html，使内容更纯净，方便导入

import os
import re

def process_html_files(source_directory, target_directory):
    """Process HTML files from source directory and save to target directory:
    1. Remove all images
    2. Remove all share blocks
    3. Remove all links (preserve text)
    4. Remove all embed content
    5. Remove specific header logo div
    6. Remove specific site titles div
    7. Remove menu button
    8. Remove all YouTube videos
    9. Remove top-nav-menu
    10. Remove oEmbed links
    11. Remove embed.js scripts
    """
    # Create target directory if it doesn't exist
    os.makedirs(target_directory, exist_ok=True)
    
    for root, _, files in os.walk(source_directory):
        for file in files:
            if file.endswith('.html'):
                source_file_path = os.path.join(root, file)
                # Calculate relative path to preserve directory structure
                relative_path = os.path.relpath(source_file_path, source_directory)
                target_file_path = os.path.join(target_directory, relative_path)
                
                # Create target subdirectory if it doesn't exist
                os.makedirs(os.path.dirname(target_file_path), exist_ok=True)
                
                try:
                    with open(source_file_path, 'r', encoding='utf-8', errors='ignore') as f:
                        content = f.read()
                    
                except Exception as e:
                    print(f"Error reading file {source_file_path}: {e}")
                    continue
                
                # 1. Remove all images
                cleaned_content = re.sub(r'<img[^>]*>', '', content)
                
                # 2. Remove all share blocks
                cleaned_content = re.sub(r'<!-- Simple Share Buttons Adder.*', '', cleaned_content, flags=re.DOTALL)
                
                # 3. Remove all links (preserve text)
                # 首先处理标准的链接格式
                cleaned_content = re.sub(r'<a[^>]*>(.*?)</a>', r'\1', cleaned_content, flags=re.DOTALL)
                # 然后处理未闭合的链接标签
                cleaned_content = re.sub(r'<a[^>]*>(.*?)(?=<li>|</li>|$)', r'\1', cleaned_content, flags=re.DOTALL)
                
                # 4. Remove all embed content
                cleaned_content = re.sub(r'<div class="wp-embed post-3948.*?</div>', '', cleaned_content, flags=re.DOTALL)
                
                # 5. Remove specific header logo div
                cleaned_content = re.sub(r'<div class="pure-u-1 pure-u-sm-1-6 pure-u-md-1-8">\s*<div class="header-logo icon icon-zhongwen-jiemi"></div>\s*</div>', '', cleaned_content, flags=re.DOTALL)
                
                # 6. Remove specific site titles div
                cleaned_content = re.sub(r'<div class="pure-u-1 pure-u-sm-5-6  pure-u-md-7-8 site-titles">\s*<h2 class="site-title">Hacking Chinese</h2>\s*<h3 class="site-subtitle">A better way of learning Mandarin</h3>\s*</div>', '', cleaned_content, flags=re.DOTALL)
                
                # 7. Remove menu button
                cleaned_content = re.sub(r'<button class="menu-button pure-button" id="menu-button">Menu</button>', '', cleaned_content, flags=re.DOTALL)
                
                # 8. Remove all YouTube videos
                cleaned_content = re.sub(r'<iframe[^>]*youtube\.com[^>]*>.*?</iframe>', '', cleaned_content, flags=re.DOTALL)
                
                # 9. Remove top-nav-menu
                cleaned_content = re.sub(r'<ul id="top-nav-menu"[^>]*>.*?</ul>', '', cleaned_content, flags=re.DOTALL)
                
                # 10. Remove oEmbed links
                cleaned_content = re.sub(r'<link[^>]*oembed[^>]*>', '', cleaned_content, flags=re.DOTALL)
                
                # 11. Remove embed.js scripts
                cleaned_content = re.sub(r'<script src="//downloads\.mailchimp\.com/js/signup-forms/popup/unique-methods/embed\.js"[^>]*></script>', '', cleaned_content, flags=re.DOTALL)
                
                # Write processed content to target file
                try:
                    with open(target_file_path, 'w', encoding='utf-8') as f:
                        f.write(cleaned_content)
                    print(f"Processed file {source_file_path} -> {target_file_path}")
                except Exception as e:
                    print(f"Error writing file {target_file_path}: {e}")
                    continue

def main():
    """Main function to process HTML files"""
    # Source and target directories
    source_directory = r"C:\newseo\SAPSEO\cache\original-html\chinese"
    target_directory = r"C:\newseo\SAPSEO\cache\original-html\chinese-fixed"
    
    print(f"Processing files from {source_directory}...")
    print(f"Saving processed files to {target_directory}...")
    
    process_html_files(source_directory, target_directory)
    print("All files processed successfully!")

if __name__ == "__main__":
    main()

HTML 到 WordPress 导入项目

目录结构

项目概述

目录结构

安装要求

1. Python 环境

2. PHP 环境（用于 WordPress 导入）

3. WP-CLI 安装

安装步骤：

4. 文件结构

使用说明

步骤 1：处理 HTML 文件

步骤 2：导入到 WordPress

处理步骤

导入到 WordPress

配置

HTML 处理配置

WordPress 导入配置

故障排除

常见问题

验证

注意事项

许可证

Related posts

Leave a Reply Cancel reply