当前位置: 首页>>技术教程>>正文


linux – iconv对UTF-8的任何编码

, , , ,

问题描述

我试图将iconv指向目录,并且无论当前编码如何,所有文件都将转换为UTF-8

我正在使用此脚本,但您必须指定要使用的编码。如何使其自动检测当前编码?

#!/bin/bash

ICONVBIN='/usr/bin/iconv' # path to iconv binary

if [ $# -lt 3 ]
then
    echo "$0 dir from_charset to_charset"
    exit
fi

for f in $1/*
do
    if test -f $f
    then
        echo -e "\nConverting $f"
        /bin/mv $f $f.old
        $ICONVBIN -f $2 -t $3 $f.old > $f
    else
        echo -e "\nSkipping $f - not a regular file";
    fi
done

终端线

sudo convert/dir_iconv.sh convert/books CURRENT_ENCODING utf8

最佳方法

也许您正在寻找enca

Enca is an Extremely Naive Charset Analyser. It detects character set and encoding of text files and can also convert them to other encodings using either a built-in converter or external libraries and tools like libiconv, librecode, or cstocs.

Currently it supports Belarusian, Bulgarian, Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Russian, Slovak, Slovene, Ukrainian, Chinese, and some multibyte encodings independently on language.

请注意,通常,自动检测当前编码是一个困难的过程(相同的字节序列可以是多种编码中的正确文本)。 enca使用启发式方法,基于您告诉它要检测的语言(以限制编码数量)。您可以使用enconvconvert text files单个编码。

次佳方法

您可以使用标准的gnu utils文件和awk获得所需的内容。例:

file -bi .xsession-errors给我:“文本/纯文本;字符集= us-ascii”

所以file -bi .xsession-errors |awk -F "=" '{print $2}'给我”us-ascii”

我在像这样的脚本中使用它:

CHARSET="$(file -bi "$i"|awk -F "=" '{print $2}')"

if [ "$CHARSET" != utf-8 ]; then

        iconv -f "$CHARSET" -t utf8 "$i" -o outfile

fi

第三种方法

全部编译。转到目录,创建dir2utf8.sh:

#!/bin/bash
# converting all files in a dir to utf8 

for f in *
do
    if test -f $f then
        echo -e "\nConverting $f"
        CHARSET="$( file -bi "$f"|awk -F "=" '{print $2}')"
        if [ "$CHARSET" != utf-8 ]; then
                iconv -f "$CHARSET" -t utf8 "$f" -o "$f"
        fi
    else
        echo -e "\nSkipping $f - it's a regular file";
    fi
done

第四种方法

这是我放置所有文件的解决方案:

#!/bin/bash

apt-get -y install recode uchardet > /dev/null
find "$1" -type f | while read FFN # 'dir' should be changed...
do
    encoding=$(uchardet "$FFN")
    echo "$FFN: $encoding"
    enc=`echo $encoding | sed 's#^x-mac-#mac#'`
    set +x
    recode $enc..UTF-8 "$FFN"
done

https://gist.github.com/demofly/25f856a96c29b89baa32

将其放入convert-dir-to-utf8.sh并运行:

bash convert-dir-to-utf8.sh /pat/to/my/trash/dir

请注意,sed是Mac编码的解决方法。许多不常见的编码都需要这样的解决方法。

参考资料

本文由Ubuntu问答整理, 博文地址: https://ubuntuqa.com/article/9159.html,未经允许,请勿转载。