o
�J�h�J � @ s d dl Z d dlZd dlmZ d dlmZmZ d dlmZm Z m
Z
mZmZ ddl
mZmZmZmZ ddlmZmZmZmZ ddlmZ dd lmZmZ dd
lmZmZmZm Z m!Z!m"Z"m#Z# e �$d�Z%e �&� Z'e'�(e �)d��
d&de*de+de+de,dee
e- dee
e- de.de.defdd�Z/
d&de de+de+de,dee
e- dee
e- de.de.defdd�Z0
d&d d!de+de+de,dee
e- dee
e- de.de.defd"d#�Z1
d'd d!de+de+de,dee
e- dee
e- de.defd$d%�Z2dS )(� N)�PathLike)�basename�splitext)�Any�BinaryIO�List�Optional�Set� )�coherence_ratio�encoding_languages�mb_encoding_languages�merge_coherence_ratios)�IANA_SUPPORTED�TOO_BIG_SEQUENCE�TOO_SMALL_SEQUENCE�TRACE)�
mess_ratio)�CharsetMatch�CharsetMatches)�any_specified_encoding�cut_sequence_chunks� iana_name�identify_sig_or_bom�
is_cp_similar�is_multi_byte_encoding�should_strip_sig_or_bom�charset_normalizerz)%(asctime)s | %(levelname)s | %(message)s� � 皙�����?TF� sequences�steps�
chunk_size� threshold�cp_isolation�cp_exclusion�preemptive_behaviour�explain�returnc - C s� t | ttf�std�t| ����|rtj}t�t � t�
t� t| �} | dkrGt�
d� |r;t�t � t�
|p9tj� tt| dddg d�g�S |dur]t�td d
�|�� dd� |D �}ng }|durut�td
d
�|�� dd� |D �}ng }| || kr�t�td||| � d}| }|dkr�| | |k r�t| | �}t| �tk }
t| �tk}|
r�t�td�| �� n|r�t�td�| �� g }|r�t| �nd}
|
dur�|�|
� t�td|
� t� }g }g }d}d}d}t� }t| �\}}|du�r|�|� t�tdt|�|� |�d� d|v�r|�d� |t D �]�}|�r!||v�r!�q|�r+||v �r+�q||v �r2�q|�|� d}||k}|�oCt|�}|dv �rU|�sUt�td|� �qzt|�}W n t t!f�yo t�td|� Y �qw z9|�r�|du �r�t"|du �r�| dtd�� n | t|�td�� |d� nt"|du �r�| n| t|�d� |d�}W n+ t#t$f�y� } zt |t$��s�t�td|t"|�� |�|� W Y d}~�qd}~ww d}|D ]
}t%||��r�d} n�q�|�r�t�td||� �qt&|�s�dnt|�| t| | ��}|�o|du�ot|�| k } | �rt�td|� tt|�d �}!t'|!d �}!d}"d}#g }$g }%z9t(| ||||||||� D ]*}&|$�|&� |%�t)|&|�� |%d! |k�rY|"d7 }"|"|!k�sf|�rh|du �rh n�q?W n! t#�y� } zt�td"|t"|�� |!}"d}#W Y d}~nd}~ww |#�s�|�r�|�s�z| td#�d� j*|d$d%� W n# t#�y� } zt�td&|t"|�� |�|� W Y d}~�qd}~ww |%�r�t+|%�t|%� nd}'|'|k�s�|"|!k�r|�|� t�td'||"t,|'d( d)d*�� |dd|
fv �r|#�st| ||dg |�}(||
k�r|(}n
|dk�r|(}n|(}�qt�td+|t,|'d( d)d*�� |�s2t-|�})nt.|�})|)�rEt�td,�|t"|)��� g }*|dk�re|$D ]}&t/|&d-|)�r[d.�|)�nd�}+|*�|+� �qNt0|*�},|,�rvt�td/�|,|�� |�t| ||'||,|�� ||
ddfv �r�|'d-k �r�t�
d0|� |�r�t�t � t�
|� t|| g� S ||k�r�t�
d1|� |�r�t�t � t�
|� t|| g� S �qt|�dk�r&|�s�|�s�|�r�t�td2� |�r�t�
d3|j1� |�|� n2|�r�|du �s|�r |�r |j2|j2k�s|du�rt�
d4� |�|� n
|�r&t�
d5� |�|� |�r8t�
d6|�3� j1t|�d � nt�
d7� |�rJt�t � t�
|� |S )8ae
Given a raw bytes sequence, return the best possibles charset usable to render str objects.
If there is no results, it is a strong indicator that the source is binary/not text.
By default, the process will extract 5 blocs of 512o each to assess the mess and coherence of a given sequence.
And will give up a particular code page after 20% of measured mess. Those criteria are customizable at will.
The preemptive behavior DOES NOT replace the traditional detection workflow, it prioritize a particular code page
but never take it for granted. Can improve the performance.
You may want to focus your attention to some code page or/and not others, use cp_isolation and cp_exclusion for that
purpose.
This function will strip the SIG in the payload/sequence every time except on UTF-16, UTF-32.
By default the library does not setup any handler other than the NullHandler, if you choose to set the 'explain'
toggle to True it will alter the logger configuration to add a StreamHandler that is suitable for debugging.
Custom logging format and handler can be set manually.
z4Expected object of type bytes or bytearray, got: {0}r z<Encoding detection on empty bytes, assuming utf_8 intention.�utf_8g F� Nz`cp_isolation is set. use this flag for debugging purpose. limited list of encoding allowed : %s.z, c S � g | ]}t |d ��qS �F�r ��.0�cp� r2 �TC:\pinokio\api\whisper-webui.git\app\env\lib\site-packages\charset_normalizer\api.py�
<listcomp>[ � zfrom_bytes.<locals>.<listcomp>zacp_exclusion is set. use this flag for debugging purpose. limited list of encoding excluded : %s.c S r, r- r. r/ r2 r2 r3 r4 f r5 z^override steps (%i) and chunk_size (%i) as content does not fit (%i byte(s) given) parameters.r
z>Trying to detect encoding from a tiny portion of ({}) byte(s).zIUsing lazy str decoding because the payload is quite large, ({}) byte(s).z@Detected declarative mark in sequence. Priority +1 given for %s.zIDetected a SIG or BOM mark on first %i byte(s). Priority +1 given for %s.�ascii> �utf_16�utf_32z[Encoding %s wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.z2Encoding %s does not provide an IncrementalDecoderg ��A)�encodingz9Code page %s does not fit given bytes sequence at ALL. %sTzW%s is deemed too similar to code page %s and was consider unsuited already. Continuing!zpCode page %s is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.� � �����zaLazyStr Loading: After MD chunk decode, code page %s does not fit given bytes sequence at ALL. %sg j�@�strict)�errorsz^LazyStr Loading: After final lookup, code page %s does not fit given bytes sequence at ALL. %szc%s was excluded because of initial chaos probing. Gave up %i time(s). Computed mean chaos is %f %%.�d � )�ndigitsz=%s passed initial chaos probing. Mean measured chaos is %f %%z&{} should target any language(s) of {}g�������?�,z We detected language {} using {}z.Encoding detection: %s is most likely the one.zoEncoding detection: %s is most likely the one as we detected a BOM or SIG within the beginning of the sequence.zONothing got out of the detection process. Using ASCII/UTF-8/Specified fallback.z7Encoding detection: %s will be used as a fallback matchz:Encoding detection: utf_8 will be used as a fallback matchz:Encoding detection: ascii will be used as a fallback matchz]Encoding detection: Found %s as plausible (best-candidate) for content. With %i alternatives.z=Encoding detection: Unable to determine any suitable charset.)4�
isinstance� bytearray�bytes� TypeError�format�type�logger�level�
addHandler�explain_handler�setLevelr �len�debug�
removeHandler�logging�WARNINGr r �log�join�intr r r �append�setr r �addr r �ModuleNotFoundError�ImportError�str�UnicodeDecodeError�LookupErrorr �range�maxr r �decode�sum�roundr r
r r r9 �fingerprint�best)-r! r"