o �J�h�J�@sddlZddlZddlmZddlmZmZddlmZm Z m Z m Z m Z ddl mZmZmZmZddlmZmZmZmZddlmZdd lmZmZdd lmZmZmZm Z m!Z!m"Z"m#Z#e�$d �Z%e�&�Z'e'�(e�)d ��      d&de*de+de+de,de e e-de e e-de.de.defdd�Z/      d&de de+de+de,de e e-de e e-de.de.defdd�Z0      d&d d!de+de+de,de e e-de e e-de.de.defd"d#�Z1     d'd d!de+de+de,de e e-de e e-de.defd$d%�Z2dS)(�N)�PathLike)�basename�splitext)�Any�BinaryIO�List�Optional�Set�)�coherence_ratio�encoding_languages�mb_encoding_languages�merge_coherence_ratios)�IANA_SUPPORTED�TOO_BIG_SEQUENCE�TOO_SMALL_SEQUENCE�TRACE)� mess_ratio)� CharsetMatch�CharsetMatches)�any_specified_encoding�cut_sequence_chunks� iana_name�identify_sig_or_bom� is_cp_similar�is_multi_byte_encoding�should_strip_sig_or_bom�charset_normalizerz)%(asctime)s | %(levelname)s | %(message)s��皙�����?TF� sequences�steps� chunk_size� threshold� cp_isolation� cp_exclusion�preemptive_behaviour�explain�returnc- Cs�t|ttf�std�t|����|rtj}t�t �t� t �t |�} | dkrGt� d�|r;t�t �t� |p9tj�tt|dddgd�g�S|dur]t�t d d �|��d d �|D�}ng}|durut�t d d �|��dd �|D�}ng}| ||kr�t�t d||| �d}| }|dkr�| ||kr�t| |�}t |�tk} t |�tk} | r�t�t d�| ��n | r�t�t d�| ��g} |r�t|�nd} | dur�| �| �t�t d| �t�}g}g}d}d}d}t�}t|�\}}|du�r| �|�t�t dt |�|�| �d�d| v�r| �d�| tD�]�}|�r!||v�r!�q|�r+||v�r+�q||v�r2�q|�|�d}||k}|�oCt|�}|dv�rU|�sUt�t d|��qzt|�}Wnt t!f�yot�t d|�Y�qwz9| �r�|du�r�t"|du�r�|dtd��n |t |�td��|d�nt"|du�r�|n|t |�d�|d�}Wn+t#t$f�y�}zt|t$��s�t�t d|t"|��|�|�WYd}~�qd}~wwd}|D] }t%||��r�d}n�q�|�r�t�t d||��qt&|�s�dnt |�| t| |��}|�o|du�ot |�| k} | �rt�t d|�tt |�d�}!t'|!d �}!d}"d}#g}$g}%z9t(|||||||||� D]*}&|$�|&�|%�t)|&|��|%d!|k�rY|"d7}"|"|!k�sf|�rh|du�rhn�q?Wn!t#�y�}zt�t d"|t"|��|!}"d}#WYd}~nd}~ww|#�s�| �r�|�s�z|td#�d�j*|d$d%�Wn#t#�y�}zt�t d&|t"|��|�|�WYd}~�qd}~ww|%�r�t+|%�t |%�nd}'|'|k�s�|"|!k�r|�|�t�t d'||"t,|'d(d)d*��|dd| fv�r|#�st|||dg|�}(|| k�r|(}n |dk�r|(}n|(}�qt�t d+|t,|'d(d)d*��|�s2t-|�})nt.|�})|)�rEt�t d,�|t"|)���g}*|dk�re|$D]}&t/|&d-|)�r[d.�|)�nd�}+|*�|+��qNt0|*�},|,�rvt�t d/�|,|��|�t|||'||,|��|| ddfv�r�|'d-k�r�t� d0|�|�r�t�t �t� |�t||g�S||k�r�t� d1|�|�r�t�t �t� |�t||g�S�qt |�dk�r&|�s�|�s�|�r�t�t d2�|�r�t� d3|j1�|�|�n2|�r�|du�s|�r |�r |j2|j2k�s|du�rt� d4�|�|�n |�r&t� d5�|�|�|�r8t� d6|�3�j1t |�d�nt� d7�|�rJt�t �t� |�|S)8ae Given a raw bytes sequence, return the best possibles charset usable to render str objects. If there is no results, it is a strong indicator that the source is binary/not text. By default, the process will extract 5 blocs of 512o each to assess the mess and coherence of a given sequence. And will give up a particular code page after 20% of measured mess. Those criteria are customizable at will. The preemptive behavior DOES NOT replace the traditional detection workflow, it prioritize a particular code page but never take it for granted. Can improve the performance. You may want to focus your attention to some code page or/and not others, use cp_isolation and cp_exclusion for that purpose. This function will strip the SIG in the payload/sequence every time except on UTF-16, UTF-32. By default the library does not setup any handler other than the NullHandler, if you choose to set the 'explain' toggle to True it will alter the logger configuration to add a StreamHandler that is suitable for debugging. Custom logging format and handler can be set manually. z4Expected object of type bytes or bytearray, got: {0}rz<Encoding detection on empty bytes, assuming utf_8 intention.�utf_8gF�Nz`cp_isolation is set. use this flag for debugging purpose. limited list of encoding allowed : %s.z, cS�g|]}t|d��qS�F�r��.0�cp�r2�TC:\pinokio\api\whisper-webui.git\app\env\lib\site-packages\charset_normalizer\api.py� <listcomp>[�zfrom_bytes.<locals>.<listcomp>zacp_exclusion is set. use this flag for debugging purpose. limited list of encoding excluded : %s.cSr,r-r.r/r2r2r3r4fr5z^override steps (%i) and chunk_size (%i) as content does not fit (%i byte(s) given) parameters.r z>Trying to detect encoding from a tiny portion of ({}) byte(s).zIUsing lazy str decoding because the payload is quite large, ({}) byte(s).z@Detected declarative mark in sequence. Priority +1 given for %s.zIDetected a SIG or BOM mark on first %i byte(s). Priority +1 given for %s.�ascii>�utf_16�utf_32z[Encoding %s wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.z2Encoding %s does not provide an IncrementalDecoderg��A)�encodingz9Code page %s does not fit given bytes sequence at ALL. %sTzW%s is deemed too similar to code page %s and was consider unsuited already. Continuing!zpCode page %s is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.�������zaLazyStr Loading: After MD chunk decode, code page %s does not fit given bytes sequence at ALL. %sgj�@�strict)�errorsz^LazyStr Loading: After final lookup, code page %s does not fit given bytes sequence at ALL. %szc%s was excluded because of initial chaos probing. Gave up %i time(s). Computed mean chaos is %f %%.�d�)�ndigitsz=%s passed initial chaos probing. Mean measured chaos is %f %%z&{} should target any language(s) of {}g�������?�,z We detected language {} using {}z.Encoding detection: %s is most likely the one.zoEncoding detection: %s is most likely the one as we detected a BOM or SIG within the beginning of the sequence.zONothing got out of the detection process. Using ASCII/UTF-8/Specified fallback.z7Encoding detection: %s will be used as a fallback matchz:Encoding detection: utf_8 will be used as a fallback matchz:Encoding detection: ascii will be used as a fallback matchz]Encoding detection: Found %s as plausible (best-candidate) for content. With %i alternatives.z=Encoding detection: Unable to determine any suitable charset.)4� isinstance� bytearray�bytes� TypeError�format�type�logger�level� addHandler�explain_handler�setLevelr�len�debug� removeHandler�logging�WARNINGrr�log�join�intrrr�append�setrr�addrr�ModuleNotFoundError� ImportError�str�UnicodeDecodeError� LookupErrorr�range�maxrr�decode�sum�roundr r r rr9� fingerprint�best)-r!r"r#r$r%r&r'r(Zprevious_logger_level�lengthZis_too_small_sequenceZis_too_large_sequenceZprioritized_encodingsZspecified_encodingZtestedZtested_but_hard_failureZtested_but_soft_failureZfallback_asciiZ fallback_u8Zfallback_specified�resultsZ sig_encodingZ sig_payloadZ encoding_ianaZdecoded_payloadZbom_or_sig_availableZstrip_sig_or_bomZis_multi_byte_decoder�eZsimilar_soft_failure_testZencoding_soft_failedZr_Zmulti_byte_bonusZmax_chunk_gave_upZearly_stop_countZlazy_str_hard_failureZ md_chunksZ md_ratios�chunkZmean_mess_ratioZfallback_entryZtarget_languagesZ cd_ratiosZchunk_languagesZcd_ratios_mergedr2r2r3� from_bytes#s���    �� �   ����� �   �     �� �� �� ��� � ��  �� �� �� �  � ���� ���� ��  � � �  � �� ����� �   �  � �� � ��      �   ri�fpc Cst|��|||||||�S)z� Same thing than the function from_bytes but using a file pointer that is already ready. Will not close the file pointer. )ri�read)rjr"r#r$r%r&r'r(r2r2r3�from_fp�s�rl�pathz PathLike[Any]c CsDt|d��}t||||||||�Wd�S1swYdS)z� Same thing than the function from_bytes but with one extra step. Opening and reading given file path in binary mode. Can raise IOError. �rbN)�openrl) rmr"r#r$r%r&r'r(rjr2r2r3� from_path�s �$�rpc Cs�t�dt�t|||||||�}t|�}tt|��} t|�dkr'td� |���|� �} | dd| j 7<t d� t |��|d�| ���d��} | �| ���Wd�| S1sZwY| S) zi Take a (text-based) file path and try to create another file next to it, this time using UTF-8. z2normalize is deprecated and will be removed in 3.0rz;Unable to normalize "{}", no encoding charset seems to fit.�-z{}r+�wbN)�warnings�warn�DeprecationWarningrpr�listrrN�IOErrorrGrdr9ror[�replacerT�write�output) rmr"r#r$r%r&r'rf�filenameZtarget_extensions�resultrjr2r2r3� normalizes@ ��  ��� ��r})rrr NNTF)rrr NNT)3rQrs�osrZos.pathrr�typingrrrrr �cdr r r rZconstantrrrr�mdr�modelsrr�utilsrrrrrrr� getLoggerrI� StreamHandlerrL� setFormatter� FormatterrErU�floatr[�boolrirlrpr}r2r2r2r3�<module>s�  $ ������ � ��� �G����� � ��� ������ � ��� ������ � ���
Memory