o �J�h�J�@sddlZddlZddlmZddlmZmZddlmZm Z m Z mZmZddl mZmZmZmZddlmZmZmZmZddlmZdd lmZmZdd lmZmZmZm Z m!Z!m"Z"m#Z#e�$d�Z%e�&�Z'e'�(e�)d�� d&de*de+de+de,dee e-dee e-de.de.defdd�Z/ d&de de+de+de,dee e-dee e-de.de.defdd�Z0 d&d d!de+de+de,dee e-dee e-de.de.defd"d#�Z1 d'd d!de+de+de,dee e-dee e-de.defd$d%�Z2dS)(�N)�PathLike)�basename�splitext)�Any�BinaryIO�List�Optional�Set�)�coherence_ratio�encoding_languages�mb_encoding_languages�merge_coherence_ratios)�IANA_SUPPORTED�TOO_BIG_SEQUENCE�TOO_SMALL_SEQUENCE�TRACE)� mess_ratio)�CharsetMatch�CharsetMatches)�any_specified_encoding�cut_sequence_chunks� iana_name�identify_sig_or_bom� is_cp_similar�is_multi_byte_encoding�should_strip_sig_or_bom�charset_normalizerz)%(asctime)s | %(levelname)s | %(message)s��皙��?TF� sequences�steps� chunk_size� threshold�cp_isolation�cp_exclusion�preemptive_behaviour�explain�returnc-Cs�t|ttf�std�t|��|rtj}t�t �t� t�t|�} | dkrGt� d�|r;t�t �t� |p9tj�tt|dddgd�g�S|dur]t�td d �|��dd�|D�}ng}|durut�td d �|��dd�|D�}ng}| ||kr�t�td||| �d}| }|dkr�| ||kr�t| |�}t|�tk} t|�tk}| r�t�td�| ��n|r�t�td�| ��g}|r�t|�nd} | dur�|�| �t�td| �t�}g}g}d}d}d}t�}t|�\}}|du�r|�|�t�tdt|�|�|�d�d|v�r|�d�|tD�]�}|�r!||v�r!�q|�r+||v�r+�q||v�r2�q|�|�d}||k}|�oCt|�}|dv�rU|�sUt�td|��qzt|�}Wnt t!f�yot�td|�Y�qwz9|�r�|du�r�t"|du�r�|dtd��n |t|�td��|d�nt"|du�r�|n|t|�d�|d�}Wn+t#t$f�y�}zt|t$��s�t�td|t"|��|�|�WYd}~�qd}~wwd}|D] }t%||��r�d}n�q�|�r�t�td||��qt&|�s�dnt|�| t| |��}|�o|du�ot|�| k} | �rt�td|�tt|�d�}!t'|!d �}!d}"d}#g}$g}%z9t(|||||||||� D]*}&|$�|&�|%�t)|&|��|%d!|k�rY|"d7}"|"|!k�sf|�rh|du�rhn�q?Wn!t#�y�}zt�td"|t"|��|!}"d}#WYd}~nd}~ww|#�s�|�r�|�s�z|td#�d�j*|d$d%�Wn#t#�y�}zt�td&|t"|��|�|�WYd}~�qd}~ww|%�r�t+|%�t|%�nd}'|'|k�s�|"|!k�r|�|�t�td'||"t,|'d(d)d*��|dd| fv�r|#�st|||dg|�}(|| k�r|(}n |dk�r|(}n|(}�qt�td+|t,|'d(d)d*��|�s2t-|�})nt.|�})|)�rEt�td,�|t"|)��g}*|dk�re|$D]}&t/|&d-|)�r[d.�|)�nd�}+|*�|+��qNt0|*�},|,�rvt�td/�|,|��|�t|||'||,|��|| ddfv�r�|'d-k�r�t� d0|�|�r�t�t �t� |�t||g�S||k�r�t� d1|�|�r�t�t �t� |�t||g�S�qt|�dk�r&|�s�|�s�|�r�t�td2�|�r�t� d3|j1�|�|�n2|�r�|du�s|�r |�r |j2|j2k�s|du�rt� d4�|�|�n |�r&t� d5�|�|�|�r8t� d6|�3�j1t|�d�nt� d7�|�rJt�t �t� |�|S)8ae Given a raw bytes sequence, return the best possibles charset usable to render str objects. If there is no results, it is a strong indicator that the source is binary/not text. By default, the process will extract 5 blocs of 512o each to assess the mess and coherence of a given sequence. And will give up a particular code page after 20% of measured mess. Those criteria are customizable at will. The preemptive behavior DOES NOT replace the traditional detection workflow, it prioritize a particular code page but never take it for granted. Can improve the performance. You may want to focus your attention to some code page or/and not others, use cp_isolation and cp_exclusion for that purpose. This function will strip the SIG in the payload/sequence every time except on UTF-16, UTF-32. By default the library does not setup any handler other than the NullHandler, if you choose to set the 'explain' toggle to True it will alter the logger configuration to add a StreamHandler that is suitable for debugging. Custom logging format and handler can be set manually. z4Expected object of type bytes or bytearray, got: {0}rz<Encoding detection on empty bytes, assuming utf_8 intention.�utf_8gF�Nz`cp_isolation is set. use this flag for debugging purpose. limited list of encoding allowed : %s.z, cS�g|]}t|d��qS�F�r��.0�cp�r2�TC:\pinokio\api\whisper-webui.git\app\env\lib\site-packages\charset_normalizer\api.py� <listcomp>[�zfrom_bytes.<locals>.<listcomp>zacp_exclusion is set. use this flag for debugging purpose. limited list of encoding excluded : %s.cSr,r-r.r/r2r2r3r4fr5z^override steps (%i) and chunk_size (%i) as content does not fit (%i byte(s) given) parameters.r z>Trying to detect encoding from a tiny portion of ({}) byte(s).zIUsing lazy str decoding because the payload is quite large, ({}) byte(s).z@Detected declarative mark in sequence. Priority +1 given for %s.zIDetected a SIG or BOM mark on first %i byte(s). Priority +1 given for %s.�ascii>�utf_16�utf_32z[Encoding %s wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.z2Encoding %s does not provide an IncrementalDecoderg��A)�encodingz9Code page %s does not fit given bytes sequence at ALL. %sTzW%s is deemed too similar to code page %s and was consider unsuited already. Continuing!zpCode page %s is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.��zaLazyStr Loading: After MD chunk decode, code page %s does not fit given bytes sequence at ALL. %sgj�@�strict)�errorsz^LazyStr Loading: After final lookup, code page %s does not fit given bytes sequence at ALL. %szc%s was excluded because of initial chaos probing. Gave up %i time(s). Computed mean chaos is %f %%.�d�)�ndigitsz=%s passed initial chaos probing. Mean measured chaos is %f %%z&{} should target any language(s) of {}g��?�,z We detected language {} using {}z.Encoding detection: %s is most likely the one.zoEncoding detection: %s is most likely the one as we detected a BOM or SIG within the beginning of the sequence.zONothing got out of the detection process. Using ASCII/UTF-8/Specified fallback.z7Encoding detection: %s will be used as a fallback matchz:Encoding detection: utf_8 will be used as a fallback matchz:Encoding detection: ascii will be used as a fallback matchz]Encoding detection: Found %s as plausible (best-candidate) for content. With %i alternatives.z=Encoding detection: Unable to determine any suitable charset.)4� isinstance� bytearray�bytes� TypeError�format�type�logger�level� addHandler�explain_handler�setLevelr�len�debug� removeHandler�logging�WARNINGrr�log�join�intrrr�append�setrr�addrr�ModuleNotFoundError�ImportError�str�UnicodeDecodeError�LookupErrorr�range�maxrr�decode�sum�roundrr rrr9�fingerprint�best)-r!r"r#r$r%r&r'r(Zprevious_logger_level�lengthZis_too_small_sequenceZis_too_large_sequenceZprioritized_encodingsZspecified_encodingZtestedZtested_but_hard_failureZtested_but_soft_failureZfallback_asciiZfallback_u8Zfallback_specified�resultsZsig_encodingZsig_payloadZ encoding_ianaZdecoded_payloadZbom_or_sig_availableZstrip_sig_or_bomZis_multi_byte_decoder�eZsimilar_soft_failure_testZencoding_soft_failedZr_Zmulti_byte_bonusZmax_chunk_gave_upZearly_stop_countZlazy_str_hard_failureZ md_chunksZ md_ratios�chunkZmean_mess_ratioZfallback_entryZtarget_languagesZ cd_ratiosZchunk_languagesZcd_ratios_mergedr2r2r3� from_bytes#s�� ri�fpc Cst|��|||||||�S)z� Same thing than the function from_bytes but using a file pointer that is already ready. Will not close the file pointer. )ri�read)rjr"r#r$r%r&r'r(r2r2r3�from_fp�s�rl�pathz PathLike[Any]c CsDt|d��}t||||||||�Wd�S1swYdS)z� Same thing than the function from_bytes but with one extra step. Opening and reading given file path in binary mode. Can raise IOError. �rbN)�openrl) rmr"r#r$r%r&r'r(rjr2r2r3� from_path�s�$�rpc Cs�t�dt�t|||||||�}t|�}tt|��} t|�dkr'td� |��|� �} | dd| j7<td� t |��|d�| ��d��}|�| ��Wd�| S1sZwY| S) zi Take a (text-based) file path and try to create another file next to it, this time using UTF-8. z2normalize is deprecated and will be removed in 3.0rz;Unable to normalize "{}", no encoding charset seems to fit.�-z{}r+�wbN)�warnings�warn�DeprecationWarningrpr�listrrN�IOErrorrGrdr9ror[�replacerT�write�output)rmr"r#r$r%r&r'rf�filenameZtarget_extensions�resultrjr2r2r3� normalizes@�� r})rrr NNTF)rrr NNT)3rQrs�osrZos.pathrr�typingrrrrr �cdrrr rZconstantrrrr�mdr�modelsrr�utilsrrrrrrr� getLoggerrI� StreamHandlerrL�setFormatter� FormatterrErU�floatr[�boolrirlrpr}r2r2r2r3�<module>s�$ �� G��