mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-26 14:45:31 +00:00
refactor: separate click wrappers from core evaluation functionality (#1981)
### Summary Click decorated functions cannot (properly) be called outside of the click interface. This makes it difficult to reuse the setup functionality in measure_text_edit_distance or measure_element_type_accuracy. This PR removes the click decoration and separates it into a wrapper function purely to execute the command. ### Technical Details - Changed as suggested in [this StackOverflow post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command) response - The locations of these now distinct functions are separate: the `_command` click-decorated functions stay in ingest/evaluate.py, and the core functions measure_text_edit_distance and measure_element_type_accuracy are moved into the unstructured/metrics/ folder (which is a more logical location for them). - Initial test added for measure_text_edit_distance ### Test `sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction` functionality is unchanged. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
This commit is contained in:
parent
ad14321016
commit
6db663e7bb
@ -1,3 +1,5 @@
|
||||
## 0.10.30-dev0
|
||||
|
||||
## 0.10.29
|
||||
|
||||
### Enhancements
|
||||
|
||||
@ -0,0 +1,36 @@
|
||||
Bank Good Credit
|
||||
Accredited with IABACTM
|
||||
(International Association of Business Analytics Certifications)
|
||||
IABAC International Association of
|
||||
Business Analytics Certification
|
||||
DataMitesTM. All Right Reserved
|
||||
|
||||
Objective & Background
|
||||
Classify credit card customers as good / bad, based on information from internal and external sources.
|
||||
Data provided
|
||||
Demographic: Base file of with credit card history details. Only one record for every customer.
|
||||
Account: Contians data for various loans availed by the customer. Not related to credit card. Multiple records for every customer.
|
||||
Enquiries: Enquired made by customers for different loan purposes. Multiple records for every customer.
|
||||
DataMitesTM. All Right Reserved
|
||||
|
||||
Design
|
||||
Data to be downloaded using SQL queries.
|
||||
Required information to be extracted from Account and Enquiry files and converted to one-to-one files.
|
||||
The columns from the two files should be merged with Demographic file using Left Join with customer no as key column, to create a final file. The final file should contain all the records in demographic and additional columns/features from Account and Enquiry files will get added to Demographic file.
|
||||
There will be many customers in account and enquiry file who will get left out. This is fine as we anyway dont know their good/bad label for training purpose.
|
||||
DataMitesTM. All Right Reserved
|
||||
|
||||
Analysis of Data
|
||||
Show using Excel File
|
||||
DataMitesTM. All Right Reserved
|
||||
|
||||
Explain Coding / outcomes
|
||||
Show using Jupyter
|
||||
DataMitesTM. All Right Reserved
|
||||
|
||||
|
||||
Thank You
|
||||
DataMitesTM. All Right Reserved
|
||||
|
||||
|
||||
|
||||
@ -0,0 +1,205 @@
|
||||
Page 1
|
||||
The introductory chapter of Government Auditing Standards (GAGAS)1
|
||||
outlines five concepts describing how public officials are to provide
|
||||
functions and services: effectively, efficiently, economically, ethically, and
|
||||
equitably. When planning, gathering and assessing evidence, and
|
||||
reporting audit results, auditors may focus on one or more of these
|
||||
concepts. The following discussion is intended to assist auditors when
|
||||
developing audit objectives for performance audits of government
|
||||
programs and activities.2
|
||||
This discussion is designed to help auditors understand
|
||||
and apply the concepts cited above for performance audits
|
||||
conducted in accordance with GAGAS. This discussion
|
||||
does not contain requirements, does not amend GAGAS,
|
||||
and is not considered interpretive guidance, as defined in
|
||||
chapter 2 of GAGAS.
|
||||
Paragraph 1.02:
|
||||
The concept of accountability for use of public resources
|
||||
and government authority is key to our nation’s governing
|
||||
processes. Management and officials entrusted with public
|
||||
resources are responsible for carrying out public functions
|
||||
and providing service to the public effectively, efficiently,
|
||||
economically, ethically, and equitably within the context
|
||||
of the statutory boundaries of the specific government
|
||||
program. [Emphasis added.]
|
||||
Paragraph 1.03:
|
||||
As reflected in applicable laws, regulations, agreements,
|
||||
and standards, management and officials of government
|
||||
programs are responsible for providing reliable, useful, and
|
||||
timely information for transparency and accountability of
|
||||
these programs and their operations. Legislators, oversight
|
||||
1GAO, Government Auditing Standards: 2018 Revision, GAO-21-368G (Washington,
|
||||
D.C.: April 2021)
|
||||
2The concepts cited may also be applicable to other GAGAS engagements, based on the
|
||||
auditors’ judgments. This discussion is limited to considering these concepts in
|
||||
performance audits.
|
||||
GAGAS Performance Audits: Discussion of
|
||||
Concepts to Consider When Auditing Public
|
||||
Functions and Services
|
||||
GAGAS Paragraphs
|
||||
Page 2
|
||||
bodies, those charged with governance, and the public
|
||||
need to know whether (1) management and officials
|
||||
manage government resources and use their authority
|
||||
properly and in compliance with laws and regulations; (2)
|
||||
government programs are achieving their objectives and
|
||||
desired outcomes; and (3) government services are
|
||||
provided effectively, efficiently, economically, ethically,
|
||||
and equitably. [Emphasis added.]
|
||||
Government administration best serves the collective interest of the public
|
||||
when it is effective, efficient, economical, ethical, and equitable. Auditors
|
||||
help inform legislators, oversight bodies, those charged with governance,
|
||||
and the public about whether public services are being provided
|
||||
consistent with these concepts. Government auditing can contribute to
|
||||
accountability and can help improve government administration by
|
||||
identifying deficiencies and recommending enhancements to achieve
|
||||
effective, efficient, economical, ethical, and equitable outcomes, when
|
||||
appropriate within the context of the audit objectives. As such, it is
|
||||
important for auditors to understand the concepts below as they relate to
|
||||
administering government programs or activities and how they can
|
||||
assess or address these expectations of government performance in
|
||||
conducting their performance audits.
|
||||
The examples that follow the discussion of each concept illustrate the
|
||||
distinctions between these concepts. In a performance audit, it is
|
||||
common practice to incorporate more than one of these concepts when
|
||||
conducting the audit.
|
||||
The administration of a government program or activity is effective when
|
||||
it achieves the intended results. A performance audit that focuses on the
|
||||
effectiveness of a program or activity seeks to establish a cause-andeffect relationship between the operation of the program or activity and
|
||||
achieving its stated objectives. Achieving the objectives does not
|
||||
guarantee that the program or activity was effective unless the auditors
|
||||
can establish that the program or activity caused, or contributed to, the
|
||||
desired outcome.
|
||||
Example: In a performance audit examining how effective a
|
||||
housing voucher program was in achieving its goal of improving
|
||||
economic outcomes for recipients, auditors may determine
|
||||
whether receiving housing vouchers led to better subsequent
|
||||
economic outcomes for recipients than those of similarly situated
|
||||
individuals who did not receive vouchers.
|
||||
Discussion
|
||||
Effective
|
||||
Page 3
|
||||
Example: In a performance audit assessing the effectiveness of
|
||||
an after-school program targeted at helping students improve their
|
||||
reading proficiency, auditors may examine the extent to which
|
||||
participants’ reading levels improved relative to baseline data from
|
||||
before they joined the program.
|
||||
The administration of a government program or activity is efficient when
|
||||
it gets the most value from available resources. When a performance
|
||||
audit focuses on efficiency, auditors examine whether the resources used
|
||||
to administer a program or activity have been put to optimal or
|
||||
satisfactory use, or whether the same or similar results could have been
|
||||
achieved more timely or with fewer resources.
|
||||
Example: In a performance audit assessing a disaster relief
|
||||
agency’s mobilization of resources to respond to a disaster,
|
||||
auditors may assess the disaster relief agency’s timeliness in
|
||||
providing relief compared to its own previous performance or the
|
||||
performance of other similarly situated agencies that have
|
||||
responded to comparable disasters.
|
||||
Example: In a performance audit assessing a consumer protection
|
||||
agency’s response to consumer complaints, auditors may assess
|
||||
whether the agency’s efforts to streamline its processes resulted
|
||||
in improved timely resolution of complaints.
|
||||
Example: In a performance audit assessing the time a state needs
|
||||
to process unemployment benefits targeted at helping those in
|
||||
need, auditors may assess how long the process takes from
|
||||
receipt of the unemployment application to the applicant’s receipt
|
||||
of the benefit, including steps such as verifying required
|
||||
information.
|
||||
The administration of a government program or activity is economical
|
||||
when it minimizes the costs of resources used in performing its functions
|
||||
while meeting timeliness and quality considerations for those resources.
|
||||
When auditing economy, auditors primarily focus on the costs of inputs
|
||||
rather than on the outcomes achieved.
|
||||
Example: In a performance audit examining an agency’s
|
||||
international travel expenses, in addition to assessing the design
|
||||
of internal controls and compliance with expense guidelines,
|
||||
auditors may test whether, for a sample of trips, bookings of
|
||||
Efficient
|
||||
Economical
|
||||
Page 4
|
||||
equivalent airline tickets and hotel rooms could be found at a
|
||||
lower cost.
|
||||
Example: In a performance audit assessing an agency’s
|
||||
acquisition practices, auditors may examine whether the agency’s
|
||||
decisions regarding purchasing, leasing, or reimbursing
|
||||
employees for the costs of acquiring various supplies or
|
||||
equipment achieved the lowest cost while meeting applicable
|
||||
requirements.
|
||||
The administration of a government program or activity is ethical when it
|
||||
advances the collective interest of the public rather than private gain and
|
||||
is conducted with honesty, integrity, and impartiality. Laws and
|
||||
regulations often specify rules of ethical conduct. Therefore, audits
|
||||
examining the ethical administration of a program or activity may involve
|
||||
assessing compliance with such laws and regulations. Fraud in
|
||||
administering a government program or activity betrays the public trust
|
||||
and is, by definition, unethical. In addition, auditors may identify instances
|
||||
of unethical conduct that result in waste and abuse during testing of
|
||||
internal controls as part of a performance audit.
|
||||
Example: In a performance audit assessing agency officials’
|
||||
compliance with conflict-of-interest requirements, auditors may
|
||||
compare a sample of financial disclosure reports filed against
|
||||
requirements in statute or regulation.
|
||||
Example: In a performance audit assessing potential regulatory
|
||||
capture related to a particular industry, auditors may assess the
|
||||
extent to which the regulatory agency has sufficient controls to
|
||||
reasonably assure its employees’ independence from the entities
|
||||
subject to the agency’s regulation.
|
||||
Example: In a performance audit assessing an office’s policies
|
||||
and procedures for purchase cards, auditors’ testing of the
|
||||
program’s controls to identify deficiencies may identify fraud,
|
||||
waste, or abuse in its administration.
|
||||
The administration of a government program or activity is equitable when
|
||||
it consistently serves members of the public, distributes public services,
|
||||
and implements public policy in a manner that promotes fairness, justice,
|
||||
and equality. Auditing whether the administration of a government
|
||||
program or activity is equitable may include assessing the
|
||||
Ethical
|
||||
Equitable
|
||||
Page 5
|
||||
• equality of access to and provision of services;
|
||||
• procedural fairness and equal treatment of individuals in
|
||||
government programs and policies;
|
||||
• causes of disparate outcomes;
|
||||
• or distributional impacts of public policies, programs, resources,
|
||||
and services.
|
||||
Disaggregating data by social groups or communities that share a
|
||||
particular characteristic (e.g., gender, race, ethnicity, age, or income)
|
||||
can help illuminate differences. Reporting on such differences, when
|
||||
appropriate within the context of the audit objectives, can increase
|
||||
understanding of the effects of policies and programs on issues of
|
||||
equity.
|
||||
Example: In a performance audit assessing the granting of
|
||||
waivers from particular requirements, auditors may use
|
||||
disaggregated data about waiver recipients to assess whether
|
||||
different groups or communities were treated fairly and equally in
|
||||
the process.
|
||||
Example: In a performance audit assessing a grant program
|
||||
aimed at expanding internet access, auditors may assess the
|
||||
extent to which formulas, criteria, or other factors (such as
|
||||
matching funds or capital requirements) considered in the
|
||||
distribution of grant funds may be to the specific advantage or
|
||||
disadvantage of certain groups, regions, or communities, thereby
|
||||
causing inequities.
|
||||
Example: In a performance audit assessing scholarship outcomes
|
||||
in higher education programs, auditors may report on the
|
||||
distribution of scholarships by race, gender identity, and income to
|
||||
illuminate potential disparities among scholarship recipients.
|
||||
These concepts may overlap. For example, efficiency may also be a
|
||||
component of effectiveness. Similarly, when appropriate within the
|
||||
context of the program and audit objectives, auditors may disaggregate
|
||||
the results of performance audits that focus on efficiency or effectiveness
|
||||
Page 6
|
||||
issues to illuminate inequities in program administration or in distribution
|
||||
of public services.
|
||||
While all of these concepts are important to administering government
|
||||
programs responsibly, it is up to the professional judgment of the auditors
|
||||
to determine the specific concepts that are relevant in conducting the
|
||||
performance audit and reporting the results. Auditors’ professional
|
||||
judgments are informed by, among other things, the needs of the users of
|
||||
the audit reports; the nature, context, and objectives of the program or
|
||||
activity under audit; and the public interest.
|
||||
To view the current Yellow Book, visit https://www.gao.gov/yellowbook.
|
||||
For technical assistance, call (202) 512-9535 or email
|
||||
yellowbook@gao.gov.
|
||||
For More Information
|
||||
@ -0,0 +1,164 @@
|
||||
Code Symbol Name
|
||||
AED . United Arab Emirates d
|
||||
AFN Afghan afghani
|
||||
ALL L Albanian lek
|
||||
AMD AMD Armenian dram
|
||||
ANG Netherlands Antillean gu
|
||||
AOA Kz Angolan kwanza
|
||||
ARS $ Argentine peso
|
||||
AUD $ Australian dollar
|
||||
AWG Afl. Aruban florin
|
||||
AZN AZN Azerbaijani manat
|
||||
BAM KM Bosnia and Herzegovina
|
||||
BBD $ Barbadian dollar
|
||||
BDT Bangladeshi taka
|
||||
BGN . Bulgarian lev
|
||||
BHD .. Bahraini dinar
|
||||
BIF Fr Burundian franc
|
||||
BMD $ Bermudian dollar
|
||||
BND $ Brunei dollar
|
||||
BOB Bs. Bolivian boliviano
|
||||
BRL R$ Brazilian real
|
||||
BSD $ Bahamian dollar
|
||||
BTC Bitcoin
|
||||
BTN Nu. Bhutanese ngultrum
|
||||
BWP P Botswana pula
|
||||
BYR Br Belarusian ruble (old)'
|
||||
BYN Br Belarusian ruble
|
||||
BZD $ Belize dollar
|
||||
CAD $ Canadian dollar
|
||||
CDF Fr Congolese franc
|
||||
CHF CHF Swiss franc
|
||||
CLP $ Chilean peso
|
||||
CNY Chinese yuan
|
||||
COP $ Colombian peso
|
||||
CRC Costa Rican coln
|
||||
CUC $ Cuban convertible peso')
|
||||
CUP $ Cuban peso
|
||||
CVE $ Cape Verdean escudo
|
||||
CZK K Czech koruna
|
||||
DJF Fr Djiboutian franc
|
||||
DKK DKK Danish krone
|
||||
DOP RD$ Dominican peso
|
||||
DZD . Algerian dinar
|
||||
EGP EGP Egyptian pound
|
||||
ERN Nfk Eritrean nakfa
|
||||
ETB Br Ethiopian birr
|
||||
EUR Euro
|
||||
FJD $ Fijian dollar
|
||||
FKP Falkland Islands pound')
|
||||
GBP Pound sterling
|
||||
GEL Georgian lari
|
||||
GGP Guernsey pound
|
||||
GHS Ghana cedi
|
||||
GIP Gibraltar pound
|
||||
GMD D Gambian dalasi
|
||||
GNF Fr Guinean franc
|
||||
GTQ Q Guatemalan quetzal
|
||||
GYD $ Guyanese dollar
|
||||
HKD $ Hong Kong dollar
|
||||
HNL L Honduran lempira
|
||||
HRK kn Croatian kuna
|
||||
HTG G Haitian gourde
|
||||
HUF Ft Hungarian forint
|
||||
IDR Rp Indonesian rupiah
|
||||
ILS Israeli new shekel
|
||||
IMP Manx pound
|
||||
INR Indian rupee
|
||||
IQD . Iraqi dinar
|
||||
IRR Iranian rial
|
||||
IRT Iranian toman
|
||||
ISK kr. Icelandic krna
|
||||
JEP Jersey pound
|
||||
JMD $ Jamaican dollar
|
||||
JOD . Jordanian dinar
|
||||
JPY Japanese yen
|
||||
KES KSh Kenyan shilling
|
||||
KGS Kyrgyzstani som
|
||||
KHR Cambodian riel
|
||||
KMF Fr Comorian franc
|
||||
KPW North Korean won
|
||||
KRW South Korean won
|
||||
KWD . Kuwaiti dinar
|
||||
KYD $ Cayman Islands dollar
|
||||
KZT Kazakhstani tenge
|
||||
LAK Lao kip
|
||||
LBP . Lebanese pound
|
||||
LKR Sri Lankan rupee
|
||||
LRD $ Liberian dollar
|
||||
LSL L Lesotho loti
|
||||
LYD . Libyan dinar
|
||||
MAD .. Moroccan dirham
|
||||
MDL MDL Moldovan leu
|
||||
MGA Ar Malagasy ariary
|
||||
MKD Macedonian denar
|
||||
MMK Ks Burmese kyat
|
||||
MNT Mongolian tgrg
|
||||
MOP P Macanese pataca
|
||||
MRU UM Mauritanian ouguiya
|
||||
MUR Mauritian rupee
|
||||
MVR . Maldivian rufiyaa
|
||||
MWK MK Malawian kwacha
|
||||
MXN $ Mexican peso
|
||||
MYR RM Malaysian ringgit
|
||||
MZN MT Mozambican metical
|
||||
NAD N$ Namibian dollar
|
||||
NGN Nigerian naira
|
||||
NIO C$ Nicaraguan crdoba
|
||||
NOK kr Norwegian krone
|
||||
NPR Nepalese rupee
|
||||
NZD $ New Zealand dollar
|
||||
OMR .. Omani rial
|
||||
PAB B/. Panamanian balboa
|
||||
PEN S/ Sol
|
||||
PGK K Papua New Guinean kina')
|
||||
PHP Philippine peso
|
||||
PKR Pakistani rupee
|
||||
PLN z Polish zoty
|
||||
PRB . Transnistrian ruble
|
||||
PYG Paraguayan guaran
|
||||
QAR . Qatari riyal
|
||||
RON lei Romanian leu
|
||||
RSD Serbian dinar
|
||||
RUB Russian ruble
|
||||
RWF Fr Rwandan franc
|
||||
SAR . Saudi riyal
|
||||
SBD $ Solomon Islands dollar')
|
||||
SCR Seychellois rupee
|
||||
SDG .. Sudanese pound
|
||||
SEK kr Swedish krona
|
||||
SGD $ Singapore dollar
|
||||
SHP Saint Helena pound
|
||||
SLL Le Sierra Leonean leone
|
||||
SOS Sh Somali shilling
|
||||
SRD $ Surinamese dollar
|
||||
SSP South Sudanese pound
|
||||
STN Db So Tom and Prncipe d
|
||||
SYP . Syrian pound
|
||||
SZL L Swazi lilangeni
|
||||
THB Thai baht
|
||||
TJS Tajikistani somoni
|
||||
TMT m Turkmenistan manat
|
||||
TND . Tunisian dinar
|
||||
TOP T$ Tongan paanga
|
||||
TRY Turkish lira
|
||||
TTD $ Trinidad and Tobago doll
|
||||
TWD NT$ New Taiwan dollar
|
||||
TZS Sh Tanzanian shilling
|
||||
UAH Ukrainian hryvnia
|
||||
UGX UGX Ugandan shilling
|
||||
USD $ United States (US) dolla
|
||||
UYU $ Uruguayan peso
|
||||
UZS UZS Uzbekistani som
|
||||
VEF Bs F Venezuelan bolvar
|
||||
VES Bs.S Bolvar soberano
|
||||
VND Vietnamese ng
|
||||
VUV Vt Vanuatu vatu
|
||||
WST T Samoan tl
|
||||
XAF CFA Central African CFA fr
|
||||
XCD $ East Caribbean dollar
|
||||
XOF CFA West African CFA franc
|
||||
XPF Fr CFP franc
|
||||
YER Yemeni rial
|
||||
ZAR R South African rand
|
||||
ZMW ZK Zambian kwacha
|
||||
@ -0,0 +1,420 @@
|
||||
[
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "0405351ac64213c7b1e40e31aff7d21b",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"category_depth": 1,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Bank Good Credit "
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "214987ebee9fd615365185fb3d692253",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "0405351ac64213c7b1e40e31aff7d21b",
|
||||
"category_depth": 0,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Accredited with IABAC\u2122"
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "fc3d53b1d173c5c72205914ea331b052",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"category_depth": 1,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "( International Association of Business Analytics Certifications)`"
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "b952b3e6d0e34020f1f48b5d9243d0a4",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"category_depth": 1,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "\u00a9 DataMites\u2122. All Rights Reserved | www.datamites.com"
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "2dc308bd8d3a5c745dfacc3bdccd81db",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"category_depth": 0,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 2
|
||||
},
|
||||
"text": "Objective & Background"
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "5a0a7e2a14285297ff3752656cb6df44",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "2dc308bd8d3a5c745dfacc3bdccd81db",
|
||||
"category_depth": 1,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 2
|
||||
},
|
||||
"text": "Classify credit card customers as good / bad, based on information from internal and external sources. "
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "5eb6ec96e6a3493c1ae56747ae457b7f",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "2dc308bd8d3a5c745dfacc3bdccd81db",
|
||||
"category_depth": 1,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 2
|
||||
},
|
||||
"text": "Data provided"
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "adec2b6c75369165b1d87dccdfd2dab8",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "5eb6ec96e6a3493c1ae56747ae457b7f",
|
||||
"category_depth": 2,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 2
|
||||
},
|
||||
"text": "Demographic: Base file of with credit card history details. Only one record for every customer."
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "c26d6dc6982b7f42045f4ffee951f8e0",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "5eb6ec96e6a3493c1ae56747ae457b7f",
|
||||
"category_depth": 2,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 2
|
||||
},
|
||||
"text": "Account: Contians data for various loans availed by the customer. Not related to credit card. Multiple records for every customer."
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "be3fc5cb3da83e6c22d1906330ee9f96",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "5eb6ec96e6a3493c1ae56747ae457b7f",
|
||||
"category_depth": 2,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 2
|
||||
},
|
||||
"text": "Enquiries: Enquired made by customers for different loan purposes. Multiple records for every customer.\t"
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "b952b3e6d0e34020f1f48b5d9243d0a4",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "2dc308bd8d3a5c745dfacc3bdccd81db",
|
||||
"category_depth": 1,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 2
|
||||
},
|
||||
"text": "\u00a9 DataMites\u2122. All Rights Reserved | www.datamites.com"
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "0072e6b934945d5ba08f9729e0084739",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"category_depth": 0,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 3
|
||||
},
|
||||
"text": "Design"
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "af14c0ecaaa7ac1d2bca5cdfbcc32ec7",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "0072e6b934945d5ba08f9729e0084739",
|
||||
"category_depth": 0,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 3
|
||||
},
|
||||
"text": "Data to be downloaded using SQL queries."
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "51d8e67259ab8a11d2fdfc5cb9bcf45e",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "0072e6b934945d5ba08f9729e0084739",
|
||||
"category_depth": 0,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 3
|
||||
},
|
||||
"text": "Required information to be extracted from Account and Enquiry files and converted to one-to-one files."
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "6165e3bc219556f6ec397adc7240386b",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "0072e6b934945d5ba08f9729e0084739",
|
||||
"category_depth": 0,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 3
|
||||
},
|
||||
"text": "The columns from the two files should be merged with Demographic file using Left Join with \u201ccustomer no\u201d as key column, to create a final file. The final file should contain all the records in demographic and additional columns/features from Account and Enquiry files will get added to Demographic file."
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "31930936fc3bad2175b05e324e9923e4",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "0072e6b934945d5ba08f9729e0084739",
|
||||
"category_depth": 0,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 3
|
||||
},
|
||||
"text": "There will be many customers in account and enquiry file who will get left out. This is fine as we anyway don\u2019t know their good/bad label for training purpose. "
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "b952b3e6d0e34020f1f48b5d9243d0a4",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "0072e6b934945d5ba08f9729e0084739",
|
||||
"category_depth": 1,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 3
|
||||
},
|
||||
"text": "\u00a9 DataMites\u2122. All Rights Reserved | www.datamites.com"
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "ed83647ab77addbea9e4dca5f7d8f216",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"category_depth": 0,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 4
|
||||
},
|
||||
"text": "Analysis of Data"
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "d936f750c577a228ebabd9ed2cec9a70",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "ed83647ab77addbea9e4dca5f7d8f216",
|
||||
"category_depth": 1,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 4
|
||||
},
|
||||
"text": "Show using Excel File"
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "b952b3e6d0e34020f1f48b5d9243d0a4",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "ed83647ab77addbea9e4dca5f7d8f216",
|
||||
"category_depth": 1,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 4
|
||||
},
|
||||
"text": "\u00a9 DataMites\u2122. All Rights Reserved | www.datamites.com"
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "7207da66fd1c6771ee1a5705dc41c0c7",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"category_depth": 0,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 5
|
||||
},
|
||||
"text": "Explain Coding / outcomes "
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "815ef1753a8bcb1ce21d59819bdc6834",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "7207da66fd1c6771ee1a5705dc41c0c7",
|
||||
"category_depth": 1,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 5
|
||||
},
|
||||
"text": "Show using Jupyter"
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "b952b3e6d0e34020f1f48b5d9243d0a4",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "7207da66fd1c6771ee1a5705dc41c0c7",
|
||||
"category_depth": 1,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 5
|
||||
},
|
||||
"text": "\u00a9 DataMites\u2122. All Rights Reserved | www.datamites.com"
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "2034ce6155036f8a009ef33985209e88",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "7207da66fd1c6771ee1a5705dc41c0c7",
|
||||
"category_depth": 1,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 6
|
||||
},
|
||||
"text": "Thank You"
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "b952b3e6d0e34020f1f48b5d9243d0a4",
|
||||
"metadata": {
|
||||
"filename": "Bank Good Credit Loan.pptx",
|
||||
"file_directory": "tmpdocs",
|
||||
"last_modified": "2023-11-02T15:16:14",
|
||||
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"parent_id": "7207da66fd1c6771ee1a5705dc41c0c7",
|
||||
"category_depth": 1,
|
||||
"languages": [
|
||||
"eng"
|
||||
],
|
||||
"page_number": 6
|
||||
},
|
||||
"text": "\u00a9 DataMites\u2122. All Rights Reserved | www.datamites.com"
|
||||
}
|
||||
]
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because one or more lines are too long
36
test_unstructured/metrics/test_evaluate.py
Normal file
36
test_unstructured/metrics/test_evaluate.py
Normal file
@ -0,0 +1,36 @@
|
||||
import os
|
||||
import pathlib
|
||||
|
||||
import pytest
|
||||
|
||||
from unstructured.metrics.evaluate import (
|
||||
measure_text_edit_distance,
|
||||
)
|
||||
|
||||
is_in_docker = os.path.exists("/.dockerenv")
|
||||
|
||||
EXAMPLE_DOCS_DIRECTORY = os.path.join(
|
||||
pathlib.Path(__file__).parent.resolve(), "..", "..", "example-docs"
|
||||
)
|
||||
TESTING_FILE_DIR = os.path.join(EXAMPLE_DOCS_DIRECTORY, "test_evaluate_files")
|
||||
|
||||
UNSTRUCTURED_OUTPUT_DIRNAME = "unstructured_output"
|
||||
GOLD_CCT_DIRNAME = "gold_standard_cct"
|
||||
|
||||
|
||||
@pytest.mark.skipif(is_in_docker, reason="Skipping this test in Docker container")
|
||||
def test_text_extraction_takes_list():
|
||||
output_dir = os.path.join(TESTING_FILE_DIR, UNSTRUCTURED_OUTPUT_DIRNAME)
|
||||
output_list = ["currency.csv.json"]
|
||||
source_dir = os.path.join(TESTING_FILE_DIR, GOLD_CCT_DIRNAME)
|
||||
export_dir = os.path.join(TESTING_FILE_DIR, "test_evaluate_results_cct")
|
||||
measure_text_edit_distance(
|
||||
output_dir=output_dir,
|
||||
source_dir=source_dir,
|
||||
output_list=output_list,
|
||||
export_dir=export_dir,
|
||||
)
|
||||
# check that only the listed files are included
|
||||
with open(os.path.join(export_dir, "all-docs-cct.tsv")) as f:
|
||||
lines = f.read().splitlines()
|
||||
assert len(lines) == len(output_list) + 1 # includes header
|
||||
@ -12,9 +12,9 @@ mkdir -p "$OUTPUT_DIR"
|
||||
EVAL_NAME="$1"
|
||||
|
||||
if [ "$EVAL_NAME" == "text-extraction" ]; then
|
||||
METRIC_STRATEGY="measure-text-edit-distance"
|
||||
METRIC_STRATEGY="measure-text-edit-distance-command"
|
||||
elif [ "$EVAL_NAME" == "element-type" ]; then
|
||||
METRIC_STRATEGY="measure-element-type-accuracy"
|
||||
METRIC_STRATEGY="measure-element-type-accuracy-command"
|
||||
else
|
||||
echo "Wrong metric evaluation strategy given. Expected one of [ text-extraction, element-type ]. Got [ $EVAL_NAME ]."
|
||||
exit 1
|
||||
|
||||
@ -1,3 +1,3 @@
|
||||
strategy average sample_sd population_sd count
|
||||
cct-accuracy 0.798 0.083 0.072 4
|
||||
cct-%missing 0.089 0.04 0.035 4
|
||||
cct-accuracy 0.735 0.069 0.048 2
|
||||
cct-%missing 0.086 0.069 0.049 2
|
||||
|
||||
|
@ -1,3 +1,3 @@
|
||||
filename connector cct-accuracy cct-%missing
|
||||
example-10k.html local 0.686 0.037
|
||||
IRS-form-1987.pdf azure 0.783 0.135
|
||||
filename doctype connector cct-accuracy cct-%missing
|
||||
IRS-form-1987.pdf pdf azure 0.783 0.135
|
||||
example-10k.html html local 0.686 0.037
|
||||
|
||||
|
@ -1 +1 @@
|
||||
__version__ = "0.10.29" # pragma: no cover
|
||||
__version__ = "0.10.30-dev0" # pragma: no cover
|
||||
|
||||
@ -1,35 +1,10 @@
|
||||
#! /usr/bin/env python3
|
||||
|
||||
import csv
|
||||
import logging
|
||||
import os
|
||||
import statistics
|
||||
import sys
|
||||
from typing import Any, List, Optional, Tuple
|
||||
from typing import List, Optional, Tuple
|
||||
|
||||
import click
|
||||
|
||||
from unstructured.metrics.element_type import (
|
||||
calculate_element_type_percent_match,
|
||||
get_element_type_frequency,
|
||||
)
|
||||
from unstructured.metrics.text_extraction import calculate_accuracy, calculate_percent_missing_text
|
||||
from unstructured.staging.base import elements_from_json, elements_to_text
|
||||
|
||||
logger = logging.getLogger("unstructured.ingest")
|
||||
handler = logging.StreamHandler()
|
||||
handler.name = "ingest_log_handler"
|
||||
formatter = logging.Formatter("%(asctime)s %(processName)-10s %(levelname)-8s %(message)s")
|
||||
handler.setFormatter(formatter)
|
||||
|
||||
# Only want to add the handler once
|
||||
if "ingest_log_handler" not in [h.name for h in logger.handlers]:
|
||||
logger.addHandler(handler)
|
||||
|
||||
logger.setLevel(logging.DEBUG)
|
||||
|
||||
|
||||
agg_headers = ["strategy", "average", "sample_sd", "population_sd", "count"]
|
||||
from unstructured.metrics.evaluate import measure_element_type_accuracy, measure_text_edit_distance
|
||||
|
||||
|
||||
@click.group()
|
||||
@ -39,6 +14,7 @@ def main():
|
||||
|
||||
@main.command()
|
||||
@click.option("--output_dir", type=str, help="Directory to structured output.")
|
||||
@click.option("--source_dir", type=str, help="Directory to source.")
|
||||
@click.option(
|
||||
"--output_list",
|
||||
type=str,
|
||||
@ -46,7 +22,6 @@ def main():
|
||||
help="Optional: list of selected structured output file names under the \
|
||||
directory to be evaluate. If none, all files under directory will be use.",
|
||||
)
|
||||
@click.option("--source_dir", type=str, help="Directory to source.")
|
||||
@click.option(
|
||||
"--source_list",
|
||||
type=str,
|
||||
@ -69,80 +44,22 @@ def main():
|
||||
help="A tuple of weights to the Levenshtein distance calculation. \
|
||||
See text_extraction.py/calculate_edit_distance for more details.",
|
||||
)
|
||||
def measure_text_edit_distance(
|
||||
def measure_text_edit_distance_command(
|
||||
output_dir: str,
|
||||
output_list: Optional[List[str]],
|
||||
source_dir: str,
|
||||
output_list: Optional[List[str]],
|
||||
source_list: Optional[List[str]],
|
||||
export_dir: str,
|
||||
weights: Tuple[int, int, int],
|
||||
) -> None:
|
||||
"""
|
||||
Loops through the list of structured output from all of `output_dir` or selected files from
|
||||
`output_list`, and compare with gold-standard of the same file name under `source_dir` or
|
||||
selected files from `source_list`.
|
||||
|
||||
Calculates text accuracy and percent missing. After looped through the whole list, write to tsv.
|
||||
Also calculates the aggregated accuracy and percent missing.
|
||||
"""
|
||||
if not output_list:
|
||||
output_list = _listdir_recursive(output_dir)
|
||||
if not source_list:
|
||||
source_list = _listdir_recursive(source_dir)
|
||||
|
||||
if not output_list:
|
||||
print("No output files to calculate to edit distances for, exiting")
|
||||
sys.exit(0)
|
||||
|
||||
rows = []
|
||||
accuracy_scores: List[float] = []
|
||||
percent_missing_scores: List[float] = []
|
||||
|
||||
# assumption: output file name convention is name-of-file.doc.json
|
||||
for doc in output_list: # type: ignore
|
||||
fn = (doc.split("/")[-1]).split(".json")[0]
|
||||
doctype = fn.rsplit(".", 1)[-1]
|
||||
fn_txt = fn + ".txt"
|
||||
connector = doc.split("/")[0]
|
||||
|
||||
if fn_txt in source_list: # type: ignore
|
||||
output_cct = elements_to_text(elements_from_json(os.path.join(output_dir, doc)))
|
||||
source_cct = _read_text(os.path.join(source_dir, fn_txt))
|
||||
accuracy = round(calculate_accuracy(output_cct, source_cct, weights), 3)
|
||||
percent_missing = round(calculate_percent_missing_text(output_cct, source_cct), 3)
|
||||
|
||||
rows.append([fn, doctype, connector, accuracy, percent_missing])
|
||||
accuracy_scores.append(accuracy)
|
||||
percent_missing_scores.append(percent_missing)
|
||||
|
||||
headers = ["filename", "doctype", "connector", "cct-accuracy", "cct-%missing"]
|
||||
_write_to_file(export_dir, "all-docs-cct.tsv", rows, headers)
|
||||
|
||||
agg_rows = []
|
||||
agg_rows.append(
|
||||
[
|
||||
"cct-accuracy",
|
||||
_mean(accuracy_scores),
|
||||
_stdev(accuracy_scores),
|
||||
_pstdev(accuracy_scores),
|
||||
len(accuracy_scores),
|
||||
],
|
||||
):
|
||||
return measure_text_edit_distance(
|
||||
output_dir, source_dir, output_list, source_list, export_dir, weights
|
||||
)
|
||||
agg_rows.append(
|
||||
[
|
||||
"cct-%missing",
|
||||
_mean(percent_missing_scores),
|
||||
_stdev(percent_missing_scores),
|
||||
_pstdev(percent_missing_scores),
|
||||
len(percent_missing_scores),
|
||||
],
|
||||
)
|
||||
_write_to_file(export_dir, "aggregate-scores-cct.tsv", agg_rows, agg_headers)
|
||||
_display(agg_rows, agg_headers)
|
||||
|
||||
|
||||
@main.command()
|
||||
@click.option("--output_dir", type=str, help="Directory to structured output.")
|
||||
@click.option("--source_dir", type=str, help="Directory to structured source.")
|
||||
@click.option(
|
||||
"--output_list",
|
||||
type=str,
|
||||
@ -150,7 +67,6 @@ def measure_text_edit_distance(
|
||||
help="Optional: list of selected structured output file names under the \
|
||||
directory to be evaluate. If none, all files under directory will be used.",
|
||||
)
|
||||
@click.option("--source_dir", type=str, help="Directory to structured source.")
|
||||
@click.option(
|
||||
"--source_list",
|
||||
type=str,
|
||||
@ -165,132 +81,16 @@ def measure_text_edit_distance(
|
||||
help="Directory to save the output evaluation metrics to. Default to \
|
||||
your/working/dir/metrics/",
|
||||
)
|
||||
def measure_element_type_accuracy(
|
||||
def measure_element_type_accuracy_command(
|
||||
output_dir: str,
|
||||
output_list: Optional[List[str]],
|
||||
source_dir: str,
|
||||
output_list: Optional[List[str]],
|
||||
source_list: Optional[List[str]],
|
||||
export_dir: str,
|
||||
):
|
||||
"""
|
||||
Loops through the list of structured output from all of `output_dir` or selected files from
|
||||
`output_list`, and compare with gold-standard of the same file name under `source_dir` or
|
||||
selected files from `source_list`.
|
||||
|
||||
Calculates element type frequency accuracy and percent missing. After looped through the
|
||||
whole list, write to tsv. Also calculates the aggregated accuracy.
|
||||
"""
|
||||
if not output_list:
|
||||
output_list = _listdir_recursive(output_dir)
|
||||
if not source_list:
|
||||
source_list = _listdir_recursive(source_dir)
|
||||
|
||||
rows = []
|
||||
accuracy_scores: List[float] = []
|
||||
|
||||
for doc in output_list: # type: ignore
|
||||
fn = (doc.split("/")[-1]).split(".json")[0]
|
||||
doctype = fn.rsplit(".", 1)[-1]
|
||||
connector = doc.split("/")[0]
|
||||
if doc in source_list: # type: ignore
|
||||
output = get_element_type_frequency(_read_text(os.path.join(output_dir, doc)))
|
||||
source = get_element_type_frequency(_read_text(os.path.join(source_dir, doc)))
|
||||
accuracy = round(calculate_element_type_percent_match(output, source), 3)
|
||||
rows.append([fn, doctype, connector, accuracy])
|
||||
accuracy_scores.append(accuracy)
|
||||
|
||||
headers = ["filename", "doctype", "connector", "element-type-accuracy"]
|
||||
_write_to_file(export_dir, "all-docs-element-type-frequency.tsv", rows, headers)
|
||||
|
||||
agg_rows = []
|
||||
agg_rows.append(
|
||||
[
|
||||
"element-type-accuracy",
|
||||
_mean(accuracy_scores),
|
||||
_stdev(accuracy_scores),
|
||||
_pstdev(accuracy_scores),
|
||||
len(accuracy_scores),
|
||||
],
|
||||
return measure_element_type_accuracy(
|
||||
output_dir, source_dir, output_list, source_list, export_dir
|
||||
)
|
||||
_write_to_file(export_dir, "aggregate-scores-element-type.tsv", agg_rows, agg_headers)
|
||||
_display(agg_rows, agg_headers)
|
||||
|
||||
|
||||
def _listdir_recursive(dir: str):
|
||||
listdir = []
|
||||
for dirpath, _, filenames in os.walk(dir):
|
||||
for filename in filenames:
|
||||
# Remove the starting directory from the path to show the relative path
|
||||
relative_path = os.path.relpath(dirpath, dir)
|
||||
if relative_path == ".":
|
||||
listdir.append(filename)
|
||||
else:
|
||||
listdir.append(f"{relative_path}/{filename}")
|
||||
return listdir
|
||||
|
||||
|
||||
def _write_to_file(dir: str, filename: str, rows: List[Any], headers: List[Any], mode: str = "w"):
|
||||
if mode not in ["w", "a"]:
|
||||
raise ValueError("Mode not supported. Mode must be one of [w, a].")
|
||||
if dir and not os.path.exists(dir):
|
||||
os.makedirs(dir)
|
||||
with open(os.path.join(os.path.join(dir, filename)), mode, newline="") as tsv:
|
||||
writer = csv.writer(tsv, delimiter="\t")
|
||||
if mode == "w":
|
||||
writer.writerow(headers)
|
||||
writer.writerows(rows)
|
||||
|
||||
|
||||
def _display(rows, headers):
|
||||
col_widths = [
|
||||
max(len(headers[i]), max(len(str(row[i])) for row in rows)) for i in range(len(headers))
|
||||
]
|
||||
click.echo(" ".join(headers[i].ljust(col_widths[i]) for i in range(len(headers))))
|
||||
click.echo("-" * sum(col_widths) + "-" * (len(headers) - 1))
|
||||
for row in rows:
|
||||
formatted_row = []
|
||||
for item in row:
|
||||
if isinstance(item, float):
|
||||
formatted_row.append(f"{item:.3f}")
|
||||
else:
|
||||
formatted_row.append(str(item))
|
||||
click.echo(
|
||||
" ".join(formatted_row[i].ljust(col_widths[i]) for i in range(len(formatted_row))),
|
||||
)
|
||||
|
||||
|
||||
def _mean(scores: List[float], rounding: Optional[int] = 3):
|
||||
if len(scores) < 1:
|
||||
return None
|
||||
elif len(scores) == 1:
|
||||
mean = scores[0]
|
||||
else:
|
||||
mean = statistics.mean(scores)
|
||||
if not rounding:
|
||||
return mean
|
||||
return round(mean, rounding)
|
||||
|
||||
|
||||
def _stdev(scores: List[float], rounding: Optional[int] = 3):
|
||||
if len(scores) <= 1:
|
||||
return None
|
||||
if not rounding:
|
||||
return statistics.stdev(scores)
|
||||
return round(statistics.stdev(scores), rounding)
|
||||
|
||||
|
||||
def _pstdev(scores: List[float], rounding: Optional[int] = 3):
|
||||
if len(scores) <= 1:
|
||||
return None
|
||||
if not rounding:
|
||||
return statistics.pstdev(scores)
|
||||
return round(statistics.pstdev(scores), rounding)
|
||||
|
||||
|
||||
def _read_text(path):
|
||||
with open(path, errors="ignore") as f:
|
||||
text = f.read()
|
||||
return text
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
232
unstructured/metrics/evaluate.py
Executable file
232
unstructured/metrics/evaluate.py
Executable file
@ -0,0 +1,232 @@
|
||||
#! /usr/bin/env python3
|
||||
|
||||
import csv
|
||||
import logging
|
||||
import os
|
||||
import statistics
|
||||
import sys
|
||||
from typing import Any, List, Optional, Tuple
|
||||
|
||||
import click
|
||||
|
||||
from unstructured.metrics.element_type import (
|
||||
calculate_element_type_percent_match,
|
||||
get_element_type_frequency,
|
||||
)
|
||||
from unstructured.metrics.text_extraction import calculate_accuracy, calculate_percent_missing_text
|
||||
from unstructured.staging.base import elements_from_json, elements_to_text
|
||||
|
||||
logger = logging.getLogger("unstructured.ingest")
|
||||
handler = logging.StreamHandler()
|
||||
handler.name = "ingest_log_handler"
|
||||
formatter = logging.Formatter("%(asctime)s %(processName)-10s %(levelname)-8s %(message)s")
|
||||
handler.setFormatter(formatter)
|
||||
|
||||
# Only want to add the handler once
|
||||
if "ingest_log_handler" not in [h.name for h in logger.handlers]:
|
||||
logger.addHandler(handler)
|
||||
|
||||
logger.setLevel(logging.DEBUG)
|
||||
|
||||
|
||||
agg_headers = ["strategy", "average", "sample_sd", "population_sd", "count"]
|
||||
|
||||
|
||||
def measure_text_edit_distance(
|
||||
output_dir: str,
|
||||
source_dir: str,
|
||||
output_list: Optional[List[str]] = None,
|
||||
source_list: Optional[List[str]] = None,
|
||||
export_dir: str = "metrics",
|
||||
weights: Tuple[int, int, int] = (2, 1, 1),
|
||||
) -> None:
|
||||
"""
|
||||
Loops through the list of structured output from all of `output_dir` or selected files from
|
||||
`output_list`, and compare with gold-standard of the same file name under `source_dir` or
|
||||
selected files from `source_list`.
|
||||
|
||||
Calculates text accuracy and percent missing. After looped through the whole list, write to tsv.
|
||||
Also calculates the aggregated accuracy and percent missing.
|
||||
"""
|
||||
if not output_list:
|
||||
output_list = _listdir_recursive(output_dir)
|
||||
if not source_list:
|
||||
source_list = _listdir_recursive(source_dir)
|
||||
|
||||
if not output_list:
|
||||
print("No output files to calculate to edit distances for, exiting")
|
||||
sys.exit(0)
|
||||
|
||||
rows = []
|
||||
accuracy_scores: List[float] = []
|
||||
percent_missing_scores: List[float] = []
|
||||
|
||||
# assumption: output file name convention is name-of-file.doc.json
|
||||
for doc in output_list: # type: ignore
|
||||
fn = (doc.split("/")[-1]).split(".json")[0]
|
||||
doctype = fn.rsplit(".", 1)[-1]
|
||||
fn_txt = fn + ".txt"
|
||||
connector = doc.split("/")[0]
|
||||
|
||||
if fn_txt in source_list: # type: ignore
|
||||
output_cct = elements_to_text(elements_from_json(os.path.join(output_dir, doc)))
|
||||
source_cct = _read_text(os.path.join(source_dir, fn_txt))
|
||||
accuracy = round(calculate_accuracy(output_cct, source_cct, weights), 3)
|
||||
percent_missing = round(calculate_percent_missing_text(output_cct, source_cct), 3)
|
||||
|
||||
rows.append([fn, doctype, connector, accuracy, percent_missing])
|
||||
accuracy_scores.append(accuracy)
|
||||
percent_missing_scores.append(percent_missing)
|
||||
|
||||
headers = ["filename", "doctype", "connector", "cct-accuracy", "cct-%missing"]
|
||||
_write_to_file(export_dir, "all-docs-cct.tsv", rows, headers)
|
||||
|
||||
agg_rows = []
|
||||
agg_rows.append(
|
||||
[
|
||||
"cct-accuracy",
|
||||
_mean(accuracy_scores),
|
||||
_stdev(accuracy_scores),
|
||||
_pstdev(accuracy_scores),
|
||||
len(accuracy_scores),
|
||||
],
|
||||
)
|
||||
agg_rows.append(
|
||||
[
|
||||
"cct-%missing",
|
||||
_mean(percent_missing_scores),
|
||||
_stdev(percent_missing_scores),
|
||||
_pstdev(percent_missing_scores),
|
||||
len(percent_missing_scores),
|
||||
],
|
||||
)
|
||||
_write_to_file(export_dir, "aggregate-scores-cct.tsv", agg_rows, agg_headers)
|
||||
_display(agg_rows, agg_headers)
|
||||
|
||||
|
||||
def measure_element_type_accuracy(
|
||||
output_dir: str,
|
||||
source_dir: str,
|
||||
output_list: Optional[List[str]] = None,
|
||||
source_list: Optional[List[str]] = None,
|
||||
export_dir: str = "metrics",
|
||||
):
|
||||
"""
|
||||
Loops through the list of structured output from all of `output_dir` or selected files from
|
||||
`output_list`, and compare with gold-standard of the same file name under `source_dir` or
|
||||
selected files from `source_list`.
|
||||
|
||||
Calculates element type frequency accuracy and percent missing. After looped through the
|
||||
whole list, write to tsv. Also calculates the aggregated accuracy.
|
||||
"""
|
||||
if not output_list:
|
||||
output_list = _listdir_recursive(output_dir)
|
||||
if not source_list:
|
||||
source_list = _listdir_recursive(source_dir)
|
||||
|
||||
rows = []
|
||||
accuracy_scores: List[float] = []
|
||||
|
||||
for doc in output_list: # type: ignore
|
||||
fn = (doc.split("/")[-1]).split(".json")[0]
|
||||
doctype = fn.rsplit(".", 1)[-1]
|
||||
connector = doc.split("/")[0]
|
||||
if doc in source_list: # type: ignore
|
||||
output = get_element_type_frequency(_read_text(os.path.join(output_dir, doc)))
|
||||
source = get_element_type_frequency(_read_text(os.path.join(source_dir, doc)))
|
||||
accuracy = round(calculate_element_type_percent_match(output, source), 3)
|
||||
rows.append([fn, doctype, connector, accuracy])
|
||||
accuracy_scores.append(accuracy)
|
||||
|
||||
headers = ["filename", "doctype", "connector", "element-type-accuracy"]
|
||||
_write_to_file(export_dir, "all-docs-element-type-frequency.tsv", rows, headers)
|
||||
|
||||
agg_rows = []
|
||||
agg_rows.append(
|
||||
[
|
||||
"element-type-accuracy",
|
||||
_mean(accuracy_scores),
|
||||
_stdev(accuracy_scores),
|
||||
_pstdev(accuracy_scores),
|
||||
len(accuracy_scores),
|
||||
],
|
||||
)
|
||||
_write_to_file(export_dir, "aggregate-scores-element-type.tsv", agg_rows, agg_headers)
|
||||
_display(agg_rows, agg_headers)
|
||||
|
||||
|
||||
def _listdir_recursive(dir: str):
|
||||
listdir = []
|
||||
for dirpath, _, filenames in os.walk(dir):
|
||||
for filename in filenames:
|
||||
# Remove the starting directory from the path to show the relative path
|
||||
relative_path = os.path.relpath(dirpath, dir)
|
||||
if relative_path == ".":
|
||||
listdir.append(filename)
|
||||
else:
|
||||
listdir.append(f"{relative_path}/{filename}")
|
||||
return listdir
|
||||
|
||||
|
||||
def _display(rows, headers):
|
||||
col_widths = [
|
||||
max(len(headers[i]), max(len(str(row[i])) for row in rows)) for i in range(len(headers))
|
||||
]
|
||||
click.echo(" ".join(headers[i].ljust(col_widths[i]) for i in range(len(headers))))
|
||||
click.echo("-" * sum(col_widths) + "-" * (len(headers) - 1))
|
||||
for row in rows:
|
||||
formatted_row = []
|
||||
for item in row:
|
||||
if isinstance(item, float):
|
||||
formatted_row.append(f"{item:.3f}")
|
||||
else:
|
||||
formatted_row.append(str(item))
|
||||
click.echo(
|
||||
" ".join(formatted_row[i].ljust(col_widths[i]) for i in range(len(formatted_row))),
|
||||
)
|
||||
|
||||
|
||||
def _write_to_file(dir: str, filename: str, rows: List[Any], headers: List[Any], mode: str = "w"):
|
||||
if mode not in ["w", "a"]:
|
||||
raise ValueError("Mode not supported. Mode must be one of [w, a].")
|
||||
if dir and not os.path.exists(dir):
|
||||
os.makedirs(dir)
|
||||
with open(os.path.join(os.path.join(dir, filename)), mode, newline="") as tsv:
|
||||
writer = csv.writer(tsv, delimiter="\t")
|
||||
if mode == "w":
|
||||
writer.writerow(headers)
|
||||
writer.writerows(rows)
|
||||
|
||||
|
||||
def _mean(scores: List[float], rounding: Optional[int] = 3):
|
||||
if len(scores) < 1:
|
||||
return None
|
||||
elif len(scores) == 1:
|
||||
mean = scores[0]
|
||||
else:
|
||||
mean = statistics.mean(scores)
|
||||
if not rounding:
|
||||
return mean
|
||||
return round(mean, rounding)
|
||||
|
||||
|
||||
def _stdev(scores: List[float], rounding: Optional[int] = 3):
|
||||
if len(scores) <= 1:
|
||||
return None
|
||||
if not rounding:
|
||||
return statistics.stdev(scores)
|
||||
return round(statistics.stdev(scores), rounding)
|
||||
|
||||
|
||||
def _pstdev(scores: List[float], rounding: Optional[int] = 3):
|
||||
if len(scores) <= 1:
|
||||
return None
|
||||
if not rounding:
|
||||
return statistics.pstdev(scores)
|
||||
return round(statistics.pstdev(scores), rounding)
|
||||
|
||||
|
||||
def _read_text(path):
|
||||
with open(path, errors="ignore") as f:
|
||||
text = f.read()
|
||||
return text
|
||||
Loading…
x
Reference in New Issue
Block a user