refactor: separate click wrappers from core evaluation functionality (#1981)

### Summary
Click decorated functions cannot (properly) be called outside of the
click interface. This makes it difficult to reuse the setup
functionality in measure_text_edit_distance or
measure_element_type_accuracy. This PR removes the click decoration and
separates it into a wrapper function purely to execute the command.

### Technical Details
- Changed as suggested in [this StackOverflow
post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command)
response
- The locations of these now distinct functions are separate: the
`_command` click-decorated functions stay in ingest/evaluate.py, and the
core functions measure_text_edit_distance and
measure_element_type_accuracy are moved into the unstructured/metrics/
folder (which is a more logical location for them).
- Initial test added for measure_text_edit_distance

### Test
`sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction`
functionality is unchanged.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
This commit is contained in:
shreyanid 2023-11-07 11:54:22 -08:00 committed by GitHub
parent ad14321016
commit 6db663e7bb
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
14 changed files with 3162 additions and 221 deletions

View File

@ -1,3 +1,5 @@
## 0.10.30-dev0
## 0.10.29
### Enhancements

View File

@ -0,0 +1,36 @@
Bank Good Credit
Accredited with IABACTM
(International Association of Business Analytics Certifications)
IABAC International Association of
Business Analytics Certification
DataMitesTM. All Right Reserved
Objective & Background
Classify credit card customers as good / bad, based on information from internal and external sources.
Data provided
Demographic: Base file of with credit card history details. Only one record for every customer.
Account: Contians data for various loans availed by the customer. Not related to credit card. Multiple records for every customer.
Enquiries: Enquired made by customers for different loan purposes. Multiple records for every customer.
DataMitesTM. All Right Reserved
Design
Data to be downloaded using SQL queries.
Required information to be extracted from Account and Enquiry files and converted to one-to-one files.
The columns from the two files should be merged with Demographic file using Left Join with customer no as key column, to create a final file. The final file should contain all the records in demographic and additional columns/features from Account and Enquiry files will get added to Demographic file.
There will be many customers in account and enquiry file who will get left out. This is fine as we anyway dont know their good/bad label for training purpose.
DataMitesTM. All Right Reserved
Analysis of Data
Show using Excel File
DataMitesTM. All Right Reserved
Explain Coding / outcomes
Show using Jupyter
DataMitesTM. All Right Reserved
Thank You
DataMitesTM. All Right Reserved

View File

@ -0,0 +1,205 @@
Page 1
The introductory chapter of Government Auditing Standards (GAGAS)1
outlines five concepts describing how public officials are to provide
functions and services: effectively, efficiently, economically, ethically, and
equitably. When planning, gathering and assessing evidence, and
reporting audit results, auditors may focus on one or more of these
concepts. The following discussion is intended to assist auditors when
developing audit objectives for performance audits of government
programs and activities.2
This discussion is designed to help auditors understand
and apply the concepts cited above for performance audits
conducted in accordance with GAGAS. This discussion
does not contain requirements, does not amend GAGAS,
and is not considered interpretive guidance, as defined in
chapter 2 of GAGAS.
Paragraph 1.02:
The concept of accountability for use of public resources
and government authority is key to our nations governing
processes. Management and officials entrusted with public
resources are responsible for carrying out public functions
and providing service to the public effectively, efficiently,
economically, ethically, and equitably within the context
of the statutory boundaries of the specific government
program. [Emphasis added.]
Paragraph 1.03:
As reflected in applicable laws, regulations, agreements,
and standards, management and officials of government
programs are responsible for providing reliable, useful, and
timely information for transparency and accountability of
these programs and their operations. Legislators, oversight
1GAO, Government Auditing Standards: 2018 Revision, GAO-21-368G (Washington,
D.C.: April 2021)
2The concepts cited may also be applicable to other GAGAS engagements, based on the
auditors judgments. This discussion is limited to considering these concepts in
performance audits.
GAGAS Performance Audits: Discussion of
Concepts to Consider When Auditing Public
Functions and Services
GAGAS Paragraphs
Page 2
bodies, those charged with governance, and the public
need to know whether (1) management and officials
manage government resources and use their authority
properly and in compliance with laws and regulations; (2)
government programs are achieving their objectives and
desired outcomes; and (3) government services are
provided effectively, efficiently, economically, ethically,
and equitably. [Emphasis added.]
Government administration best serves the collective interest of the public
when it is effective, efficient, economical, ethical, and equitable. Auditors
help inform legislators, oversight bodies, those charged with governance,
and the public about whether public services are being provided
consistent with these concepts. Government auditing can contribute to
accountability and can help improve government administration by
identifying deficiencies and recommending enhancements to achieve
effective, efficient, economical, ethical, and equitable outcomes, when
appropriate within the context of the audit objectives. As such, it is
important for auditors to understand the concepts below as they relate to
administering government programs or activities and how they can
assess or address these expectations of government performance in
conducting their performance audits.
The examples that follow the discussion of each concept illustrate the
distinctions between these concepts. In a performance audit, it is
common practice to incorporate more than one of these concepts when
conducting the audit.
The administration of a government program or activity is effective when
it achieves the intended results. A performance audit that focuses on the
effectiveness of a program or activity seeks to establish a cause-andeffect relationship between the operation of the program or activity and
achieving its stated objectives. Achieving the objectives does not
guarantee that the program or activity was effective unless the auditors
can establish that the program or activity caused, or contributed to, the
desired outcome.
Example: In a performance audit examining how effective a
housing voucher program was in achieving its goal of improving
economic outcomes for recipients, auditors may determine
whether receiving housing vouchers led to better subsequent
economic outcomes for recipients than those of similarly situated
individuals who did not receive vouchers.
Discussion
Effective
Page 3
Example: In a performance audit assessing the effectiveness of
an after-school program targeted at helping students improve their
reading proficiency, auditors may examine the extent to which
participants reading levels improved relative to baseline data from
before they joined the program.
The administration of a government program or activity is efficient when
it gets the most value from available resources. When a performance
audit focuses on efficiency, auditors examine whether the resources used
to administer a program or activity have been put to optimal or
satisfactory use, or whether the same or similar results could have been
achieved more timely or with fewer resources.
Example: In a performance audit assessing a disaster relief
agencys mobilization of resources to respond to a disaster,
auditors may assess the disaster relief agencys timeliness in
providing relief compared to its own previous performance or the
performance of other similarly situated agencies that have
responded to comparable disasters.
Example: In a performance audit assessing a consumer protection
agencys response to consumer complaints, auditors may assess
whether the agencys efforts to streamline its processes resulted
in improved timely resolution of complaints.
Example: In a performance audit assessing the time a state needs
to process unemployment benefits targeted at helping those in
need, auditors may assess how long the process takes from
receipt of the unemployment application to the applicants receipt
of the benefit, including steps such as verifying required
information.
The administration of a government program or activity is economical
when it minimizes the costs of resources used in performing its functions
while meeting timeliness and quality considerations for those resources.
When auditing economy, auditors primarily focus on the costs of inputs
rather than on the outcomes achieved.
Example: In a performance audit examining an agencys
international travel expenses, in addition to assessing the design
of internal controls and compliance with expense guidelines,
auditors may test whether, for a sample of trips, bookings of
Efficient
Economical
Page 4
equivalent airline tickets and hotel rooms could be found at a
lower cost.
Example: In a performance audit assessing an agencys
acquisition practices, auditors may examine whether the agencys
decisions regarding purchasing, leasing, or reimbursing
employees for the costs of acquiring various supplies or
equipment achieved the lowest cost while meeting applicable
requirements.
The administration of a government program or activity is ethical when it
advances the collective interest of the public rather than private gain and
is conducted with honesty, integrity, and impartiality. Laws and
regulations often specify rules of ethical conduct. Therefore, audits
examining the ethical administration of a program or activity may involve
assessing compliance with such laws and regulations. Fraud in
administering a government program or activity betrays the public trust
and is, by definition, unethical. In addition, auditors may identify instances
of unethical conduct that result in waste and abuse during testing of
internal controls as part of a performance audit.
Example: In a performance audit assessing agency officials
compliance with conflict-of-interest requirements, auditors may
compare a sample of financial disclosure reports filed against
requirements in statute or regulation.
Example: In a performance audit assessing potential regulatory
capture related to a particular industry, auditors may assess the
extent to which the regulatory agency has sufficient controls to
reasonably assure its employees independence from the entities
subject to the agencys regulation.
Example: In a performance audit assessing an offices policies
and procedures for purchase cards, auditors testing of the
programs controls to identify deficiencies may identify fraud,
waste, or abuse in its administration.
The administration of a government program or activity is equitable when
it consistently serves members of the public, distributes public services,
and implements public policy in a manner that promotes fairness, justice,
and equality. Auditing whether the administration of a government
program or activity is equitable may include assessing the
Ethical
Equitable
Page 5
• equality of access to and provision of services;
• procedural fairness and equal treatment of individuals in
government programs and policies;
• causes of disparate outcomes;
• or distributional impacts of public policies, programs, resources,
and services.
Disaggregating data by social groups or communities that share a
particular characteristic (e.g., gender, race, ethnicity, age, or income)
can help illuminate differences. Reporting on such differences, when
appropriate within the context of the audit objectives, can increase
understanding of the effects of policies and programs on issues of
equity.
Example: In a performance audit assessing the granting of
waivers from particular requirements, auditors may use
disaggregated data about waiver recipients to assess whether
different groups or communities were treated fairly and equally in
the process.
Example: In a performance audit assessing a grant program
aimed at expanding internet access, auditors may assess the
extent to which formulas, criteria, or other factors (such as
matching funds or capital requirements) considered in the
distribution of grant funds may be to the specific advantage or
disadvantage of certain groups, regions, or communities, thereby
causing inequities.
Example: In a performance audit assessing scholarship outcomes
in higher education programs, auditors may report on the
distribution of scholarships by race, gender identity, and income to
illuminate potential disparities among scholarship recipients.
These concepts may overlap. For example, efficiency may also be a
component of effectiveness. Similarly, when appropriate within the
context of the program and audit objectives, auditors may disaggregate
the results of performance audits that focus on efficiency or effectiveness
Page 6
issues to illuminate inequities in program administration or in distribution
of public services.
While all of these concepts are important to administering government
programs responsibly, it is up to the professional judgment of the auditors
to determine the specific concepts that are relevant in conducting the
performance audit and reporting the results. Auditors professional
judgments are informed by, among other things, the needs of the users of
the audit reports; the nature, context, and objectives of the program or
activity under audit; and the public interest.
To view the current Yellow Book, visit https://www.gao.gov/yellowbook.
For technical assistance, call (202) 512-9535 or email
yellowbook@gao.gov.
For More Information

View File

@ -0,0 +1,164 @@
Code Symbol Name
AED . United Arab Emirates d
AFN Afghan afghani
ALL L Albanian lek
AMD AMD Armenian dram
ANG Netherlands Antillean gu
AOA Kz Angolan kwanza
ARS $ Argentine peso
AUD $ Australian dollar
AWG Afl. Aruban florin
AZN AZN Azerbaijani manat
BAM KM Bosnia and Herzegovina
BBD $ Barbadian dollar
BDT Bangladeshi taka
BGN . Bulgarian lev
BHD .. Bahraini dinar
BIF Fr Burundian franc
BMD $ Bermudian dollar
BND $ Brunei dollar
BOB Bs. Bolivian boliviano
BRL R$ Brazilian real
BSD $ Bahamian dollar
BTC Bitcoin
BTN Nu. Bhutanese ngultrum
BWP P Botswana pula
BYR Br Belarusian ruble (old)'
BYN Br Belarusian ruble
BZD $ Belize dollar
CAD $ Canadian dollar
CDF Fr Congolese franc
CHF CHF Swiss franc
CLP $ Chilean peso
CNY Chinese yuan
COP $ Colombian peso
CRC Costa Rican coln
CUC $ Cuban convertible peso')
CUP $ Cuban peso
CVE $ Cape Verdean escudo
CZK K Czech koruna
DJF Fr Djiboutian franc
DKK DKK Danish krone
DOP RD$ Dominican peso
DZD . Algerian dinar
EGP EGP Egyptian pound
ERN Nfk Eritrean nakfa
ETB Br Ethiopian birr
EUR Euro
FJD $ Fijian dollar
FKP Falkland Islands pound')
GBP Pound sterling
GEL Georgian lari
GGP Guernsey pound
GHS Ghana cedi
GIP Gibraltar pound
GMD D Gambian dalasi
GNF Fr Guinean franc
GTQ Q Guatemalan quetzal
GYD $ Guyanese dollar
HKD $ Hong Kong dollar
HNL L Honduran lempira
HRK kn Croatian kuna
HTG G Haitian gourde
HUF Ft Hungarian forint
IDR Rp Indonesian rupiah
ILS Israeli new shekel
IMP Manx pound
INR Indian rupee
IQD . Iraqi dinar
IRR Iranian rial
IRT Iranian toman
ISK kr. Icelandic krna
JEP Jersey pound
JMD $ Jamaican dollar
JOD . Jordanian dinar
JPY Japanese yen
KES KSh Kenyan shilling
KGS Kyrgyzstani som
KHR Cambodian riel
KMF Fr Comorian franc
KPW North Korean won
KRW South Korean won
KWD . Kuwaiti dinar
KYD $ Cayman Islands dollar
KZT Kazakhstani tenge
LAK Lao kip
LBP . Lebanese pound
LKR Sri Lankan rupee
LRD $ Liberian dollar
LSL L Lesotho loti
LYD . Libyan dinar
MAD .. Moroccan dirham
MDL MDL Moldovan leu
MGA Ar Malagasy ariary
MKD Macedonian denar
MMK Ks Burmese kyat
MNT Mongolian tgrg
MOP P Macanese pataca
MRU UM Mauritanian ouguiya
MUR Mauritian rupee
MVR . Maldivian rufiyaa
MWK MK Malawian kwacha
MXN $ Mexican peso
MYR RM Malaysian ringgit
MZN MT Mozambican metical
NAD N$ Namibian dollar
NGN Nigerian naira
NIO C$ Nicaraguan crdoba
NOK kr Norwegian krone
NPR Nepalese rupee
NZD $ New Zealand dollar
OMR .. Omani rial
PAB B/. Panamanian balboa
PEN S/ Sol
PGK K Papua New Guinean kina')
PHP Philippine peso
PKR Pakistani rupee
PLN z Polish zoty
PRB . Transnistrian ruble
PYG Paraguayan guaran
QAR . Qatari riyal
RON lei Romanian leu
RSD Serbian dinar
RUB Russian ruble
RWF Fr Rwandan franc
SAR . Saudi riyal
SBD $ Solomon Islands dollar')
SCR Seychellois rupee
SDG .. Sudanese pound
SEK kr Swedish krona
SGD $ Singapore dollar
SHP Saint Helena pound
SLL Le Sierra Leonean leone
SOS Sh Somali shilling
SRD $ Surinamese dollar
SSP South Sudanese pound
STN Db So Tom and Prncipe d
SYP . Syrian pound
SZL L Swazi lilangeni
THB Thai baht
TJS Tajikistani somoni
TMT m Turkmenistan manat
TND . Tunisian dinar
TOP T$ Tongan paanga
TRY Turkish lira
TTD $ Trinidad and Tobago doll
TWD NT$ New Taiwan dollar
TZS Sh Tanzanian shilling
UAH Ukrainian hryvnia
UGX UGX Ugandan shilling
USD $ United States (US) dolla
UYU $ Uruguayan peso
UZS UZS Uzbekistani som
VEF Bs F Venezuelan bolvar
VES Bs.S Bolvar soberano
VND Vietnamese ng
VUV Vt Vanuatu vatu
WST T Samoan tl
XAF CFA Central African CFA fr
XCD $ East Caribbean dollar
XOF CFA West African CFA franc
XPF Fr CFP franc
YER Yemeni rial
ZAR R South African rand
ZMW ZK Zambian kwacha

View File

@ -0,0 +1,420 @@
[
{
"type": "Title",
"element_id": "0405351ac64213c7b1e40e31aff7d21b",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"category_depth": 1,
"languages": [
"eng"
],
"page_number": 1
},
"text": "Bank Good Credit "
},
{
"type": "NarrativeText",
"element_id": "214987ebee9fd615365185fb3d692253",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "0405351ac64213c7b1e40e31aff7d21b",
"category_depth": 0,
"languages": [
"eng"
],
"page_number": 1
},
"text": "Accredited with IABAC\u2122"
},
{
"type": "Title",
"element_id": "fc3d53b1d173c5c72205914ea331b052",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"category_depth": 1,
"languages": [
"eng"
],
"page_number": 1
},
"text": "( International Association of Business Analytics Certifications)`"
},
{
"type": "Title",
"element_id": "b952b3e6d0e34020f1f48b5d9243d0a4",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"category_depth": 1,
"languages": [
"eng"
],
"page_number": 1
},
"text": "\u00a9 DataMites\u2122. All Rights Reserved | www.datamites.com"
},
{
"type": "Title",
"element_id": "2dc308bd8d3a5c745dfacc3bdccd81db",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"category_depth": 0,
"languages": [
"eng"
],
"page_number": 2
},
"text": "Objective & Background"
},
{
"type": "ListItem",
"element_id": "5a0a7e2a14285297ff3752656cb6df44",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "2dc308bd8d3a5c745dfacc3bdccd81db",
"category_depth": 1,
"languages": [
"eng"
],
"page_number": 2
},
"text": "Classify credit card customers as good / bad, based on information from internal and external sources. "
},
{
"type": "ListItem",
"element_id": "5eb6ec96e6a3493c1ae56747ae457b7f",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "2dc308bd8d3a5c745dfacc3bdccd81db",
"category_depth": 1,
"languages": [
"eng"
],
"page_number": 2
},
"text": "Data provided"
},
{
"type": "ListItem",
"element_id": "adec2b6c75369165b1d87dccdfd2dab8",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "5eb6ec96e6a3493c1ae56747ae457b7f",
"category_depth": 2,
"languages": [
"eng"
],
"page_number": 2
},
"text": "Demographic: Base file of with credit card history details. Only one record for every customer."
},
{
"type": "ListItem",
"element_id": "c26d6dc6982b7f42045f4ffee951f8e0",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "5eb6ec96e6a3493c1ae56747ae457b7f",
"category_depth": 2,
"languages": [
"eng"
],
"page_number": 2
},
"text": "Account: Contians data for various loans availed by the customer. Not related to credit card. Multiple records for every customer."
},
{
"type": "ListItem",
"element_id": "be3fc5cb3da83e6c22d1906330ee9f96",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "5eb6ec96e6a3493c1ae56747ae457b7f",
"category_depth": 2,
"languages": [
"eng"
],
"page_number": 2
},
"text": "Enquiries: Enquired made by customers for different loan purposes. Multiple records for every customer.\t"
},
{
"type": "Title",
"element_id": "b952b3e6d0e34020f1f48b5d9243d0a4",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "2dc308bd8d3a5c745dfacc3bdccd81db",
"category_depth": 1,
"languages": [
"eng"
],
"page_number": 2
},
"text": "\u00a9 DataMites\u2122. All Rights Reserved | www.datamites.com"
},
{
"type": "Title",
"element_id": "0072e6b934945d5ba08f9729e0084739",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"category_depth": 0,
"languages": [
"eng"
],
"page_number": 3
},
"text": "Design"
},
{
"type": "ListItem",
"element_id": "af14c0ecaaa7ac1d2bca5cdfbcc32ec7",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "0072e6b934945d5ba08f9729e0084739",
"category_depth": 0,
"languages": [
"eng"
],
"page_number": 3
},
"text": "Data to be downloaded using SQL queries."
},
{
"type": "ListItem",
"element_id": "51d8e67259ab8a11d2fdfc5cb9bcf45e",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "0072e6b934945d5ba08f9729e0084739",
"category_depth": 0,
"languages": [
"eng"
],
"page_number": 3
},
"text": "Required information to be extracted from Account and Enquiry files and converted to one-to-one files."
},
{
"type": "ListItem",
"element_id": "6165e3bc219556f6ec397adc7240386b",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "0072e6b934945d5ba08f9729e0084739",
"category_depth": 0,
"languages": [
"eng"
],
"page_number": 3
},
"text": "The columns from the two files should be merged with Demographic file using Left Join with \u201ccustomer no\u201d as key column, to create a final file. The final file should contain all the records in demographic and additional columns/features from Account and Enquiry files will get added to Demographic file."
},
{
"type": "ListItem",
"element_id": "31930936fc3bad2175b05e324e9923e4",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "0072e6b934945d5ba08f9729e0084739",
"category_depth": 0,
"languages": [
"eng"
],
"page_number": 3
},
"text": "There will be many customers in account and enquiry file who will get left out. This is fine as we anyway don\u2019t know their good/bad label for training purpose. "
},
{
"type": "Title",
"element_id": "b952b3e6d0e34020f1f48b5d9243d0a4",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "0072e6b934945d5ba08f9729e0084739",
"category_depth": 1,
"languages": [
"eng"
],
"page_number": 3
},
"text": "\u00a9 DataMites\u2122. All Rights Reserved | www.datamites.com"
},
{
"type": "Title",
"element_id": "ed83647ab77addbea9e4dca5f7d8f216",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"category_depth": 0,
"languages": [
"eng"
],
"page_number": 4
},
"text": "Analysis of Data"
},
{
"type": "ListItem",
"element_id": "d936f750c577a228ebabd9ed2cec9a70",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "ed83647ab77addbea9e4dca5f7d8f216",
"category_depth": 1,
"languages": [
"eng"
],
"page_number": 4
},
"text": "Show using Excel File"
},
{
"type": "Title",
"element_id": "b952b3e6d0e34020f1f48b5d9243d0a4",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "ed83647ab77addbea9e4dca5f7d8f216",
"category_depth": 1,
"languages": [
"eng"
],
"page_number": 4
},
"text": "\u00a9 DataMites\u2122. All Rights Reserved | www.datamites.com"
},
{
"type": "Title",
"element_id": "7207da66fd1c6771ee1a5705dc41c0c7",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"category_depth": 0,
"languages": [
"eng"
],
"page_number": 5
},
"text": "Explain Coding / outcomes "
},
{
"type": "ListItem",
"element_id": "815ef1753a8bcb1ce21d59819bdc6834",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "7207da66fd1c6771ee1a5705dc41c0c7",
"category_depth": 1,
"languages": [
"eng"
],
"page_number": 5
},
"text": "Show using Jupyter"
},
{
"type": "Title",
"element_id": "b952b3e6d0e34020f1f48b5d9243d0a4",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "7207da66fd1c6771ee1a5705dc41c0c7",
"category_depth": 1,
"languages": [
"eng"
],
"page_number": 5
},
"text": "\u00a9 DataMites\u2122. All Rights Reserved | www.datamites.com"
},
{
"type": "Title",
"element_id": "2034ce6155036f8a009ef33985209e88",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "7207da66fd1c6771ee1a5705dc41c0c7",
"category_depth": 1,
"languages": [
"eng"
],
"page_number": 6
},
"text": "Thank You"
},
{
"type": "Title",
"element_id": "b952b3e6d0e34020f1f48b5d9243d0a4",
"metadata": {
"filename": "Bank Good Credit Loan.pptx",
"file_directory": "tmpdocs",
"last_modified": "2023-11-02T15:16:14",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"parent_id": "7207da66fd1c6771ee1a5705dc41c0c7",
"category_depth": 1,
"languages": [
"eng"
],
"page_number": 6
},
"text": "\u00a9 DataMites\u2122. All Rights Reserved | www.datamites.com"
}
]

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,36 @@
import os
import pathlib
import pytest
from unstructured.metrics.evaluate import (
measure_text_edit_distance,
)
is_in_docker = os.path.exists("/.dockerenv")
EXAMPLE_DOCS_DIRECTORY = os.path.join(
pathlib.Path(__file__).parent.resolve(), "..", "..", "example-docs"
)
TESTING_FILE_DIR = os.path.join(EXAMPLE_DOCS_DIRECTORY, "test_evaluate_files")
UNSTRUCTURED_OUTPUT_DIRNAME = "unstructured_output"
GOLD_CCT_DIRNAME = "gold_standard_cct"
@pytest.mark.skipif(is_in_docker, reason="Skipping this test in Docker container")
def test_text_extraction_takes_list():
output_dir = os.path.join(TESTING_FILE_DIR, UNSTRUCTURED_OUTPUT_DIRNAME)
output_list = ["currency.csv.json"]
source_dir = os.path.join(TESTING_FILE_DIR, GOLD_CCT_DIRNAME)
export_dir = os.path.join(TESTING_FILE_DIR, "test_evaluate_results_cct")
measure_text_edit_distance(
output_dir=output_dir,
source_dir=source_dir,
output_list=output_list,
export_dir=export_dir,
)
# check that only the listed files are included
with open(os.path.join(export_dir, "all-docs-cct.tsv")) as f:
lines = f.read().splitlines()
assert len(lines) == len(output_list) + 1 # includes header

View File

@ -12,9 +12,9 @@ mkdir -p "$OUTPUT_DIR"
EVAL_NAME="$1"
if [ "$EVAL_NAME" == "text-extraction" ]; then
METRIC_STRATEGY="measure-text-edit-distance"
METRIC_STRATEGY="measure-text-edit-distance-command"
elif [ "$EVAL_NAME" == "element-type" ]; then
METRIC_STRATEGY="measure-element-type-accuracy"
METRIC_STRATEGY="measure-element-type-accuracy-command"
else
echo "Wrong metric evaluation strategy given. Expected one of [ text-extraction, element-type ]. Got [ $EVAL_NAME ]."
exit 1

View File

@ -1,3 +1,3 @@
strategy average sample_sd population_sd count
cct-accuracy 0.798 0.083 0.072 4
cct-%missing 0.089 0.04 0.035 4
cct-accuracy 0.735 0.069 0.048 2
cct-%missing 0.086 0.069 0.049 2

1 strategy average sample_sd population_sd count
2 cct-accuracy 0.798 0.735 0.083 0.069 0.072 0.048 4 2
3 cct-%missing 0.089 0.086 0.04 0.069 0.035 0.049 4 2

View File

@ -1,3 +1,3 @@
filename connector cct-accuracy cct-%missing
example-10k.html local 0.686 0.037
IRS-form-1987.pdf azure 0.783 0.135
filename doctype connector cct-accuracy cct-%missing
IRS-form-1987.pdf pdf azure 0.783 0.135
example-10k.html html local 0.686 0.037

1 filename doctype connector cct-accuracy cct-%missing
2 example-10k.html IRS-form-1987.pdf pdf local azure 0.686 0.783 0.037 0.135
3 IRS-form-1987.pdf example-10k.html html azure local 0.783 0.686 0.135 0.037

View File

@ -1 +1 @@
__version__ = "0.10.29" # pragma: no cover
__version__ = "0.10.30-dev0" # pragma: no cover

View File

@ -1,35 +1,10 @@
#! /usr/bin/env python3
import csv
import logging
import os
import statistics
import sys
from typing import Any, List, Optional, Tuple
from typing import List, Optional, Tuple
import click
from unstructured.metrics.element_type import (
calculate_element_type_percent_match,
get_element_type_frequency,
)
from unstructured.metrics.text_extraction import calculate_accuracy, calculate_percent_missing_text
from unstructured.staging.base import elements_from_json, elements_to_text
logger = logging.getLogger("unstructured.ingest")
handler = logging.StreamHandler()
handler.name = "ingest_log_handler"
formatter = logging.Formatter("%(asctime)s %(processName)-10s %(levelname)-8s %(message)s")
handler.setFormatter(formatter)
# Only want to add the handler once
if "ingest_log_handler" not in [h.name for h in logger.handlers]:
logger.addHandler(handler)
logger.setLevel(logging.DEBUG)
agg_headers = ["strategy", "average", "sample_sd", "population_sd", "count"]
from unstructured.metrics.evaluate import measure_element_type_accuracy, measure_text_edit_distance
@click.group()
@ -39,6 +14,7 @@ def main():
@main.command()
@click.option("--output_dir", type=str, help="Directory to structured output.")
@click.option("--source_dir", type=str, help="Directory to source.")
@click.option(
"--output_list",
type=str,
@ -46,7 +22,6 @@ def main():
help="Optional: list of selected structured output file names under the \
directory to be evaluate. If none, all files under directory will be use.",
)
@click.option("--source_dir", type=str, help="Directory to source.")
@click.option(
"--source_list",
type=str,
@ -69,80 +44,22 @@ def main():
help="A tuple of weights to the Levenshtein distance calculation. \
See text_extraction.py/calculate_edit_distance for more details.",
)
def measure_text_edit_distance(
def measure_text_edit_distance_command(
output_dir: str,
output_list: Optional[List[str]],
source_dir: str,
output_list: Optional[List[str]],
source_list: Optional[List[str]],
export_dir: str,
weights: Tuple[int, int, int],
) -> None:
"""
Loops through the list of structured output from all of `output_dir` or selected files from
`output_list`, and compare with gold-standard of the same file name under `source_dir` or
selected files from `source_list`.
Calculates text accuracy and percent missing. After looped through the whole list, write to tsv.
Also calculates the aggregated accuracy and percent missing.
"""
if not output_list:
output_list = _listdir_recursive(output_dir)
if not source_list:
source_list = _listdir_recursive(source_dir)
if not output_list:
print("No output files to calculate to edit distances for, exiting")
sys.exit(0)
rows = []
accuracy_scores: List[float] = []
percent_missing_scores: List[float] = []
# assumption: output file name convention is name-of-file.doc.json
for doc in output_list: # type: ignore
fn = (doc.split("/")[-1]).split(".json")[0]
doctype = fn.rsplit(".", 1)[-1]
fn_txt = fn + ".txt"
connector = doc.split("/")[0]
if fn_txt in source_list: # type: ignore
output_cct = elements_to_text(elements_from_json(os.path.join(output_dir, doc)))
source_cct = _read_text(os.path.join(source_dir, fn_txt))
accuracy = round(calculate_accuracy(output_cct, source_cct, weights), 3)
percent_missing = round(calculate_percent_missing_text(output_cct, source_cct), 3)
rows.append([fn, doctype, connector, accuracy, percent_missing])
accuracy_scores.append(accuracy)
percent_missing_scores.append(percent_missing)
headers = ["filename", "doctype", "connector", "cct-accuracy", "cct-%missing"]
_write_to_file(export_dir, "all-docs-cct.tsv", rows, headers)
agg_rows = []
agg_rows.append(
[
"cct-accuracy",
_mean(accuracy_scores),
_stdev(accuracy_scores),
_pstdev(accuracy_scores),
len(accuracy_scores),
],
):
return measure_text_edit_distance(
output_dir, source_dir, output_list, source_list, export_dir, weights
)
agg_rows.append(
[
"cct-%missing",
_mean(percent_missing_scores),
_stdev(percent_missing_scores),
_pstdev(percent_missing_scores),
len(percent_missing_scores),
],
)
_write_to_file(export_dir, "aggregate-scores-cct.tsv", agg_rows, agg_headers)
_display(agg_rows, agg_headers)
@main.command()
@click.option("--output_dir", type=str, help="Directory to structured output.")
@click.option("--source_dir", type=str, help="Directory to structured source.")
@click.option(
"--output_list",
type=str,
@ -150,7 +67,6 @@ def measure_text_edit_distance(
help="Optional: list of selected structured output file names under the \
directory to be evaluate. If none, all files under directory will be used.",
)
@click.option("--source_dir", type=str, help="Directory to structured source.")
@click.option(
"--source_list",
type=str,
@ -165,132 +81,16 @@ def measure_text_edit_distance(
help="Directory to save the output evaluation metrics to. Default to \
your/working/dir/metrics/",
)
def measure_element_type_accuracy(
def measure_element_type_accuracy_command(
output_dir: str,
output_list: Optional[List[str]],
source_dir: str,
output_list: Optional[List[str]],
source_list: Optional[List[str]],
export_dir: str,
):
"""
Loops through the list of structured output from all of `output_dir` or selected files from
`output_list`, and compare with gold-standard of the same file name under `source_dir` or
selected files from `source_list`.
Calculates element type frequency accuracy and percent missing. After looped through the
whole list, write to tsv. Also calculates the aggregated accuracy.
"""
if not output_list:
output_list = _listdir_recursive(output_dir)
if not source_list:
source_list = _listdir_recursive(source_dir)
rows = []
accuracy_scores: List[float] = []
for doc in output_list: # type: ignore
fn = (doc.split("/")[-1]).split(".json")[0]
doctype = fn.rsplit(".", 1)[-1]
connector = doc.split("/")[0]
if doc in source_list: # type: ignore
output = get_element_type_frequency(_read_text(os.path.join(output_dir, doc)))
source = get_element_type_frequency(_read_text(os.path.join(source_dir, doc)))
accuracy = round(calculate_element_type_percent_match(output, source), 3)
rows.append([fn, doctype, connector, accuracy])
accuracy_scores.append(accuracy)
headers = ["filename", "doctype", "connector", "element-type-accuracy"]
_write_to_file(export_dir, "all-docs-element-type-frequency.tsv", rows, headers)
agg_rows = []
agg_rows.append(
[
"element-type-accuracy",
_mean(accuracy_scores),
_stdev(accuracy_scores),
_pstdev(accuracy_scores),
len(accuracy_scores),
],
return measure_element_type_accuracy(
output_dir, source_dir, output_list, source_list, export_dir
)
_write_to_file(export_dir, "aggregate-scores-element-type.tsv", agg_rows, agg_headers)
_display(agg_rows, agg_headers)
def _listdir_recursive(dir: str):
listdir = []
for dirpath, _, filenames in os.walk(dir):
for filename in filenames:
# Remove the starting directory from the path to show the relative path
relative_path = os.path.relpath(dirpath, dir)
if relative_path == ".":
listdir.append(filename)
else:
listdir.append(f"{relative_path}/{filename}")
return listdir
def _write_to_file(dir: str, filename: str, rows: List[Any], headers: List[Any], mode: str = "w"):
if mode not in ["w", "a"]:
raise ValueError("Mode not supported. Mode must be one of [w, a].")
if dir and not os.path.exists(dir):
os.makedirs(dir)
with open(os.path.join(os.path.join(dir, filename)), mode, newline="") as tsv:
writer = csv.writer(tsv, delimiter="\t")
if mode == "w":
writer.writerow(headers)
writer.writerows(rows)
def _display(rows, headers):
col_widths = [
max(len(headers[i]), max(len(str(row[i])) for row in rows)) for i in range(len(headers))
]
click.echo(" ".join(headers[i].ljust(col_widths[i]) for i in range(len(headers))))
click.echo("-" * sum(col_widths) + "-" * (len(headers) - 1))
for row in rows:
formatted_row = []
for item in row:
if isinstance(item, float):
formatted_row.append(f"{item:.3f}")
else:
formatted_row.append(str(item))
click.echo(
" ".join(formatted_row[i].ljust(col_widths[i]) for i in range(len(formatted_row))),
)
def _mean(scores: List[float], rounding: Optional[int] = 3):
if len(scores) < 1:
return None
elif len(scores) == 1:
mean = scores[0]
else:
mean = statistics.mean(scores)
if not rounding:
return mean
return round(mean, rounding)
def _stdev(scores: List[float], rounding: Optional[int] = 3):
if len(scores) <= 1:
return None
if not rounding:
return statistics.stdev(scores)
return round(statistics.stdev(scores), rounding)
def _pstdev(scores: List[float], rounding: Optional[int] = 3):
if len(scores) <= 1:
return None
if not rounding:
return statistics.pstdev(scores)
return round(statistics.pstdev(scores), rounding)
def _read_text(path):
with open(path, errors="ignore") as f:
text = f.read()
return text
if __name__ == "__main__":

232
unstructured/metrics/evaluate.py Executable file
View File

@ -0,0 +1,232 @@
#! /usr/bin/env python3
import csv
import logging
import os
import statistics
import sys
from typing import Any, List, Optional, Tuple
import click
from unstructured.metrics.element_type import (
calculate_element_type_percent_match,
get_element_type_frequency,
)
from unstructured.metrics.text_extraction import calculate_accuracy, calculate_percent_missing_text
from unstructured.staging.base import elements_from_json, elements_to_text
logger = logging.getLogger("unstructured.ingest")
handler = logging.StreamHandler()
handler.name = "ingest_log_handler"
formatter = logging.Formatter("%(asctime)s %(processName)-10s %(levelname)-8s %(message)s")
handler.setFormatter(formatter)
# Only want to add the handler once
if "ingest_log_handler" not in [h.name for h in logger.handlers]:
logger.addHandler(handler)
logger.setLevel(logging.DEBUG)
agg_headers = ["strategy", "average", "sample_sd", "population_sd", "count"]
def measure_text_edit_distance(
output_dir: str,
source_dir: str,
output_list: Optional[List[str]] = None,
source_list: Optional[List[str]] = None,
export_dir: str = "metrics",
weights: Tuple[int, int, int] = (2, 1, 1),
) -> None:
"""
Loops through the list of structured output from all of `output_dir` or selected files from
`output_list`, and compare with gold-standard of the same file name under `source_dir` or
selected files from `source_list`.
Calculates text accuracy and percent missing. After looped through the whole list, write to tsv.
Also calculates the aggregated accuracy and percent missing.
"""
if not output_list:
output_list = _listdir_recursive(output_dir)
if not source_list:
source_list = _listdir_recursive(source_dir)
if not output_list:
print("No output files to calculate to edit distances for, exiting")
sys.exit(0)
rows = []
accuracy_scores: List[float] = []
percent_missing_scores: List[float] = []
# assumption: output file name convention is name-of-file.doc.json
for doc in output_list: # type: ignore
fn = (doc.split("/")[-1]).split(".json")[0]
doctype = fn.rsplit(".", 1)[-1]
fn_txt = fn + ".txt"
connector = doc.split("/")[0]
if fn_txt in source_list: # type: ignore
output_cct = elements_to_text(elements_from_json(os.path.join(output_dir, doc)))
source_cct = _read_text(os.path.join(source_dir, fn_txt))
accuracy = round(calculate_accuracy(output_cct, source_cct, weights), 3)
percent_missing = round(calculate_percent_missing_text(output_cct, source_cct), 3)
rows.append([fn, doctype, connector, accuracy, percent_missing])
accuracy_scores.append(accuracy)
percent_missing_scores.append(percent_missing)
headers = ["filename", "doctype", "connector", "cct-accuracy", "cct-%missing"]
_write_to_file(export_dir, "all-docs-cct.tsv", rows, headers)
agg_rows = []
agg_rows.append(
[
"cct-accuracy",
_mean(accuracy_scores),
_stdev(accuracy_scores),
_pstdev(accuracy_scores),
len(accuracy_scores),
],
)
agg_rows.append(
[
"cct-%missing",
_mean(percent_missing_scores),
_stdev(percent_missing_scores),
_pstdev(percent_missing_scores),
len(percent_missing_scores),
],
)
_write_to_file(export_dir, "aggregate-scores-cct.tsv", agg_rows, agg_headers)
_display(agg_rows, agg_headers)
def measure_element_type_accuracy(
output_dir: str,
source_dir: str,
output_list: Optional[List[str]] = None,
source_list: Optional[List[str]] = None,
export_dir: str = "metrics",
):
"""
Loops through the list of structured output from all of `output_dir` or selected files from
`output_list`, and compare with gold-standard of the same file name under `source_dir` or
selected files from `source_list`.
Calculates element type frequency accuracy and percent missing. After looped through the
whole list, write to tsv. Also calculates the aggregated accuracy.
"""
if not output_list:
output_list = _listdir_recursive(output_dir)
if not source_list:
source_list = _listdir_recursive(source_dir)
rows = []
accuracy_scores: List[float] = []
for doc in output_list: # type: ignore
fn = (doc.split("/")[-1]).split(".json")[0]
doctype = fn.rsplit(".", 1)[-1]
connector = doc.split("/")[0]
if doc in source_list: # type: ignore
output = get_element_type_frequency(_read_text(os.path.join(output_dir, doc)))
source = get_element_type_frequency(_read_text(os.path.join(source_dir, doc)))
accuracy = round(calculate_element_type_percent_match(output, source), 3)
rows.append([fn, doctype, connector, accuracy])
accuracy_scores.append(accuracy)
headers = ["filename", "doctype", "connector", "element-type-accuracy"]
_write_to_file(export_dir, "all-docs-element-type-frequency.tsv", rows, headers)
agg_rows = []
agg_rows.append(
[
"element-type-accuracy",
_mean(accuracy_scores),
_stdev(accuracy_scores),
_pstdev(accuracy_scores),
len(accuracy_scores),
],
)
_write_to_file(export_dir, "aggregate-scores-element-type.tsv", agg_rows, agg_headers)
_display(agg_rows, agg_headers)
def _listdir_recursive(dir: str):
listdir = []
for dirpath, _, filenames in os.walk(dir):
for filename in filenames:
# Remove the starting directory from the path to show the relative path
relative_path = os.path.relpath(dirpath, dir)
if relative_path == ".":
listdir.append(filename)
else:
listdir.append(f"{relative_path}/{filename}")
return listdir
def _display(rows, headers):
col_widths = [
max(len(headers[i]), max(len(str(row[i])) for row in rows)) for i in range(len(headers))
]
click.echo(" ".join(headers[i].ljust(col_widths[i]) for i in range(len(headers))))
click.echo("-" * sum(col_widths) + "-" * (len(headers) - 1))
for row in rows:
formatted_row = []
for item in row:
if isinstance(item, float):
formatted_row.append(f"{item:.3f}")
else:
formatted_row.append(str(item))
click.echo(
" ".join(formatted_row[i].ljust(col_widths[i]) for i in range(len(formatted_row))),
)
def _write_to_file(dir: str, filename: str, rows: List[Any], headers: List[Any], mode: str = "w"):
if mode not in ["w", "a"]:
raise ValueError("Mode not supported. Mode must be one of [w, a].")
if dir and not os.path.exists(dir):
os.makedirs(dir)
with open(os.path.join(os.path.join(dir, filename)), mode, newline="") as tsv:
writer = csv.writer(tsv, delimiter="\t")
if mode == "w":
writer.writerow(headers)
writer.writerows(rows)
def _mean(scores: List[float], rounding: Optional[int] = 3):
if len(scores) < 1:
return None
elif len(scores) == 1:
mean = scores[0]
else:
mean = statistics.mean(scores)
if not rounding:
return mean
return round(mean, rounding)
def _stdev(scores: List[float], rounding: Optional[int] = 3):
if len(scores) <= 1:
return None
if not rounding:
return statistics.stdev(scores)
return round(statistics.stdev(scores), rounding)
def _pstdev(scores: List[float], rounding: Optional[int] = 3):
if len(scores) <= 1:
return None
if not rounding:
return statistics.pstdev(scores)
return round(statistics.pstdev(scores), rounding)
def _read_text(path):
with open(path, errors="ignore") as f:
text = f.read()
return text